REN

Ph.D. in Computer Science at Rutgers University

Three kinds of Char array in C

Char arrays are very often used in C language, while new C-learners may be confused by how to declare and use a char array. In this blog, I'll show the difference between three kinds of char array in C. First of all, let's take a look at the code below:

/* test.c */
#include <stdio.h>

int main() {
	// Declaration of four arrays
	const char *r = "hello world!";
	char *s = "hello world!";
	char t[20] = "hello world!";
	char* u = (char*)malloc(20*sizeof(char));
	memset(u, 0, 20);
	strcpy(u, "hello world!");
	// print the addresses storing each array
	printf("0x%lx\n", &r);
	printf("0x%lx\n", &s);
	printf("0x%lx\n", &t);
	printf("0x%lx\n", &u);
	printf("-------------\n");
	// print the addresses of first element in each array
	printf("0x%lx\n", r);
	printf("0x%lx\n", s);
	printf("0x%lx\n", t);
	printf("0x%lx\n", u);
	printf("-------------\n");
	// print the content of arrays (correct)
	printf("%s\n", r);
	printf("%s\n", s);
	printf("%s\n", t);
	printf("%s\n", u);
	printf("-------------\n");
	// print the content of arrays (incorrect)
	printf("%s\n", &r);
	printf("%s\n", &s);
	printf("%s\n", &t);
	printf("%s\n", &u);
	return 0;
}

Let's compile this program in GCC: (using -m32 option to make it looks simple)

gcc -m32 -o test test.c

Now the result of those codes are:

0xffffd22c
0xffffd230
0xffffd238
0xffffd234
-------------
0x8048770
0x804877d
0xffffd238
0x804b008
-------------
hello world!
hello world!
hello world!
hello world!
-------------
p�} 	hello world!
} 	hello world!
hello world!
	hello world!

From the result of this program, we could clearly see the difference between these three kinds of char arrays:

1. Defined as const char array using pointer

const char *r = "hello world!";
char *s = "hello world!";

This kind of char array’s l-value (address) is on stack, and r-value (content) is in .data area. If you don’t assign a const to it, it will be transferred to const. In this example, r and s are the same.


2. Defined as char array

char t[20] = "hello world!";

This kind of char array’s l-value (address) and r-value (content) is on stack.


3. Defined as a dynamic allocated char array

char* u = (char*)malloc(20*sizeof(char));

This kind of char array’s l-value (address) is in stack, and r-value (content) is in heap.


From this program, we know that the array that is declared not using dynamic memory whose address storing this array (&t) has the same address with its first elements address (t). Remember, t has double meaning: t is both the name of this char array (when we do sizeof(t), it comes to 20) and refer to the address of the first element in this array. While the array that is declared using dynamic memory whose address storing this array (&u) has the different address with its first elements address (u), and u is on heap. While the array that is declared using const whose address storing this array (&r and &s) has the different address with its first elements address (r and s), and r and s are in .data area. The table below illustrates how they exist in main memory.

Simple virtual memory image for the code example

// remind of our sample code
const char *r = "hello world!";            // const char array
char *s = "hello world!";                  // const char array
char t[20] = "hello world!";               // char array
char* u = (char*)malloc(20*sizeof(char));  // dynamic char array
memset(u, 0, 20);
strcpy(u, "hello world!");

In theorem, the incorrect code in the last four lines of code cound not output the result. How could all of them generate "hello world" as well? Now let's analyze how the incorrect code output "correct" or "partially correct code". Just using GDB to see how they're arranged:

(gdb) x/10x r
0x8048770:	0x6c6c6568	0x6f77206f	0x21646c72	0x6c656800
0x8048780:	0x77206f6c	0x6c64726f	0x78300021	0x0a786c25
0x8048790:	0x2d2d2d00	0x2d2d2d2d
(gdb) x/10x &r
0xffffd22c:	0x08048770	0x0804877d	0x0804b008	0x6c6c6568
0xffffd23c:	0x6f77206f	0x21646c72	0x00000000	0x00000000
0xffffd24c:	0x80ad5600	0xf7fb43dc

As we can see in the GDB info above and previous output, r sits in 0xffffd22c, s sits in 0xffff230, t sits in 0xffff238 while u sits in 0xffff234. If we look it carefully, t fell beind u in stack while we defined t before u in code. I guess the GCC made some optimization on it to keep alignment of data fetch group in micro-architecture level to save one fetch cycle. Thus, s is in the last of the four arrays and its content is also in stack. The weird ouput of last four lines of code makes sense now.