sizeof(char) is 1

2007-04-08

I found this piece of code, apparently in Mozilla, using Google’s codesearch facility. The code in question is:

group->text = (char *) malloc(sizeof(char) * strlen(params->text));

If you’re not careful you see code like this all the time. sizeof(char) is 1. The standard says so. There is no point in writing sizeof(char) in code like this. It adds clutter and it suggests that you don’t know enough C. In particular, you don’t know that sizeof(char) is 1, guaranteed. If you don’t know that, what else don’t you know about C? Maybe you don’t know that malloc returns void * and that this type can be correctly converted to any other (object) pointer type without an explicit cast.

Aside: There are more serious problems with this code. The use of strlen inside a malloc without a +1 should always fire off alarm bells. And sure enough if we look at the surrounding code:

group->text = (char *) malloc(sizeof(char) * strlen(params->text));
if (group->text == NULL) {
	res = MP_MEM;
}
strcpy(group->text, params->text);

We see that the code is copying a string, but it doesn’t allocate enough memory for it. Recall that in C you need strlen(s)+1 bytes of storage to copy the string s. C invites you to make this error. Daily.

Some people might think it’s reasonable to use sizeof(char) becuase char might be #defined to something else (such as wchar_t). You can’t. The standard says it’s illegal to do so.

In C a byte is the amount of storage used to represent a char. When you’re talking about storage space the two are essentially synonymous. What about those weird architectures like the TI C54x DSP range? This comment in the FreeType sources says that it has 16-bit char. That’s right, sizeof(char) is still 1, but it has more than 8 bits. It’s still a byte; not all bytes are 8-bits (if you really do mean 8-bit byte then say octet). Does your bit-twiddling code work when CHAR_BIT is not 8?

Zlib gives an example of how to code for the wrong assumptions. The code is:

#define Buf_size (8 * 2*sizeof(char))
/* Number of bits used within bi_buf. (bi_buf might be implemented on
 * more than 16 bits on some systems.)

See how they understand that a 2 char buffer might be bigger than 16-bits? Unfortunately they assume that sizeof always returns its answers in 8-bit chunks (octets) and that it is sizeof(char) that varies. Wrong.

sizeof and malloc

sizeof and malloc go together. sizeof yields results in bytes and malloc requires bytes for its argument.

A classic use is in code like this
(a more or less randomly found example):

file = (URL_FILE *)malloc(sizeof(URL_FILE));

This is correct use of sizeof and malloc, but I don’t recommend it. There are better ways to write this code.

The first issue I have with the code is that it assumes that the file object is of type URL_FILE in order to calculate the right size. As it happens it is, but it’s not necessary to have that assumption. sizeof can operate on an expression and should do so in this case:

file = (URL_FILE *)malloc(sizeof *file);

The other issue I have is that there is simply no need to explicitly cast the result of malloc to the type URL_FILE *. malloc returns void *, is defined to do so by the standard and that should be relied upon. void * will correctly convert, without an explicit cast, to any object pointer type. If the header files supplied by your vendor do not declare malloc correctly then your C implementation is not remotely standards compliant and you should complain to your vendor and probably fix the headers (yes, by editing the system supplied header files). If your implementation of malloc really does return int so you must supply an explicit cast then you have Special Needs and you should probably think very carefully before writing any C code at all.

[Added on 2007-06-26: Apparently K&R’s CPL contains the advice to explicitly coerce the return value of malloc, perhaps that’s why lots of programmers do it. In the list of errata for that book they say (in the erratum for page 142) that, in the light of the Standard’s conversion rules, the explicit coercion is “not necessary […], and possibly harmful if malloc, or a proxy for it, fails to be declared as returning void *”.]

So now we have:

file = malloc(sizeof *file);

Much nicer. If you need persuading that this code is better consider what happens if the type of file is changed. In my suggested code, the call to malloc does not need changing. In the example code as first presented then both the cast and the sizeof will need changing. Will your programmer change both? If they forget to change the sizeof then the wrong size block of memory will get allocated. Oopsie. Another reason why the code without the explicit cast is better is if you forget to #include <stdlib.h> to declare malloc. With the explicit cast many compilers will simply assume that you know what your doing, namely that you intended to call an undeclared function which implicitly returns int and convert the result to a pointer type, and they won’t warn you. Without the explicit cast most compilers will generate a warning even on their most feeble settings.

By the way the syntax for sizeof is either sizeof(type) or sizeof expression. You only need parentheses around the operand of sizeof if it is a type. If the expression needs protecting from adjacent expressions then put the sizeof itself inside parentheses: malloc((sizeof *p) + (sizeof *q)). If you don’t need the parentheses then don’t use them, it shows that you know the syntax of sizeof and generates confidence that you know what you’re doing. p = malloc(sizeof *p) should be idiomatic.

For more examples of cruelly finding somebody else’s bugs use codesearch, see The perils of multiple-value-prog1. Maybe you’ll learn some Lisp too.

About these ads

21 Responses to “sizeof(char) is 1”

  1. glorkspangle Says:

    Just conceivably, and trying to look on these code fragments as generously as possible, some of these programmers know what they are doing, and are writing code which has to be processed by a compiler for a language other than C, or other than standard C.
    An erstwhile colleague of yours and mine once long ago worked on a “C” compiler, which believed that sizeof(int) == sizeof(char) != 1. I found this out when we worked together and the subject arose of the problems that follow (in the standard library) if sizeof(int) == sizeof(char).
    I do not know, and wish that I never had to care, what C++ compilers think about any of these issues.

  2. drj11 Says:

    Ah yes, the land where sizeof(char) is not 1. A parallel universe where everything almost works the same way, but not quite. That’s definitely Special Needs territory. Some people will need to write code that works both in C and in a C-like language where sizeof(char) != 1. A chosen few. I would expect to see such code littered with explosive comments.

    I think you’re being overly charitable, but I agree your interpretation is possible.

    Incidentally Jack Klein reckons, here and here, that the SHARC DSP has (or had) sizeof(int) == sizeof(char) == 1; CHAR_BIT == 32.

    When sizeof(char) == sizeof(int) the standard library does indeed have problems. Text I/O still seems possible since even if you believe that all representations of char are also representations of int you can rob a value to use for EOF because no-one expects to reliably read and write all char values. Binary I/O seems impossible to do reliably using the standard library since every unsigned char is also an int so we can’t distinguish EOF.

    But wait, we can’t use the return value of fgetc to detect EOF, but we can still use feof and ferror. What about:

    while(1) {
    c = fget(f);
    if(feof(f) || ferror(f)) break;
    }

    Are there other problems?

    C++, Meh. Incidentally wordpress seems to aggressively merge my C and C++ tags so that I can’t distinguish them. Sorry about that.

  3. H3g3m0n Says:

    Firstly a lot of project use programs like lint or splint to ensure the code is correct, these program are extremely strict about how you do things and often require odd stuff like /*@NULL@*/ being declared for any functions that have the possibility of returning null.

    In regards to the sizeof(char):
    Personally I find sizeof(char) is much more readable, and clearly shows what is being allocated even if it is redundant. sizeof(char) clearly shows to people that you want to allocate space for chars.

    It also make it painfully obvious to someone reading the code that there hasn’t accidentally been the type left out. The same bit of code can also be cut and pasted into an area that allocates a different type.

    Some of those might not be the usefully but any half optimized compiler should turn sizeof(char) into a integer anyway although gcc generates different checksums but its possible there is a checksum for the source file included in the binary or some such, either way you shouldn’t sacrifice code readability for extremely minor optimizations that the compiler should handle.

    Its also possible that they might decide to do a find and replace for char with a unicode char in that particular file.

    In regurds to the malloc:
    The extra level of casting is actually needed in some cases, it will work without 99% of the time but eventually you will hit one of the odd situations where it doesn’t work correctly. Even if you hit one of those situations its extremely hard to tell if you have since you can often corrupt the stack in a minor way without doing any noticeable damage. I had a situation when coding an SunRPC program that required such a cast for a linked list pointer, the entire program was crashing but not until it received a second request from a client, and the bit that was crashing was part of the RPC libs not my code do debugging was fuitile.

    Secondly malloc used to return a char* rather than a void*, so its more portable to outdated systems if you cast it. http://en.wikipedia.org/wiki/Malloc#Dynamic_memory_allocation_in_C

  4. drj11 Says:

    @H3g3m0n:

    Thanks for taking the time to reply.

    On using static code checking tools like lint: first decide what you want your code to look like (coding style), then find a tool that checks it. Do not blindly pick a tool and then massage your code to satisfy the tool. Sometimes you will have to change your code to satisfy a tool (redundant initialisation and return statements are common), but when you can’t avoid it, then add a comment that explains why you added the code: /* avoids compiler warning */. Do not add rules to your coding standard simply because they keep a tool happy. In other words you should have a good reason for preferring a certain style of code without relying on the “tool X says Y” argument.

    There is a communicative aspect I agree. And I do slight the communicative aspect. But the communication argument can go either way. malloc allocates a number of chars, therefore by using it I’m already communicating that intend to allocate chars. Really I’d rather not see types in calls to malloc; similar to the URL_FILE example, an array of char should be allocated using an expression: buf = malloc(len * sizeof *buf). This code also works when the element type of buf is changed.

    I don’t advocate cut-and-paste as a way of creating code, so I’m not concerned about coding style practices that make it easier.

    I’m not concerned about the optimisation aspect (it hadn’t even occurred to me until you mentioned it), and in any case I agree, readability should only be sacrificed on the altar of optimisation when proven to be absolutely necessary (next year, your code will run faster anyway, but it won’t have got any more readable). My very brief experiments with gcc suggest that an extra «* sizeof(char)» has no effect on the resulting binary, as I would expect.

    So you want to make your code unicode compatible and you think that at some point you’ll be doing a find-and-replace of «char»? I hope not.

    Yes, some people’s malloc requires a cast, I already mentioned this. Those people have Special Needs and should not be looking to this sort of article for advice; they need to think very carefully about how to write code in a non-standard environment. The C standard, which formalised the notion that malloc returns void *, has now been around for about 17 years. It’s time that we started to code as if vendors implemented it.

    I already hinted that adding a cast can be dangerous, but let’s go over that. If I write file = (URL_FILE*)malloc(sizeof *file); then this can mask a problem if I’ve forgotten to include <stdlib.h>. If I forget to include the header file then the compiler will assume that malloc returns an int. I’m using an explicit cast so the compiler generates the code to convert from an int return value to a pointer. No warning. This mostly works, because on a lot of architectures int and a pointer have more or less the same representation and use more or less the same registers (for returned values for example). It will go wrong when you port your code to a 68K architecture because pointers are returned in A registers and int is returned in a D register so the code that converts from int to pointer will take a value in a D register and convert to a pointer in an A register. Unfortunately malloc returns its value in an A register (the implementation of malloc isn’t incorrect, just the fact that we forgot to declare it). The value in a D register that is being converted will be nonsense. It will also go wrong on an architecture where int is 32-bits and pointers are 64-bits. For example Win64.

    By the way, did you know that sizeof(char) is 1?

  5. Laurie Cheers Says:

    Worth mentioning: in C++ you need an explicit cast to convert void* into char*.

  6. drj11 Says:

    @Laurie: Thanks for your suggestion, but I’m a C programmer, not a C++ programmer, so I wouldn’t know anything about that. If it is true that in C++ you need an explicit cast to convert from void * to char * then I can see why I often hear C++ programmers say you should avoid malloc.

    I suspect there’s a whole load of bad C style recommended by C++ programmers who just don’t know C well enough. Meh.

  7. Mick Bythell Says:

    I would argue that the knowledge that sizeof(char) == 1 is not the issue here.
    It’s usually good practice (coding standards) that magic numbers not be placed in source code, irrespective of any code analysis tool.

    >If it is true that in C++ you need an explicit cast to convert from void * to char * then I can see why I often hear C++ programmers say you should avoid malloc.

    Probably a different issue, related to mixing malloc/free with new/delete and problems with object initialisation/uninitialisation when constructing/destructing.

  8. drj11 Says:

    Mick: I’m not sure I understand what you’re saying about magic numbers. Of course I agree that it’s good practice to avoid magic numbers in source code, but the discussion isn’t about that. I’m not advocating replacing sizeof(char) with 1 in the code, I’m advocating writing code as if you knew that sizeof(char) is 1. All but one of the pieces of code I discuss avoids using magic numbers; magic numbers simply aren’t an issue here.


  9. […] sizeof(char) is 1 – 但 char 不一定只佔 1 個 byte,所以這帳就難算了。 […]


  10. Using sizeof on strings caught me out a few times when I started programming in C:

    http://www.pixelbeat.org/programming/c_c++_notes.html#sizeof_strings

  11. Wesko Says:

    sizeof(char) can be != 1 byte in ANSI C! It’s size is 1, yes, that means you need 1 char to store a char. It’s not defined as 1 byte in the standard, that’s the trouble. Something we call a byte does not exist.

    Moreover, a char is not even signed. Of course, all popular compilers DO assume a char is just 1 signed byte.

    We all healthily assume it’s 1 and the first sucker that creates a compiler that uses 2 bytes as a size has lots of problems of course.

    sizeof(int) would return the number of chars you need to store an int. I hope i added some value to this nice discussion.

    • drj11 Says:

      @Wesko:

      Allow me to quote from:

      ISO 9899:1999 “Programming Languages — C”

      Section 6.5.3.4 “The sizeof operator”

      “The sizeof operator yields the size (in bytes) of its operand” … “When applied to an operand that has type char, unsigned char, or signed char, (or a qualified version thereof) the result is 1″

      1 char, 1 byte. In standard C.

      A “byte” in C is not always 8 bits. There are compilers out there where CHAR_BIT > 8. But still sizeof(char) is 1.

      The habit of thinking “byte is 8 bits” is relatively modern. Historically a byte used to be all sorts of different widths in different situations (example: 7- and 9-bit bytes on the PDP). In the Lisp community byte isn’t even a fixed sized quantity.

  12. Will Says:

    I actually do this myself. Not because I don’t know sizeof(char) == 1, but because I think it’s better self documenting code. I only ever do this when allocating space that’s to be used as a string, so including this statement (which will be removed by the optimiser anyway) aids in identifying the point of an allocation.

    Yes it’s superfluous to the compiler, but it makes it clearer to anyone else reading the code (just like brackets in expressions do).

    • Derek Says:

      I agree. The argument that “It adds clutter and it suggests that you don’t know enough C” is nonsense.

      The standard also allows goto statements and global variables that can reduce “clutter”, so let’s all move to using those as well..

      • N Says:

        Well, if you’re writing optimized code, you better loose any strong beliefs you might have against global variables and go to.
        There are times when they are the better option and do not damage readability if properly used.

        About Go To see:

        http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf

      • xilun Says:

        Well, if you are dynamically allocating and releasing resources, I rather want to see some goto in your source code (to handle errors during multiple allocations) than not.

        On the other hand if you are writing code for a semi-critical system where dynamic allocation is forbidden, you will probably use global variables (probably most of some struct type if the code is well organized).

        Indeed to produce code of decent quality and correct maintainability, you should probably forget about dogmatism and start thinking again on a case by case basis about what will improve your code for rational reasons…

    • David Jones Says:

      Think about that in the context of using strlen, as in the first example in the article.

      • willcannings Says:

        In that context I probably wouldn’t use sizeof(char), but I also very rarely need to base the length of one string on the length of another.


  13. […] me how often I run across programming language elitists who express opinions such as those found in sizeof(char) is 1. The prevailing opinion here is that if you write extraneous code, it’s obvious that you […]

  14. punkatux Says:

    group->text = (char *) malloc(sizeof(char) * strlen(params->text));
    if (group->text == NULL) {
    res = MP_MEM;
    }
    strcpy(group->text, params->text);
    ———————————————————–
    Isn’t this just this?
    group->text = strdup(params->text);

    • drj11 Says:

      Well, aside from the bug, it’s the same as strdup, except for the fact that strdup isn’t an ISO Standard C function.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: