C Language Gotchas
This document describes some easy-to-make mistakes in C.
- 1.1 <ctype.h> And Character Types
- 1.2 <stdio.h> And Character Types
- 2. printf Format Strings
- 3. Unbounded String Operations
- 4. Type Assumptions
- 5. Null Pointer Constants
- 6. I/O Error Checking
1. Character Data
1.1 <ctype.h> And Character Types
The char data type might, in any implementation, be signed or unsigned. However, the functions defined in <ctype.h> don't automatically work well with this: the standard says:
In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall be equal to the value of EOF.This means that if you write code like this, then you run the risk of passing illegal values to the ctype functions:
size_t count_whitespace(const char *s) { const char *t = s; while(isspace(*t)) /* possibly *t < 0 */ ++t; return t - s; }
One approach would be to pass all your character strings around explicitly using unsigned characters. But this is inconvenient when you have to interwork with other functions that haven't been designed with this approach in mind - for example the string handling functions found in the standard library.
The approach I adopt is to use an explicit cast. The above example would thus become:
size_t count_whitespace(const char *s) {
const char *t = s;
while(isspace((unsigned char)*t))
++t;
return t - s;
}
There are a couple of other points worth making here:
If you assign a character picked out of a string to an integer variable, the cast should be applied at that point, not just before passing the character to a ctype function.
You can see the idea that you're supposed to use the unsigned representation of characters surface in the definition of functions such as strcmp and memcmp, which are defined as interpreting the characters they compare as unsigned char values.
1.2 And Character Types
Characters obtained directly from stdio functions such as getc don't require casting before being used in a
Characters returned from stdio functions are usually assigned to int variables, not char or unsigned char variables, since if you do the latter you can no longer correctly check for EOF. Consider for example this code:
char c;
c = getchar();
while(c != EOF) {
...
c = getchar();
}
The loop may never terminate: if char is an unsigned type then EOF will be converted to some positive value. On systems with where char is signed, there is a more subtle bug. Suppose for example that EOF is -1 - then if character 255 is read it will be converted to the value -1 and terminate the input prematurely.
The conventional way to write this code is:
int c;
c = getchar();
while(c != EOF) {
...
c = getchar();
}
There is still a problem. Although EOF is guaranteed to be negative, and thus distinct from the value of any unsigned char, it is not guaranteed to be different from any such value when converted to an int. This is never a problem on many common systems; but it is at least a theoretical possibility permitted by the C standard. Imagine for instance a 16-bit word-addressed machine with a compiler that didn't try to pack multiple characters to the word but instead made char and int identical types. Then we would have (int)(unsigned char)65535 == -1.
(There are certainly word-addressed machines. The second half of the condition might be a reasonable design decision for a compiler vendor with customers who weren't very interested in text processing).
You might feel that you would be in good company if you ignored systems where the natural character and integer types were the same size, but this article would certainly be incomplete if it did so. To work correctly on such systems, the proper way to write the above loop would be:
int c;
c = getchar();
while(!feof(stdin) && !ferror(stdin)) {
...
c = getchar();
}
You should be careful to consider the effect of end of file or error on any tests you make on these values. Consider this loop, intended to scan all characters up to the next whitespace character received:
int c;
c = getchar();
while(!isspace(c)) {
...
c = getchar();
}
If EOF is returned before any whitespace is detected then this loop may never terminate (since it is not a whitespace character). A better way to write this would be:
int c;
c = getchar();
while(!feof(stdin) && !ferror(stdin) && !isspace(c)) {
...
c = getchar();
}
Finally, it is worth noting that although EOF is usually -1, all the standard promises is that it is a negative integral constant with type int.
2. printf Format Strings
Conversions contained in printf format strings require specific types to appear in the argument list: for example %d requires an int argument, and %.*s requires an int followed by a character pointer.
Unfortunately, the nature of the printf interface is such that the compiler can only do very limited type-checking, and is forbidden from doing the implicit conversion to the correct type that ordinary function calls have in C.
If you have something like, say, a pid_t object, and want to print out its numeric value, then the temptation is to just write %d and rely on the fact that pid_t objects do indeed often promote to a type compatible with int. For example, on my system, gcc compiles this without complaint, and it runs fine:
#include <stdio.h>
#include <unistd.h>
int main(void) {
printf("$=%d\n", getpid());
return 0;
}
But such facts are rarely guaranteed, and may not be true on some other system that your code will one day be ported to. My usual approach is to cast explicitly to the widest available type of the appropriate signedness, and used the correct conversion specification for the resulting type. For example:
printf("$ = %ld\n", (long)getpid());
This points up a possible problem: what if PIDs are bigger than longs? Of course this problem was here in the previous version - only worse, as the problem was with the (potentially) smaller type. Answers might be:
Use a pair of macros; but this make the code hard to read, and therefore hard to maintain.
Use intmax_t; but this relies on a C99 implementation, or an implementation that supports this type as an extension, which may limit portability.
Use long long; but this relies on a C99 implementation, or an implementation that supports this type as an extension, which may limit portability, and there's no sure guarantee that the type you're trying to print won't be even wider than long long.
Ignore the problem and hope that nobody ever violates your assumptions (but this might not even be safe with PIDs, never mind any other data type). Increasingly risky as 64-bit machines become more widespread.
2.1 %.*s And Pointer Differences
A common idiom used to print a substring delimited by two character pointers is to use %.*s with the ".*" part being satisfied by the difference of the two pointers. This, for example, compiles without complaint on my system:
#include <stdio.h>
void t(const char *p1, const char *p2) {
printf("%.*s\n", p2 - p1, p1);
}
But the pointer difference actually has type ptrdiff_t, not the int that printf will be expecting. On a machine with 64-bit pointers but 32-bit ints, this becomes a serious problem; a compiler than can check the format strings would, hopefully, at least produce a warning. The correct way to write this is to explicitly cast the pointer difference to an int:
printf("%.*s\n", (int)(p2 - p1), p1);
This points up the other problem with this approach - what if the pointer difference is so large that it won't fit an int (which might be as small as 16 bits)? This problem was always there; it's just that it was less noticeable before. Answers might be:
Check that the pointer difference is in range before casting it.
Limit your strings to a suitable size earlier in the code.
Ignore the problem, and assume that available memory will limit your strings to a suitable size (which is not a bad bet when int and the machine's address space are both 32 bits, but falls over if you port to machines with larger address spaces or smaller ints.) Increasingly risky as 64-bit machines become more widespread.
2.2 Format String Vulnerabilities
A "format string vulnerability" is where the programmer passes a user-supplied string to some function (typically e.g. printf and syslog) that will then interpret it in a way that can potentially be exploited to trick the program into doing something the author didn't intend. In some programs this is just a bug, but if the program runs with more privilege than the user than it becomes a security hole.
For example, suppose that the program has read some data from the user into the buffer somevar and now wants to print it out. The wrong way to do it would be:
printf(somevar);
A hostile user can choose the string they send in such a way as to extract information from the program or even to modify its data and internal workings. (Ask Google about "exploiting format string vulnerabilities" for more information about this. The point of this document is to teach you how not to be vulnerable to these exploits, not to teach you how to construct them.)
The correct way to do what the programmer usually intends in this situation is:
printf("%s", somevar);
or:
fputs(somevar, stdout);
3. Unbounded String Operations
Functions such as strcpy, strcat and sprintf have a well-known misfeature: they don't bound the size of the resulting string, which means that whatever buffer has been allocated for the result can be overrun if some check is not made.
Formally, overrunning the buffer just results in undefined behaviour. This might just make a program unreliable, but it might also be a security hole.
Various solutions exist.
Explicitly check that the result string will fit prior to the operation. If it doesn't, don't perform the operation but instead signal an error somehow. Example:
if(strlen(buffer) + strlen(str) >= sizeof buffer)
return -1;
strcat(buffer, str);(This assumes that buffer is an array, not a char *. If it is just a pointer, sizeof will return the wrong value, and you must find the size some other way.)
Use snprintf instead of sprintf. This ensures that only a buffer of known size will be updated, and returns the number of characters (excluding the terminating 0) that would be required. The disadvantage is that you have to rely on having C99 or a C library which supports snprintf as an extension.
Another worry with snprintf is that not every implementation chooses the same conventions for the return value: some old versions return -1 on overflow instead of the number of characters that would have been written if space were available. See also http://www.unixpapa.com/incnote/stdio.html.
Use strncpy instead of strcpy. However, you should be aware of two things; firstly strncpy does not guarantee to null-terminate the result if it is too long, merely to truncate it; and secondly that if there is space for a terminator, it fills up the rest of the buffer with 0s, not just the one byte that is necessary. To use this function safely, it is therefore necessary to explicitly set the last byte of the buffer to 0 after calling it; for example:
strncpy(buffer, str, sizeof buffer);
buffer[sizeof buffer - 1] = 0;The same warning regarding sizeof as above applies.
Use strncat instead of strcat. Be careful: the size parameter is the number of characters to copy, not the buffer size; and a 0 character is always appended. So there must be enough space in the buffer for the original string, the number of characters specified, and the terminator. For example:
strncat(buffer, str, sizeof buffer - strlen(buffer) - 1);
Use strlcpy and strlcat instead of strcpy and strcat. Unfortunately these functions are not yet defined in any standard.
Dynamically allocate sufficient memory for all result strings, rather than relying on fixed-size buffers and bound checking. The downside of this is that you then have much more memory allocation to worry about, and (particularly for sprintf) it can be tiresome to calculate all the lengths in advance. Example:
if(!(buffer = malloc(strlen(s1) + strlen(s2) + 1)))
return -1; /* malloc failed */
strcpy(buffer, s1);
strcat(buffer, s2);Use a string handling library which does most of the memory allocation work for you behind the scenes.
4. Type Assumptions
4.1 time_t
It's fairly common practice, especially in programs written under some variant of UNIX, to assume that time_t is a count of non-leap seconds from the start of 1970. In fact the C standard doesn't guarantee this - it only guarantees that it be an arithmetic type (which includes float and double). It doesn't guarantee that it has a resolution of seconds, nor even that it increases with time.
Even some UNIX standards don't guarantee a starting point of 1970.
The answer to this is to avoid making any unguaranteed assumptions about time_t. This might be quite hard work, as the functions defined by the C standard are not powerful enough for many applications.
Also, there's quite a bit of code available for free that already assumes the traditional UNIX time_t, and if you want to use it, that means you need a matching time representation.
One answer is to define your own time type, which you can guarantee to have the required representation, and use conversion functions to convert between your type and the system's time_t. On most (all?) UNIX systems, and anything that has a compatible time_t, this type can be a synonym for time_t and the conversion functions need do nothing. On other systems the type should be defined as a suitable integer type, and the conversion functions depend on what the system's time_t is like.
If you define your type as long, then, since that type may be only 32 bits wide, your program may stop working in 2038.
Another alternative is not to care and declare that your program will never work on systems with unusual time representations. But you may come to regret doing this.
5. Null Pointer Constants
The comp.lang.c FAQ has an entire section on null pointers. There are some points worth making here, though.
The standard says:
An integral constant expression with the value 0, or such an expression cast to void *, is called a null pointer constant. If a null pointer constant is assigned to or compared for equality to a pointer, the constant is converted to a pointer of that type. Such a pointer, called a null pointer, is guaranteed to compared unequal to a pointer to any object or function.Firstly, despite the name a "null pointer constant" need not actually be a null pointer according to the definition above; it will only become one in a context where a pointer is expected. For instance outside a pointer context the constant 0 has type int despite being a "null pointer constant".
This matters when the programmer knows that a pointer is expected but the compiler does not. Variadic functions are one situation in which this case arises. Consider this code fragment:
printf("%p\n", 0);
%p expects a pointer argument (specifically void *) but the compiler does not know this - it is worked out at runtime instead. Consequently the argument is not converted to a pointer at all (despite technically being a null pointer constant) and so the wrong type of argument is passed. The resulting behaviour is undefined (it might crash, for instance). The correct way to write this is to use an explicit cast:
printf("%p\n", (void *)0);
Secondly, NULL is guaranteed only to be a null pointer constant - not necessarily to be a null pointer. Again, this does matter. Consider this code fragment:
printf("%p\n", NULL);
If NULL is defined as 0, as is legitimate, then this is no more correct than the original version. If you want to use NULL then you must still cast it:
printf("%p\n", (void *)NULL);
You might reasonably wonder what the point of NULL is in that case. I generally do not use it in my own code, but some people like it - it makes it clearer that a null pointer is intended.
Thirdly, only constants are null pointer constants. (The name makes more sense here.) A variable that happens to have the value 0 might not become a null pointer when converted to an appropriate pointer type. So this fragment:
int x = 0;
printf("%p\n", (void *)x);
...might perfectly legitimately produce different output to the previous examples.
Finally, beware of using the word "null" to describe things other than null pointers or null pointer constants in the context of C; it can lead to confusion. I use "character 0" or '\0' instead of "the null character" or "NUL", and "the empty string" instead "the null string". This is a point about human language rather than what you write in a C program, though.
6. I/O Error Checking
There are a few points worth making regarding checking for I/O errors.
Firstly, a common mistake is to fail to close output streams before calling exit. Consider for instance this program:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
if(printf("Hello, world\n") < 0) {
perror("writing to stdout");
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
This program correctly checks for an error when calling printf; however if standard output is buffered then the greeting will not actually be written until the stream is closed by exit. If the write succeeds all is well, but if it fails, then there is no way for the C library to report the failure. This applies when main returns, as well as when exit is called.
The library could change the successful termination indication to a failed termination indication in this situation. However, the standard does not require this and most existing C libraries do not do it, so although it would be a good idea if they did, we can't rely on it as application authors.
One possible corrected version of the program is therefore this:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
if(printf("Hello, world\n") < 0
|| fclose(stdout) < 0) {
perror("writing to stdout");
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
Secondly, it is worth re-iterating here the point about EOF made in section 1.2 above; it is possible that some valid character may compare equal to EOF, so calls to getc, getchar, fgetc, putc, putchar and fputc must use explicit calls to ferror and (for the read functions) feof to check for an error or the end of the file, rather than checking the return value.
ungetc doesn't set the error indicator even when it fails, so it's not possible to reliably check for errors unless you know that the argument isn't equal to EOF. On the other hand, the standard says "one character of pushback is guaranteed", so this shouldn't be an issue if you only ever push at most one character back before reading again.
These issues are only a practical problem when the char and int types have the same representation, but portable code cannot assume that they do not.
Finally, the error returns from the various <stdio.h> functions are completely inconsistent. The table below summarizes:
Function | Successful Return | Error Return |
---|---|---|
fclose | zero | EOF (negative) |
fflush | zero | EOF (negative) |
fgetc | character read | use ferror/feof |
fgetpos | zero | nonzero |
fgets | pointer to array | null pointer |
fprintf | number of characters (non-negative) | negative |
fputc | character written | use ferror |
fputs | non-negative | EOF (negative) |
fopen | pointer to stream | null pointer |
fread | elements read | elements read |
freopen | pointer to stream | null pointer |
fscanf | number of conversions (non-negative) | EOF or not all conversions |
fseek | zero [1] | nonzero |
fsetpos | zero | nonzero |
ftell | file position | -1L |
fwrite | elements written | elements written |
getc | character read | use ferror/feof |
getchar | character read | use ferror/feof |
gets | never use this function | |
printf | number of characters (non-negative) | negative |
putc | character written | use ferror |
puts | non-negative | EOF (negative) |
remove | zero | nonzero |
rename | zero | nonzero |
setvbuf | zero | nonzero |
scanf | number of conversions (non-negative) | EOF or not all conversions |
snprintf | number of characters that would be written (non-negative) [2] | negative |
sprintf | number of characters (non-negative) | negative |
sscanf | number of conversions (non-negative) | EOF or not all conversions |
tmpfile | pointer to stream | null pointer |
tmpnam | non-null pointer | null pointer |
ungetc | character pushed back | EOF (see above) |
vfprintf | number of characters (non-negative) | negative |
vfscanf | number of conversions (non-negative) | EOF or not all conversions |
vprintf | number of characters (non-negative) | negative |
vscanf | number of conversions (non-negative) | EOF or not all conversions |
vsnprintf | number of characters that would be written (non-negative) [2] | negative |
vsprintf | number of characters (non-negative) | negative |
[1] the standard text is "The fseek function returns nonzero only for a request that cannot be satisfied.", which seems unambiguous but is oddly different in style from most of the other return value specifications.
[2] early versions of Glibc had a broken implementation of snprintf and vsnprintf: they would return -1 on overflow rather than the number of characters that would be written. Broken versions of these functions are quite common, especially in older C libraries.
Afterword
Quite a few of the examples here focus on variadic functions. In particular they look at printf - but this is just because that is one of the more complex variadic functions that C programmers usually encounter. The underlying problem is that C's compile-time type checking mechanisms do not apply in the case of variadic functions.
The same problems occur with unprototyped functions. However, it is usually reasonable to regard these as obsolete and always avoid them. The one exception is for code that must be portable to very old machines; personally I take the view that this consideration only ever applies to code necessary to bring up a modern compiler targetted on such machines.
It would be useful to do away with these problems even in the context of variadic functions, though. There is some hope. In the case of functions with the same interface as printf, scanf or strftime, GNU C can be told to do type checking based on the format string.
Obviously this doesn't help for variadic functions with different requirements, though. execlp and similar functions in UNIX could have compile-time type checking in the same way (a common bug is to fail to make the terminating null pointer actually a null pointer) but gcc does not currently support this.
Even if it did, though, building knowledge of each special format into the compiler is not a very general solution. It's not clear how this would be sensibly fixed without very radical changes to C.
Administrivia
Copyright © 2001, 2002, 2007 Richard Kettlewell. Please send feedback to richard+cfu@@sfere.greenend.org.uk.
Please link to this document as http://www.greenend.org.uk/rjk/2001/02/cfu.html, rather than copying it (or linking to any other URL that might happen to produce it.)
No comments:
Post a Comment