Thursday, February 15, 2007

Mike's Programming Peeves or A growing list of things to think about when coding

I wrote this while I worked for Sun Microsystems from 1989-1999.
-Michael Grant

People don't always consider that programming is often a group effort. Many people never consider that snippet of code they are writing alone in front of their computer will be read or modified by someone else. Programming tends to be a very personal thing. As the programmer, we make decisions as to the flow of the code, the variable names, the comments, etc. We are in complete control. We can use some clever trick which saves a few lines of code here or there or decide to write something sloppily and come back later and fix it. At first, only the author knows exactly how all the functions written work and what their limitations are. What we don't often think about or want to believe is that someone else might come along and extend our code or fix bugs in it.

This document is written with C code in mind, but many of the ideas can be extended to any programming language. There are several general areas that should be kept in mind when coding:

  • Layout - where to put things
  • Style - ways of doing things
  • Playing it safe - paranoid programming
  • Using the proper tool for the job
  • Document it
  • Performance issues
  • Compiling

Layout

Layout is where you put things and how you put them there. It's essentially how things are formatted or laid out. By changing the layout, we may not change the code, but it's easier to read.

Functions should fit on a page (or nearly so)

A single function should, in theory, not span more than a page on the screen. With a reasonable font you can display about 70 lines on today's. In theory, this should be enough to write a high level function which calls other functions.

Certainly there are occasions where it's impossible to stick to this. Usually any function has some initialization or checking stuff in the beginning, then the real work, then some cleanup stuff in the end. One should strive to keep the real work section on a single page.

Where it is not possible to keep the real work on a single page, the reader should be easily guided through the code with comments. There should be a comment block at the top of the work section or at the top of the function which describes from a high level what and how the function works.

Group similar functions into a file

Functions which are similar or work on a similar object or towards a similar goal should be grouped into a single file. Do not put other completely random functions into this file as they will become lost.

Indenting should not be very deep

The deeper the indention, the harder it is for the eye to follow what's going on. Also, as a human reading the code, we only have a memory stack so deep before we forget what's going on. When indention gets too deep, it's a good sign that you probably should be creating a function.

Eliminating needless indention

There should not be too many indentations, if you are indenting too much, perhaps you should eliminate some of the cases first. The reader has a stack too, if you push too many things onto it, they will get confused too.


if (X != NULL) {
for(j=0; j<50; j++) {
if (y[j] < 10) {
foo();
bar();
}
}
return baz();
} else {
return(NULL)
}

can be rewritten as:


if (X == NULL) {
return(NULL);
}

for(j=0; j < 50; j++) {
if (y[j] < 10) {
foo();
bar();
}
return baz();
}

Not only is the indention less here, but the flow is more clear. In the rewritten code, the return(NULL) is closer to the if, it's much easier to see that if X is NULL that the function returns. The reader doesn't need to wait until the end of the if clause to find that out. When the clause is long, the reader may have completely forgotten what the else is for.

White space

White space is your friend. Leaving a space between groups of lines can make it clear to the reader that you've finished with one small thing and you're now going on to another. For example:


/* find x */
x=g(z);
y=x*4;
x=x+y;


/* now calculate c */
c=f(x);

{ ... }

Always use { } around the action part of an if statement if the action part contains more than one line (not necessarily more than one statement). It's too easy to add another statement after the while and not notice if it's to be in the if or not:


if (good) {
while (something) {
....
}
}

In the above example, the action part of the if statement contains a single statement, the while loop. Even though the { } are not necessary, it's better to add them.

However, for simple one-line things, it's ok to do this:


if (x==NULL)
return (NULL);

#ifdef ... #else ... #endif

Avoid using lots of #ifdef / #else / #endif especially nesting them, it is confusing to read.

If you have an #ifdef that is more than a few lines from the #endif, mark the #else and #endif with the identifier from the #ifdef:


#ifdef FOO
....
#endif FOO

Indenting

Always use a real tab character to indent.

The standard width of a tab (ctrl-I) is 8 spaces. When you use a tab to indent, your editor of choice must insert a real tab for each level of indent. If it does not, choose a different editor. Both emacs and vi will gladly insert real tab.

Some people like (even insist upon) changing the width of a tab in their editor. If everyone uses tabs instead of spaces for indentation, then when someone adjusts the width of a tab in their editor, the indenting should look good.

Do not set your default tab width to something other than 8, then use spaces to make things look correct on your screen. This really annoys others who have set the width of a tab to something different (or left it at 8). Always reformat it with tabs. There are several programs for converting spaces to tabs. M-x tabify in emacs will do this for you.

One minor problem exists with always using tabs for indenting. emacs's C-style indention indents lines which are continued up to the line above it, for example:


my_function(foo,
bar,
baz)

It's impossible to guarantee that the indent is always going to be as wide as a tab stop. If someone comes along and edits the above with a different width tab, the continued line will not line up in the correct place. Therefore, Bill Shannon recommends you indent a continued line is with 4 spaces:


my_function(foo,
bar,
baz)

With this fairly minor inconvenience, everyone can then adjust the width of a tab stop in their editor and the code looks acceptably good on everybody's screen.

Emacs users may want to stick this at the end of their files to alter the tab-width of a particular file:


# Local Variables:
# tab-width: 4
# End:
Or add this to your .emacs init file:

(set-variable 'tab-width 4)

Other obvious things

Don't pass a file descriptor into a function and close the descriptor inside the function!


foo(fp);
...
foo(FILE *fp) {
....
close(fp);
}

And, of course, it's always clearer to pass in a parameter than to use a global variable.


Style

Style is the particular manner of how you write or build the code. Some styles are more efficient than others. Some are easier to read. Changing the style generally changes the code. Either by putting things into functions or by rearranging things.

Basic style

Basic style should be one that leads the reader through at a high level, then lower down functions lead reader through the details.


while (input = get_input()) {
good = parse(input)
if (!good)
break;
analyze_data();
print_results();
}

This is clearer than if we had written parse, analyze_data, and print_results directly into loop.

Avoid checking the result in an if immediately if the condition is going to be complex:


if (parse(input)!=NULL && checking==strong) {
...
Better to write it like this:

good = parse(input);
if (good!=NULL && checking==strong) {
...

Never use a goto to jump back

Use a loop instead. Gotos are extremely disconcerting when they jump back into code. Humans read code from top to bottom. When we see a label for a goto, it's very difficult to imagine where the code jumps to this label, especially since it could be jumped to from more than one place.

Gotos can sometimes (read: very rarely) be useful for error conditions to work like break, use only where necessary and only to one place and always forward. Should almost never be used.

Avoid structures within structures within structures

When you have pointers to structures with pointers to structures, set a local variable when working on sub-structures. It makes it much easier to read and avoids long arguments like:

entries_res[i]->Val[0]->Val = 1;

Better to write:


entry = entries_res[i];
values = entry->Val;
first_value = values[0];
first_value->value = 1;

Avoid duplicating a name of a field in a structure in a sub-structure like above.

Use block local variables

Block local variables like this:


...

if (y>0) {
int x;

for (x=1; x<10; x++)
....
}

From the above, you can easily see that x is a temporary variable, it's life doesn't extend outside the if. This means less variables to keep track of when reading the code.

Functions with more than one purpose should be separate

A function which does multiple things should be split into separate functions. For example:


void *do_work(int type, void *thing) {
switch(type) {
USER:
/* do stuff with user */
GROUP:
/* do stuff with group */
MAILBOX:
/* do stuff with mailbox */
}
}

This seems very innocent and perhaps even a good idea, but as soon as the "do stuff" becomes at all complicated (more than a few lines of code), this function gets unwieldy fast. Furthermore, the fact that you must pass a void pointer into this function (or perhaps get one out) is error prone because no type checking is done by the compiler. It is better to have several functions. This allows the compiler to do the type checking for you and avoid needless errors. It further adds to the documentation because the name of the function more accurately tells what it does. It also makes for smaller, more readable functions:


int do_user(char *user) {
/* do stuff with user */
}


char *do_group(int group) {
/* do stuff with group */
}


char *do_mailbox(char *mailbox) {
/* do stuff with mailbox */
}

Building interfaces that are easy to use

Often when we program in groups, we write libraries which other people in the group use. This is where using techniques to make your code clearer becomes much more important.

Simplicity

The simple interfaces work best. The best interfaces are the ones you can call only once to do something. The next best ones are ones which you need to call some sort of initialization function and receive back a pointer to something. Then you pass this pointer into all future functions that need it. There should be only one initialization function. If the user of the interface has to wade through the entire documentation to figure out which sequence of initialization functions to call, it's too complex. A simple example of how to use the interface should be provided in the man page. If there is no man page, document it in the .h file at the very least.

Return Values

Don't use crazy return values. We are used to character strings which are null terminated. Don't invent your own string with a length unless there's a damn good reason. Arrays of strings are another good example. One needs to know the length of an array of strings. It's common to use an array of pointers to strings with the last pointer null. This makes it easy to return a pointer to char ** rather than having to create a bizarre and non-standard structure which contains the number of valid strings.

.h header files

All libraries and interfaces should have a .h file. That means if you have some functions in a .c file that are used in another .c file, they should be prototyped in a .h file somewhere. Either have one .h file for the entire program or one .h file for each .c file. I find that I prefer one .h file for each .c file. It makes it easier to reuse the functions in other places if I want to.

Never prototype a function directly in the file it's used in unless it's static (local to the file).

extern f();

This is just too dangerous. It's too easy to define the function in multiple places, then the function might change, but you'll need to chase down all the prototypes. If you don't, you will not get an error when you compile. Furthermore, never prototype a function without it's arguments. No type checking will be done for you in this case.

It's best to prototype your functions in one place only, in the .h file, then include this .h file in the .c file where you write the function. All libraries should have a .h file, if you use a library, include it's .h file. This forces the compiler to check that the prototype is the same as what the function expects.

Ludovic Poitou (of Sun) says:


#include: <...> must be used for standard includes (/usr/include).
#include "..." must be used for project includes.

The later must be declared after the former because you do not want to redefine a system include.

Thread safe

Keep in mind that someone may be calling your interface from a multi-threaded environment. This means don't store any static data inside your functions. If you need static data, allow the user to pass a pointer which they re-pass the next time they call your interface. Try to use the *_r routines if they exist, for example strtok_r() in place of strtok().

fork()

I cannot think of many reasons to use fork() within a library. You don't have any idea how large the process is or how many threads or lwps it has that you are about to duplicate. Think twice about this! If it's still necessary to use fork, consider fork1() instead.

Error messages

Library routines should not print error messages. Either use a standard error message number and use errno or return a number which corresponds to a documented error in your interface. Errors in the form of a string should be avoided because it makes a program impossible to internationalize.

Using objects in C

You can create an interface similar to C++ objects in C to create a relatively clean way to represent data and it's member functions. Methods like this help unclutter code, especially when the higher level code shouldn't be concerned with the inner workings of a link list, a tree, or memory allocation. You can and should use this even if you are not building a library for someone else to use. An example of how to do this is here.

Speed vs. Clarity

Often when we program, we see some tricks or ways to take advantage of something to save a few clock cycles. In many cases, the extra few clock cycles you save by creating a very tight piece of code is not worth the savings. The amount of time spent thinking about and debugging such tricks never equals the clock cycles saved. This is especially true when someone else must read the code later and step through it with a debugger for hours to understand it. Obviously if you are writing a piece of real-time code such as a driver, this does not apply.


Playing it safe

This is related to style. There are ways of doing things which are safer than other ways. There are some easy tricks you can use to cross-check yourself.

Return values

If a function has a return value, check it. This is especially true for functions which allocate storage such as malloc, calloc, realloc. If you don't want to clutter your code by checking the value, write the code in such a way that if the function returned NULL, the code wouldn't break. Recall that strdup uses malloc internally and could return NULL if there is insufficient memory.

It's a pain to check each return value and plus, it adds code which clutters things up. It's perfectly reasonable to create small helper functions which check the error code and exit gracefully while logging an error appropriately.


char *ch_malloc(unsigned long size){
char *new;

if ( (new = (char *) malloc( size )) == NULL ) {
perror("malloc");
exit(1);
}
return( new );
}

Always use function prototypes

Always prototype your functions at the top of a file in which they are used and always use meaningful variable names in them. Never prototype a function inside another function without it's arguments. This means never do this:


my_function(char *x)
{
int foo(); <----- don't do this!
int y;
...
y = foo(atoi(x));
...
}

int foo(int x)
{
...
}

As stated above in the .h section, using prototypes in a .h file will force you to never have a function being called with the incorrect types or numbers of values.

Checking for null values in an if

Avoid the following because it's confusing:


if (good = parse(input))
...

Better to write:


good = parse(input);
if (good != NULL)
...

The former is perfectly valid C, but it can be confusing, it looks like we are comparing the value of good to the return value of parse.

Use strcmp()==0 instead of !strcmp()

It's just too easy to miss the "!" at the beginning and furthermore, it's confusing. It's only 2 extra characters to type "==0" and you can think of the operator "==" or "!-" as how things are being compared. Furthermore, operators like "<" and ">" also work.

Some things from David Robinson's lint tips:

Placing (void) in front of functions who's return value is ignored

When calling a function with a return value but not using it, place a (void) cast in front. Common functions are sprintf and strcpy. Many times ignoring a function return value can hide error conditions. Functions that always have the values ignored might need to consider having their specification redeclared void.

Always use the proper format string

  • Format strings for long integers use one of "%ld", %lx", or "%lu"
  • Format strings for unsigned integers use one of "%u" or "%x"
  • Format strings for long long integers use "%lld" or "%llx"
  • Format strings for pointers should use "%p", with the pointer cast to (void *).
  • Using %d or %x and casting to an (int) will break in a 64-bit kernel.

Using the proper tools for the job

Always use getopt()

Using getopt is just a good idea. It's too easy to do something non-standard when you parse command line arguments by hand. Furthermore, the code is already written for you in the man page for getopt.

When parsing command line options, I prefer to store all the options in a structure. This makes it obvious when I use the option that it was set on the command line or had a default. I usually call this struct options:


struct options {
int lines; /* -n number of lines */
int debug; /* -d debug level */
int list; /* -l list flag */
...
} options;

Use lex/yacc to read complicated files, do not parse by hand

For even slightly complex files, using lex and yacc to build a parser increases the readability and the extensibility. For files which are essentially one line per "record", reading these files with lex/yacc is definitely overkill. A good candidate for a file to read with lex/yacc would be files which:

  • each record spans multiple lines
  • have a format which can be described easily by a grammar
  • there are optional parts of the record

A good example is a configuration file that has a syntax like this:


foo = "123"
bar = { 456, "xyz" }
baz = {
foo,
bar,
{ "abc", 123 }
}

The grammar for reading such files is not difficult to create whereas the code to create such a parser by hand would likely not be easily extendible without rewriting it. It would depend on the initial design.

Since I originally wrote this several years ago, XML is becoming far more popular as a markup language for configuration files.

Use enumerations instead of #defines in C

#defined constants do not work in most debuggers. There is another mechanism which is seldom used called an enumeration. enumerations can't replace all uses of #defined constants, but where possible, use them. (Note that some compilers treat enumeration constants as a separate memory space as #define constants).

Avoid using #ifdef comments

Using #ifdef for comments makes the code hard to read, better to define a DEBUG macro instead of this:


#ifdef COMMENTS
printf("got here\n");
#endif

use a macro something like this:


#define DEBUG fprintf(stderr, message)
...
DEBUG("got here");

Avoid using large macros

Large macros cannot be stepped in the debugger. It's better to use a function. Some of the newer compilers have the ability to automatically inline certain functions.

Avoid using macros inconsistently

The following macros were defined in the php source code:


#ifdef ZTS
# define CLS_D zend_compiler_globals *compiler_globals
# define CLS_DC , CLS_D
# define CLS_C compiler_globals
# define CLS_CC , CLS_C
# define CG(v) (((zend_compiler_globals *) compiler_globals)->v)
# define CLS_FETCH() zend_compiler_globals *compiler_globals = (zend_compiler_globals
*) ts_resource(compiler_globals_id)
BEGIN_EXTERN_C()
int zendparse(void *compiler_globals);
END_EXTERN_C()
#else
# define CLS_D void
# define CLS_DC
# define CLS_C
# define CLS_CC
# define CG(v) (compiler_globals.v)
# define CLS_FETCH()
extern ZEND_API struct _zend_compiler_globals compiler_globals;
int zendparse(void);
#endif

Then, they were used like this:


ZEND_API int zend_execute_scripts(int type CLS_DC ELS_DC, int file_count, ...)
{
...
}

and


ZEND_API void zend_bailout()
{
CLS_FETCH();
ELS_FETCH();

CG(unclean_shutdown) = 1;
longjmp(EG(bailout), FAILURE);
}

I leave it as an exercise for the reader to understand the above.

Never redefine a library function

Never redefine a library's function without giving it a new name. Also, never create a macro which covers a library function. This is just too confusing. Even if it looks nice, don't do it.

There is no need to printf a final NULL in a string

The following is useless because these functions stop when reading the first null (\0):


sprintf(s, "my string\0");
strcat(s, "some more\0");
strcpy(s, "and more\0");

About the only time it's needed to manually make sure that a null is at the end of a string is when you use strncat or strncpy. These functions will not null terminate a string if there are n or more characters in what you are trying to copy.

Use the ctype functions

Use functions like isspace() instead of checking if the character is a space or a tab directly. These functions are both less error prone and also work with internationalization. You just never know what might be considered white space or a digit these days. There are several of these functions. See the man page for ctype(3c).


Document it

Names should be meaningful

Identifiers within the code communicate a lot to the reader who is trying to figure out what a program does. Choosing good identifiers is not easy. Often compromises must be made between a name being too long or too cryptic. I prefer to have names on the longer side rather than the shorter side.

When variable names have to have spaces in them, I prefer to use an _, for example "list_head".

With variables like list_head, "list" is telling us that this variable has to do with lists, "head" tells us that it's specifically the head of the list. I prefer to see the type in front of the specifier. i.e. list_head as opposed to head_list. This is not a general rule, there are times when it makes sense to do it the other way around like listp for a list pointer. Many people use the xxxp convention.

There are some common practices in use, for example i,j,k and x,y,z are commonly used as index variables. t is often used for a time. Appending p to the end of a name is often used for a pointer.

Product names should definitely never be stored in variables. Names of products have a tendency to change.

Comments, lots of them

They should be meaningful and walk the reader through the code.

Documentation everything

  • Each function should be documented to say what it's purpose is and what's its input and output.
  • Each file should contain a description of what the general purpose of the functions are for in the file.
  • Complicated functions should contain an overview of how they work.
  • There should be some file which describes the basic architecture.
  • This file should contain some basic pseudo code showing an overview of how it works and how it fits together.

  • If you do something bizarre, comment it where it's used and where it's declared.

Declare global variables and functions static which are not used outside a file

Global variables which are only global within a file and functions which are only used within a single .c file should be declared static. Declaring them static prohibits them from being exported to the linker and therefore makes them local to that file. This makes the code easier to read because when a reader sees something that is declared as static, the reader knows that it's not used elsewhere.

README

A simple README file in the directory where your program exists is a wonderful place to keep notes on the design, list of features, even a basic description as to how the code works and how it's laid out. A file like this becomes invaluable to someone who has to come along later and document the mess you have created.

Don't forget to keep the README up to date as you add/change features.


Performance considerations

Memory

Memory allocation takes time. Memory allocated with malloc must be freed.

It's quickest to use memory allocated on the stack at function call time. Memory allocated like this is essentially free as it merely only means setting a stack pointer higher. Obviously there are limits to how much memory you can allocate like this and it's automatically freed upon return.


f() {
char x[4096];
...
}

It's second quickest to use memory allocated by alloca (10%-15% slower than if it was allocated at function call time). Again, same restrictions apply because this is just allocating memory on the stack.

One major drawback of alloca() is that you don't know when it's failed. It doesn't return NULL upon failure! You can probably use alloca() for small values if you are going to return it relatively soon, but it shouldn't be used for large values which will be kept around a long time or you risk running out of space and dumping core, just as if you recursed too deeply.


f() {
char *x;
...
x = (char *)alloca(64);
...
}

When using threads, it's often useful to use a global variable which is specific to a thread. It is quite a bit slower to use pthread_getspecific with a global variable. Took on average 5 times longer than the allocating it on the stack.


pthread_key_t x_key;

f() {
char *x;

x = (char *)pthread_getspecific(x_key);
...
}

Worst -- and by a long-long-long-shot -- is to use malloc() This took on average 20 times longer than the allocation on the stack (the first method).


f() {
char *x;
...
x = (char *)malloc(4096);
...
free(x);
}

Each method has it's uses. Sometimes you just have to use malloc(). If I had a choice when I was writing the initial program, I would attempt to use malloc at the beginning and avoid it later on. If I was writing a threaded program, I would attempt to allocate memory on the stack within each thread rather than using a global variable and pthread_getspecific(). If I need some space temporarily that I will not need upon return from the function, I would use alloca. alloca has a drawback that it may not be portable, but I have not found any cases where it wouldn't work on Solaris.

When you malloc memory (remember that strdup uses malloc too!), this memory should be freed when no longer used. Memory that was malloced and has nothing pointing to it is known as leaked (or garbage). If you don't want to be bothered with freeing memory, use alloca() or create it at function call time.

Passing things on the stack

If you have a function like this:


myfunction(struct foo *barp) {
int x;
x = barp->x
...
}

If you pass a structure to the function like this:


main() {
struct foo bar;


myfunction(bar);
...

You are actually passing a copy of bar on the stack. Within myfunction(), the compiler gives you a pointer to bar pointing into the stack. If the foo structure is large, it could use up the entire remaining stack space.

It's far better to do this:


struct foo bar;
...
myfunction(&bar);

myfunction() is unchanged except instead of pointing to a copy of bar on the stack, you have a pointer to the original bar. Only a pointer to bar is passed on the stack. As you can imagine, this can be much more efficient.

TCP/IP

Compose the response packet in a buffer and do a write all at once to the socket, otherwise, you stand a good chance that your packet will go out in pieces.

Locks in multithreaded programs

It's tempting to use pthread_mutex_lock() around every critical section of code. Consider using rw_lock() if the situation merits a read/write lock. This allows multiple threads to use a block of code for "reading" while only one thread is allowed through for "writing".

Performance Tradeoffs for Readability

Sometimes writing tight code that squeezes every last cycle out of the CPU isn't necessary. The number of cycles you will save versus the amount of time and headaches you will have later on sometimes far out-weigh the savings. I have seen lots of wonderful examples of this sort of programming. Many examples I have seen are programmers trying to show off their skills at how "smart" they are by trying to do something clever which isn't clear (but may work fine). Until someone else comes along that needs to debug it or modify it that is.

You just have to use common sense. It doesn't make sense to squeeze every last cycle out of the machine for certain tasks that it just doesn't matter. Hardware is fast enough these days that wasting some cycles isn't often noticed. Unless you're writing a game or something that the user will notice, I would program to clarity rather than speed. Don't get me wrong, you have to use good judgment. Many things are worth writing efficient code, you just have to know when to be reasonable.

On the same grounds, you should try to avoid complex logic. Think about the problem and try to solve it in such a way that is simple and elegant. The more complex and convoluted you have to make your program, the harder it will be for someone else to understand.

Compiling

Linking

Ludovic Poitou says: when linking, the order of things to link have some importance (especially when the linker is changed for a checking tool like insight). First specify the .o files then the project specific libs then the solaris libs. And try to avoid lib duplication.

LD_LIBRARY_PATH

Your program should not rely on LD_LIBRARY_PATH being set properly to find it's libraries. If your program uses dynamic libraries which are not in /usr/lib, you must add the default directory for the libraries when you link in the -R option.

Other Resources

  • Bill Shannon's C-style guide
  • At Sun we had a cmode.el auto-indent mode for emacs which complied with Bill Shannon's C-style document, but I don't know where this file is publicly
  • Deep C Secrets by Peter Van Der Linden