Getting The Least Out of Your C Compiler: Class #363, Embedded Systems Conference San Francisco 2002
Getting The Least Out of Your C Compiler: Class #363, Embedded Systems Conference San Francisco 2002
Class #363, Embedded Systems Conference San Francisco 2002 Jakob Engblom
IAR Systems Box 23051 SE-750 23 Uppsala Sweden
email: [email protected]
Dept. of Computer Systems Uppsala University Box 325 SE-751 05 Uppsala Sweden
email: [email protected]
Writing portable and readable C code that still compiles efficiently is very important for developers of all types of computer systems. For embedded systems, the size of the generated code and the data is very important, since using smaller external memory or on-chip memory only can decrease the cost and power consumption of a system significantly . This article discusses how to help a modern, highly optimizing C compiler generate small and efficient code, while writing maintainable, readable, and portable C code. In order to facilitate an understanding for what a compiler likes and does not like, we will give an introduction to modern compiler technology. We will show how help the optimizer generate good code, and how to prevent optimizations when they are undesirable. Many established truths and tricks are invalidated when using modern compilers. We will demonstrate some of the more common mistakes and how to avoid them, and give a catalogue of good coding techniques. The presentation will be illustrated by snippets of real-world code to demonstrate important concepts. An important conclusion is that code that is easy for a human to understand is usually also compiler friendly, contrary to hacker tradition This text has been partially updated since ESC SF 2001, especially in Section 3.
1 Introduction
A C compiler is a basic tool for most embedded systems programmers. It is the tool by which the ideas and algorithms in your application (expressed as C source code) are transformed into machine code executable by your target processor. To a large extent, the C compiler determines how large the executable code for the application will be. The C language is well suited for low-level programming. It was designed for coding operating systems, which has left the imprint of powerful pointer handling, bit manipulation power not found in other highlevel languages, and target-dependent type sizes to generate the best possible code for a certain target. The semantics of C are specified by the ISO/ANSI C Standard Error! Reference source not found.. The standard makes an admirable job of specifying the language without unduly constraining an implementation of the language. For example, compared to Java, the C standard gives the compiler writer some flexibility in the size of types and the precise order and implementation of calculations. The result is that there are C compilers for all available processors, from the humblest 8-bitter to the proudest supercomputers. A compiler performs many transformations on a program in order to generate the best possible code. Examples of such transformations are storing values in registers instead of memory, removing code which does nothing useful, reordering computations in a more efficient order, and replacing arithmetic operations by cheaper operations. To most programmers of embedded systems, the case that a program does not quite fit into the available memory is a familiar phenomenon. Recoding parts of an application in assembly language or throwing out functionality may seem to be the only alternatives, while the solution could be as simple as rewriting the C code in a more compiler-friendly manner. In order to write code that is compiler friendly, you need to have a working understanding of compilers. Some simple changes to a program, like changing the data type of a frequently-accessed variable, can
have a big impact on code size while other changes have no effect at all. Having an idea of what a compiler can and cannot do makes such optimization work much easier.
2 Modern C Compilers
Assembly programs specify both what, how, and the precise order in which calculations should be carried out. A C program, on the other hand, only specifies the calculations that should be performed. With some restrictions, the order and the technique used to realize the calculations are up to the compiler. The compiler will look at the code and try to understand what is being calculated. It will then generate the best possible code, given the information it has managed to obtain, locally within a single statement and also across entire functions and sometimes even whole programs.
2.1
! ! ! ! ! !
In general, a program is processed in six main steps in a modern compiler (not all compilers follow this blueprint completely, but as a conceptual guide it is sufficient): Parser: The conversion from C source code to an intermediate language. High-level optimization: Optimizations on the intermediate code. Code generation: Generation of target machine code from the intermediate code. Low-level optimization: Optimizations on the machine code. Assembly: Generation of an object file that can be linked from the target machine code. Linking: Linking of all code for a program into an executable or downloadable file.
The parser parses the C source code, checking the syntax and generating error messages if syntactical errors are found in the source. If no errors are found, the parser then generates intermediate code (an internal representation of the parsed code), and compilation proceeds with the first optimization pass. The high-level optimizer transforms the code to make it better. The optimizer has a large number of transformations available and will perform them in various passes, possibly repeating some passes. Note that we use the word transformation and not optimization. Optimization is a bit of a misnomer. It conveys the intuition that a change always improves a program and that we actually find optimal solutions, while in fact optimal solutions are very expensive or even impossible to find (undecidable, in computer science lingo). To ensure reasonable compilation times and termination of the compilation process, the compiler has to use heuristic methods (good guesses). Transforming a program is a highly non-linear activity, where different orderings of transformations will yield different results, and some transformations may actually make the code worse. Piling on more optimizations will not necessarily yield better code. When the high-level optimizer is done, the code generator transforms the intermediate code to the target processor instruction set. This stage is performed, piece-by-piece, on the intermediate code from the optimizer, and the compiler will try to do smart things on the level of a single expression or statement, but not across several statements. The code generator will also have to account for any differences between the C language and the target processor. For example, 32-bit arithmetic will have to be broken down to 8-bit arithmetic for a small embedded target (like an Intel 8051, Motorola 68HC08, Samsung SAM8, or Microchip PIC). A very important part of code generation is allocating registers to variables. The goal is to keep as many values as possible in registers, since register-based operations are typically faster and smaller than memory-based operations. After the code generator is done, another phase of optimization takes place, where transformations are performed on the target code. The low-level optimizer will clean up after the code generator (which sometimes makes suboptimal choices), and perform more transformations. There are many transformations which can only be applied on the target code, and some which are repeated from the high-level phase, but on a lower level. For example, transformations like removing a clear carry instruction if we already know that the carry flag is zero are only possible at the target code level, since the flags are not visible before code generation. After the low-level optimizer is finished, the code is sent to an assembler and output to an object file.
All the object files of a program are then linked to produce a final binary executable ROM image (in some format appropriate for the target). The linker may also perform some optimizations, for example by discarding unused functions. Thus, one can see that the seemingly simple task of compiling a C program is actually a rather long and winding road through a highly complex system. Different transformations may interact, and a local improvement may be worse for the whole program. For example, an expression can typically be evaluated more efficiently if given more temporary registers. Taking a local view, it thus seems to be a good idea to provide as many registers as necessary. A global effect, however, is that variables in registers may have to be spilled to memory, which could be more expensive than evaluating the expression with fewer registers.
2.2
Before the compiler can apply transformations to a program, it must analyze the code to determine which transformations are legal and likely to result in improvements. The legality of transformations is determined by the semantics laid down in the C standard. The most basic interpretation of a C program is that only statements that have side effects or compute values used for performing side-effects need to be kept in the program. Side effects are any effects of an expression that change the global state of the program. Examples which are generally considered to be side effects are writing to a screen, accessing global variables, reading volatile variables, and calling unknown functions. The calculations between the side-effects are carried out according to the principle of do what I say, not what I mean. The compiler will try to rewrite each expression into the most efficient form possible, but a rewrite is only possible if the result of the rewritten code is the same as the original expression. The C standard defines what is considered the same, and sets the limits of allowable optimizations.
2.3
Basic Transformations
A modern compiler performs a large number of basic transformations that act locally, like folding constant expressions, replacing expensive operations by cheaper ones (strength reduction), finding and removing redundant calculations, and moving invariant calculations outside of loops. The compiler can do most mechanical improvements just as well as a human programmer, but without tiring or making a mistake. The table below shows (in C form for readability) some typical basic transformations performed by a modern C compiler. Before Transformation
unsigned short int a; a /= 8; a *= 2; a = b + c * d; e = f + c * d; a = b * 1; a = 17; b = 56 + ( 2 * a ); #define BITNO 4 port |= ( 1 << BITNO); if(a > 10) { b = b * c + k; if(a < 5) a+=6; } a = b * c + k; a = k + 7; for(i=0; i<10; i++) { b = k * c; p[i] = b; }
After Transformation
unsigned short int a; a >>= 3; /* shift replaced divide*/ a += a; /* multiply to add */ a <<= 1; /* or using a shift */ temp = c * d; /* common expression */ a = b + temp; /* saves one multiply */ e = f + temp; a = b; /* x*1 == x */ a = 17; b = 90; /* constant value evaluated */ /* at compile time */ port |= 0x10; if(a > 10) { b = b * c + k; /* unreachable code removed */ } /* useless computation removed */ a = k + 7; b = k * c; /* constant code moved */ /* outside the loop */ for(i=0; i<10; i++) { p[i] = b; }
All code that is not considered usefulaccording to the definition in the previous sectionis removed. This removal of unreachable or useless computations can cause some unexpected effects. An important example is that empty loops are completely discarded, making empty delay loops useless. The code
shown below stopped working properly when upgrading to a modern compiler that removed useless computations: Code that Stopped Working
void delay(unsigned int time) { unsigned int i; for (i=0; i<time; i++) ; return; } void InitHW(void) { /* Highly timing-dependent code */ OUT_SIGNAL (0x20); delay(120); OUT_SIGNAL (0x21); delay(121); OUT_SIGNAL (0x19); delay(120); OUT_SIGNAL (0x38); delay(147); OUT_SIGNAL (0x0C); }
void InitHW(void) { /* Delays do not last long here */ OUT_SIGNAL (0x20); delay(120); OUT_SIGNAL (0x21); delay(121); OUT_SIGNAL (0x19); delay(120); OUT_SIGNAL (0x38); delay(147); OUT_SIGNAL (0x0C); }
Note that a compiler cannot in general make function calls into common subexpressions. Two subsequent calls to the same function with the same argument will generate two function calls, since the compiler does not in general know what happens inside the called function (for example, it might perform sideeffects). If you intend to evaluate a function only once, write it once! Bad Example
void bad() { /* Two calls to foo */ if (foo(12) && SomeCondition) { ... } if (!foo(12) || SomeOtherCondition) { ... } }
Good Example
void good() { /* One call to foo */ r = foo(12) if ( r && SomeCondition) { ... } if ( !r || SomeOtherCondition) { ... } }
2.4
Register Allocation
Processors usually get better performance and smaller code when calculations are performed using registers, instead of memory. This means that the compiler will try to assign the variables in a function to registers. A local variable or parameter will not need any RAM allocated at all if the variable can be kept in registers for the duration of the function. If there are more variables than registers available, the compiler needs to decide which of the variables to keep in registers, and which to put in memory. This is the problem of register allocation, and it cannot be solved optimally. Instead, heuristic techniques are used. The algorithms used can be quite sensitive, and even small changes to a function may considerably alter the register allocation. Note that a variable needs only occupy a register while it is being used. If a variable is only used in part of a function, it will be register allocated in that part, but it will not exist in the rest of the function. This explains that a debugger sometimes tells you that a variable is optimized away at this point. The register allocator is limited by the language rules of Cfor example, global variables have to be written back to memory when calling other functions, since they can be accessed by the called function, and all changes must be visible to all functions. Between the function calls, the variables can be kept in registers. The same applies to local variables that have their address taken (for instance as parameters to functions like scanf(), which require pointers to local variables to store the values read). Note that there are times when you do not want variables to be register allocated. For example, reading an I/O port or spinning on a lock, you want each read in the source code to be made from memory, since the variable can be changed outside the control of your program. This is where the volatile keyword is to
be used. It signals to the compiler that the variable should not ever be allocated in registers, but read from memory (or written) each time it is accessed. In general, only simple values like integers, floats, and pointers are considered for register allocation. Arrays have to reside in memory since they are designed to be accessed through pointers, and structures are usually too large. Also, on small processors, large values like 32-bit integers and floats may be hard to allocate to registers.
2.5
Function Calls
As assembly programmers well know, calling a function written in a high-level language can be rather burdensome. The calling function must save global variables back to memory, make sure to move local variables to the registers that survive the call (or save to the stack), and some parameters may have to be pushed on the stack. Inside the called function, registers will have to be saved, parameters taken off the stack, and space allocated on the stack for local variables. For large functions with many parameters and variables, the effort required for a call can be quite large. Modern compilers do their best, however, to reduce the cost of a function call, especially the use of stack space. A number of registers will be designated for parameters, so that short parameter lists will most likely be passed entirely in registers. Likewise, the return value will be put in a register, and local variables will only be put on the stack if they cannot be allocated to registers. The number of register parameters will vary wildly between different compilers and architecture. Note, however, that research has found that passing just 3 bytes in registers would cover about 87% of all functions for a set of embedded 8- and 16-bit programs. Note also that just like for register allocation, only small parameter types will be passed in registers. Arrays are always passed as pointers to the array (C semantics dictate that), and structures are usually copied to the stack and the structure parameter changed to a pointer to a structure. That pointer might be passed in a register, however. It is a good rule to always use pointers to structures as parameters and not the structures themselves. C supports functions with variable numbers of arguments. This is used in standard library functions like printf() and scanf() to provide a convenient interface. However, the implementation of variable numbers of arguments to a function incurs significant overhead. All arguments have to be put on the stack, since the function must be able to step through the parameter list using pointers to arguments, and the code accessing the arguments is much less efficient than for fixed parameter lists. There is no typechecking on the arguments, which increases the risk of bugs. Variable numbers of arguments should not be used in embedded systems!
2.6
Function Inlining
It is good programming practice to break out common pieces of computation and accesses to shared data structures into (small) functions. This, however, brings with it the cost of calling a function each time something should be done. In order to mitigate this cost, the compiler transformation of function inlining has been developed. Inlining a function means that a copy of the code for the function is placed in the calling function, and the call is removed. Inlining is a very efficient method to speed up the code, since the function call overhead is avoided but the same computations carried out. Many programmers do this manually by using preprocessor macros for common pieces of code instead of functions, but macros lack the type checking of functions and produce harder-to-find bugs. The executable code will often grow as a result of inlining, since code is being copied into several places. Inlining may also help shrink the code: for small functions, the code size cost of a function call might be bigger than the code for the function. In this case, inlining a function will actually save code size (as well as speeding up the program). The main problem when inlining for size is to estimate the gains in code size (when optimizing for speed, the gain is almost guaranteed). Since inlining in general increases the code size, the inliner has to be quite conservative. The effect of inlining on code size cannot be exactly determined, since the code of the calling function is disturbed, with non-linear effects. To reduce the code size, the ideal would be to inline all calls to a function, which allows us to remove the function from the program altogether. This is only possible if all calls are known, i.e. are placed in the same source file as the function, and the function is marked static, so that it cannot be seen from other files. Otherwise, the function will have to be kept (even though it might still be inlined at some calls), and
we rely on the linker to remove it if it is not called. However, since this decreases the likely gain from inlining, we are less likely to inline such a function.
2.7
A common transformation on the target code level is to find common sequences of instructions from several functions, and breaking them out into subroutines. This transformation can be very effective at shrinking the executable code of a program, at the cost of performing more jumps (note that this transformation only introduces machine-level subroutine calls and not full-strength function calls). Experience shows a gain from 10 to 30% for this transformation.
2.8
Linker
The linker should be considered an integral part of the compilation system, since there are some optimizations that are performed in the linker. The most basic embedded-systems linker should remove all unused functions and variables from a program, and only add those parts of the standard libraries that are actually used. The granularity at which program parts are discarded varies, from files or library modules down to individual functions or even snippets of code. The smaller the granularity, the better the linker. Some linkers also perform post-compilation transformations on the program. Common transformation is the removal of unnecessary bank and page switches (which cannot be done at compile-time since the exact allocation of variable addresses is unknown at that time) and code compression as discussed above extended to the entire program.
2.9
A compiler can be instructed to compile a program with different goals, usually speed or size. For each setting, a set of transformations has been selected that tend to work towards the goalmaximal speed (minimal execution time) or minimal size. The settings should be considered approximate. To give better control, some compilers allow individual transformations to be enabled or disabled. For size optimization, the compiler uses a combination of transformations that tend to generate smaller code, but it might fail in some cases, due to the characteristics of the compiled program. This means that one should always explore the effects of the optimization settings on a given program. As an example, the fact that an inliner is more aggressive for speed optimization makes some programs smaller on the speed setting than on the size setting. The following example data demonstrates this; the two programs were compiled with the same version of the same compiler, using the same memory and data model settings, but optimizing for speed or size: Program 1 Program 2 Speed optimization Size optimization Speed optimization Size optimization 1301 bytes 1493 bytes 20432 bytes 16830 bytes
Program 1 gets slightly smaller with speed optimization, while program 2 is considerably larger, an effect we traced to the fact that the inliner was lucky on program 1. The conclusion is that one should always try to compile a program with different optimization settings and see what happens. Some compilers allow you to adjust the aggressiveness of individual optimizationsexplore that possibility, especially for inlining. It is often worthwhile to use different compilation settings for different files in a project: put the code that must run very quickly into a separate file and compile that for minimal execution time (maximum speed), and the rest of the code for minimal code size. This will give a small program, which is still fast enough where it matters. Some compiler allow different optimization settings for different functions in the same source file using #pragma directives.
2.10
Memory Model
An embedded micro is usually available in several variants (derivatives), each with a different amount of program and data memory. For smaller chips, the fact that the amount of memory that can be addressed is limited can be exploited by the compiler to generate smaller code. An 8-bit pointer uses less code memory than a 24-bit banked pointer where software has to switch banks before each access. This goes for code as well as data.
For example, some Atmel AVR chips have a code area of only 8 kB, which allows a small jump with an offset of +/- 4 kB to reach all code memory, using wrap-around to jump from high addresses to low addresses. Taking advantage of this yields smaller and faster code. The capability of the target chip can be communicated to the compiler using a memory model option. There are usually several different memory models available, ranging from small up to huge. In general, function calls get more expensive as the amount of code allowed increases, and data pointers get bigger and more expensive as the amount of accessible data increases. Use the smallest model that fits your target chip and applicationthis might give you large savings in code size. Bigger is not always better
3.1
The semantics of C state that all calculations should have the same result as if all operands were cast to int and the operation performed on int1. If the result is to be stored in a variable of a smaller type like char, the result is cast down. On any decent 8-bit micro compiler, this process is short-circuited where appropriate, and the entire expression calculated using char or short. Thus, the size of a data item to be processed should be appropriate for the CPU used. If an unnatural size is chosen, the code generated might get much worse. For example, on an 8-bit micro, accessing and calculating 8-bit data is very efficient. Working with 32-bit values will generate much bigger code and run more slowly, and should only be considered when the data being manipulated need all 32 bits. Using big values also increases the demand for registers for register allocation, since a 32-bit value will require four 8-bit registers to be stored. On a 32-bit processor, working with smaller data might be inefficient, since the registers are 32 bits. The results of calculations will need to be cast down if the storing variable type is smaller than 32 bits, which introduces shift, mask, and sign-extend operations in the code (depending on how smaller types are represented). On such machines, 32-bit integers should be used for as many variables as possible. chars and shorts should only be used when the precise number of bits are needed (like when doing I/O), or when big types would use too much memory (for example, an array with a large number of elements). The int type in C is intended to be the best choice when one does not care about the range of a variable. However, since the C language defines int to be at least 16 bits, int might not be the best choice for a small 8-bit micro. A portable solution to the data size problem is to define a set of types with minimal guaranteed range, and then change the definition depending on the target. For example, best_int8_t is guaranteed to cover the range 128 to + 127, but might be bigger than 8 bits if that is more efficient: For 8-bit machine
/* char is best type for 8 bits */ typedef signed char best_int8_t; typedef signed short best_int16_t;
This solution should not be used when the actual size of the data is important (reading and writing I/O, for example).
3.2
The signedness of a variable will affect the code generated. For operations like division, which may not be directly supported in the instruction set, there might be significant differences in the code generated. The language rules make it implementation-dependent how division involving negative numbers and right-shifting of signed values work, but everybody expects division to truncate towards zero and rightshift to be arithmetic shift (preserving the sign), which in effect locks the semantics.
Unless some operand has a larger range than int, this might be the case for long int.
Considering this, when replacing constant division by shift, special consideration has to be given to signed variables. The result is an extra test and jump, which would not be necessary for an unsigned variable, as in the following example: Source code
int a; a /= 2; unsigned int a; a /= 2;
Optimized Code
int a; if( a < 0) /* compensate for */ a++; /* sign of a */ a >>= 1; unsigned int a; a >>= 1;
The conclusion is that, if you think about a value as never going below zero, make it unsigned. If the purpose of a variable is to manipulate it as bits, make it unsigned. Otherwise, operations like right shifting and masking might do strange things to the value of the variable.
3.3
A typical embedded micro has several different pointer types, allowing access to memory in variety of ways, from small zero-page pointers to software-emulated generic pointers. It is obvious that using smaller pointer types are better than using bigger, since both the data space required to store them and the manipulating code is smaller for smaller pointers. However, there may be several pointers of the same size but with different properties, for example two banked 24-bit pointers huge and far, with the sole difference that huge allows objects to cross bank boundaries. This difference makes the code to manipulate huge pointers much bigger, since each increment or decrement must check for a bank boundary. Unless you really require very large objects, using the smaller pointer variant will save a lot of code space. For machines with many disjoint memory spaces (like Microchip PIC and Intel 8051), there might be generic pointers that can point to all memory spaces. These pointers might be tempting to use, since they are simple to use, but they carry a cost in that special code is needed before each pointer access to check which memory a pointer points to and performing appropriate actions. Also note that using generic pointers typically brings in some library functions (see Section 3.24). In summary: use the smallest pointers you can, and avoid any form of generic pointers unless necessary. Remember to check the compiler default pointer type (used for unqualified pointers). In many cases it is a rather large pointer type.
3.4
C performs some implicit casts (for example, between floating point and integers, and different sizes of integers), and C programs often contain explicit casts. Performing a cast is in general not free (except for some special cases), and casts should be avoided wherever possible. Casting from a smaller and a bigger signed integer type will introduce sign extend operations, and casting to and from floating point will force calls to the floating point library. These types of casts are easy yo introduce by mistake, for example by mixing variables of different sizes or mixing floating point and integers in the same expression. A good general guideline is to avoid mixing types in an expression. Casting gets more dangerous when function pointers are involved. People used to desktop systems often consider int and function pointers (and other pointers) to be interchangeable, which they are not. If you want to store a function pointer, use a variable of function pointer type. It is not uncommon that function pointers are larger than inton a 16-bit machine, int will be 16 bits, while code pointers may well be 24 bits. Casting an int to a function pointer will lose information in this case. The code below shows an example of this mistake.
/* Macro to call a function really? */ /* The argument (function name) is cast to an integer */ /* and then back to a function pointer, and then called */ #define Call(pF) ((FuncPtr)(uint16_t)(*(&pF)))(); char foo( void ); /* a function to call */ void main( void ) { Call(foo); /* can jump to strange places if FuncPtr is > 16 bits */ }
A good way to avoid implicit casts is to use consistent typedefs for all types used in the application, which helps you avoid mixed types by accident. Some lint-type tools can also be used to check for type consistency.
Alignment requirements are rare on 8- and 16-bit CPUs, but quite common on 32-bit CPUs. Some CPUs (like the Motorola 680x0 and NEC V850) will generate errors for misaligned loads, while other will only lose performance (Intel x86). Padding will be inserted at the end of a structure if necessary, to align the size of the structure with the biggest alignment requirement of the machine (typically 4 bytes for a 32-bit machine). This is because every element in an array of structures must start at an aligned boundary in memory. The sizeof() operator will reveal the total size of a struct, including padding at the end. Incrementing a pointer to a structure will move the pointer sizeof() bytes forward in memory, thus reflecting possible padding at the end. In some cases, the compiler offers the possibility to pack structures in memory (by #pragma, special keywords, or command-line options), removing the padding. This will save data space, but might cost code size, since the code to load misaligned members is potentially much bigger and more complex than the code required to load aligned members. To make better use of memory, sort the members of the struct in order of decreasing size: 32-bit values first, the 16-bit values, and finally 8-bit values. This will make internal padding unnecessary, since each member will be naturally aligned (there will still be padding at end of the struct if the size of the struct is not an even multiple of the machine word size). When a struct contains another struct, all padding of the member structure is maintained, and the member structure is aligned at the greatest alignment, as in the example below. This can cause lots of memory to wasted on padding, since the amount of padding can get very large. Note that the compiler's padding can break code that uses structs to decode information from external units or to manage memory-mapped I/O areas. This is especially dangerous when code is ported from an architecture without alignment requirements to one with.
Another caveat is that when casting other pointers to pointers to structures, make sure that the pointer is aligned on an even word boundary (if required by the machine). Casting a misaligned pointer into a structure pointer would generate addressing errors when structure elements are accessed.
can
/* /* /* /*
This example is NOT const-correct variable intended to be const const gets cast away p can now be used to write k
If a program violates const-correctness, the results can be very strange. In many cases, embedded compilers do allocate const variables to read-only memory (ROM), in order to save on precious RAM, and writing to such a variable could generate hardware errors or have no effect. Consts are very useful for function parameters: declaring a parameter as const tells the calling function that the called function will not change the variable. Declaring a pointer to const data tells the calling function that the data pointed-to will not change, which opens for better optimizations. const Function Arguments
void calc( int result [], const int source1 [], const int source2 [] ) { /* calc can assume that the result array does not overlap source1 or source2, since they are pointers to const data. */ }
3.7
Use Parameters
Some programmers use global variables to pass information between functions. As discussed above, register allocation has a hard time with global variables. If you want to improve register allocation, use parameters to pass information to a called function. They will often be allocated to registers both in the calling and called function, leading to very efficient calls.
Note that the calling conventions of some architectures and compilers limit the number of available registers for parameters (often unnecessarily), which makes it a good idea to keep the number of parameters down for code that needs to be portable and efficient across a wide range of platforms. It might pay off to split a very complex function into several smaller ones, or to reconsider the data being passed into a function.
3.8
If you take the address of a local variable (the &var construction), it is not likely to be allocated to a register, since it has to have an address and thus a place in memory (usually on the stack). It also has to be written back to memory before each function call, just like a global variable, since some other function might have gotten hold of the address and is expecting the latest value. Taking the address of a global variable does not hurt as much, since they have to have a memory address anyway. Thus, you should only take the address of a local variable if you really must (it is very seldom necessary). If the taking of addresses is used to receive return values from called functions (from scanf(), for example), introduce a temporary variable to receive the result, and then copy the value from the temporary to the real variable2. This should allow the real variable to be register allocated. Making a global variable static is a good idea (unless it is referred to in another file), since this allows the compiler to know all places where the address is taken, potentially leading to better code. An example of when not to use the address-of operator is the following, where the use of addresses to access the high byte of a variable will force the variable to the stack. The good way is to use shifts to access parts of values. Bad example
#define highbyte(x) (*((char *)(&x)+1)) short a; char b = highbyte(a);
Good example
#define highbyte(x) ((x>>8)&0xFF) short a; char b = highbyte(a);
3.9
Function prototypes were introduced in ANSI C as a way to improve type checking. The old style of calling functions without first declaring them was considered unsafe, and is also a hindrance to efficient function calls. If a function is not properly prototyped, the compiler has to fall back on the language rules dictating that all arguments should be promoted to int (or double, for floating-point arguments). This means that the function call will be much less efficient, since type casts will have to be inserted to convert the arguments. For a desktop machine, the effect is not very noticeable (most things are the size of int or double already), but for small embedded systems, the effect is potentially great. Problems include ruining register parameter passing (larger values use more registers) and lots of unnecessary type conversion code. In many cases, the compiler will give you a warning when a function without a prototype is called. Make sure that no such warnings are present when you compile! The old way to declare a function before calling it (Kernighan & Ritchie or K&R style) was to leave the parameter list empty, like extern void foo(). This is not a proper ANSI prototype and will not help code generation. Unfortunately, few compilers warn about this by default. The register to parameter assignment for a function can always be inferred from the type of the function, i.e. the complete list of parameter types (as given in the prototype). This means that all calls to a function will use the same registers to store parameters, which is necessary in order to generate correct code. The code in a function does not in any way affect the assignment of registers to parameters.
3.10
If you are accessing a global variable several times over the life of a function, it might pay to copy the value of the global into a local temporary. This temporary has a much higher chance of being register allocated, and if you call functions, the temporary might remain in registers while a global variable would have to be written to memory. Note that this assumes that you know that the functions called will not modify your variables.
Note that in C++, reference parameters (foo(int &)) can introduce pointers to variables in a calling function without the syntax of the call showing that the address of a variable is taken.
2
Example
unsigned char gGlobal; /* global variable */ void foo(int x) { unsigned char ctemp; ctemp = gGlobal; /* should go into register */ ... /* Calculations involving ctemp, i.e. gGlobal */ bar(z); /* does not read or write gGlobal, otherwise error */ /* More calculations on ctemp */ ... gGlobal = ctemp; /* make sure to remember the result */ }
3.11
Function calls are bad for register allocation, since they force write-back of global variables to memory and increase the demand for registers (since registers are used for parameters and return values, and the called function is allowed to scratch certain registers). For this reason, it is a good idea to avoid function calls for as long stretches of code as possible. To minimize the effects, try to group function calls together. For example, the code below shows two different ways of initializing a structure. The second variant will most likely generate better code, since the function calls are grouped together, making the simple values assigned to the other fields much more likely to survive in registers. Bad example
void foo(char a, int b, char c, char d) { struct S s; s.A s.B s.C s.D s.E } = = = = = a; bar(b); c; c; baz(d); }
Good example
void foo(char a, int b, char c, char d) { struct S s; s.A s.C s.D s.B s.E = = = = = a; c; c; bar(b); baz(d);
Note that grouping function calls has no effect if the functions become inlined.
3.12
Function inlining is a very effective optimization for reducing the run time and sometimes code size of a program. To facilitate inlining, structure your code so that small functions are located in the same source code file (module) as the callers, and make the functions static. Making a function static tells the compiler that the function will not be called from outside the module, giving more information to the inliner, enabling it to be more aggressive and make better decisions about when to inline. Note that only functions located within the same module can be inlined (otherwise, the source code of the function is not available). A risky technique to take advantage of inlining is to declare a number of small helper functions in a header file and making them static. This will give each compiled module its own copy of every function, but since they are small and static, they are quite likely to be inlined. If the inlining succeeds, you might save a lot of code space and gain speed. This approach is similar to declaring inline functions in C++ header files, which is also just a hint to the compiler.
3.13
In order to help register allocation, it is a good idea to try to keep the number of values or variables that are simultaneously live to a minimum. This can be achieved by making sure that variables are assigned close to where they are used, and to reduce the complexity of expressions. Often, temporary variables have to be introduced to hold intermediate values in expression evaluations, which can overload the register set and force the compiler to assign variables to memory.
3.14
Using inline assembly is a very efficient way of hampering the optimizer. Since there is a block of code that the compiler knows nothing about, it cannot optimize across that block. In many cases, variables will be forced to memory and most optimizations turned off. Instruction scheduling (especially important on DSPs) has a hard time coping with inline assembly, since the hand-written assembly should not be rescheduledthe programmer has made a very strong statement about the code he/she wants to be generated. The output of a function containing inline assembly should be inspected after each compilation run to make sure that the assembly code still works as intended. In addition, the portability of inline assembly is very poor, both across machines (obviously) and across different compilers for the same target. If you need to use assembler, the best solution is to split it out into assembly source files, or at least into functions containing only inline assembly. Do not mix C code and assembly code in the same function!
3.15
Some C programmers believe that writing fewer source code characters and making clever use of C constructions will make the code smaller or faster. The result is code which is harder to read, and which is also harder to compile. Writing things in a straightforward way helps both humans and compilers understand your code, giving you better results. For examples, conditional expressions gain from being clearly expressed as conditions. For example, consider the two ways to set the lowest bit of variable b if the lower 21 bits of another (32bit) variable are non-zero as illustrated below. The clever code uses the! operator in C, which returns zero if the argument is non-zero (true in C is any value except zero), and one if the argument is zero. The straightforward solution is easy to compile into a conditional followed by a set bit instruction, since the bit-setting operation is obvious and the masking is likely to be more efficient than the shift. Ideally, the two solutions should generate the same code. The clever code, however, may result in more code since it performs two ! operations, each of which may be compiled into a conditional. Clever solution
unsigned long int a; unsigned char b; /* Move bits 0..20 to positions 11..31 * If non-zero, first ! gives 0 */ b |= !!(a << 11);
Straightforward solution
unsigned long int a; unsigned char b; /* Straight-forward if statement */ if( (a & 0x1FFFFF) != 0) b |= 0x01;
Another example is the use of conditional values in calculations. The clever code will result in larger machine code, since the generated code will contain the same test as the straightforward code, and adds a temporary variable to hold the one or zero to add to str. The straightforward code can use a simple increment operation rather than a full addition, and does not require the generation of any intermediate results. Clever solution
int bar(char *str) { /* Calculating with result of */ /* comparison. */ return foo(str+(*str=='+')); }
Straightforward solution
int bar(char *str) { if(*str=='+') str++; return foo(str); }
Since clever code almost never compiles better than straightforward code, why write clever code? From a maintenance standpoint, writing simpler and more understandable code is definitely the method of choice.
3.16
Accessing special hardware registers and peripheral units is a fundamental feature of embedded software. How you access these registers may have a big impact on the code quality of your application; the best approach is usually to use the special features available in your compiler for hardware access, since this will generate better code. This breaks portability across compilers, but only for the small pieces of the code that interact with hardware which are project- and hardware-specific anyway.
The best approach is to place a variable on top of the I/O register. This will allow the variable to be typechecked by the compiler, including const, volatile, and keywords for memory attributes (__near, etc.). The common idiom to cast a constant address to a pointer *((char *) 0x437F) is bad, since it makes the code harder to read and may generate worse target code. Also, using a variable allows you to compile the code on your PC and try it there, simulating the peripheral using the variablewith an explicit pointer deference, you can be almost assured to get a segmentation fault. Placing a variable at a particular address is done using compiler-specific syntax or by declaring an extern variable, never providing a definition, and then setting the value of the symbol in the linker file. The last solution is portable across compilers and will enable optimization of the source, but will require work on the link file for each target. Compiler-specific features
#pragma location=0x4000 volatile char Port;
Linker use
/* C-Source: */ extern volatile char Port; /* Linker file: */ DEFINE Port=0x437F
For standard derivatives, there are usually header files available describing the available I/O-ports. Use them, as they save you some work. They are also written in the most appropriate way for your compiler.
3.17
If you want a jump table, see if you can use a switch statement to achieve the same effect. It is quite likely that the compiler will generate better and smaller code for the switch rather than a series of indirect function calls through a table. Also, using the switch makes the program flow explicit, helping the compiler optimize the surrounding code better. It is very likely that the compiler will generate a jump table, at least for a small dense switch (where all or most values are used). Using a switch is also more reliable across machines; the layout that may be optimal on one CPU may not be optimal on another, but the compiler for each will know how to make the best possible jump table for both. The switch statement was put into the C language to facilitate multiway jumps: use it!
3.18
Split Structures
Structures are hard to allocate to registers due to their copy semantics and large size (compared to simple values like integers and pointers). If you are using a structure which is just an aggregate of smaller values for convenience, consider splitting it up into several variables. This will make the pieces much easier to allocate to registers, and might improve your code. Note that it is especially difficult to allocate structures or structure members to registers if the address of the struct or one of its members has been taken (just like for ordinary variables as discussed above). This is to be considered a desperate measure, since it is considered good programming style to group related values into structures.
3.19
It is quite common that a program extracts a single byte from a larger value. This is usually done with shifting and masking, but if the destination variable is a byte-sized value, the final masking can be replaced with a simple cast to the destination type, since the casting of an unsigned type to a smaller unsigned type is basically the same as removing the upper bits of the value. The code below illustrates the concept: Shift and Mask
uint32_t uint8_t word; byte3; uint32_t uint8_t
An advantage of the cast-based approach is that it is easier for the compiler to replace the explicit mask with a simple byte-copy. Note that masking before shifting is bad idea, since this will require a large mask to be formed and then another mask to truncate the value into an 8-bit value (unless the compiler is quite smart).
3.20
On processors with conditional execution like ARM and HP-PA, it might pay off to make the bodies of if statements (and else paths) small, since this allows the use of conditional execution for these bodies. If the
bodies are large, compilers will probably choose to use jumps in the implementation, since this will run faster than a long sequence of conditionally executed instructions Error! Reference source not found..
3.21
Bitfields is one of the more obscure features of the C language. They offer a very readable way to address small groups of bits as integers, but the bit layout is implementation defined, which makes it a problem for portable code. Since bitfields are seldom used, the code generated for bitfields will be of very varying quality. Some compilers will generate incredibly poor code since they do not consider them worth optimizing, while others will optimize the operations so that there is no difference to manual masking and shifting. The advice is to test a few bitfield variables and check that the bit layout is as expected, and that the operations are efficiently implemented. If several compilers are being used, check that they have the same bit layout. In general, using explicit masks and shifts will generate more reliable code across more targets and compilers.
3.22
Different compilers for the same chip are different. Some are better at generating fast code, other at generating small code, and some may be no good at all. To evaluate a compiler, the best way is to use a demo version to compile small portions of your own typical code. Some chip vendors also provide benchmark tests of various compilers for their chips, usually targeted towards the intended application area for their chips. The compiler vendors own benchmarks should be taken with some skepticism; it is (almost) always possible to find a program where a certain compiler performs better than the competition.
3.23
Some compilers allow the programmer to specify useful information that the compiler cannot deduce itself to help optimize the code. For example, DSP compilers often allow users to specify that two pointer or array arguments are unaliased, which helps the compiler optimize code accessing these two arrays simultaneously. Other examples are the specification of function as pure (without side effects) or tasks (will loop forever, thus no need to save registers on entry). A common example is inline, which might be considered a hint or an order by the compiler. This information is usually introduced using non-portable keywords and should be put in tuned header files (if possible). It might give great benefits in code efficiency, however.
3.24
As discussed above, the linker has to bring in all library functions used by a program with the program. This is obvious for C standard library functions like printf() and strcat(), but there are also large parts of the library which are brought in implicitly when certain types of arithmetic are needed, most notably floating point. Due to the way in which C performs implicit type conversions inside expressions, it is quite easy to inadvertently bring in floating point, even if no floating point variables are being used. For example, the following code will bring in floating point, since the ImportantRatio constant is of floating point typeeven if its value would be 1.95*20==39, and all variables are integers: Example of Accidental Floating Point
#define ImportantRatio (1.95*Other) int temp = a * b + CONSTANT * ImportantRatio;
If a small change to a program causes a big change in program size, look at the library functions included after linking. Especially floating point and 32-bit integer libraries can be insidious, and creep in due to C implicit casts. Another way to shrink the code of your program is to use limited versions of standard functions. For instance, the standard printf() is a very big function. Unless you really need the full functionality, you should use a limited version that only handles basic formatting or ignores floating point. Note that this should be done at link time: the source code is the same, but a simpler version is linked. Because the first argument to printf() is a string, and can be provided as a variable, it is not possible for the compiler to automatically figure out which parts of the function your program needs.
4 Summary
This paper has tried to give an idea of how a modern C compiler works. Based on this, we have also given practical tips for how you can write code that is easy to compile and that will allow your executable code to be made smaller. A compiler is a very complex system with highly non-linear behavior, where a seemingly small change in the source code can have big effects on the assembly code generated. The basis for the compilation process is that the compiler should be able to understand what your code is supposed to do, in order to perform the operations in the best possible way for a given target. As a general rule, code that is easy to understand for a fellow human programmerand thus easy to maintain and portis also easier to compile efficiently. For more tips on C programming and embedded systems programming in general, check out the IAR web site section with white papers and articles: http://www.iar.com/Press/Articles.