Documents/Exploring arm-elf-gcc

From Nutwiki
Jump to: navigation, search

Exploring arm-elf-gcc

About this Document


Computer languages offer a useful mechanism for splitting code into subroutines and reusing them throughout the program, even recursively. Parameters may be passed to a function and upon return a result may be passed back to the caller. Typically an abstract data structure called stack is used to implement this. In general a stack is a LIFO (last in, first out) queue, where items can be added to the top or taken from the top. Before jumping to a subroutine, the caller may put the address of the following instruction (return address) and a number of parameters on top of the stack. The related operation is called push. The subroutine will then remove the parameters from the top of the stack using pop (sometimes also called pull) operations. During execution, the function may further use the stack for storing local variables. When all processing is done, it will pop the return address from the stack, push the result on the stack and jump to the return address. This way, subroutines may be nested to any level, provided that enough memory is available for the stack.

That's the theoretical approach. In a real world, it is much faster to pass parameters and return results via CPU registers. However, as the number of registers is limited and the stack will still be used in case no more registers are available.

A special register is used to keep the current memory address of the stack's top. This way parameters need not to be removed from the stack. Instead they can be directly accessed by adding a specific offset to the stack pointer. Finally the stack pointer is simply restored to the value it had before adding parameters or local variables.

In the C language, subroutines are called functions and parameters passed to subroutines are called arguments. C allows to call specific functions with a variable number of arguments. Probably the most well known is the printf function. In this case restoring the stack pointer requires two steps. The function itself will not know, how many arguments are actually available. Thus, the stack pointer needs to be retored by the caller. However, the called function may allocate extra space on the stack for local variables, which had to be released before returning to the caller.

The ARM Stack

The ARM CPU has no dedicated stack pointer register. Any register may be used except r14 (link register) and r15 (program counter). By convention, GCC uses r13.

Unlike many other CPUs, the ARM instruction set doesn't provide specific push or pop operations. Instead, the instructions store multiple and load multiple can be used to implement a stack. Four different implementations are available:

stmfd / ldmfd Stack pointer points to the last occupied address. Descending stack, grows towards low memory addresses.
stmed / ldmed Stack pointer points to the next available address. Descending stack, grows towards low memory addresses.
stmfa / ldmfa Stack pointer points to the last occupied address. Ascending stack, grows towards high memory addresses.
stmea / ldmea Stack pointer points to the next available address. Ascending stack, grows towards high memory addresses.

[[File:../../img/arm-sp.png|ARM stack implementations]] Traditionally the first implementation is used by GCC. The following assembly statement pushes r0-r12 plus the link register on the stack.

stmfd   r13!, {r0-r12, r14}

The next statement pulls r0-r12 plus the program counter from the stack.

ldmfd   r13!, {r0-r12, r15}

Note, that loading r15 with the value that had been pushed from r14 can be used to return to a location following a BL operation.

Most Simple C Program

In the following code examples we will prepend line numbers for reference. They are, of course, not part of the original code.

Here is the most simple C program we can think of:

   1 int main(void)
   2 {
   3     return 0;
   4 }

Latest versions of the GNU compiler do not accept main as a void function, thus returning a value is required to avoid a warning message. With all optimizations turned off, the GNU compiler will translate it to

   1 main:
   2         mov     ip, sp
   3         stmfd   sp!, {fp, ip, lr, pc}
   4         sub     fp, ip, #4
   5         mov     r3, #0
   6         mov     r0, r3
   7         ldmfd   sp, {fp, sp, pc}

Line 1: Label for function main.

Line 2: Original stack pointer location is copied in scratch register ip.

Line 3: Pushing frame pointer, stack pointer, link register and program counter on the stack. This modifies the stack pointer. However, its original value, saved in register ip, will be restored in line 7.

Line 4: Set frame pointer to the stack base. Actually the frame pointer is not used in this simple program. It will be discussed below in the next example.

Line 5: Load integer constant (function result) into scratch register r3.

Line 6: Pass the constant to the function result register r0, where it is expected by the caller.

Line 7: Restore original frame pointer and stack pointer. Finally the program counter is loaded with the link register contents that had been pushed on the stack in line 3. This results in a return to the caller.

You probably noticed, that the compiler is using synonyms for some registers.

fp r11 Frame pointer register
ip r12 Intra procedure call scratch register
sp r13 Stack pointer register
lr r14 Link register
pc r15 Program counter

Function Calls

We will now include a function with local variables, which is called from main. Lines that differ from our first example are colored red.

   1 int inc(int);
   3 int main(void)
   4 {
   5     return inc(10);
   6 }
   8 int inc(int value)
   9 {
  10     int step = 5;
  11     int result;
  13     result = value + step;
  15     return result;
  16 }

Compared to our first example, this is already complex and the compiler creates the following assembly code, again with all optimizations turned off:

   1 main:
   2         mov     ip, sp
   3         stmfd   sp!, {fp, ip, lr, pc}
   4         sub     fp, ip, #4
   5         mov     r0, #10
   6         bl      inc
   7         mov     r3, r0
   8         mov     r0, r3
   9         ldmfd   sp, {fp, sp, pc}
  10 inc:
  11         mov     ip, sp
  12         stmfd   sp!, {fp, ip, lr, pc}
  13         sub     fp, ip, #4
  14         sub     sp, sp, #12
  15         str     r0, [fp, #-24]
  16         mov     r3, #5
  17         str     r3, [fp, #-20]
  18         ldr     r2, [fp, #-24]
  19         ldr     r3, [fp, #-20]
  20         add     r3, r2, r3
  21         str     r3, [fp, #-16]
  22         ldr     r3, [fp, #-16]
  23         mov     r0, r3
  24         sub     sp, fp, #12
  25         ldmfd   sp, {fp, sp, pc}

Like in the first example, the frame pointer had been set, but is not used in the main function, simply because there are no function arguments or local variables defined. This changes in our new function inc, where the frame pointer is heavily used to access local variables.

Let's look to the code of function inc in more detail. Lines 10 to 13 are familiar. Some registers are saved on the stack to be restored later, before returning to main and the frame pointer is set to the stack base. Line 14 is new. It adjusts the stack pointer to make room for the local variables. In total, three of them are used. The two that had been explicitly defined in the C program, step and result and the function argument value. The latter had been passed in register r0 and with all optimizations turned off, the compiler will handle it as a local variable. All three variables are of the type int, which occupies 32 bits or 4 bytes on the ARM platform, resulting in a total stack frame size of 12 bytes.

Here is a snapshot of the stack layout immediately before line 15 will be executed.

[[File:../../img/arm-sp-func.png|C function stack snapshot]] Line 15: The argument passed in register r0 is stored in the local variable value (frame pointer offset -24).

Lines 16 and 17: The local variable step (offset -20) is initialized to 5.

Lines 17 to 21: step is added to value and the result is stored in result.

Lines 22 to 23: The contents of result is stored in register r0, where the caller expects the function result.

Line 24: The local variables are no longer required and the stack pointer is adjusted.

Line 25: Frame and stack pointer registers, which had been saved in line 12, are restored. The saved value of the link register is now moved to the program counter. The execution will continue with line 7.

It's noticeable, that the compiler wastes an additional register, the frame pointer, to access the local variables on the stack. Why doesn't it simply use the stack pointer? The reason is, that in more complex functions additional items need to be pushed on the stack later, for example when calling a function with a large number of arguments. When the stack pointer changes, the offsets used to access the variables change as well. This way the compiler must keep track of the items currently stored in the stack. Not a big problem in simple functions, but this may result in larger code, possibly needing more registers. Even, there are situations where this is not possible. Within a function, the frame pointer won't change and the offsets for accessing variables are constants.

To summarize: Upon entry, the following tasks must be done by a function:

  • Saving the caller's frame pointer and stack pointer.
  • Saving the return address.
  • Setting up its own frame pointer.
  • Reserving stack space for local variables.

This part is called the function's prolog.

Upon exit, the following tasks need to be done:

  • Releasing any reserved stack space.
  • Restoring the caller's frame pointer and stack pointer.
  • Loading the program counter with the return address.

This part is called the function's epilog.

Well, let's face it. The results presented by the compiler so far are not impressive. Here's the resulting code, when the compiler has been instructed to optimize for size:

   1 inc:
   2         add     r0, r0, #5
   3         bx      lr
   4 main:
   5         mov     r0, #15
   6         bx      lr

The compiler did not even bother to call the subroutine, because it can do the required calculation by itself. The main code simply loads the result into the function result register r0 and returns.

The only reason for not completely removing our function inc is, that it hasn't been declared static and therefore may be referenced externally. With this high level of optimization it shall become clear, that it is very difficult to predict, what the compiler will do in certain situations.

Interfering with Inline Assembly

At the time of this writing a severe bug had been detected in Nut/OS. In order to disable and re-enable global interrupts, two macros are defined, NutEnterCritical and NutExitCritical.

Disabling and enabling interrupts is quite simple. To disable all interrupts, both, the IRQ and the FIQ bit in CPSR (Current Processor Status Register) need to be set to 1. To re-enable them, these bits must be cleared to 0.

Problems will arise, when interrupts are disabled and re-enabled in a function, which is called by another function doing the same. When the lower level function returns, interrupts will be enabled, which the calling function may not be aware of.

To solve this nesting problem, Nut/OS pushes the current status of the CPSR on the stack before interrupts are disabled. Instead of explicitely enabling them, the saved CPSR is restored.

However, as it turned out, the compiler does not expect, that any embedded assembly code manipulates the stack pointer. It may happen, that the stack becomes corrupted. This happens under very rare conditions, but when it happens, the system typically crashes.


For a more thorough discussion of compiler options, see the gcc user manual. The latest version is always available at: