Firmware/Linker Scripts

From Nutwiki
Jump to: navigation, search

GNU ARM Linker Scripts

No question, creating linker scripts is black magic for many developers. In the early days, when Nut/OS supports a single CPU, the ATmega103, linker scripts were nothing to worry about. The Nut/OS build simply uses the one supplied by the runtime library, avr-libc. For later 8-bit AVR devices, this didn't change.

A new situation came up with the first ARM port for the Gameboy Advanced, which was created without any runtime library. Luckily the Internet offers a collection of ready-made scripts, which btw. were actually based on the same source somehow. Later on, newlib was used and it comes without linker scripts. It turns out, that for ARM based boards all linker scripts must be provided by Nut/OS.

For many target boards several scripts are offered. Still, the one you need for your specific memory layout may not be available. We will later learn, what's the reason for this growing collection and how problems may be avoided in the future.

Until another solution is provided, this document shall help to adapt existing linker scripts to your needs.

What does a linker script do?

While reading this question, you may first ask yourself: What does a linker actually do? Simple answer: It links object files. Object files are created by a compiler or assembler by translating source code into machine instructions. In almost all cases, these object files are relocatable. That means, that their final location in the address space is not fixed. While it is generally possible to create pure relocatable code by avoiding instructions with absolute addresses, such code will not be optimal. Instead, the compiler will try to create the best code and specially mark all absolute references. The actual addresses are not known until the linker collects and combines all object files to the final binary. It will then fix the references that had been marked by the compiler.

Of course, there are certain rules, that the linker must follow when processing the object files. Today, almost all embedded systems offer different types of memories, like SRAM, SDRAM, ROM, Flash memory etc., which can be used by the firmware for either program code, constant data and variables, generally speaking. Naturally, variables will be placed in RAM, code and constant data in ROM or Flash.

When the compiler generates the object files, it keeps code, constants and variables separated in so called segments. The linker is then responsible for placing the segments in the right physical memory. Of course, the linker has no idea about how to do this, unless a linker script is provided.

Selecting the right script

Default linker scripts are defined in the board configuration, e.g. for Ethernut 5 the related entry in nut/conf/ethernut50f.conf is

LDSCRIPT = "at91sam9xe512_ram"

You can change this default in the Nut/OS Configurator.

[[../../img/linker-script-select.png|[[File:../../img/linker-script-select_346.png|Selecting a linker script]]]] Many ARM-based boards used with Nut/OS have CPUs with internal Flash memory and RAM. Obviously, the linker script will tell the linker to place variables in RAM and the rest in ROM. Quite easy, so why make such a mountain out of a molehill?

The tricky part is, that some code needs to be placed at fixed memory addresses. For example, the ARM interrupt vectors must start at a fixed physical memory address, typically address zero. Another problem, which we overlooked in the simple description of segments: Variables may be initialized to a constant value and then change during runtime. For this we need a special segment placed in Flash, which will be copied to RAM when the system starts running.

But there's is a lot more to consider. While the 8-bit AVRs with Harvard architecture were not able run program code from anything but Flash memory, ARM CPUs may freely mix code and data within the same memory device. Why would someone want to do this? There are many reasons. For example, during development it is preferred to run code in RAM. Programming Flash memory is slow and it wears out. Thus, if enough RAM is available, this is the best place for testing and debugging your firmware. Another reason is, that ARM exception vectors are preferably placed in RAM, which allows to dynamically modify them during runtime. Many ARM targets allow to remap memories. While the system starts with ROM code at address zero, RAM memory may be remapped to this address during runtime. Last not least, different types of memory provide different access times. Typically, internal SRAM is faster than external Flash. During startup, some routines may be copied from Flash to RAM, where they can be executed at maximum speed. You may now get an idea, why one or two linker scripts cannot fulfill all requirements.

Just for the complete picture: Most embedded CPUs come in various incarnations with all kind of memory sizes. At the time of this writing, Nut/OS offers 8 different linker scripts for the AT91SAM7S, and these are just basic setups.

The conclusion is, that you need to make up your mind, where you want to place the different segments of your firmware. If you are lucky, a ready-to-use linker script already exists. Too often this will not be the case and you need to roll your own.

Nut/OS memory segments

By default, the compiler will place executable code in the .text segment, uninitialized, non-auto variables in the .bss segment, values for explicitly initialized variables in the .data segment and constant data in the .rodata segment. If you look into a linker map file, you may also notice a segment named COMMON, which is actually a .bss segment of all global variables.

Beside these, the source code may freely define additional segments. The runtime library defines several, most of them are variations of basic segments listed above.

Last not least, Nut/OS itself defines a number of additional segments. Most important are the .init segments, which contain early initialization code that must be executed first. Then there is the segment .vectors which, you guessed it, contains the exception (interrupt) vector table. Some functions are located in a segment named .ramfunc, either because they will fail when running in external memory or because they are time critical and therefore should not run in slow Flash memory.

Learn linker scripting in a few minutes

Nut/OS linker scripts for ARM CPUs are found in subdirectory nut/arch/arm/ldscripts. Among the first directives, they contain a list of available memory areas. Here is one for the AT91SAM7X128:

MEMORY
{
  rom(rx): org = 0x00000000, len = 128k
  ram(rw): org = 0x00200000, len = 32k
}

The device contains 128kB Flash (r=readable, x=executable) located at address 0x00000000 and 32kB RAM (r=readable, w=writable) located at address 0x00200000. Note, that the names of the memory areas, ram and rom in our case, may be freely chosen.

The following statements will now tell the linker, which segment must be placed in which of these memory areas. More precisely, one or more sections will be defined, which contain one or more segments. And each section will be placed in a specified memory area. Usually, the exception vectors are placed at the start of Flash memory, followed by the initialization code (.init segment), application code (.text segment) and read-only data (.rodata segment).

.text :
{
  *(.vectors);
  *(.init);
  *(.text);
  *(.rodata);
  __etext = .;
} > rom

It is worth to note the difference between segments and sections. Sections are created by the linker and contain collections of segments created by the compiler. For historical reasons, sections often get the same name as their most prominent segment. While you are free to choose different names, I wouldn't recommend to do so. Tools as well as developers might become confused. Better stick with the conventions.

If you wonder about __etext: We will soon come back to this.

Then we have not-explicitly-initialized static and global variables, located in the segments .bss and COMMON, which should go to RAM.

.bss :
{
  *(.bss)
  *(COMMON)
} > ram

The C program expects the contents of these variables to be initialized to zero. The C runtime initialization contains code to do this during startup.

Finally we have explicitly initialized variables, which require special handling. When the system starts, the contents of the RAM is undetermined. Again, the C runtime initialization will initialize these variables, but unlike the .bss and COMMON segments, we cannot simply clear this memory area, but must fill it with defined values. So, what we have to do is, to define an area in Flash, which will be copied to RAM by the startup code.

.data : AT (__etext)
{
  *(.data)
} > ram

AT (__etext) does all the magic. While all references of the .data segment will use addresses in RAM, the contents will be placed in Flash, starting at __etext.

We referred to C runtime initialization at least two times. First, it will clear the .bss and COMMON segment. Second, it will transfer the .data segment to RAM. For this, the initialization code must know, at which addresses these segments were placed by the linker. Fortunately, symbols that are declared in the linker script (like __etext) are visible to the initialization code. Of course, a few more will be required than just the start address of the .data segment.

Although newer ARM architectures allow to write runtime initialization in C, the majority is still written in assembly language. Many C programmers are afraid of assembly language and try to avoid it, for no reason, really. The next chapter will proof this.

Learn ARM assembly language in a few minutes

Runtime initialization routines are for simple tasks, which can be done with simple code. Assembly code gurus are not required.

As you probably know, the ARM CPU has a number of registers, almost all of them are 32 bit wide on the ARM CPU. Registers like r0, r1, r2 and so on are general purpose registers, others like sp (stack pointer), lr (link register) or pc (program counter) server special purposes. That's trivial, right?

Registers may be loaded with immediate values or may load or store their contents from or to memory locations. Or, register contents may be moved to other registers. Instead of just moving, contents may be added, subtracted or modified by all kind of binary operations. Still trivial.

Let's become more specific. The following code will load all ones into register r0. The equal sign specifies a constant value.

ldr     r0, =0xFFFFFFFF

Actually, the ARM processor is not able to load 32-bit immediate values. But this is nothing we have to worry about, because the assembler will place the value into memory and instead create an instruction, which fetches the value from that memory location. The resulting code will be something like

ldr     r0, [pc, #20]
... more code ...
.word   0xFFFFFFFF

In this very special case, however, it will even optimize it further, because

mvn     r0, #0

will give the same result. It means: Move zero negated into register r0. The CPU can do this without using an additional memory location for the value. Isn't that great? We created perfectly working assembly instructions we were not even aware of. Seriously, what I want to show is, that a few simple instructions will do the job. There is no need to know all available variants to figure out which one fits best.

Now let's do something useful. To disable the watchdog of AT91SAM9XE CPU, we need to set bit 15 in the watchdog timer mode register (address 0xFFFFFD44).

ldr     r0, =0x8000
ldr     r1, =0xFFFFFD44
str     r0, [r1]

The new item is the str command, which stores the contents of a register in a memory location pointed to by a second register. This may look weird first, but here the same problem with 32-bit values appears. The CPU cannot handle the 32-bit address as an immediate value and therefore needs a secondary storage for it. In this case an additional register (r1) will be used.

Although working, after a few days we won't have any clue what the sequence above is used for. The GNU assembler allows us to preprocess assembly source code in the same way as C programs. This allows to use C style comments, which are removed by the preprocessor. But, even better, we can use C header files containing register definitions, which already exist in Nut/OS.

#include <arch/arm.h>

ldr     r0, =WDT_WDDIS
ldr     r1, =WDT_MR
str     r0, [r1]

While 32 bit immediate values cannot be used, smaller offsets are no problem. Often we deal with a specific group of registers, which base address may be loaded into one register. So we can use immediate offsets to access a specific register within this group.

ldr     r1, =WDT_BASE
ldr     r0, =WDT_WDDIS
str     r0, [r1, #WDT_MR_OFF]

If the last line looks strange, you should become familiar with it. It instructs the CPU to store the contents of register r0 into the memory location contained in register r1 plus a fixed offset WDT_MR_OFF.

You want to do something more useful like clearing the .bss segment? No problem.

        ldr     r0, =0
        ldr     r1, =__bss_start
        ldr     r2, =__bss_end
clrnxt: cmp     r1, r2
        strne   r0, [r1], #4
        bne     clrnxt

Nothing new in the first three lines, at least not instruction-wise. The fourth line compares the contents of registers r1 and r2 and sets or clears specific bits in the CPU status register. The next statement is a variation of our str instruction we already used above. The contents of register r0 will be stored in memory location pointed to by register r1. The trailing part #4 is a really nice feature. It will automatically increment the pointer register r1 by 4 (bytes) after the value had been stored. So it will then point to the next 32-bit memory location. The other modification is strne instead of str. It is still an str instruction, but it is executed only, if the equal flag in the status register is not set. In other words, if the cmp instruction figures out, that r1 and r2 are not equal, then the str instruction is executed. Otherwise this instruction is skipped.

The last instruction is new, but contains a known part. The branch instruction b is similar to goto in C, it will jump to the specified label. If ne is appended, it will work in the same way as with str. The branch is executed only, if the last compare was done with unequal values.

The full picture is, that we created a loop, which clears all 32-bit memory locations starting at __bss_start up to but not including __bss_end. If you are wondering, where __bss_start and __bss_end are defined, you are perfectly right. We have to add them to the linker script.

.bss :
{
  __bss_start = .;
  *(.bss)
  *(COMMON)
  __bss_end = .;
} > ram

Are you puzzled? We have more. The following code snippet moves the contents of the .data segment from Flash to RAM:

        ldr     r1, =__etext
        ldr     r2, =__data_start
        ldr     r3, =__data_end
1:      cmp     r2, r3
        ldrlo   r0, [r1], #4
        strlo   r0, [r2], #4
        blo     1b

The first three lines load registers in a familiar way. Register r1 is used as a pointer into Flash memory, r2 as a pointer into RAM. Register r3 contains the loop end value. Instead of the ne (if not equal) modifier, we use lo (if lower) this time. Just for not to become boring, the result is the same.

The only really new feature is the branch instruction, more precisely the label. The GNU assembler allows to use numeric labels. Unlike labels in C, they may be re-used in the same source file as often as you like. When referencing a numeric label, you append either b (backward) or f (forward) to refer to the previous or next label with the specified numeric value.

I leave it to you as an exercise to figure out, how to define the symbols __etext, __data_start and __data_end in the linker script.

OK, I agree, we just scratched the surface. One cannot learn ARM assembly in a few minutes, but you got a head start.

For C programmers I have a valuable hint. If you need a more complicated initialization routine and can't get it working in assembly language, why not program it in C? Actually it might be even possible to call your C routine from existing initialization code written in assembly language. But be aware of traps. In early stages there may be no stack available and variables may not be initialized or located in memory areas, which are not yet available. Additional things may be needed to stop the compiler from optimizing out essential parts like delay loops. Under such circumstances, writing C routines will become a nightmare. Anyway, write it, compile it, preferably with optimization switched off, and then look into the compiler listing. Nut/OS default compile options instructs the compiler to include assembly instructions in the listing. You can use them as a template for your first assembly routine. Its definitely worth the effort.

Loading and running code in RAM

As mentioned previously, during debugging, it is preferred to directly load the binary image into and start execution in RAM. Typically, the image will be uploaded to the target via JTAG. Another possibility is to use a boot loader, which may load the application binary from serial or NAND Flash, over an RS-232 or network interface, from SD-Card etc. In all these cases the linker will simply place all segments in the same memory device.

This is (a simplified) linker script for the Elektor Internet Radio (EIR):

MEMORY
{
  xram : org = 0x20000000, len = 256M
}

SECTIONS
{
  .text:
  {
    *(.vectors);
    *(.init);
    *(.text);
    *(.rodata);
    *(.data);
    __bss_start = .;
    *(.bss)
    *(COMMON)
    __bss_end = .;
  } > xram
}

The full script is available in nut/arch/arm/ldscripts/at91sam7se512_xram.ld.

There is no need to initialize variables in the startup code, because their values will be directly loaded into the right place. Note, that it is still necessary to clear the .bss and COMMON segments.

When loading the code into internal SRAM, the loader (JTAG or boot loader) can be quite simple. All required hardware initialization may be done in the application code. Actually, I recommended to do it the other way round, keeping the application's startup code simple and let the loader to all the hardware initialization. How to do this with JTAG uploading for Ethernut 5 is explained [[../hardware/enut5/openocd.html|on this page]]. The advantage is, that you can interactively try various settings by using the OpenOCD TELNET interface, without time consuming edit-compile-link-upload-cycles. The advantage with letting boot loaders do the work is, that they are typically long living, reliable parts that do not require much updating. If you ever had to modify an old firmware application, which built fine with the toolchain available at that time, but refuses to cooperate with the latest tools, you know what I'm talking about. In these cases it is good to know, that the old boot loader will at least make sure, that the hardware is properly initialized.

Typical applications will not fit in the internal 32k SRAM of the EIR. The linker script sample given above is made for code running in external SDRAM. In this case the JTAG tool or the boot loader must first enable the external bus interface and initialize the SDRAM chips, which in turn may require to initialize the CPU clocks as well. Anyway, our linker script and startup code remains the same.

For performance reasons you may want to run some routines in internal RAM, which provides 32-bit access without waitstates. When uploading binaries in elf or hex format, this is no big deal. Following the current Nut/OS conventions, you declare such functions as

RAMFUNC void MyFastCode(int xlen);

The compiler will put this function in a new segment named .ramfunc and we extend the linker script by a new section .fast, which will be loaded into internal RAM.

MEMORY
{
  iram : org = 0x00200000, len = 32k
  xram : org = 0x20000000, len = 256M
}

SECTIONS
{
  .text:
  {
    *(.vectors);
    *(.init);
    *(.text);
    *(.rodata);
    *(.data);
    __bss_start = .;
    *(.bss)
    *(COMMON)
    __bss_end = .;
  } > xram

  .fast:
  {
    *(.ramfunc);
  } > iram
}

Things are not always going into the right direction, though. When calling the default make procedure to build your application, you will notice, that it takes much longer than usual. having a look into the application's build directory, you will further notice, that the application's .bin file has become large, extremely large.

The cause is, that binary files do not allow gaps. While hex and elf formats contain address information, bin files are simple raw images. In our case, the gap between internal RAM and external SDRAM (about 500MB) will be filled with zeroes. To solve this, you may either exclude the .bin file from the build or place the .ramfunc segment in SDRAM and move it to the internal RAM in the startup code, as this is typically done with code running in Flash, demonstrated in the next chapter.

Running code in Flash

When debugging and testing in RAM give you the impression, that your application is running stable, you may want to burn it into Flash memory, so it will start automatically when powering up the board. For this you need a different linker script and a new startup code as well.

Here is a simplified script for the eNet-sam7X module:

MEMORY
{
  rom(rx) : org = 0x00000000, len = 512k
  ram(rw) : org = 0x00200000, len = 128k
}

SECTIONS
{
  .text:
  {
    *(.vectors);
    *(.init);
    *(.text);
    *(.rodata);
    __etext = .;
  } > rom

  .data : AT (__etext)
  {
    __data_start = .;
    *(.data)
    *(.ramfunc)
    __data_end = .;
  } > ram

  .bss :
  {
    __bss_start = .;
    *(.bss)
    *(COMMON)
    __bss_end = .;
  } > ram
}

The full script is available in nut/arch/arm/ldscripts/at91sam7x512_rom.ld.

If you were able to follow the previous chapters, you won't find many new items here. But notethe .ramfunc segment, which is included together with the .data segment in the .data section. This whole section will have to be copied from Flash to RAM by the runtime initialization, which uses the symbols __etext, __data_start and __data_end. Furthermore, the startup code will now have to deal with a virgin CPU, no loader was involved. That means, that CPU clocks, watchdog timer, reset controller and possibly other essential hardware functions must be properly configured in the startup code.

Loading or running code elsewhere

You are probably able to think about other combinations of memory usage. But do they make sense? Of course, they do. Let's have a look to Ethernut 3, which has been designed for time critical applications in mind.

On this board we have slow, 16-bit NOR Flash and a large internal SRAM with 32-bit access and no wait states, running applications at about 74 MIPS on a low power ARM7TDMI. During development, a bootloader running in Flash memory will load the application via TFTP into internal RAM. This is quite convenient and many developers miss it when moving to a different target without TFTP boot loader. All hardware initialization is done in the boot loader and the binary images are linked for execution and variable storage in RAM. Naturally, the related linker script and the startup code are quite simple.

For the final system, the same binary image can be used. This time, the image will be placed in NOR Flash, somewhere above the boot loader code. When started, the boot loader will first check this area and, if a valid image is found, it will move it to RAM ans start it, instead of initiating the TFTP transfer.

In both cases, the boot loader may be interrupted via RS-232, where it offers a simple dialog to change essential system settings like the IP address etc. When shipping your application to the final customer, this is not always what you want. The alternative is, to overwrite the boot loader with your application, so that it will be started immediately on power up.

No question, this requires a different linker script and startup code, because no boot loader will take care about early hardware initialization anymore. Your application can still run at full speed in internal RAM. Nut/OS offers a special script/code combination, where the application code moves itself from Flash to RAM during runtime initialization.

Although 256kB RAM is a lot for Nut/OS, large application may not fit. For example, the UROM file system for HTML contents is wasting valuable resources when located in RAM. Thus, it makes sense to use a different linker script, which places the .rodata segment in Flash.

But even the executable code may grow further, making the 4MB of Flash memory attractive for functions, which need not to run that fast. For this we may create a new .slow segment, which is excluded from the loop that moves the application to RAM.

Luckily, all these requirements and much more can be achieved with a specialized linker script and its related runtime initialization.

What should be offered in the future?

The number of supported targets is constantly growing. If we want to cover all kind of memory layouts, the number of linker scripts will grow multiple times. Right now, the Configurator will offer all scripts to the user, even if they don't make sense for his platform. This can be limited by some smart Lua scripts.

However, the real problem is maintenance. From time to time new features are added to Nut/OS, which may require a change in almost all linker scripts. Similar problems sometimes appear, when upgrading the compiler or the runtime library. If the maintainer is not familiar with the specific platform, he may introduce unforeseen problems. Beside that, the sheer number of linker scripts makes changes in this area a time consuming and error-prone business.

When looking into the scripts, you will notice, that most of the code is just copied from other scripts, including outdated comments, bad style and other odds. With big (virtual) letters above all: Never touch a running systems. No wonder, that Nut/OS developers shy away from modifications in this area.

One proposal to solve this, might be the C preprocessor. Instead of selecting a ready-to-use linker script, one may be created based on the user's configuration. The final linker script can be defined as a make target, and make will then initiate the preprocessor to create it, based on some templates and configuration files.