Chapter 5 Bare-metal firmware build and boot process (CMSIS Core(M))

5.1 Introduction

The following chapter examines an bare metal firmware example, which is based on Cortex Microcontroller Software Interface Standard (CMSIS). CMSIS is a “vendor-independent hardware abstraction layer for microcontrollers that are based on Arm Cortex processors.” Chances are high that you are going to encounter CMSIS in your embedded research projects and even if not, the overall concepts shown here pretty much should be the same.

Since the linker file is crucial for understanding the firmware layout, we are going to look into the linker file and the CMSIS source code in parallel and examine how they work together to form the firmware. We are examining a secure world firmware, but again: The concepts for non-secure and secure firmware are the same, non-secure side missing only the non-secure callable region.

Topics and concepts covered in this chapter:

The Boot process
Initialization of the RAM layout (.data, .bss)
Setup of the stack and heap
Setup the non-secure callable region

The following example is based on a STM32L5 SoC, however it is only used to have a descriptive, real-world example with real memory addresses to explain the concepts. Everything covered here is very similar for other systems and vendors.

5.2 Memory segments

In the following chapter we will walk through an actual linker script, which is used in example-stm32-p1. The linker files for non-secure and secure world are pretty similar, so we’ll leave the detailed examination of the non-secure linker script as a task to the reader and only examine the secure world linker script.

Let’s assume that we have decided to use the following memory map for our firmware:

Table 5.1: secure / non-secure callable regions in an example system
Type	Segment Name	security attributes	from..to	section overview
Flash	ROM_NSC	NSC	0x0C03 E000 - 0x0C03 FFFF	Explored in chapter 5.2.3
Flash	ROM	S	0x0C00 0000 - 0x0C03 DFFF	Explored in chapter 5.2.1
SRAM	RAM	S	0x3000 0000 - 0x3001 7FFF	Explored in chapter 5.2.2

The system memory map is predefined at a high level by Arm for M-profile architectures: See chapter 3.5. The further separation of the memory map into secure / non-secure / NSC depends on SoC design and configuration. An example where on bit[28] of the address is used to define the security attribution of the respective memory address, is shown in 4.5.

5.2.1 ROM segment

The following output sections are written into the ROM segment:

Table 5.2: ROM output sections
output section	explanation
`.isr_vector`	Vector Table. Important part of boot process. See chapter @ref(sec:stm32i-firmware-boot-process.
`.text`	This output section combines the following input sections: - `.text`: compiled and assembled code. - `glue_7` and `glue_7t`. See chapter 2.12. - `.init` and `.fini`. Libc related functions, see chapter 5.3.
`.rodata`	Constants in the code.
`.ARM` and `.ARM.extab`	These sections contain information for unwinding the stack, when an Exception is thrown in C++. For C-Projects these should be empty.
`.preinit_array`, `.init_array`, `.fini_array`	Libc related functions. See chapter 5.3 for details.
`.data`	initialized objects with static storage duration. Explored in chapter 5.3

5.2.2 RAM segment

The following output sections are written into the RAM segment:

Table 5.3: RAM output sections
section	explanation
`.data`	Details discussed in chapter 5.3 and chapter 5.6
`._user_heap_stack`	Details discussed in chapter 5.6
`.bss`	Details discussed in chapter 5.3

5.2.3 ROM_NSC segment

The following output sections are written into the ROM_NSC segment:

Table 5.4: ROM_NSC output sections
section	explanation
`.gnu.sgstubgs`	See chapter 5.4

5.3 ROM segment and Boot Process

Terms explained in this chapter:

.word
.weak
Load Memory Address (LMA)
Virtual Memory Address (VMA)

We defined the address where the system starts executing to be 0x0C00.0000, which corresponds to the address ROM region will be loaded to. Since the linker script is processed from top to bottom, the very first input region the linker will write into ROM segment is the output section .isr_vector

→ example-stm32-p1(P1_TZEN/Secure/STM32L552ZETXQ_FLASH.ld)

SECTIONS
{
  /* The startup code into "ROM" Rom type memory */
  .isr_vector :
  {
    . = ALIGN(8);
    KEEP(*(.isr_vector)) /* Startup code */
    . = ALIGN(8);
  } >ROM

More on Vector Table:

Chapter 3.6: Arm Architecture:Exceptions, Interrupts, Faults
Chapter 5.3: CMSIS: ROM segment and Boot Process

Now that we know how the exception vectors are set up, let’s look what happens on reset. The core is set up to start executing at 0x0C00 0000. On reset the first thing is to pop the first value from the vector table, which is the main stack pointer, into the SP register (vector_table[0]). Then the second entry in the vector table (vector_table[1]), which is the Reset_Handler, is called.

The core register VTOR can be used to relocate the vector table from a default to a new base address. ArmV8-M core support that features, but in Armv6-M VTOR is fixed at 0x0000.0000 and can’t be changed. Some SoC designers fix that by implementing a remapping feature in their bootloader.

→ example-stm32-p1(P1_TZEN/Secure/Core/Startup/startup_stm32l552zetxq.s)

  .section  .text.Reset_Handler
    .weak   Reset_Handler
    .type   Reset_Handler, %function
Reset_Handler:
  ldr   sp, =_estack    /* set stack pointer */

/* Copy the data segment initializers from flash to SRAM */
  movs  r1, #0
  b LoopCopyDataInit

CopyDataInit:
    ldr r3, =_sidata
    ldr r3, [r3, r1]
    str r3, [r0, r1]
    adds    r1, r1, #4

LoopCopyDataInit:
    ldr r0, =_sdata
    ldr r3, =_edata
    adds    r2, r0, r1
    cmp r2, r3
    bcc CopyDataInit
    ldr r2, =_sbss
    b   LoopFillZerobss
/* Zero fill the bss segment. */
FillZerobss:
    movs    r3, #0
    str r3, [r2], #4

LoopFillZerobss:
    ldr r3, = _ebss
    cmp r2, r3
    bcc FillZerobss

/* Call the clock system intitialization function.*/
    bl  SystemInit
/* Call static constructors */
    bl __libc_init_array
/* Call the application's entry point.*/
    bl  main

LoopForever:
    b LoopForever

.size   Reset_Handler, .-Reset_Handler

Reset_Handler starts and sets up the .data and .bss sections:

copy the .data section from flash into SRAM (CopyDataInit: and LoopCopyDataInit:)
initialize .bss with zeros in SRAM

The symbol _sidata holds the address in flash, where the .data is going to be written to when the system is flashed The symbols _sdata and _edata on the other side hold the start and end address of the region in SRAM, where .data is being loaded to by the Reset_Handler in the startup code. The labels CopyDataInit: and LoopCopyDataInit: mark routines which copy .data data from flash into SRAM.

→ example-stm32-p1(P1_TZEN/Secure/STM32L552ZETXQ_FLASH.ld)

  /* Used by the startup to initialize data */
  _sidata = LOADADDR(.data);

  /* Initialized data sections into "RAM" Ram type memory */
  .data : 
  {
    . = ALIGN(8);
    _sdata = .;        /* create a global symbol at data start */
    *(.data)           /* .data sections */
    *(.data*)          /* .data* sections */

    . = ALIGN(8);
    _edata = .;        /* define a global symbol at data end */
    
  } >RAM AT> ROM

In many systems there are different types of memory, most often flash and RAM. Firmware is programmed into flash and during reset the system is initialized and some data is copied from flash into RAM, e.g. the .data section. The linker supports this notion by allowing to differentiate between a Virtual Memory Address (VMA) and a Load Memory Address (LMA). The former is the address where data is loaded to and the latter is the address where data is loaded from. In systems where the whole ELF file is loaded into RAM LMA and VMA do not differ, which is the case on your personal computer at home for example.

By specifying >RAM AT> ROM, we define that .data is loaded from ROM into RAM. The linker can then use LOADADDR() to access the LMA of a symbol. In our example the startup code in the reset handler uses the LMA of _sidata to retrieve the address where from to read .data.

→ example-stm32-p1(P1_TZEN/Secure/Core/Startup/startup_stm32l552zetxq.s

  b LoopCopyDataInit
  
CopyDataInit:
    ldr r3, =_sidata
    ldr r3, [r3, r1]
    str r3, [r0, r1]
    adds    r1, r1, #4

LoopCopyDataInit:
    ldr r0, =_sdata
    ldr r3, =_edata
    adds    r2, r0, r1
    cmp r2, r3
    bcc CopyDataInit

The two routines initialize R3 with the LMA and R0 with the VMA of .data. CopyDataInit: starts at the LMA loads word by word relative to R3 and stores it relative to R0 into RAM. The offset used in R0 and R3 and number of words copied is tracked in R1 and the loop finishes as soon as all words are copied.

Next up is the .bss section:

→ example-stm32-p1(P1_TZEN/Secure/STM32L552ZETXQ_FLASH.ld)

  /* Uninitialized data section into "RAM" Ram type memory */
  . = ALIGN(8);
  .bss :
  {
    /* This is used by the startup in order to initialize the .bss section */
    _sbss = .;         /* define a global symbol at bss start */
    __bss_start__ = _sbss;
    *(.bss)
    *(.bss*)
    *(COMMON)

    . = ALIGN(8);
    _ebss = .;         /* define a global symbol at bss end */
    __bss_end__ = _ebss;
  } >RAM

We already learned in chapter 2 that .bss does not hold any data. All variables in .bss are initialized with zero, as you can see for yourself in the routines labeled FillZerobss and LoopFillZerobss in the startup code:

→ example-stm32-p1(P1_TZEN/Secure/Core/Startup/startup_stm32l552zetxq.s

    ldr r2, =_sbss
    b   LoopFillZerobss
/* Zero fill the bss segment. */
FillZerobss:
    movs    r3, #0
    str r3, [r2], #4

LoopFillZerobss:
    ldr r3, = _ebss
    cmp r2, r3
    bcc FillZerobss

The startup code uses the symbols provided in the linker script which defined the start (_sbss) and end address ( _ebss) of the output section .bss to initialize the section in RAM with zero. Nothing is loaded from flash.

After initializing .data and .bss, Reset_Handler proceeds and jumps to SystemInit, which initializes SAU (details in chapter 7.2) and VTOR for NS world. Before jumping to main __libc_init_array is called, which in turn calls all libc initialization functions located in the section .preinit_array, .init_array and .init. These were also written by the linker into the ROM section:

→ example-stm32-p1(example-stm32-p1/P1_TZEN/Secure/STM32L552ZETXQ_FLASH.ld)

  .preinit_array     :
  {
    . = ALIGN(8);
    PROVIDE_HIDDEN (__preinit_array_start = .);
    KEEP (*(.preinit_array*))
    PROVIDE_HIDDEN (__preinit_array_end = .);
    . = ALIGN(8);
  } >ROM
  
  .init_array :
  {
    . = ALIGN(8);
    PROVIDE_HIDDEN (__init_array_start = .);
    KEEP (*(SORT(.init_array.*)))
    KEEP (*(.init_array*))
    PROVIDE_HIDDEN (__init_array_end = .);
    . = ALIGN(8);
  } >ROM

The symbols defined here, e.g. __init_array_start, are used by the C library (newlibc init.c) to calculate the number of entries in the array:

  #ifdef HAVE_INITFINI_ARRAY

  /* These magic symbols are provided by the linker.  */
  extern void (*__preinit_array_start []) (void) __attribute__((weak));
  extern void (*__preinit_array_end []) (void) __attribute__((weak));
  extern void (*__init_array_start []) (void) __attribute__((weak));
  extern void (*__init_array_end []) (void) __attribute__((weak));

  extern void _init (void);

  /* Iterate over all the init routines.  */
  void
  __libc_init_array (void)
  {
    size_t count;
    size_t i;

    count = __preinit_array_end - __preinit_array_start;
    for (i = 0; i < count; i++)
      __preinit_array_start[i] ();

    _init ();

    count = __init_array_end - __init_array_start;
    for (i = 0; i < count; i++)
      __init_array_start[i] ();
  }
  #endif

Each address placed into the sections .preinit_array and .init_array are called in __preinit_array_start[i]() and __init_array_start[i]()

A graphical summary of this chapter:

Figure 5.2: boot process graphical summary

5.4 Non-Secure Callable segment

Terms explained in this chapter:

SG veneer
CSME import library

This chapter applies only for secure world firmware. As discussed in chapter 4 the state (secure or non-secure) is determined by the memory region the processor executes from. When the core executes code from non-secure memory, the state is NS. When the processor executes code from a region, which is attributed as secure the processor runs in secure mode and must fetch instructions from secure memory. The security attribution of memory regions is defined in IDAU and SAU.

More on Implementation Defined Attribution Unit:

Chapter 4.5: IDAU and SAU: Security attribution
Chapter 10.2.4: STM32L5: Memory Map

5.5 Calling non-secure world from secure world

This section is here as reference. The linker does not need to generate special regions for NS to S transitions. Everything needed to know is defined in chapter 4.6.2.

5.6 Heap and Stack

From the Linker file we also can learn a lot about the RAM layout of the device: For example where Heap and the Stack will be located. The output sections listed in table 5.3 are written sequentially into the RAM segment, in the order they were defined in the linker script:

Table 5.3: RAM output sections
section	explanation
`.data`	See chapter 5.3 and chapter 5.6
`.bss`	See chapter 5.3
`._user_heap_stack`	See chapter 5.6

The output section ._user_heap_stack sets up the heap and stack regions. 3 important symbols are defined in the beginning of the linker script:

_estack
_Min_Heap_Size
_Min_Stack_Size

→ example-stm32-p1(example-stm32-p1/P1_TZEN/Secure/STM32L552ZETXQ_FLASH.ld)

/* Highest address of the user mode stack */
_estack = ORIGIN(RAM) + LENGTH(RAM);    /* end of "RAM" Ram type memory */

_Min_Heap_Size = 0x200 ;    /* required amount of heap  */
_Min_Stack_Size = 0x400 ;   /* required amount of stack */

/* Memories definition */
MEMORY
{
  RAM   (xrw)   : ORIGIN = 0x30000000,  LENGTH = 96K    /* Memory is divided. Actual start is 0x30000000 and actual length is 256K */
  ROM   (rx)    : ORIGIN = 0x0C000000,  LENGTH = 248K    /* Memory is divided. Actual start is 0x0C000000 and actual length is 512K */
  ROM_NSC   (rx)    : ORIGIN = 0x0C03E000,  LENGTH = 8K    /* Non-Secure Call-able region */
}

They are used to set up the size of the output section ._user_heap_stack:

→ example-stm32-p1(P1_TZEN/Secure/STM32L552ZETXQ_FLASH.ld)

 /* User_heap_stack section, used to check that there is enough "RAM" Ram  type memory left */
  ._user_heap_stack :
  {
    . = ALIGN(8);
    PROVIDE ( end = . );
    PROVIDE ( _end = . );
    . = . + _Min_Heap_Size;
    . = . + _Min_Stack_Size;
    . = ALIGN(8);
  } >RAM

The location counter . in a linker script always contains the current output section location. When you assign to the location counter, as done here, the location moves and creates free space in the output section.

The final layout of the RAM segment:

The segment will start at 0x3000.0000, as defined in the MEMORY command of the linker script
First .data, then .bss is written into RAM.
Then the location counter is moved by _Min_Heap_Size and _Min_Stack_Size forward.

This results in the RAM layout showed in figure 5.3.

Figure 5.3: RAM layout

If the sum of the length of .data, .bss, stack and heap exceeds the LENGTH of RAM, the linker will notify the developer.

The symbol _estack is set to be at the highest address of RAM, which is 0x3001 7fff (0x3000 000 + 96 KB). Stack grows upwards as you put data on it, towards the heap. The beginning of heap is defined by the symbol end (or _end). These symbols are provided to the C library and used to initialize the heap. When memory on heap is allocated, malloc() calls the system call sbrk() to acquire additional space on heap:

→ example-stm32-p1(P1_TZEN/Secure/Core/Src/sysmem.c)

register char * stack_ptr asm("sp");

/* Functions */

/**
 _sbrk
 Increase program data space. Malloc and related functions depend on this
**/
caddr_t _sbrk(int incr)
{
    extern char end asm("end");
    static char *heap_end;
    char *prev_heap_end;

    if (heap_end == 0)
        heap_end = &end;

    prev_heap_end = heap_end;
    if (heap_end + incr > stack_ptr)
    {
        errno = ENOMEM;
        return (caddr_t) -1;
    }

    heap_end += incr;

    return (caddr_t) prev_heap_end;
}

_sbrk() uses the provided symbol end to initialize the lowest address of heap when called for the first time. To check if enough memory is available, i.e. stack and heap do not collide, it checks that the current stack pointer address is lower than the current heap address plus the requested amount of bytes.

5.7 ELF to .bin

When flashing a SoC we do not write a ELF file into flash, but an additional step of firmware image creation happens. How the final firmware image is build from an ELF file depends on the target, but for STM32L5 it is simply:

arm-none-eabi-objcopy -O binary "P1_TZEN_Secure.elf" "P1_TZEN_Secure.bin"

objcopy makes a memory dump of the ELF file. It starts at the load memory address (PhysAddr) of the lowest segment and copies contents from the ELF segment into the .bin file, starting at 0x0. It also fills gaps between segments with zeros. For example the S world ELF:

arm-none-eabi-readelf P1_TZEN_Secure.elf (segments and program headers shown)

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD           0x010000 0x0c000000 0x0c000000 0x03270 0x03270 RWE 0x10000
  LOAD           0x020000 0x30000000 0x0c003270 0x00010 0x00010 RW  0x10000
  LOAD           0x02e000 0x0c03e000 0x0c03e000 0x00020 0x00020 R E 0x10000
  LOAD           0x030010 0x30000010 0x30000010 0x00000 0x00658 RW  0x10000

 Section to Segment mapping:
  Segment Sections...
   00     .isr_vector .text .rodata .init_array .fini_array 
   01     .data 
   02     .gnu.sgstubs 
   03     .bss ._user_heap_stack

File bin file will start with the contents of segment 00 (0x0C00.0000)
Directly followed by the contents of segment 01, which starts at the file offset 0x0000.3270 (0x0c00.3270 - 0x0c00.0000)
The space between segment 01 and 02 is filled with 0 until file offset 0x0003.e000, where 02 starts
segment 02 content
segment 03 is empty and has no content. Nothing will be written into the bin.

The final binary blob is written into STM32L5 at flash offset 0x0C00.0000. When the device starts booting from 0x0c00.0000, all offsets in the binary fit and the firmware runs successfully.