Chapter 6 AAPCS Arm ABI and Runtime view

A procedure call standard or calling convention defines how different compilation units, even when compiled by different compilers, can work together. It defines how parameters are passed from one function to another, which registers and other program states must be preserved by caller an callee and what might be altered by the callee. The procedure call standard is one in a set of standards, which define the application binary interface (ABI) a compilation unit has to respect.

When we develop software we often encounter programming interfaces, for example the Twitter API (application programming interface) defines how we have to “talk” to Twitter, to retrieve meaningful results. The same applies for compiled files. Imagine your executable targets Linux as operating system and intends to establish a network connects. In Linux there is the connect() syscall. connect() has an associated system call number in Linux: 283. In chapter 3.6 we learned how the SysCall exception is triggered by the svc instruction in assembler. Maybe you now assume that you can call (ie the compiler could compile your C-code) svc #283 in your code to ask Linux for a connect(). In fact, that worked until the Old Application Binary Interface (OABI) - the Arm ABI for Linux - was replaced by the new Arm EABI for Linux.

The new Arm EABI for now expects the syscall number in R0 and 0 as an argument to SVC. A summary on the syscall interface for Linux is given in the man page for syscall(2):

The first table shows how a syscall is called:

    Arch/ABI    Instruction           System  Ret  Ret  Error    Notes
                                         call #  val  val2
       ───────────────────────────────────────────────────────────────────
       arm/OABI    swi NR                -       a1   -    -        2
       arm/EABI    swi 0x0               r7      r0   r1   -

And the second how arguments are passed to the syscalls:

   Arch/ABI      arg1  arg2  arg3  arg4  arg5  arg6  arg7  Notes
       ──────────────────────────────────────────────────────────────
       alpha         a0    a1    a2    a3    a4    a5    -
       arc           r0    r1    r2    r3    r4    r5    -
       arm/OABI      a1    a2    a3    a4    v1    v2    v3
       arm/EABI      r0    r1    r2    r3    r4    r5    r6

For bare-metal projects not running on Linux the EABI for Linux is not relevant. Instead the ABI for the Arm Architecture, specified by Arm, is used. As already mentioned the ABI is a set of standards. One element, the procedure call standard, defines an interface for function calls in between compilation units. Another important element of the ABI set of standards is the ELF extensions for Arm binaries, from which we already saw parts in chapter 2.

More on ELF:

  • Chapter 2.4: Compiler, Assembler and Linker: Terms and concepts
  • Chapter 2.5: ELF

For reverse engineering the procedure call standard is the most relevant to understand, so the Procedure Call Standard for Arm Architecture is the focus of this chapter.


Over the years Arm released multiple versions of their procedure call standard. In the following chapter we will rely on the Procedure Call Standard for Arm Architecture (AAPCS), which the most recent release. You might encounter other terms:

  • AAPCS: Procedure Call Standard for Arm Architecture
  • ATPCS Arm-Thumb Procedure Call Standard (precursor AAPCS).
  • APCS: Arm Procedure Call Standard (obsolete).
  • TPCS: Thumb Procedure Call Standard (obsolete).

When compiling using gcc for example, you can encounter these in the -mabi= option, where you can specify atpcs or apcs-gnu to compile for these target ABI.

6.1 Data Types

The AAPCS defines composite types and fundamental types. A fundamental type is anyone of the C types listed in table 6.1 or 6.2. There are also fundamental types for floats, if you want a full list check the AAPCS. A composite type consists of one to many fundamental or other composite data types: For example a struct in C it will be handled as composite type. Each member of the struct is either another composite data type (another struct) or a fundamental data type. Since the definition is recursive (struct in struct, ie composite data type in composite data type), in the end each composite data type consists of a collection of fundamental data types.

struct myStruct7Word{
        int a;
        int b;
        int c;
        int d;
        int e;
        int f;
        char g;
};

struct myStruct7Word is a composite type, which consists of 7 fundamental types: int a to int f and char g.

Tables 6.1 and 6.2 shows the mapping from C types to machine types, their size in memory and their alignment boundary.

Terms explained in this chapter:

  • composite type
  • fundamental type
  • maschine type
Table 6.1: Integral types and their mapping to C types
Maschine Type Size in Bytes Alignment Boundary C types
(signed and unsigned) byte 1 1 (signed and unsigned) char
(signed and unsigned) half-word 2 2 (signed and unsigned) short
(signed and unsigned) word 4 4 (signed and unsigned) int and long int
(signed and unsigned) double-word 8 8 (signed and unsigned) long long int
Table 6.2: Pointer types and their mapping to C types
Maschine Type Size in Bytes Alignment Boundary C types
pointer (data and code 4 4 T*, T (*F)(), T&

Composite types can be splitted into their fundamental types, when they are passed as value into a subroutine. We are going to examine that in great detail in in chapter 6.2. Fundamental types are handled as single entity when a subroutine is called and are not splitted further.

6.2 Subroutine Call

Terms explained in this chapter:

  • parameter
  • argument
  • caller-saved register
  • callee-saved register
  • stack frame
  • stack layout

Let me first introduce the difference between parameters and arguments. When I speak about an argument, I mean the actual value passed into a function. The parameter on the other side is the corresponding variable, which is part of the declaration of the function.


More on Arm Registers:

  • Chapter 6.2: AAPCS: Subroutine Call
  • Chapter 4.4.1: Banked Registers
  • Chapter 3.7: Arm Architecture: General-Purpose registers, special-purpose registers

Before we dive into the subroutine call, we need to categorize registers into two broad categories: caller-saved and callee-saved registers. AAPCS does not directly mention these categories, but rather divides registers into preserved and not preserved ones. For reverse engineering firmware, it is important to know the purpose of each register within a function and for a function call:

  • The general-purpose registers R0-R3 are used to pass arguments to a function and also return values. They are not needed to be preserved by the callee, they are caller-saved.
  • R4-R8, R10 and R11 are used to hold local variables within a function. A caller can expect them to be unchanged, when the called function returns: They are callee-saved.

To pass an fundamental data type argument to a function, the compiler first maps the parameter to a machine type (see tables 6.1 and 6.2). Then the argument is bound to a register, which is used to pass the argument to the function. When all all caller-saved registers (R0 to R3) are bound to arguments, the stack is used to pass all arguments left:

  1. Parameters are passed from left to right as declared.
  2. Arguments are passed in registers and on stack.
  3. Composite types can be splitted between registers and stack. Their contained fundamental types are not splitted! (see 5))
  4. The first 4 parameters are passed in R0-R3. All the following parameters are passed on the stack.
  5. If one argument is bigger than 32 bit (e.g. a long long int), the parameter is splitted into consecutive registers, e.g. R0 and R1 up to R3.
  6. Fundamental types can’t be splitted between registers and stack. So when R0-R2 are already reserved for the first two parameters of a function, a third long long int parameter with 64 bit won’t be in splitted but rather fully passed on stack.

Composite types are splitted into their fundamental data types and then these are handled the same way as described above.

6.2.1 Passing arguments

Let’s quickly look at an example for step number 5, to show how fundamental types are either passed entirely in registers or on the stack:

→ chapter 7, example 1

arm-none-eabi-gcc -c -mcpu=cortex-m33 -mthumb -Os -nostdlib main.c
extern void callee(int a, int b, int c, long long int d);

void caller(){
        callee(0,1,2,3);
}

We see 3 times word sized arguments (int) and one 8 byte sized double-word (long long int). What do we expect if we follow the routine above? The ints should be passed in R0 to R2. R3 would be empty, but since long long int is a fundamental data type is can’t be splitted further, d should be passed on stack.

arm-none-eabi-objdump --disassemble=caller main.o
main8.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <caller>:
   0:   b507            push    {r0, r1, r2, lr}
   2:   2300            movs    r3, #0
   4:   2203            movs    r2, #3
   6:   2101            movs    r1, #1
   8:   e9cd 2300       strd    r2, r3, [sp]
   c:   2000            movs    r0, #0
   e:   2202            movs    r2, #2
  10:   f7ff fffe       bl      0 <callee>
  14:   b003            add     sp, #12
  16:   f85d fb04       ldr.w   pc, [sp], #4

We see how first the long long int is loaded into R2 and R3 and then stored on the stack: The first 32 bits will be at SP+0 to SP+4, the next 32 bit then SP+4 to SP+8. SP is kept as it is. Then R0, R1 and R2 are prepared with the other arguments and callee() is called.


Let’s next look at an example for 3), where

  • a composite type is splitted into it’s fundamental types and passed in registers and
  • a composite type is splitted between registers and stack

→ chapter 7, example 2

arm-none-eabi-gcc -c -mcpu=cortex-m33 -mthumb -Os -nostdlib main.c
struct myStruct1Word {
        int a;
};

struct myStruct7Word{
        int a;
        int b;
        int c;
        int d;
        int e;
        int f;
        char g;
};

extern void callee1(struct myStruct1Word);
extern void callee7(struct myStruct7Word);

void caller(){
        struct myStruct1Word s;
        s.a = 1111;
        callee1(s);
        struct myStruct7Word t;
        t.a = 2222;
        t.g = 0xFF;
        callee7(t);
}

There are two structs defined: myStruct1Word and myStruct7Word. Note that myStruct7Word is actually 6 times an int, which is mapped to machine type of 4 bytes each and a char, which is 1 byte. They are local variables, which are located on the stack. Also remember: The structs are passed by value. The disassembly of caller():

arm-none-eabi-objdump --disassemble=caller main.o
main6.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <caller>:
   0:   b500            push    {lr}
   2:   f240 4057       movw    r0, #1111       ; 0x457
   6:   b08d            sub     sp, #52 ; 0x34
   8:   f7ff fffe       bl      0 <callee1>
   c:   f640 03ae       movw    r3, #2222       ; 0x8ae
  10:   9305            str     r3, [sp, #20]
  12:   23ff            movs    r3, #255        ; 0xff
  14:   f88d 302c       strb.w  r3, [sp, #44]   ; 0x2c
  18:   ab0c            add     r3, sp, #48     ; 0x30
  1a:   e913 0007       ldmdb   r3, {r0, r1, r2}
  1e:   e88d 0007       stmia.w sp, {r0, r1, r2}
  22:   ab05            add     r3, sp, #20
  24:   cb0f            ldmia   r3, {r0, r1, r2, r3}
  26:   f7ff fffe       bl      0 <callee7>
  2a:   b00d            add     sp, #52 ; 0x34
  2c:   f85d fb04       ldr.w   pc, [sp], #4

How does gcc set up the stack? First in line 8 stack space is reserved. Then R0 is loaded with s.a. Note: Despite s being stack variables, s.a is never stored in the stack frame, since it is never used in caller(). That was optimized away! The composite data type myStruct1Word is splitted into it’s fundamental type int and passed in the register R0. Then `callee1() is called.

Afterwards the preparations to call callee7() begin. For t, the following happens: First, write the values 2222 and 0xFF into memory (lines 11 - 14). Then the values t.e, t.f, t.g are copied on top of SP, to be passed as arguments to callee7():

Stack Layout during execution

Figure 6.1: Stack Layout during execution

After that the remaining 4 arguments (t.a to t.d) are loaded into R0-R3 and callee7() is called.


A more generalized stack layout would be:

Generalized stack layout

Figure 6.2: Generalized stack layout

The dark grey parts of figure 6.2 depicts memory regions the currently running function owns. The arguments passed to the callee are at the bottom and the dark grey parts in the middle are used for local variables.

The light grey parts are:

  • previous function state storage: Values the callee preserved for the caller. The stored LR and the callee-saved registers at the bottom of stack.
  • the prepared arguments for the next function call, i.e. the arguments on top of the stack.

6.2.2 Return function values

How values are returned from functions is also part of the AAPS. The rules are very similar to passing arguments, with some important details differing:

  1. For fundamental types and composite types of sizes smaller or equal to 32 bit R0 is used.
  2. For fundamental types only: If the return value is bigger than 32 bit (e.g. a long long int) the parameter is splitted into consecutive registers, e.g. R0 and R1 up to R3, if necessary (128 bit return value).
  3. For composite data types bigger than 32 bit bit only: The caller places a reference to memory he reserved into R0, which is then used by the callee to return the values.

Rules 1 and 2 are obvious and we saw in the previous chapter, when we examined how arguments are passed into a function, these rules in action. When a function returns a composite type things get interesting, so let’s examine the first of those examples:

→ chapter 7, example 3

arm-none-eabi-gcc -c -marm -Os -nostdlib main.c
struct myStruct1Word {
        int a;
};

struct myStruct2Word {
        int a;
        int b;
};
struct myStruct7Word{
        int a;
        int b;
        int c;
        int d;
        int e;
        int f;
        char g;
};

extern struct myStruct1Word callee1();
extern struct myStruct2Word callee2();
extern struct myStruct7Word callee7();

int caller(){
        struct myStruct1Word s;
        struct myStruct2Word t;
        struct myStruct7Word u;

        s = callee1();
        t = callee2();
        u = callee7();
        return s.a + t.a + u.a;
}
arm-none-eabi-objdump --disassemble=caller main.o
main9.o:     file format elf32-littlearm
                                                           
                                                           
Disassembly of section .text:
                                                           
00000000 <caller>:    
   0:   e92d4010        push    {r4, lr}
   4:   e24dd028        sub     sp, sp, #40     ; 0x28
   8:   ebfffffe        bl      0 <callee1>
   c:   e1a04000        mov     r4, r0                                                                                 
  10:   e28d0004        add     r0, sp, #4
  14:   ebfffffe        bl      0 <callee2>
  18:   e28d000c        add     r0, sp, #12
  1c:   ebfffffe        bl      0 <callee7>
  20:   e59d0004        ldr     r0, [sp, #4]
  24:   e0844000        add     r4, r4, r0
  28:   e59d000c        ldr     r0, [sp, #12]
  2c:   e0840000        add     r0, r4, r0
  30:   e28dd028        add     sp, sp, #40     ; 0x28
  34:   e8bd4010        pop     {r4, lr}
  38:   e12fff1e        bx      lr

First thing again in line 7 is to reserve some space on stack.

callee1() returns only an one word sized composite type (myStruct1Word), so according to the rules described above only R0 is used to return s, which is exactly was happend in line 10.

The return values from callee7() and callee2() are bigger than 32 bit, so we expect a reference passed to them in R0:

The address SP+4 is put into R0 and callee2() is called. The value for t.a is loaded from SP+4 in line 15. The same process but with the address SP+12 is done for callee7(). caller() reserved some space for myStruct2Word and myStruct7Word on his stack and passed a reference to these locations in R0 to callee2() respectively callee7(). Then towards the end the return value for caller() is calculated and moved into R0 (line 18). Last, caller() returns.

More on Secure function calls:

  • Chapter 5.4: CMSIS: Non-Secure Callable segment
  • Chapter 6.2: AAPCS: Subroutine Call
  • Chapter 4.4.1: Banked Registers
  • Chapter 7.3: Details: Secure function call
  • Chapter 4.6.1: Overview: Secure function call

More on Nonsecure function calls:

  • Chapter 6.2: AAPCS: Subroutine Call
  • Chapter 4.4.1: Banked Registers
  • Chapter 4.6.2: Overview: Non-Secure function call
  • Chapter 7.4: Details: Non-secure function call

6.3 Exception Entry and Return

Terms explained in this chapter:

  • exception entry
  • exeption phases
  • state context
Exception Phases

Figure 6.3: Exception Phases

An exception is an interruption of normal execution flow. This means that the code currently running is being stopped unforeseen during execution and the core switches to the interrupt handler for execution. This is be done right in the middle of any execution function. In the previous chapters we saw how the compiler and linker do prepare function calls in a structured and well-defined way to make sure the state context of one function is saved before transferring execution to another function. To implement exceptions nevertheless, the core takes care of a lot of state context saving, as we are going to see in the following chapter.

More on Arm Exception System:

  • Chapter 3.6.2: Arm Architecture:NVIC
  • Chapter 3.6.3: Arm Architecture: Masking
  • Chapter 6.3: Exception Entry and Return
  • Chapter 10.2.5.2: STM32L5: TrustZone Illegal Access Controller (TZIC)
  • Chapter 3.4: Arm Architecture: Execution Modes and Privilege Levels
  • Chapter 3.6: Arm Architecture:Exceptions, Interrupts, Faults
  • Chapter 7.5: Secure and non-secure exceptions

On exception entry the core saves state context on the stack. The state context is a set of registers:

  • RETPSR (Combined Exception Return Program Status Registers)
  • ReturnAddress
  • LR
  • R12
  • R0-R3

The following steps are done by the core on exception entry:

  1. Save state context
  2. Set LR to EXC_RETURN
  3. Clearing some registers
  4. Enable Main Stack Pointer
  5. PC is set to the exception handler

If floating point extensions are available in the core and the floating point context is activated in CONTROL.FPCA bit, first the floating point context is stored on stack, and then the mentioned state context. The stack on exception entry then looks like:

Exception Entry Stack

Figure 6.4: Exception Entry Stack


The exception return phase is started by the core, when PC is set to EXC_RETURN by the exception handler. During the return phase the saved contexts are restored and the execution resumes execution where is was interrupted before the exception, triggered by setting PC to the stored Return Address from stack.

6.4 Excursus: IP register for Interworking

More on Interworking:

  • Chapter 2.12: Excursus: Arm-Thumb interworking and .glue_7 and .glue_7t
  • Chapter 6.4: AAPCS: IP register for Interworking

In previous chapters we often saw veneers, which got inserted by the linker to make long branches. To implement these the linker has to insert stubs, which load a value of 32 bit from memory intro a register and then jump to that location by referencing that register. In this way the full 32 bit address space can be reached, instead of the limited range the Arm instructions branching to an immediate provide.

Since these veneers change the content of a register, it necessary to have a register, which neither caller nor callee rely on. This is R12 or the intra-procedure-call scratch register. AAPCS defines R12 as the register, which can be used by the linker to implement these long branches.

#{r, results="asis", echo=FALSE, comment='' } #nextlevel("veneers") #

In chapter 2.9 we saw how the linker would resolve a symbol with relocation type R_ARM_THM_CALL, which defines a relocation of a BL #immediate thumb instruction. There is obviously a lot of complexity hidden in the linker and each relocation type differs from the others a little bit, for example in the calculation needed to determine the destination address.

For the relocation type R_ARM_THM_CALL binutils linker has defined the following veneer, which gets inserted when a R_ARM_THM_CALL symbol is relocated:

/* Thumb -> Thumb long branch stub. Used on M-profile architectures.  */
static const insn_sequence elf32_arm_stub_long_branch_thumb_only[] =
{
  THUMB16_INSN (0xb401),         /* push {r0} */
  THUMB16_INSN (0x4802),         /* ldr  r0, [pc, #8] */
  THUMB16_INSN (0x4684),         /* mov  ip, r0 */
  THUMB16_INSN (0xbc01),         /* pop  {r0} */
  THUMB16_INSN (0x4760),         /* bx   ip */
  THUMB16_INSN (0xbf00),         /* nop */
  DATA_WORD (0, R_ARM_ABS32, 0),     /* dcd  R_ARM_ABS32(X) */
};

PC+8 is the location in memory, where the linker will write the destination address to. This value is loaded into R0, moved into IP and finally R0 is restored.

Why that detour using R0? LDR instructions in thumb mode have 3 bits to encode for the destination register, so you can only use R0 - R7 (3 bits). Although Armv8-M main extension has encoding for LDR where the destination register is encoded in 4 bits, this seems not to be used in the stubs.