DPU ABI
Knowledge requirements
This documentation assumes that the reader has a global understanding of what the UPMEM DPU is and of its architecture. It also considers that the reader knows the C programming language.
Procedure Call Standard
Registers
Each thread of the DPU has 24 32-bit registers, labeled r0-r23, and 8 read-only registers:
Name
Permission
Information
r0
R/W
Argument 1 register (caller saved) / Return register for 32bits value / MSB of 64bits return value
r1
R/W
Argument 2 register (caller saved) / LSB of 64bits return value
r2
R/W
Argument 3 register (caller saved)
r3
R/W
Argument 4 register (caller saved)
r4
R/W
Argument 5 register (caller saved)
r5
R/W
Argument 6 register (caller saved)
r6
R/W
Argument 7 register (caller saved)
r7
R/W
Argument 8 register (caller saved)
r8
R/W
Scratch register (caller saved)
r9
R/W
Scratch register (caller saved)
r10
R/W
Scratch register (caller saved)
r11
R/W
Scratch register (caller saved)
r12
R/W
Scratch register (caller saved)
r13
R/W
Scratch register (caller saved)
r14
R/W
Callee saved register
r15
R/W
Callee saved register
r16
R/W
Callee saved register
r17
R/W
Callee saved register
r18
R/W
Callee saved register
r19
R/W
Callee saved register
r20
R/W
Callee saved register
r21
R/W
Callee saved register
r22
R/W
Stack pointer
r23
R/W
Return address
zero
Read-only
= 0
one
Read-only
= 1
lneg
Read-only
= 0xffffffff (-1)
mneg
Read-only
= 0x80000000
id
Read-only
= thread_id
id2
Read-only
= 2 x thread_id
id4
Read-only
= 4 x thread_id
id8
Read-only
= 8 x thread_id
Each consecutive pair of registers can be used as a 64-bit register, which makes 12 64-bit registers: d0(r0&r1), d2(r2&r3), …, d22(r22&r23).
The MSBs are in the even register (r0, r2, …) and the LSB are in the odd register (r1, r3, …). Note that d1, d3, …, d21 do not exist.
Also, read-only registers cannot be seen as 64-bit registers.
Data types
Machine Type
C Type
Byte size
Byte alignment
Prefered byte alignment
Note
Unsigned byte
unsigned char
1
1
4
Store into a 32-bits register
Signed byte
char
1
1
4
Store into a 32-bits register
Unsigned half-word
unsigned short
2
2
4
Store into a 32-bits register
Signed half-word
short
2
2
4
Store into a 32-bits register
Unsigned word
unsigned int
4
4
4
Signed word
int
4
4
4
Unsigned double-word
unsigned long / unsigned long long
8
8
8
Signed double-word
long / long long
8
8
8
Single precision floating point (IEEE 754)
float
4
4
4
No native operation, only software emulation
Double precision floating point (IEEE 754)
double
8
8
8
No native operation, only software emulation
Data pointer
T *
4
4
4
Code pointer
T (*F) ()
4
4
4
Endianness
The DPU is a little-endian architecture.
But the DPU can perform load/store from the WRAM in big-endian, which is not the default behavior of the load/store operations.
Composite Types
Aggregates (C “struct”)
An aggregate is aligned on the most-aligned component of the aggregate. The size of an aggregate is the smallest multiple of its alignment that is sufficient to hold all of its members.
- These rules can be overriden by passing struct attributes to the compiler such as:
“__attribute__((packed))”
“__attribute__((aligned(x)))”
…
Unions
A union is aligned on the most-aligned component of the union. The size of a union is the smallest multiple of its alignment that is sufficient to hold its largest members.
Arrays
An array is aligned on the alignment of its base type. Its size is the size of its base type multiplied by the number of elements in the array.
Calling convention
Argument passing
Byte and half-word arguments are promoted to word arguments.
Word arguments are assigned into one of the 8 32-bit registers used for argument passing (r0 to r7 included).
Double-word arguments are assigned into one of the 4 64-bit registers used for argument passing (d0 to d6 included).
Composite-types arguments are always passed by reference.
Return value
Byte and half-word return values are promoted to word return value.
Word return values are assigned into r0.
Double-word return values are assigned into d0 (r0&r1).
Composite-types arguments are always transformed to become an argument passed by reference.
Variable Argument
Byte and half-word argument are promoted to word argument.
Word, double-word and composite type argument are assigned to stack respecting their sizes and alignments.
Stack management
Each thread has its own stack, which is a contiguous area of memory used for storage of local variables and for passing additional arguments to subroutines when these are insufficient argument registers available.
The stack pointer is held in the register r22.
The allocation of memory on the stack is done by adding a positive value to the current stack pointer (full-ascending implementation). The stack pointer is always 8-byte aligned.
The size of the stack is statically defined. Writing into a stack of another thread results in undefined behavior (which can happen when using more memory than initially defined).
Stack organization
0 |
|
...
|
| ^
| | Local variables of the current function
| v
| ^
| | Arguments passed through the stack for the next functions
| v
r22 - 8 -> | ^
| | stack pointer (r23) of the previous function
| v
r22 - 4 -> |^
| | return adress (r22) of the current function
| v
r22 -> | ^
| | Local variables of the next functions
| v
|
v
Function Prologue & Epilogue
If the function is not a leaf function, the return address is saved into the stack, as it will be modified when calling sub-function.
The function should return with the same stack pointer it had when being called. For a non-leaf function, this means that the stack pointer should also be saved into the stack.
The easiest way of doing it is to save the return address (r23) and the stack pointer (r22) at the same time using the instruction sd r22, <somewhere_in_the_stack>, d22.
Then the stack pointer can be modified and will be restored in the epilogue with the return address using the instruction ld d22, r22, <where_d22_was_stored_in_the_prologue>.
For a leaf function that does not modify the stack pointer, neither of these operations is necessary.
Other ABI details
Binary file format
DPU Binary files follow the standard 32-bit ELF format.
The EI_OSABI identification index is set to ELFOSABI_NONE (= 0).
The e_machine header field is set to EM_DPU (= 0xf5).
The e_flags header field is used to check the abi version.
The bit 23 is set to indicate that the bits 24 to 31 are used to encode the abi version.
Files need to have the same abi version to be linked together.
The current ABI version is 2.
The following relocations can be used in a DPU binary:
General relocations
Name
Value
Use
R_DPU_NONE
0
No relocation
Data relocations
Name
Value
Use
R_DPU_32
1
32-bit data
R_DPU_8
2
8-bit data
R_DPU_16
3
16-bit data
R_DPU_64
4
64-bit data
Instruction relocations
Name
Value
Use
R_DPU_PC
128
16-bit PC value as an instruction offset
R_DPU_IMM5
129
5-bit Shift/rotation immediate
R_DPU_IMM8_DMA
130
8-bit DMA transfer length
R_DPU_IMM24_PC
131
8-bit immediate when the instruction already has a PC argument
R_DPU_IMM27_PC
132
11-bit immediate when the
OPCXinstruction already has a PC argument and no destinationR_DPU_IMM28_PC_OPC8
133
12-bit immediate when the
OPC8instruction already has a PC argument and no destinationR_DPU_IMM8_STR
134
8-bit immediate to store in
Store Immediate ByteinstructionsR_DPU_IMM12_STR
135
12-bit address offset in
Store ImmediateinstructionsR_DPU_IMM16_STR
136
16-bit immediate to store in
Store Immediate Half/Word/DoubleinstructionsR_DPU_IMM16_ATM
137
16-bit lock offset in
Acquire/ReleaseinstructionsR_DPU_IMM24
138
24-bit immediate
R_DPU_IMM24_RB
139
24-bit address offset in
Store RegisterinstructionsR_DPU_IMM27
140
27-bit immediate for
OPCXinstructions with aSet BooleanconditionR_DPU_IMM28
141
28-bit immediate for
OPC8instructions with aSet BooleanconditionR_DPU_IMM32
142
32-bit immediate
R_DPU_IMM32_ZERO_RB
143
32-bit immediate when the instruction has no destination
R_DPU_IMM17_24
144
17-bit immediate sign-extended to 24-bit used in
Safe Add/SubinstructionsR_DPU_IMM32_DUS_RB
145
32-bit immediate in 64-bit extended instructions (except
Constant Andextended instructions which use R_DPU_IMM32)
Instruction size
The IRAM can store up to 4096 instructions on the v1A DPU, or 3968 on he v1B DPU. In the IRAM, each instruction takes 6 bytes. But for the binary format, each instruction is viewed as an 8-byte word.
When copying instructions from the MRAM to the IRAM, the DMA instruction expect a buffer where each instruction is encoded on 8 bytes.
Virtual address offset
The different memories have different virtual address offset added to their symbols value:
Memory
Virtual address offset
WRAM
0x00000000
IRAM
0x80000000
MRAM
0x08000000
Atomic
0xF0000000