Integrating assembly code with C programs
Integrating assembler code into applications is usually motivated by the need for performance, which can be solved at different levels of the program:
Taking benefits of peculiar DPU instructions to optimize specific operations within the C code. This is achieved by using builtins
Inlining assembly code, to integrate a sequence of assembly instructions within the flow. This can be done, thanks to inline assembly
Creating a dedicated assembly module, to optimize a feature in the program. In this case, one has to create and integrate a specific assembly module
Built-in instructions
Every single DPU instruction is associated with a C function, defined in built_ins.h. Function names follow a strict format:
The name starts with __builtin_
It is followed by the assembly instruction name
And completed by an “argument” profile
The argument profile summarizes the instruction parameters, r standing for a register, s for a safe register, i for an immediate, k for a constant register, z for the constant register zero, e for an endianness, c for a condition and f for the false condition.
For example, the function add rc ra value is represented by built-in function __builtin_add_rri.
Notice that the built-in function name for a given instruction can be found in Assembler syntax, or by dpuasmdoc with the -details option.
Parameters to the built-in functions are either:
Variables, when the instruction parameter is a register
Strings representing values when the instruction parameter is an immediate (e.g.,
"0x12"for an operand equal to 18)
Example
Let’s illustrate this type of usage with a DPU specific instruction:
cmpb4 rc ra rb
x = For i=[0..3] if byte ra[i] == byte rb[i] then byte i is set to 1, 0 otherwise
rc = x
ZF <- x
#include <built_ins.h>
int main() {
unsigned int a = 0x12345678;
unsigned int b = 0xff34ff78;
unsigned int res;
__builtin_cmpb4_rrr(res, a, b);
return res;
}
When running this program with dpu-lldb:
file builtin_cmpb4
breakpoint set --source-pattern-regexp "return"
process launch
frame variable/x res
exit
One may observe that the value returned by main is 0x00010001, equal to the bytes mask of equal bytes between a and b:
(unsigned int) res = 0x00010001
Compiler inline assembly
dpu-upmem-dpurte-clang is compliant with inline assembly directives (__asm__) as well as clang (described in this document).
Specific modules in assembler
Assembly codes larger than few lines are way clearer when isolated in dedicated modules, defining functions that can be invoked from the C part of the program.
In this case, the code must comply with the DPU ABI.
The invoked function must be declared as global, using the .global assembly directive.
On the C part of the program, this function must be declared as extern and can be invoked like standard C functions.
Example
In this example, an assembly function ror_buffer rotates every 32-bits word of a buffer to 8 positions on the right and stores the result into a target buffer:
// Assembly code integrated to C program.
// Illustrates the main constraints with such an approach.
// Rotates all the longwords of a buffer to one
// byte on the right.
// Uses the "ror" DPU instruction to do this.
//
// r0 contains the buffer address in WRAM
// r1 gives the buffer's length, in longwords.
// r2 contains the destination buffer's address in WRAM
//
// The function returns the last rotate word's value.
//
.globl ror_buffer
ror_buffer:
#define inbuf r0
#define nr_lw r1
#define outbuf r2
#define current_lw r3
// for (nr_lw = nr_lw - 1; nr_lw >= 0; nr_lw--) ...
add nr_lw, nr_lw, -1
for_every_lw:
lw current_lw , inbuf, 0
ror current_lw, current_lw, 8
sw outbuf, 0 , current_lw
// Move the two pointers to the next position
add inbuf, inbuf, 4
add outbuf, outbuf, 4
// Loopback until the counter is less than 0
add nr_lw, nr_lw, -1, pl, for_every_lw
// result is in r0
move r0, current_lw
jump r23
The main program, in C, creates a buffer and invokes this rotation routine:
// Integration of assembly code with C, using syscalls.
#include <stdint.h>
/*
* An external assembly module provides a
* function that rotates every 32-bits word in a buffer
* by 8 positions to the right.
*
* @param inbuf address of the input buffer
* @param len the buffer's length, in longwords
* @param outbuf address of the output buffer
* @return The last rotate word's value
*/
extern uint32_t ror_buffer(const uint32_t *inbuf, unsigned int len,
uint32_t *outbuf);
uint32_t inbuf[32], outbuf[32];
int main() {
int i;
for (i = 0; i < 32; i++) {
inbuf[i] = i;
}
uint32_t result = ror_buffer(inbuf, 32, outbuf);
return result;
}
Compile and link the files together to produce the executable:
dpu-upmem-dpurte-clang cFunc.c assemblyFunc.S -o ror_example
We can use dpu-lldb to validate the execution:
file ror_example
breakpoint set --source-pattern-regexp "return result;"
process launch
frame variable/x result
frame variable/x outbuf
exit
As expected, the output buffer is equal to 0, 1, 2, 3… rotated by 8 positions to the right:
(uint32_t [32]) outbuf = {
[0] = 0x00000000
[1] = 0x01000000
[2] = 0x02000000
[3] = 0x03000000
[4] = 0x04000000
[5] = 0x05000000
[6] = 0x06000000
[7] = 0x07000000
[8] = 0x08000000
[9] = 0x09000000
[10] = 0x0a000000
[11] = 0x0b000000
[12] = 0x0c000000
[13] = 0x0d000000
[14] = 0x0e000000
[15] = 0x0f000000
[16] = 0x10000000
[17] = 0x11000000
[18] = 0x12000000
[19] = 0x13000000
[20] = 0x14000000
[21] = 0x15000000
[22] = 0x16000000
[23] = 0x17000000
[24] = 0x18000000
[25] = 0x19000000
[26] = 0x1a000000
[27] = 0x1b000000
[28] = 0x1c000000
[29] = 0x1d000000
[30] = 0x1e000000
[31] = 0x1f000000
}
And the returned value is the last output:
(uint32_t) result = 0x1f000000