Integrating assembly code with C programs

Integrating assembler code into applications is usually motivated by the need for performance, which can be solved at different levels of the program:

  • Taking benefits of peculiar DPU instructions to optimize specific operations within the C code. This is achieved by using builtins

  • Inlining assembly code, to integrate a sequence of assembly instructions within the flow. This can be done, thanks to inline assembly

  • Creating a dedicated assembly module, to optimize a feature in the program. In this case, one has to create and integrate a specific assembly module

Built-in instructions

Every single DPU instruction is associated with a C function, defined in built_ins.h. Function names follow a strict format:

  • The name starts with __builtin_

  • It is followed by the assembly instruction name

  • And completed by an “argument” profile

The argument profile summarizes the instruction parameters, r standing for a register, s for a safe register, i for an immediate, k for a constant register, z for the constant register zero, e for an endianness, c for a condition and f for the false condition. For example, the function add rc ra value is represented by built-in function __builtin_add_rri.

Notice that the built-in function name for a given instruction can be found in Assembler syntax, or by dpuasmdoc with the -details option.

Parameters to the built-in functions are either:

  • Variables, when the instruction parameter is a register

  • Strings representing values when the instruction parameter is an immediate (e.g., "0x12" for an operand equal to 18)

Example

Let’s illustrate this type of usage with a DPU specific instruction:

cmpb4 rc ra rb
	x = For i=[0..3] if byte ra[i] == byte rb[i] then byte i is set to 1, 0 otherwise
	rc = x
	ZF <- x
#include <built_ins.h>

int main() {
  unsigned int a = 0x12345678;
  unsigned int b = 0xff34ff78;
  unsigned int res;
  __builtin_cmpb4_rrr(res, a, b);
  return res;
}

When running this program with dpu-lldb:

file builtin_cmpb4
breakpoint set --source-pattern-regexp "return"
process launch
frame variable/x res
exit

One may observe that the value returned by main is 0x00010001, equal to the bytes mask of equal bytes between a and b:

(unsigned int) res = 0x00010001

Compiler inline assembly

dpu-upmem-dpurte-clang is compliant with inline assembly directives (__asm__) as well as clang (described in this document).

Specific modules in assembler

Assembly codes larger than few lines are way clearer when isolated in dedicated modules, defining functions that can be invoked from the C part of the program.

In this case, the code must comply with the DPU ABI.

The invoked function must be declared as global, using the .global assembly directive.

On the C part of the program, this function must be declared as extern and can be invoked like standard C functions.

Example

In this example, an assembly function ror_buffer rotates every 32-bits word of a buffer to 8 positions on the right and stores the result into a target buffer:

// Assembly code integrated to C program.
// Illustrates the main constraints with such an approach.

// Rotates all the longwords of a buffer to one
// byte on the right.
// Uses the "ror" DPU instruction to do this.
//
// r0 contains the buffer address in WRAM
// r1 gives the buffer's length, in longwords.
// r2 contains the destination buffer's address in WRAM
//
// The function returns the last rotate word's value.
//
.globl ror_buffer
ror_buffer:
#define inbuf r0
#define nr_lw r1
#define outbuf r2
#define current_lw r3

  // for (nr_lw = nr_lw - 1; nr_lw >= 0; nr_lw--) ...
  add nr_lw, nr_lw, -1
for_every_lw:
  lw current_lw , inbuf, 0
  ror current_lw, current_lw, 8
  sw outbuf, 0 , current_lw
  // Move the two pointers to the next position
  add inbuf, inbuf, 4
  add outbuf, outbuf, 4
  // Loopback until the counter is less than 0
  add nr_lw, nr_lw, -1, pl, for_every_lw

  // result is in r0
  move r0, current_lw
  jump r23

The main program, in C, creates a buffer and invokes this rotation routine:

// Integration of assembly code with C, using syscalls.

#include <stdint.h>

/*
 * An external assembly module provides a
 * function that rotates every 32-bits word in a buffer
 * by 8 positions to the right.
 *
 * @param inbuf  address of the input buffer
 * @param len    the buffer's length, in longwords
 * @param outbuf address of the output buffer
 * @return The last rotate word's value
 */
extern uint32_t ror_buffer(const uint32_t *inbuf, unsigned int len,
                           uint32_t *outbuf);

uint32_t inbuf[32], outbuf[32];

int main() {
  int i;
  for (i = 0; i < 32; i++) {
    inbuf[i] = i;
  }
  uint32_t result = ror_buffer(inbuf, 32, outbuf);
  return result;
}

Compile and link the files together to produce the executable:

dpu-upmem-dpurte-clang cFunc.c assemblyFunc.S -o ror_example

We can use dpu-lldb to validate the execution:

file ror_example
breakpoint set --source-pattern-regexp "return result;"
process launch
frame variable/x result
frame variable/x outbuf
exit

As expected, the output buffer is equal to 0, 1, 2, 3… rotated by 8 positions to the right:

(uint32_t [32]) outbuf = {
  [0] = 0x00000000
  [1] = 0x01000000
  [2] = 0x02000000
  [3] = 0x03000000
  [4] = 0x04000000
  [5] = 0x05000000
  [6] = 0x06000000
  [7] = 0x07000000
  [8] = 0x08000000
  [9] = 0x09000000
  [10] = 0x0a000000
  [11] = 0x0b000000
  [12] = 0x0c000000
  [13] = 0x0d000000
  [14] = 0x0e000000
  [15] = 0x0f000000
  [16] = 0x10000000
  [17] = 0x11000000
  [18] = 0x12000000
  [19] = 0x13000000
  [20] = 0x14000000
  [21] = 0x15000000
  [22] = 0x16000000
  [23] = 0x17000000
  [24] = 0x18000000
  [25] = 0x19000000
  [26] = 0x1a000000
  [27] = 0x1b000000
  [28] = 0x1c000000
  [29] = 0x1d000000
  [30] = 0x1e000000
  [31] = 0x1f000000
}

And the returned value is the last output:

(uint32_t) result = 0x1f000000