Tasklet management and synchronization

The Runtime Library provides an abstraction over the primitive threads, called tasklets. A thread is the hardware implementation of an execution thread, natively implemented by DPUs. A Tasklet is a software abstraction, providing the underlying associated hardware thread with some system abilities (like a stack).

Based on tasklet capabilities, the runtime library offers various synchronization primitives: mutexes, semaphores, barriers, and handshakes.

Tasklets

Tasklets are reducing the scope of execution of threads with the following constraints:

Boot: every tasklet is started at boot time to execute the main function

Synchronization: it cannot be stopped, but may suspend its execution, via synchronization primitives, such as mutual exclusions

Memory: it owns a specific space in the WRAM to store its execution stack

Notice that the tasklets share the same memory space. This means, in particular, that any global variable declared in the source code is accessible by any tasklet, whatever its scope is (static or not). In other words, the developer should use global variables with care, relying on synchronization primitives whenever potential concurrence occurs.

Every tasklet has a system name, which can be fetched by the function me(), defined in defs.h, returning a unique system name (sysname_t).

Let’s consider the following code (in tasklet_stack_check.c):

#include <stdio.h>
#include <defs.h>

int main() {
    printf("tasklet %u: stack = %u\n", me(), check_stack());
    return 0;
}

The check_stack function returns the remaining available size in the stack.

To define the number of tasklets of a program and the size of each tasklets stack, the user needs to compile the program with specific options:

NR_TASKLETS is used to define the number of tasklets.

STACK_SIZE_DEFAULT is used to define the size of the stack for all the tasklets which stack is not specified.

STACK_SIZE_TASKLET_<X> is used to define the size of the stack for a specific tasklet.

dpu-upmem-dpurte-clang -DNR_TASKLETS=3 -DSTACK_SIZE_DEFAULT=256 -DSTACK_SIZE_TASKLET_1=2048 -O2 -o tasklet_stack_check tasklet_stack_check.c

Note: Make sure to put those options before the -o option, otherwise they won’t be taken into account by the clang driver.

Now let’s execute this program with dpu-lldb:

file tasklet_stack_check
process launch
quit

We can now check that the right number of tasklets have been executed with their corresponding size of stack:

tasklet 0: stack = 232
tasklet 1: stack = 2024
tasklet 2: stack = 232

Note: The available stack size is less than the allocated value because of the memory space already taken by the main function.

Mutual exclusions

Mutual exclusions (mutexes) are the simplest and fastest way to define critical sections between the tasklets.

The Runtime Library defines a collection of functions in mutex.h to lock and unlock mutexes, first defined by the runtime configuration:

mutex_lock and mutex_unlock request for a lock and an unlock of mutual exclusion, respectively.

mutex_trylock which has the same behavior of mutex_lock except returns immediately if the lock is already taken

The declaration of mutexes is done with the MUTEX_INIT(name) macro, also available in mutex.h. name must be a C standard identifier.

Let’s illustrate this with an example: given two tasklets running concurrently, the first tasklet entering a critical section records its system name into a global variable protected by a mutex.

The C code is the following (in mutex_example.c):

#include "defs.h"
#include "mutex.h"

#define UNDEFINED_VAL (-1)
int shared_variable = UNDEFINED_VAL;

MUTEX_INIT(my_mutex);

int main() {
  mutex_lock(my_mutex);

  /* Tasklet 0 would set shared_variable to 1, tasklet 1 to 2, and so on... */
  if (shared_variable == UNDEFINED_VAL)
    shared_variable = 1 << me();

  mutex_unlock(my_mutex);

  return shared_variable;
}

To create the program:

dpu-upmem-dpurte-clang -DNR_TASKLETS=2 -o mutex_example mutex_example.c

Now let’s execute this program with dpu-lldb:

file mutex_example
process launch
exit

As a response you will get a message like:

exited with status = 1 (0x00000001)

From this information, we can see that the first tasklet that entered the critical section was tasklet number 0.

The v1A DPU provides at most 56 hardware mutexes, while the v1B DPU provides at most 64. But the Runtime Library also provides a software abstraction for mutual exclusions in vmutex.h. It enables to create a larger number of virtual mutexes. This can be used, for instance, to protect the access to a large array shared by a number of tasklets. This is illustrated in the following example where the tasklets are creating the histogram of an input array (vmutex_example.c):

#include <defs.h>
#include <mram.h>
#include <vmutex.h>

#define BUFFER_SIZE (1024 * 1024)
#define NR_ELEMENTS_HIST (1 << 8)
#define NR_ELEMENTS_PER_TASKLET (BUFFER_SIZE / NR_TASKLETS)

__mram_noinit uint8_t input_table[BUFFER_SIZE];
__mram uint64_t histogram[NR_ELEMENTS_HIST];

/**
 * Create a virtual mutex to protect each element of the histogram.
 * Only one hardware mutex is used.
 **/
VMUTEX_INIT(my_vmutex, NR_ELEMENTS_HIST, 1);

int main() {

  for (unsigned i = me() * NR_ELEMENTS_PER_TASKLET;
       i < (me() + 1) * NR_ELEMENTS_PER_TASKLET; ++i) {
    uint8_t elem = input_table[i];
    /**
     * Lock the virtual mutex with id 'elem'
     **/
    vmutex_lock(&my_vmutex, elem);
    histogram[elem]++;
    vmutex_unlock(&my_vmutex, elem);
  }
}

Virtual mutexes are declared using the VMUTEX_INIT(name, nb_vmutexes, nb_mutexes) macro. It creates nb_vmutexes virtual mutexes, using nb_mutexes hardware mutexes. Internally, each virtual mutex is a state bit in WRAM. The hardware mutexes are used to protect the access to the state bits. Note that the cost of acquiring or releasing a virtual mutex is considerably higher than for a harware mutex (around 10 instructions compared to one). However, it avoids any false conflicts in the critical section of the histogram example. In the same example, using one harware mutex to protect the histogram array is around 2 times slower, due to false conflicts in the critical section. For virtual mutexes, it is required that the number of hardware mutexes is a power of 2, and the number of virtual mutexes a multiple of 8.

Another way to ensure mutual exclusion in the histogram example is to use a pool of hardware mutexes. For instance, with a pool of 8 hardware mutexes, each element in the histogram is protected using the hardware mutex corresponding to its position modulo 8 (i.e., hardware mutex 0 protects the elements 0, 8, 16 … and hardware mutex 1 protects the elements 1, 9, 17 etc.). The Runtime Library also provides this implementation in mutex_pool.h as shown in the following code (mutex_pool_example.c):

#include <defs.h>
#include <mram.h>
#include <mutex_pool.h>

#define BUFFER_SIZE (1024 * 1024)
#define NR_ELEMENTS_HIST (1 << 8)
#define NR_ELEMENTS_PER_TASKLET (BUFFER_SIZE / NR_TASKLETS)

__mram_noinit uint8_t input_table[BUFFER_SIZE];
__mram uint64_t histogram[NR_ELEMENTS_HIST];

/**
 * Create a mutex pool of size 8 to protect access to the histogram.
 **/
MUTEX_POOL_INIT(my_mutex_pool, 8);

int main() {

  for (unsigned i = me() * NR_ELEMENTS_PER_TASKLET;
       i < (me() + 1) * NR_ELEMENTS_PER_TASKLET; ++i) {
    uint8_t elem = input_table[i];
    /**
     * Lock the element with id 'elem'.
     * The call to mutex_pool_lock is internally locking the
     * hardware mutex of id 'elem & 7'.
     **/
    mutex_pool_lock(&my_mutex_pool, elem);
    histogram[elem]++;
    mutex_pool_unlock(&my_mutex_pool, elem);
  }
}

Compared to using virtual mutexes, this solution requires more hardware mutexes and is also not free of false conflicts in the critical section. However, locking/unlocking is faster. Which solution will perform best depends on the length of the critical section and the probability of conflicts.

Semaphores

The Runtime Library also offers counting semaphore primitives for more sophisticated synchronizations. Please refer to Wikipedia for a “standard” definition of counting semaphores.

These are less efficient than mutexes and must be used if and only if the program needs a counter for the synchronization.

The Runtime Library defines a collection of functions in sem.h to take and give semaphores, first defined in the runtime configuration:

sem_take and sem_give decrement and increment the semaphore counter, respectively. If the counter is zero or less, sem_take suspends the invoking tasklet execution until another tasklet invokes sem_give

Semaphores are defined with the macro SEMAPHORE_INIT(name, counter), available in sem.h. name must be a C standard identifier. counter is the initial counter for the semaphore, and must be an 8-bit value.

A typical illustration of semaphore usage is the classical rendez-vous: uses a semaphore initialized to 0 to ensure that some consumer tasklets do not start before a producer tasklet performing some preliminary work.

In the following example, the producer sets the global variable result to !10 (equal to 0x375f00). The consumers wait for this result to be computed, then return this result, along with their respective tasklet name (in rendezvous_example.c):

#include <defs.h>
#include <sem.h>

SEMAPHORE_INIT(my_semaphore, 0);

int factorial_result = 0;

int factorial10() {
  int i = 10, f = 1;
  for (; i > 0; i--) {
    f = i * f;
  }
  return f;
}

int producer() {
  factorial_result = factorial10();
  sem_give(&my_semaphore);
  sem_give(&my_semaphore);

  int result_producer = (me() << 24) | factorial_result;
  return result_producer;
}

int consumer() {
  sem_take(&my_semaphore);

  int result_consumer = (me() << 24) | factorial_result;
  return result_consumer;
}

int (*tasks[])(void) = {producer, consumer, consumer};
int main() { return tasks[me()](); }

To create the program:

dpu-upmem-dpurte-clang -DNR_TASKLETS=3 -o rendezvous_example rendezvous_example.c

Now let’s execute this program with dpu-lldb:

file rendezvous_example
breakpoint set --source-pattern-regexp "return result"
target stop-hook add --one-liner "target variable/x factorial_result"
process launch
frame variable/x result_consumer
process continue
frame variable/x result_consumer
process continue
frame variable/x result_producer
process continue
exit

We set a breakpoint in the return line of the main and consumer functions. Each tasklet will stop on the return line, allowing us to have a look at the returned result and the factorial_result variable. To do that we have added a stop-hook that will print the content of the factorial_result variable each time dpu-lldb stops.

(int) factorial_result = 0x00375f00

And we used the command frame variable to print the returned result. Notice that frame variable is used to display local variables, and target variable is used to display global variables. Also /x is added to the command to print the value in hexadecimal.

By checking the result of each tasklet, users will see that the functions return !10 appended with the tasklet names((me() << 24) | result)

(int) result_consumer = 0x01375f00

(int) result_consumer = 0x02375f00

(int) result_producer = 0x00375f00

Barriers

Barriers are special synchronization primitives ensuring that a specific number of tasklets suspend on the same synchronization point before restarting.

When a tasklet calls barrier_wait, its execution is suspended until exactly N tasklets have called barrier_wait, where N is the specific number of tasklets for the specified barrier.

Barriers are defined with the macro BARRIER_INIT(name, counter), available in barrier.h. name must be a C standard identifier. counter is the initial counter for the barrier, and must be an 8-bit value.

The runtime service API defines the following functions in barrier.h to access barriers:

barrier_wait allows a tasklet to wait on a barrier

A typical barrier use case is the initialization of a program implying:

An initialization sequence

Some tasklets that can’t start until the initialization is complete

For example, let’s consider that the tasklet number 0 performs such an initialization sequence and the tasklets 1, 2, and 3 wait for this sequence to complete before running.

Based on this configuration, the following program computes the checksum of portions of an array of bytes (coefficients):

Every tasklet computes the sum of 32 consecutive bytes in coefficients, starting at an index equal to 32 times the tasklet’s name

The table is initialized by the first tasklet, so that coefficients[i] = i

The expected results are:

Tasklet

Expected result

0

0+1+2+3+4+…+31 = 0x1f0

1

32+33+34+…+63 = 0x5f0

2

64+65+66+…+91 = 0x9f0

3

92+93+94+..+127 = 0xdf0

The C code is the following (in barrier_example.c):

#include <barrier.h>
#include <defs.h>
#include <stdint.h>

BARRIER_INIT(my_barrier, NR_TASKLETS);

uint8_t coefficients[128];

/* Computes the sum of coefficients within a tasklet specific range. */
int compute_checksum() {
  int i, checksum = 0;
  for (i = 0; i < 32; i++)
    checksum += coefficients[(me() << 5) + i];
  return checksum;
}

/* Initializes the coefficient table. */
void setup_coefficients() {
  int i;
  for (i = 0; i < 128; i++) {
    coefficients[i] = i;
  }
}

/* The main thread initializes the table and joins the barrier to wake up the
 * other tasklets. */
int master() {
  setup_coefficients();
  barrier_wait(&my_barrier);
  int result = compute_checksum();
  return result;
}

/* Tasklets wait for the initialization to complete, then compute their
 * checksum. */
int slave() {
  barrier_wait(&my_barrier);
  int result = compute_checksum();
  return result;
}

int main() {
  return me() == 0 ? master(): slave();
}

To create the program:

dpu-upmem-dpurte-clang -DNR_TASKLETS=4 -o barrier_example barrier_example.c

Now let’s execute this program with dpu-lldb:

file barrier_example
breakpoint set --source-pattern-regexp "return result"
target stop-hook add --one-liner "frame variable/x result"
process launch
process continue
process continue
process continue
process continue
exit

As expected all the tasklets operated on a valid table of coefficients since they were properly synchronized on a common barrier.

Handshakes

Handshakes ease synchronization between two tasklets, usually to start one of them when the other has finished a specific task.

When a tasklet calls handshake_wait_for(notifier), its execution is suspended until the notifier tasklet calls handshake_notify(). If the notifier calls handshake_notify() before the other tasklet starts to wait, the notifier execution is suspended until a tasklet calls handshake_wait_for(notifier). If a tasklet calls handshake_wait_for(notifier) with an already waiting tasklet on this notifier, the second tasklet does not wait and an error is returned.

Handshakes do not need additional configuration to be enabled.

The runtime service API defines the following functions in handshake.h to access handshakes:

handshake_wait_for(sysname_t notifier) allows a tasklet to wait on another one and returns 0 in case of success

handshake_notify(void) allows a tasklet to notify a waiting tasklet

Let’s see a simple example:

We have three tasks: task0, task1, task2. The result of task2 should be returned to the host. task2 depends on the results of task0 and task1, which are independent. task0 and task2 will be executed by tasklet#0, while task1 will be executed by tasklet#1. A handshake will be used to synchronize task2 (tasklet#0) and task1 (tasklet#1).

The C code is the following (in handshake_example.c):

#include <handshake.h>
#include <stdint.h>
#include <defs.h>

#define TASKLET_1 1

uint32_t message;

static uint32_t complex_operation_0(uint32_t x) { return x + 1; }

static uint32_t complex_operation_1(uint32_t x) { return x * 3; }

static uint32_t complex_operation_2(uint32_t x, uint32_t y) { return y - x; }

uint32_t task2(uint32_t result0, uint32_t result1) {
  return complex_operation_2(result0, result1);
}

int task1() {
  uint32_t result;
  uint32_t param = 0x19;

  result = complex_operation_1(param);

  message = result;
  handshake_notify();

  return 0;
}

int task0() {
  uint32_t result;
  uint32_t param = 0x42;

  result = complex_operation_0(param);

  handshake_wait_for(TASKLET_1);
  result = task2(result, message);

  return result;
}

int (*tasks[])(void) = {task0, task1};
int main() {
  return tasks[me()]();
}

To create the program:

dpu-upmem-dpurte-clang -DNR_TASKLETS=2 -o handshake_example handshake_example.c

Now let’s execute this program with dpu-lldb:

file handshake_example
process launch
exit

Now let’s verify that task0 compute the expected result ((0x19 * 3) - (0x42 + 1) = 0x8) by looking at the exit status. It should print:

exited with status = 8 (0x00000008)