Memory management

The UPMEM DPU Runtime Library defines a specific usage of memories:

The WRAM is the execution memory for programs. This is where the stacks, global variables, etc. are placed, according to a pre-defined scheme.

Within the WRAM, the runtime configuration defines specific memory areas to implement a heap and shared memories.

The MRAM is seen as an “external peripheral”, whose access is simplified by Runtime Library functions.

From the programming perspective, this means that the Runtime Library defines primitives for:

the WRAM management: to dynamically manage the WRAM

the MRAM accesses: to manage the transactions between the MRAM and the WRAM

WRAM Management

Tasklets have the possibility to get buffers in memory for their own purpose or access some pre-reserved shared memories for collaborative work.

Heap allocation

Due to the complexity and memory footprint required by dynamic memory allocators, the Runtime Library implements simple mechanisms to allocate memory within the WRAM:

an incremental allocator

a fixed-size block allocator

a buddy allocator

Incremental Allocator

The Runtime Library organizes the WRAM in such a way that the memory ends with a “free area”, left to programs. The area size depends on the amount of memory used by the program, in particular, the total amount of stack needed to execute the tasklets

Within this free area, a task can dynamically request a buffer, which is exclusively reserved for its own usage

There is no “free” method: once allocated, a buffer remains the property of its owning tasklet until the program ends

However, there’s a reset function, to clean-up the heap if very necessary: mem_reset(). In particular, if a DPU is booted multiple times by an application, it shall be used at the beginning of the program to ensure that it restarts from a “fresh heap”.

To request a new buffer, a tasklet invokes mem_alloc(size_t size) (defined in alloc.h), returning a pointer to the newly allocated buffer. If the heap is full, the function puts the DPU in error.

The returned buffer address is aligned on 64 bits; the allocation procedure is multi-tasklet safe. The provided buffer is directly usable for transfers between the WRAM and the MRAM.

Fixed-Size Block Allocator

A fixed-size block allocator allows the user to allocate and free blocks of fixed size. The size and number of the blocks are defined when allocating and initializing the allocator. All the needed functions are defined in fsb_allocator.h, but are also included in alloc.h.

To instantiate a new fixed-size block allocator, the user invokes fsb_alloc(unsigned int block_size, unsigned int nb_of_blocks), returning the newly created and initialized allocator. If the heap if full, the function puts the DPU in error.

To allocate a block, a tasklet invokes fsb_get(fsb_allocator_t allocator), returning a pointer to the allocated block (or NULL if no block is available). The returned block address is aligned on 64 bits; the allocation procedure is multi-tasklet safe. After allocating a block, the memory content is undefined.

To free an allocated block, a tasklet invokes fsb_free(fsb_allocator_t allocator, void* ptr). The procedure is multi-tasklet safe. Beware that there is no protection preventing an invalid pointer to be freed or preventing a block from one allocator to be given back to another allocator. After freeing a block, the memory content is undefined.

The following example illustrates how the fixed-size block allocator can be used with a simplistic list implementation:

the allocator is defined

the list is populated with some data

some data is filtered out of the list

the sum of the remaining data is calculated

the list is cleaned

Next is the code achieving this task (in fsb_example.c):

#include <alloc.h>
#include <stddef.h>

#define BLOCK_SIZE (sizeof(list_t))
#define NB_OF_BLOCKS (1000)

typedef struct _list_t list_t;

struct _list_t {
  int data;
  list_t *next;
};

fsb_allocator_t allocator;

static void initialize_allocator() {
  allocator = fsb_alloc(BLOCK_SIZE, NB_OF_BLOCKS);
}

static list_t *add_head_data(list_t *list, int data) {
  list_t *new_data = fsb_get(allocator);

  if (new_data == NULL)
    return NULL;

  new_data->data = data;
  new_data->next = list;

  return new_data;
}

static list_t *populate_list() {
  list_t *list = NULL;

  list = add_head_data(list, 42);
  list = add_head_data(list, 1);
  list = add_head_data(list, -2);
  list = add_head_data(list, 13);
  list = add_head_data(list, 22);
  list = add_head_data(list, 10000);
  list = add_head_data(list, 0);
  list = add_head_data(list, 91);
  list = add_head_data(list, -45);
  list = add_head_data(list, 9);
  list = add_head_data(list, 0);

  return list;
}

static void clean_list(list_t *list) {
  list_t *current = list;

  while (current != NULL) {
    list_t *tmp = current;
    current = current->next;
    fsb_free(allocator, tmp);
  }
}

int main() {
  list_t *list;
  int result = 0;

  /* Allocator initialization */
  initialize_allocator();

  /* List initialization with some data */
  list = populate_list();

  list_t *current = list;
  list_t *previous = NULL;

  /* Filtering out data that is even */
  while (current != NULL) {
    if ((current->data % 2) == 0) {
      list_t *tmp = current;
      current = current->next;
      fsb_free(allocator, tmp);

      if (previous != NULL) {
        previous->next = current;
      } else {
        list = current;
      }
    } else {
      previous = current;
      current = current->next;
    }
  }

  current = list;

  /* Computing the sum of the data in the list */
  while (current != NULL) {
    result += current->data;
    current = current->next;
  }

  /* Cleaning the list */
  clean_list(list);

  return result;
}

The code is built to be executed by a single tasklet:

dpu-upmem-dpurte-clang fsb_example.c -o fsb_example

To validate that everything works, let’s debug the program with dpu-lldb:

file fsb_example
process launch
exit

The exit status of the process should print the sum of the data (0x45):

exited with status = 69 (0x00000045)

Buddy Allocator

A buddy allocator uses a pre-allocated area in the heap to perform dynamic allocation and freeing of buffers, offering functions similar to standard malloc and free.

All the needed functions are defined in buddy_alloc.h, but are also included in alloc.h.

Any program that needs to use the buddy allocator must first allocate and initialize the buddy area in the heap by invoking buddy_init. Then the program can:

Allocate buffers, using buddy_alloc, with the following restrictions:

Allocated buffer size should not exceed 4096 bytes

Minimum size of allocated buffers is 32 bytes

Allocated buffers are automatically aligned on DMA-transfer size, so that they can be used to transfer data from/to MRAM

Free previously allocated buffers, using buddy_free

The following example uses the buddy allocator to store temporary strings. The input is a list of cities and states represented in comma-separated values (CSV). It logs the states found in this initial list.

#include <alloc.h>
#include <stdlib.h>
#include <string.h>

// List of CSV data, processed by the program
// A list of cities, along with their states.
static const char *city_list[] = {"New York,New York", "Los Angeles,California",
                                  "Chicago,Illinois",  "Houston,Texas",
                                  "Phoenix,Arizona",   NULL};

static char *get_state_name_from_csv(const char *csv) {
  char *state_name = strchr(csv, ',') + 1;
  int result_len = strlen(state_name) + 2;
  char *result = buddy_alloc(result_len);
  (void)strcpy(result, state_name);
  result[result_len - 2] = ',';
  result[result_len - 1] = '\0';
  return result;
}

static char *add_state_to_all_states(char *all_states, char *new_state) {
  int all_states_len = strlen(all_states);
  int new_len = strlen(new_state) + all_states_len;
  char *new_all_states = buddy_alloc(new_len);
  strcpy(new_all_states, all_states);
  strcpy(&new_all_states[all_states_len], new_state);

  buddy_free(all_states);
  buddy_free(new_state);
  return new_all_states;
}

int main() {
  int i;
  char *all_states;
  buddy_init(4096);

  all_states = buddy_alloc(1);
  *all_states = '\0';

  for (i = 0; city_list[i] != NULL; i++) {
    char *state_name = get_state_name_from_csv(city_list[i]);
    all_states = add_state_to_all_states(all_states, state_name);
  }

  return 0;
}

The code is built to be executed by a single tasklet:

dpu-upmem-dpurte-clang buddy_example.c -o buddy_example

To validate that everything works, let’s debug the program with dpu-lldb:

file buddy_example
breakpoint set --source-pattern-regexp "return 0;"
process launch
frame variable all_states
exit

The debugger should print:

"New York,California,Illinois,Texas,Arizona,"

MRAM Management

The MRAM management routines define a collection of functions that simplify transactions between the MRAM and the WRAM, taking into account the alignment and size constraints defined by the UPMEM DPU. They also define some useful functions that simplify the programming model implying such transactions.

These functions are grouped according to the level of abstraction they offset:

Low level accesses to the MRAM resources, via mram functions

Mapping of the MRAM onto the WRAM, using sequential readers (seqread)

MRAM variables

MRAM variable can be declared directly in the DPU program source code. Including mram.h gives access to three variable attributes:

__mram which will place the associated variable in MRAM.

__mram_noinit which does the same as __mram but no initial value will be associated with the variable. This will help reduce the size of the DPU binary and the program loading, notably when declaring big MRAM arrays.

__mram_ptr which enable to use a pointer on a MRAM variable or declare a extern MRAM variable.

The DPU MRAM Heap Pointer

A special MRAM variable is defined in mram.h: DPU_MRAM_HEAP_POINTER. It defines the end of the memory range used by the MRAM variables. The range from DPU_MRAM_HEAP_POINTER to the end of the MRAM can be used freely by a DPU program, for example to handle dynamically-sized MRAM arrays.

Software cache

An MRAM variable can be accessed like any WRAM variable. When doing so, a pre-defined cache in WRAM is used to handle the MRAM transactions.

This model is very convenient for developers who want to focus first on the algorithmic part of their implementation and then address the memory transactions. However, the cost of such a cache can be significant. Indeed, each access to an MRAM variable will imply an MRAM transfer, which is much slower than a WRAM access. Using direct MRAM access can provide better results.

Implicit write access of MRAM variables is not multi-tasklet safe for data types lower than 8 bytes (e.g., char, int). Indeed, such an access is decomposed as three operations: 1) read 8 bytes in WRAM cache, 2) modify x bytes (x < 8), and 3) write 8 bytes back. When two tasklets are trying to write values within the same 8-byte location (such as two consecutive integers), a race condition may happen.

Example

These attributes can be used like so:

#include <alloc.h>
#include <mram.h>
#include <stdint.h>

/* Buffer in MRAM. */
uint32_t __mram_noinit mram_array[4];

uint32_t input[4] = {0, 2, 4, 6};
uint32_t output[4];

int main() {
  for (int i = 0; i < 4; ++i) {
    mram_array[i] = input[i];
  }

  for (int i = 0; i < 4; ++i) {
    output[i] = mram_array[i];
  }

  return 0;
}

The code is built to be executed:

dpu-upmem-dpurte-clang mram_variable_example.c -o mram_variable_example

To validate that everything works, let’s check the result in dpu-lldb:

file mram_variable_example
breakpoint set --source-pattern-regexp "return 0;"
process launch
target variable/x mram_array
target variable/x output
exit

The result of the print should be the data stored from the WRAM input array:

(uint32_t [4]) output = ([0] = 0x00000000, [1] = 0x00000002, [2] = 0x00000004, [3] = 0x00000006)

Direct access to the MRAM

The first collection of functions of the Runtime Library defined in mram.h allow to perform transactions between the MRAM and the WRAM. The source and destination buffers must comply with the strict rules defined by DPUs:

The source or target address in WRAM must be aligned on 8 bytes.

The source or target address in MRAM must be aligned on 8 bytes. Developers must carefully respect this rule since the Runtime Library does not perform any check regarding this point

The size of the transfer must be a multiple of 8, at least equal to 8 and not greater than 2048.

The Runtime Library defines two “low level” functions to perform a copy:

From the WRAM to the MRAM (mram_write(const void *from, __mram_ptr void *to, unsigned int nb_of_bytes))

From the MRAM to the WRAM (mram_read(const __mram_ptr void *from, void *to, unsigned int nb_of_bytes))

Notes

When possible, the compiler will try and check the different constraints, triggering an error if they are not respected. This is not always possible. For example, with no optimization (ie -O0), no check is made. It is also the case when the size argument is not a compile-time constant.

Example

The next example illustrates the WRAM/MRAM transactions with a simple copy:

A buffer (input) in WRAM is populated with well-known data (a byte count)

The buffer is copied in an MRAM buffer, defined as an MRAM variable

The program reads back the MRAM at this location into a new buffer (output)

Notice that the WRAM buffers (input and output) are __dma_aligned. It is mandatory to use __dma_aligned for WRAM buffers used in direct MRAM access.

Next is the code achieving this task (in mram_example.c):

#include <mram.h>
#include <stdint.h>

#define BUFFER_SIZE 256

/* Buffer in MRAM. */
uint8_t __mram_noinit mram_array[BUFFER_SIZE];

int main() {
  /* A 256-bytes buffer in WRAM, containing the initial data. */
  __dma_aligned uint8_t input[BUFFER_SIZE];
  /* The other buffer in WRAM, where data are copied back. */
  __dma_aligned uint8_t output[BUFFER_SIZE];

  /* Populate the initial buffer. */
  for (int i = 0; i < BUFFER_SIZE; i++)
    input[i] = i;
  mram_write(input, mram_array, sizeof(input));

  /* Copy back the data. */
  mram_read(mram_array, output, sizeof(output));
  for (int i = 0; i < BUFFER_SIZE; i++)
    if (i != output[i])
      return 1;

  return 0;
}

The code is built to be executed by a single tasklet:

dpu-upmem-dpurte-clang mram_example.c -o mram_example

To validate that everything works, let’s debug the program with dpu-lldb:

file mram_example
breakpoint set --source-pattern-regexp "return 0"
process launch
parray/x 20 &input[0]
parray/x 20 &output[0]
memory read `&mram_array` -c 32
exit

We can see that everything is at the expected values:

  (uint8_t) [0] = 0x00
  (uint8_t) [1] = 0x01
  (uint8_t) [2] = 0x02
  (uint8_t) [3] = 0x03
  (uint8_t) [4] = 0x04
  (uint8_t) [5] = 0x05
  (uint8_t) [6] = 0x06
  (uint8_t) [7] = 0x07
  (uint8_t) [8] = 0x08
  (uint8_t) [9] = 0x09
  (uint8_t) [10] = 0x0a
  (uint8_t) [11] = 0x0b
  (uint8_t) [12] = 0x0c
  (uint8_t) [13] = 0x0d
  (uint8_t) [14] = 0x0e
  (uint8_t) [15] = 0x0f
  (uint8_t) [16] = 0x10
  (uint8_t) [17] = 0x11
  (uint8_t) [18] = 0x12
  (uint8_t) [19] = 0x13

0x08000000: 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f  ................
0x08000010: 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f  ................

Notice that for lldb the MRAM starts at the address 0x08000000.

Sequential readers

The third collection of Runtime Library functions, defined in seqread.h allow to simplify sequential reads of the MRAM. This abstraction uses a cache in WRAM to store temporary data from the MRAM, and a reader, to store where the next element should be read in WRAM and MRAM. Moreover, this abstraction implementation has been optimized and will provide better performance than a standard C check of the cache boundaries.

A sequential reader is managed by three functions:

seqread_alloc() to allocate the cache. The cache size is determined by the macro SEQREAD_CACHE_SIZE, which is defined as 256 by default, but can be set to 32, 64, 128, 256, 512 or 1024.

seqread_init(seqreader_buffer_t *cache, __mram_ptr void *mram_addr, seqreader_t *reader) to initialize the reader, using the specified cache, starting at the specified MRAM address, and returning the first value corresponding to the MRAM address.

seqread_get(void *ptr, uint32_t inc, seqreader_t *reader) to get next value of the specified size for the specified reader.

seqread_seek(__mram_ptr void *mram_addr, seqreader_t *reader) to jump to a MRAM address for the specified reader, and returning the first value corresponding to the MRAM address.

seqread_tell(void *ptr, seqreader_t *reader) to get the current MRAM address corresponding to the specified pointer (which should be a pointer in the cache specified during the initialization of the reader).

Notice that the sequential reader implementation is not thread-safe.

Example

The following example will illustrate a typical use of a sequential reader on a simple case. The goal is to compute the sum of some data placed in MRAM and to store the result at the start of the MRAM. The main task here is to read the data from the MRAM, the process of the data being trivial and the write-back consisting only of the result value. Thus the sequential reader can be really effective here. The MRAM structure is the following:

The first 4 bytes in MRAM represent the buffer size N

The subsequent N bytes in MRAM contain the data for which the application requests a checksum computation

Next is the code placed in the DPU (in seqreader_example.c):

#include <mram.h>
#include <seqread.h>
#include <stdint.h>

#define DLEN (1 << 16)

__mram_noinit uint8_t buffer[DLEN];

int main() {
  unsigned int bytes_read = 0, buffer_size, checksum = 0;

  /* Cache for the sequential reader */
  seqreader_buffer_t local_cache = seqread_alloc();
  /* The reader */
  seqreader_t sr;
  /* The pointer where we will access the cached data */
  uint8_t *current_char = seqread_init(local_cache, buffer, &sr);

  buffer_size = *(uint32_t *)current_char;
  current_char = seqread_get(current_char, sizeof(uint32_t), &sr);

  while (bytes_read != buffer_size) {
    checksum += *current_char;
    bytes_read++;
    current_char = seqread_get(current_char, sizeof(*current_char), &sr);
  }

  return checksum;
}

The code is built to be executed:

dpu-upmem-dpurte-clang seqreader_example.c -o seqreader_example

To validate that everything works, let’s check the result in dpu-lldb, using a sample MRAM (sample.bin contains the MRAM image of 64KB of counting bytes):

file seqreader_example
breakpoint set --source-pattern-regexp "return"
process launch --stop-at-entry
memory write -i sample.bin `&buffer[0]`
process continue
frame variable/x checksum
exit

The result of the print should be the checksum of 64KB of counting bytes:

(unsigned int) checksum = 0x007f8000

Below is the code to generate sample.bin (in sampleGenerator.c):

#include <stdio.h>

#define DLEN (1 << 16)

int main() {
  FILE *f = fopen("sample.bin", "wb");
  int i, checksum = 0;
  i = DLEN;
  fwrite(&i, 4, 1, f);
  for (i = 0; i < DLEN; i++) {
    unsigned char ii = (unsigned char)i;
    fwrite(&ii, 1, 1, f);
    checksum += ii;
  }
  fclose(f);
  printf("checksum = %d = 0x%x\n", checksum, checksum);
  return 0;
}