Memory management
The UPMEM DPU Runtime Library defines a specific usage of memories:
The WRAM is the execution memory for programs. This is where the stacks, global variables, etc. are placed, according to a pre-defined scheme.
Within the WRAM, the runtime configuration defines specific memory areas to implement a heap and shared memories.
The MRAM is seen as an “external peripheral”, whose access is simplified by Runtime Library functions.
From the programming perspective, this means that the Runtime Library defines primitives for:
the WRAM management: to dynamically manage the WRAM
the MRAM accesses: to manage the transactions between the MRAM and the WRAM
WRAM Management
Tasklets have the possibility to get buffers in memory for their own purpose or access some pre-reserved shared memories for collaborative work.
Heap allocation
Due to the complexity and memory footprint required by dynamic memory allocators, the Runtime Library implements simple mechanisms to allocate memory within the WRAM:
an incremental allocator
a fixed-size block allocator
a buddy allocator
Incremental Allocator
The Runtime Library organizes the WRAM in such a way that the memory ends with a “free area”, left to programs. The area size depends on the amount of memory used by the program, in particular, the total amount of stack needed to execute the tasklets
Within this free area, a task can dynamically request a buffer, which is exclusively reserved for its own usage
There is no “free” method: once allocated, a buffer remains the property of its owning tasklet until the program ends
However, there’s a reset function, to clean-up the heap if very necessary:
mem_reset(). In particular, if a DPU is booted multiple times by an application, it shall be used at the beginning of the program to ensure that it restarts from a “fresh heap”.
To request a new buffer, a tasklet invokes mem_alloc(size_t size) (defined in alloc.h), returning a pointer to the newly allocated buffer.
If the heap is full, the function puts the DPU in error.
The returned buffer address is aligned on 64 bits; the allocation procedure is multi-tasklet safe. The provided buffer is directly usable for transfers between the WRAM and the MRAM.
Fixed-Size Block Allocator
A fixed-size block allocator allows the user to allocate and free blocks of fixed size. The size and number of the blocks are defined when allocating and initializing the allocator. All the needed functions are defined in fsb_allocator.h, but are also included in alloc.h.
To instantiate a new fixed-size block allocator, the user invokes fsb_alloc(unsigned int block_size, unsigned int nb_of_blocks), returning the newly
created and initialized allocator. If the heap if full, the function puts the DPU in error.
To allocate a block, a tasklet invokes fsb_get(fsb_allocator_t allocator), returning a pointer to the allocated block (or NULL if no block
is available). The returned block address is aligned on 64 bits; the allocation procedure is multi-tasklet safe. After allocating a block,
the memory content is undefined.
To free an allocated block, a tasklet invokes fsb_free(fsb_allocator_t allocator, void* ptr). The procedure is multi-tasklet safe.
Beware that there is no protection preventing an invalid pointer to be freed or preventing a block from one allocator to be given back
to another allocator. After freeing a block, the memory content is undefined.
The following example illustrates how the fixed-size block allocator can be used with a simplistic list implementation:
the allocator is defined
the list is populated with some data
some data is filtered out of the list
the sum of the remaining data is calculated
the list is cleaned
Next is the code achieving this task (in fsb_example.c):
#include <alloc.h>
#include <stddef.h>
#define BLOCK_SIZE (sizeof(list_t))
#define NB_OF_BLOCKS (1000)
typedef struct _list_t list_t;
struct _list_t {
int data;
list_t *next;
};
fsb_allocator_t allocator;
static void initialize_allocator() {
allocator = fsb_alloc(BLOCK_SIZE, NB_OF_BLOCKS);
}
static list_t *add_head_data(list_t *list, int data) {
list_t *new_data = fsb_get(allocator);
if (new_data == NULL)
return NULL;
new_data->data = data;
new_data->next = list;
return new_data;
}
static list_t *populate_list() {
list_t *list = NULL;
list = add_head_data(list, 42);
list = add_head_data(list, 1);
list = add_head_data(list, -2);
list = add_head_data(list, 13);
list = add_head_data(list, 22);
list = add_head_data(list, 10000);
list = add_head_data(list, 0);
list = add_head_data(list, 91);
list = add_head_data(list, -45);
list = add_head_data(list, 9);
list = add_head_data(list, 0);
return list;
}
static void clean_list(list_t *list) {
list_t *current = list;
while (current != NULL) {
list_t *tmp = current;
current = current->next;
fsb_free(allocator, tmp);
}
}
int main() {
list_t *list;
int result = 0;
/* Allocator initialization */
initialize_allocator();
/* List initialization with some data */
list = populate_list();
list_t *current = list;
list_t *previous = NULL;
/* Filtering out data that is even */
while (current != NULL) {
if ((current->data % 2) == 0) {
list_t *tmp = current;
current = current->next;
fsb_free(allocator, tmp);
if (previous != NULL) {
previous->next = current;
} else {
list = current;
}
} else {
previous = current;
current = current->next;
}
}
current = list;
/* Computing the sum of the data in the list */
while (current != NULL) {
result += current->data;
current = current->next;
}
/* Cleaning the list */
clean_list(list);
return result;
}
The code is built to be executed by a single tasklet:
dpu-upmem-dpurte-clang fsb_example.c -o fsb_example
To validate that everything works, let’s debug the program with dpu-lldb:
file fsb_example
process launch
exit
The exit status of the process should print the sum of the data (0x45):
exited with status = 69 (0x00000045)
Buddy Allocator
A buddy allocator uses a pre-allocated area in the heap to perform dynamic allocation and freeing of buffers, offering functions similar to standard malloc and free.
All the needed functions are defined in buddy_alloc.h, but are also included in alloc.h.
Any program that needs to use the buddy allocator must first allocate and initialize the buddy area in the heap by
invoking buddy_init. Then the program can:
Allocate buffers, using
buddy_alloc, with the following restrictions:
Allocated buffer size should not exceed 4096 bytes
Minimum size of allocated buffers is 32 bytes
Allocated buffers are automatically aligned on DMA-transfer size, so that they can be used to transfer data from/to
MRAMFree previously allocated buffers, using
buddy_free
The following example uses the buddy allocator to store temporary strings. The input is a list of cities and states represented in comma-separated values (CSV). It logs the states found in this initial list.
#include <alloc.h>
#include <stdlib.h>
#include <string.h>
// List of CSV data, processed by the program
// A list of cities, along with their states.
static const char *city_list[] = {"New York,New York", "Los Angeles,California",
"Chicago,Illinois", "Houston,Texas",
"Phoenix,Arizona", NULL};
static char *get_state_name_from_csv(const char *csv) {
char *state_name = strchr(csv, ',') + 1;
int result_len = strlen(state_name) + 2;
char *result = buddy_alloc(result_len);
(void)strcpy(result, state_name);
result[result_len - 2] = ',';
result[result_len - 1] = '\0';
return result;
}
static char *add_state_to_all_states(char *all_states, char *new_state) {
int all_states_len = strlen(all_states);
int new_len = strlen(new_state) + all_states_len;
char *new_all_states = buddy_alloc(new_len);
strcpy(new_all_states, all_states);
strcpy(&new_all_states[all_states_len], new_state);
buddy_free(all_states);
buddy_free(new_state);
return new_all_states;
}
int main() {
int i;
char *all_states;
buddy_init(4096);
all_states = buddy_alloc(1);
*all_states = '\0';
for (i = 0; city_list[i] != NULL; i++) {
char *state_name = get_state_name_from_csv(city_list[i]);
all_states = add_state_to_all_states(all_states, state_name);
}
return 0;
}
The code is built to be executed by a single tasklet:
dpu-upmem-dpurte-clang buddy_example.c -o buddy_example
To validate that everything works, let’s debug the program with dpu-lldb:
file buddy_example
breakpoint set --source-pattern-regexp "return 0;"
process launch
frame variable all_states
exit
The debugger should print:
"New York,California,Illinois,Texas,Arizona,"
MRAM Management
The MRAM management routines define a collection of functions that simplify transactions between the MRAM and the WRAM, taking into account the alignment and size constraints defined by the UPMEM DPU. They also define some useful functions that simplify the programming model implying such transactions.
These functions are grouped according to the level of abstraction they offset:
MRAM variables
MRAM variable can be declared directly in the DPU program source code. Including mram.h gives access to three variable attributes:
__mramwhich will place the associated variable in MRAM.
__mram_noinitwhich does the same as__mrambut no initial value will be associated with the variable. This will help reduce the size of the DPU binary and the program loading, notably when declaring big MRAM arrays.
__mram_ptrwhich enable to use a pointer on a MRAM variable or declare aexternMRAM variable.
The DPU MRAM Heap Pointer
A special MRAM variable is defined in mram.h: DPU_MRAM_HEAP_POINTER.
It defines the end of the memory range used by the MRAM variables.
The range from DPU_MRAM_HEAP_POINTER to the end of the MRAM can be used freely by a DPU program,
for example to handle dynamically-sized MRAM arrays.
Software cache
An MRAM variable can be accessed like any WRAM variable. When doing so, a pre-defined cache in WRAM is used to handle the MRAM transactions.
This model is very convenient for developers who want to focus first on the algorithmic part of their implementation and then address the memory transactions. However, the cost of such a cache can be significant. Indeed, each access to an MRAM variable will imply an MRAM transfer, which is much slower than a WRAM access. Using direct MRAM access can provide better results.
Implicit write access of MRAM variables is not multi-tasklet safe for data types lower than 8 bytes (e.g., char, int). Indeed, such an access is decomposed as three operations: 1) read 8 bytes in WRAM cache, 2) modify x bytes (x < 8), and 3) write 8 bytes back. When two tasklets are trying to write values within the same 8-byte location (such as two consecutive integers), a race condition may happen.
Example
These attributes can be used like so:
#include <alloc.h>
#include <mram.h>
#include <stdint.h>
/* Buffer in MRAM. */
uint32_t __mram_noinit mram_array[4];
uint32_t input[4] = {0, 2, 4, 6};
uint32_t output[4];
int main() {
for (int i = 0; i < 4; ++i) {
mram_array[i] = input[i];
}
for (int i = 0; i < 4; ++i) {
output[i] = mram_array[i];
}
return 0;
}
The code is built to be executed:
dpu-upmem-dpurte-clang mram_variable_example.c -o mram_variable_example
To validate that everything works, let’s check the result in dpu-lldb:
file mram_variable_example
breakpoint set --source-pattern-regexp "return 0;"
process launch
target variable/x mram_array
target variable/x output
exit
The result of the print should be the data stored from the WRAM input array:
(uint32_t [4]) output = ([0] = 0x00000000, [1] = 0x00000002, [2] = 0x00000004, [3] = 0x00000006)
Direct access to the MRAM
The first collection of functions of the Runtime Library defined in mram.h allow to perform transactions between the MRAM and the WRAM. The source and destination buffers must comply with the strict rules defined by DPUs:
The source or target address in WRAM must be aligned on 8 bytes.
The source or target address in MRAM must be aligned on 8 bytes. Developers must carefully respect this rule since the Runtime Library does not perform any check regarding this point
The size of the transfer must be a multiple of 8, at least equal to 8 and not greater than 2048.
The Runtime Library defines two “low level” functions to perform a copy:
From the WRAM to the MRAM (
mram_write(const void *from, __mram_ptr void *to, unsigned int nb_of_bytes))From the MRAM to the WRAM (
mram_read(const __mram_ptr void *from, void *to, unsigned int nb_of_bytes))
Notes
When possible, the compiler will try and check the different constraints, triggering an error if they are not respected.
This is not always possible. For example, with no optimization (ie -O0), no check is made. It is also the case when
the size argument is not a compile-time constant.
Example
The next example illustrates the WRAM/MRAM transactions with a simple copy:
A buffer (
input) in WRAM is populated with well-known data (a byte count)The buffer is copied in an MRAM buffer, defined as an MRAM variable
The program reads back the MRAM at this location into a new buffer (
output)
Notice that the WRAM buffers (input and output) are __dma_aligned.
It is mandatory to use __dma_aligned for WRAM buffers used in direct MRAM access.
Next is the code achieving this task (in mram_example.c):
#include <mram.h>
#include <stdint.h>
#define BUFFER_SIZE 256
/* Buffer in MRAM. */
uint8_t __mram_noinit mram_array[BUFFER_SIZE];
int main() {
/* A 256-bytes buffer in WRAM, containing the initial data. */
__dma_aligned uint8_t input[BUFFER_SIZE];
/* The other buffer in WRAM, where data are copied back. */
__dma_aligned uint8_t output[BUFFER_SIZE];
/* Populate the initial buffer. */
for (int i = 0; i < BUFFER_SIZE; i++)
input[i] = i;
mram_write(input, mram_array, sizeof(input));
/* Copy back the data. */
mram_read(mram_array, output, sizeof(output));
for (int i = 0; i < BUFFER_SIZE; i++)
if (i != output[i])
return 1;
return 0;
}
The code is built to be executed by a single tasklet:
dpu-upmem-dpurte-clang mram_example.c -o mram_example
To validate that everything works, let’s debug the program with dpu-lldb:
file mram_example
breakpoint set --source-pattern-regexp "return 0"
process launch
parray/x 20 &input[0]
parray/x 20 &output[0]
memory read `&mram_array` -c 32
exit
We can see that everything is at the expected values:
(uint8_t) [0] = 0x00
(uint8_t) [1] = 0x01
(uint8_t) [2] = 0x02
(uint8_t) [3] = 0x03
(uint8_t) [4] = 0x04
(uint8_t) [5] = 0x05
(uint8_t) [6] = 0x06
(uint8_t) [7] = 0x07
(uint8_t) [8] = 0x08
(uint8_t) [9] = 0x09
(uint8_t) [10] = 0x0a
(uint8_t) [11] = 0x0b
(uint8_t) [12] = 0x0c
(uint8_t) [13] = 0x0d
(uint8_t) [14] = 0x0e
(uint8_t) [15] = 0x0f
(uint8_t) [16] = 0x10
(uint8_t) [17] = 0x11
(uint8_t) [18] = 0x12
(uint8_t) [19] = 0x13
0x08000000: 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f ................
0x08000010: 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f ................
Notice that for lldb the MRAM starts at the address 0x08000000.
Sequential readers
The third collection of Runtime Library functions, defined in seqread.h allow to simplify sequential reads of the MRAM. This abstraction uses a cache in WRAM to store temporary data from the MRAM, and a reader, to store where the next element should be read in WRAM and MRAM. Moreover, this abstraction implementation has been optimized and will provide better performance than a standard C check of the cache boundaries.
A sequential reader is managed by three functions:
seqread_alloc()to allocate the cache. The cache size is determined by the macroSEQREAD_CACHE_SIZE, which is defined as256by default, but can be set to32,64,128,256,512or1024.
seqread_init(seqreader_buffer_t *cache, __mram_ptr void *mram_addr, seqreader_t *reader)to initialize the reader, using the specified cache, starting at the specified MRAM address, and returning the first value corresponding to the MRAM address.
seqread_get(void *ptr, uint32_t inc, seqreader_t *reader)to get next value of the specified size for the specified reader.
seqread_seek(__mram_ptr void *mram_addr, seqreader_t *reader)to jump to a MRAM address for the specified reader, and returning the first value corresponding to the MRAM address.
seqread_tell(void *ptr, seqreader_t *reader)to get the current MRAM address corresponding to the specified pointer (which should be a pointer in the cache specified during the initialization of the reader).
Notice that the sequential reader implementation is not thread-safe.
Example
The following example will illustrate a typical use of a sequential reader on a simple case. The goal is to compute the sum of some data placed in MRAM and to store the result at the start of the MRAM. The main task here is to read the data from the MRAM, the process of the data being trivial and the write-back consisting only of the result value. Thus the sequential reader can be really effective here. The MRAM structure is the following:
The first 4 bytes in MRAM represent the buffer size
NThe subsequent
Nbytes in MRAM contain the data for which the application requests a checksum computation
Next is the code placed in the DPU (in seqreader_example.c):
#include <mram.h>
#include <seqread.h>
#include <stdint.h>
#define DLEN (1 << 16)
__mram_noinit uint8_t buffer[DLEN];
int main() {
unsigned int bytes_read = 0, buffer_size, checksum = 0;
/* Cache for the sequential reader */
seqreader_buffer_t local_cache = seqread_alloc();
/* The reader */
seqreader_t sr;
/* The pointer where we will access the cached data */
uint8_t *current_char = seqread_init(local_cache, buffer, &sr);
buffer_size = *(uint32_t *)current_char;
current_char = seqread_get(current_char, sizeof(uint32_t), &sr);
while (bytes_read != buffer_size) {
checksum += *current_char;
bytes_read++;
current_char = seqread_get(current_char, sizeof(*current_char), &sr);
}
return checksum;
}
The code is built to be executed:
dpu-upmem-dpurte-clang seqreader_example.c -o seqreader_example
To validate that everything works, let’s check the result in dpu-lldb, using a sample MRAM (sample.bin contains the MRAM image of 64KB of counting bytes):
file seqreader_example
breakpoint set --source-pattern-regexp "return"
process launch --stop-at-entry
memory write -i sample.bin `&buffer[0]`
process continue
frame variable/x checksum
exit
The result of the print should be the checksum of 64KB of counting bytes:
(unsigned int) checksum = 0x007f8000
Below is the code to generate sample.bin (in sampleGenerator.c):
#include <stdio.h>
#define DLEN (1 << 16)
int main() {
FILE *f = fopen("sample.bin", "wb");
int i, checksum = 0;
i = DLEN;
fwrite(&i, 4, 1, f);
for (i = 0; i < DLEN; i++) {
unsigned char ii = (unsigned char)i;
fwrite(&ii, 1, 1, f);
checksum += ii;
}
fclose(f);
printf("checksum = %d = 0x%x\n", checksum, checksum);
return 0;
}