Unaligned MRAM Accesses

As explained in section Direct access to the MRAM, the MRAM access is constrained by strict rules:

The source or target address in WRAM must be aligned on 8 bytes.

The source or target address in MRAM must be aligned on 8 bytes.

The size of the transfer must be a multiple of 8, at least equal to 8 and not greater than 2048.

However, in some situations, it may be needed to perform accesses with the MRAM address and/or the size of the transfer not aligned on 8 bytes. An example is when the application needs to read or write an odd number of 4-byte integers in MRAM.

The Runtime Library thus defines two functions to perform unaligned copy:

From the MRAM to the WRAM (void* mram_read_unaligned(const __mram_ptr void *from, void *buffer, unsigned int nb_of_bytes))

From the WRAM to the MRAM (void mram_write_unaligned(const void *from, __mram_ptr void *to, unsigned int nb_of_bytes))

These functions are extensions to the “low-level” functions mram_read and mram_write, with the additional support for arbitrary MRAM addresses and arbitrary transfer sizes not greater than 2048.

The mram_write_unaligned function supports unaligned MRAM destination addresses, but still requires that the source WRAM address has the same alignment than the destination address (i.e., equal modulo 8). If the condition is not respected, a fault is generated (see Fault).

Important note: this support comes with a performance cost, especially for the mram_write_unaligned function. Hence, it should always be preferred, if possible, to architecture the application with aligned MRAM accesses.

Below is an example usage of these functions. The DPU program receives an array of daily data for a period of 64 years. Each daily data is a vector of five 4-bytes integers (it could be the sales count for different products for instance). The program’s task is to report, for each year, the maximum of each sales count on the first 35 days of the year. Each tasklet is processing the data for one year at a time, and it therefore has to make unaligned reads and writes from and to the MRAM (mram_unaligned_copy_example.c).

#include <defs.h>
#include <mram_unaligned.h>

#define NB_YEARS 64
#define DIMENSION 5
#define NB_ELEM_PER_YEAR (365 * DIMENSION)
#define NB_DAYS 35

// inputs
__mram uint32_t daily_data[NB_YEARS * NB_ELEM_PER_YEAR];

// outputs
__mram uint32_t max_per_year[NB_YEARS * DIMENSION];

#define NR_ELEMENTS_PER_TASKLET (NB_YEARS / NR_TASKLETS)

int main() {

  // WRAM cache
  __dma_aligned uint32_t cache[NB_DAYS * DIMENSION + 4];
  __dma_aligned uint32_t max[DIMENSION + 1];

  for (unsigned i = me() * NR_ELEMENTS_PER_TASKLET;
       i < (me() + 1) * NR_ELEMENTS_PER_TASKLET; ++i) {

    /**
     * unaligned mram read of the first 35 days of the year
     **/
    uint32_t *data =
        mram_read_unaligned(&daily_data[i * NB_ELEM_PER_YEAR], cache,
                            NB_DAYS * DIMENSION * sizeof(uint32_t));

    /**
     * Compute the max for each dimension
     *
     * A wram buffer offset is used to ensure that the data is stored
     * in WRAM at an address with the same alignment than the MRAM address to
     * which it is going to be copied (pre-condition of mram_write_unaligned
     * function).
     **/
    uint8_t wram_offset = (i * DIMENSION) & 1;

    for (uint32_t k = wram_offset; k < DIMENSION + wram_offset; ++k)
      max[k] = 0;

    for (uint32_t j = 0; j < NB_DAYS; ++j) {
      for (uint32_t k = 0; k < DIMENSION; ++k) {
        if (max[k + wram_offset] < data[j * DIMENSION + k])
          max[k + wram_offset] = data[j * DIMENSION + k];
      }
    }

    /**
     * unaligned mram write of the max vector for this year
     **/
    mram_write_unaligned(&max[wram_offset], &max_per_year[i * DIMENSION],
                         DIMENSION * sizeof(uint32_t));
  }
}

For both mram_read_unaligned and mram_write_unaligned, when the address and size passed are aligned on 8 bytes, the mram_read or mram_write function is called instead, so there is little overhead.

For unaligned reads, the mram_read_unaligned function performs an mram_read call while extending the address and size so that more data than needed is loaded into the WRAM. For example, if the MRAM address is 0x08000004, and the size is 4 bytes, the mram_read is done at address 0x8000000 with size 8. The function thus requires an input WRAM buffer that is at least nb_of_bytes + 16 bytes in size. Passing a smaller buffer can lead to undefined behavior. The return value is a WRAM pointer to where the value at MRAM address from is stored in WRAM (this may not be the address of the buffer passed to the function). In the previous example, the returned value would be (char*)buffer + 4.

For an unaligned write, the mram_write_unaligned function performs an aligned mram_write of a reduced part of the data that satisfies the alignment constraint. For the rest of the data (prolog/epilog), it needs to first read 8 bytes in WRAM, change the few bytes that need to be changed, and write the 8 bytes back. This operation needs to be atomic (i.e., the 8 bytes cannot be changed by another tasklet between the read and the write). The mram_write_unaligned function also requires the source WRAM buffer to have the same alignment than the destination MRAM address (i.e., ((uintptr_t)from & 7) == ((uintptr_t)dest & 7)). This is the reason of using the wram_offset variable in the above example.

Changing a single integer/byte value in MRAM

The Runtime Library also provides specific macros to write or update a single byte or 4-byte integer in MRAM:

mram_write_byte_atomic(dest, val)

mram_update_byte_atomic(dest, update_func, args)

mram_write_int_atomic(dest, val)

mram_update_int_atomic(dest, update_func, args)

Where the parameter dest is an address in MRAM, val is a 8-bit or 32-bit value to be written, and update_func is a function to provide a new value based on the current value and some arguments (args). For example, the following DPU program replaces each 4-byte values from an input table by its square, using the macro mram_update_int_atomic (mram_update_int_example.c):

#include <defs.h>
#include <mram_unaligned.h>

#define BUFFER_SIZE (1024 * 1024)
#define NR_ELEMENTS_PER_TASKLET (BUFFER_SIZE / NR_TASKLETS)

/**
 * callback function for mram_update_int_atomic
 * Update the input value with its square
 **/
static void square(int *i, void *args) { (*i) *= (*i); }

__mram_noinit uint32_t input_table[BUFFER_SIZE];

int main() {

  for (unsigned i = me() * NR_ELEMENTS_PER_TASKLET;
       i < (me() + 1) * NR_ELEMENTS_PER_TASKLET; ++i) {
    /**
     * atomic update of the histogram value
     **/
    mram_update_int_atomic(&input_table[i], square, 0);

    /**
     * WARNING: an implicit MRAM access like the following
     * would lead to undefined behavior:
     *
     * input_table[i] *= input_table[i]; // WRONG
     *
     * This is due to the fact that the implicit access to MRAM
     * is not multi-tasklet safe for sizes smaller than 8 bytes.
     **/
  }
}

It is important to see that, because the input values are 4-byte integers, an implicit MRAM access is not multi-tasklet safe (see Software cache).

This macro can also be used in the histogram example introduced in section Mutual exclusions, where the histogram is now declared using 4-bytes integers instead of 8-bytes integers (mram_update_int_histogram_example.c):

#include <defs.h>
#include <mram_unaligned.h>

#define BUFFER_SIZE (1024 * 1024)
#define NR_ELEMENTS_HIST (1 << 8)
#define NR_ELEMENTS_PER_TASKLET (BUFFER_SIZE / NR_TASKLETS)

/**
 * callback function for mram_update_int_atomic
 * Add +1 to the mram value
 **/
static void adder(int *i, void *args) { *i += 1; }

__mram_noinit uint8_t input_table[BUFFER_SIZE];
__mram uint32_t histogram[NR_ELEMENTS_HIST];

int main() {

  for (unsigned i = me() * NR_ELEMENTS_PER_TASKLET;
       i < (me() + 1) * NR_ELEMENTS_PER_TASKLET; ++i) {
    uint8_t elem = input_table[i];
    /**
     * atomic update of the histogram value
     **/
    mram_update_int_atomic(&histogram[elem], adder, 0);
  }
}

Note that in this code no mutex or virtual mutexes are used to protect the access to the histogram, since the integer update is atomic with respect to the other tasklets’ execution. The atomicity is achieved using virtual mutexes under the hood, so that any two tasklets cannot write within the same 8-byte memory location concurrently. Using the mram_update_int_atomic macro is faster than a mram_read_unaligned followed by a mram_write_unaligned, as in the second case an additional mram read is done by the mram_write_unaligned call. For the same reason, it is faster than an implicit MRAM access (e.g., histogram[elem]++) protected by virtual mutexes (the C increment statement will generate two reads of the 8-byte memory location where histogram[elem] is stored). Also, note that using an implicit MRAM access without synchronization would lead to undefined behavior, since the implicit MRAM access on 4 bytes is not multi-tasklet safe.