Controlling the execution of DPUs from host applications
The DPU host API facilitates interactions between host applications and DPUs by offering functions to:
Dynamically obtain some DPUs to achieve its goal
Load the DPUs with a program
Start DPUs and get their execution status
Obtaining DPUs
From a physical standpoint, DPUs are grouped by ranks. Within a rank, each operation can address one or several DPUs at a time. The size of a rank varies depending on the actual underneath implementation:
UPMEM DIMMs provide ranks of 64 DPUs
F1 on AWS provide ranks of 32 DPUs
The UPMEM simulator provides ranks of 1 DPU only
Etc.
However, often one may want to apply the same action on all DPUs of all ranks. And sometimes, there is no performance drop in doing so, rather than applying the action at the rank level.
As a consequence, the host API works on sets of DPUs, which may contain multiple DPU ranks. The provided C macro DPU_RANK_FOREACH
and DPU_FOREACH iterate over the ranks and DPUs respectively of a set. Here are some of the available C functions to manage a DPU set:
dpu_alloc: returns a set of DPUs, which contains exactly the specified number of DPUs, or an error if the given number of DPUs cannot be allocated. UnlessDPU_ALLOCATE_ALLis used, which means thatdpu_allocwill allocate all available DPUs.
dpu_free: frees a given set of DPUs. Only sets allocated withdpu_alloccan be freed.
dpu_get_nr_ranks: returns the number of ranks in a DPU set
dpu_get_nr_dpus: returns the number of DPUs in a DPU set
The allocation functions get a string (called the profile) to describe the target:
This string is a comma separated list of key and values:
"key1=value1,key2=value2,key3=value3,..."
Here is a non-exhaustive list of keys with their associated values:
backend
simulator
hw
cycleAccurate(only for FPGA)
true
false
A NULL profile is equivalent to an empty profile.
In C++, Java and Python the allocate method is used to obtain a set of DPUs or ranks, represented as a DpuSet or DpuSystem object. More information can be found in the documentation for the host APIs in C++ Host API, Java Library and Python Library.
Loading programs
This operation is achieved by dpu_load to program all the DPUs in a set.
The function gets a binary file path as input and loads the enclosed program onto the specified DPUs.
The program information that can be stored in a given pointer, or ignored if the pointer is NULL.
The program is persistent in the DPU memory, meaning that it can be rebooted as many times as the application wants and will always execute the same code.
Note: as explained in Coding tips and recommended practices global constants are persistent amongst boot, even if static.
Applications may, however, reload DPUs with new programs, by invoking dpu_load at any moment.
Note that the C host API also provides 2 functions similar to dpu_load to load program from memory:
dpu_load_from_incbinloads a program stored in memory using theDPU_INCBINmacro.
dpu_load_from_memoryloads a program stored in memory.
In C++, Java and Python the load method is available for this operation. Please check the Host API documentation of the corresponding language for more details.
Executing programs
This goal is achieved by “booting” DPUs, via invocations to dpu_launch to boot all the DPUs of a given set
Some resources, but not all of them, are reset before booting. More details about what is reset can be found in Coding tips and recommended practices.
Applications can execute DPUs synchronously or asynchronously:
DPU_SYNCHRONOUSsuspends the application until the requested DPUs complete their execution (or encounters an error)
DPU_ASYNCHRONOUSimmediately gives back the control to the application, which will be in charge of checking the DPU’s status viadpu_statusordpu_sync
In C++, Java and Python the exec method is used to boot the DPUs. Please check the Host API documentation of the corresponding language for more details.