Communication with host applications
------------------------------------

.. _dpu-memory-interface-label:

Memory Interface
~~~~~~~~~~~~~~~~

The C host API provides functions to transfer data between the host memory and any of the DPU memory (**IRAM**, **WRAM** or **MRAM**):

  * ``dpu_copy_from(struct dpu_set_t set, const char *symbol_name, uint32_t symbol_offset, void *dst, size_t length)`` to copy a buffer from a single DPU
  * ``dpu_broadcast_to(struct dpu_set_t set, const char *symbol_name, uint32_t symbol_offset, const void *src, size_t length, dpu_xfer_flags_t flags)`` to broadcast a buffer to a set of DPUs
  * ``dpu_push_xfer(struct dpu_set_t set, dpu_xfer_t xfer, const char *symbol_name, uint32_t symbol_offset, size_t length, dpu_xfer_flags_t flags)`` to push different buffers to a set of DPUs in one transfer.

There are some alignment limitations when using these functions, depending on the target DPU memory:
  * **IRAM** address and length must be aligned on 8 bytes
  * **WRAM** address and length must be aligned on 4 bytes
  * **MRAM** address and length must be aligned on 8 bytes

The functions will return an error if these constraints are not respected.

The ``symbol_name`` argument consists of a name of a variable in the DPU code.
It can be either a **MRAM** variable (with the ``__mram`` or ``__mram_noinit`` attribute) or a **WRAM** variable (with the ``__host`` attribute).
Other variables are not visible to the host application. (**Note:** Before you use WRAM transfers, read the :ref:`data-sharing-label` section.)

**Note:**
The special **MRAM** variable ``DPU_MRAM_HEAP_POINTER`` (cf :ref:`dpu-mram-heap-pointer-explanation-label`) can be accessed by specifying ``DPU_MRAM_HEAP_POINTER_NAME`` (defined in ``dpu_types.h``) as the ``symbol_name``.

When the DPU set contains multiple DPUs:

  * ``dpu_broadcast_to`` will copy the same buffer to all DPUs in the set
  * ``dpu_copy_from`` will return ``DPU_ERR_INVALID_DPU_SET``
  * ``dpu_push_xfer``: see Section :ref:`dpu-rank-transfer-interface-label`

As an illustration, let's implement a trivial checksum function in the DPU. The host application fills in the **MRAM**
with a buffer of arbitrary size:

  * The first 4 bytes in **MRAM** represent the buffer size ``N``
  * The subsequent ``N`` bytes in **MRAM** contain the data for which the application requests a checksum computation

On the DPU side, the program uses a single tasklet to fetch ``N`` and compute the checksum of the supplied buffer. When
done, the result is posted back into the first four bytes of the **MRAM**.

Next is a very simple way of implementing the code on the DPU side, using a mix of **MRAM** variables and low level **MRAM**/**WRAM** access functions (in ``trivial_checksum_example.c``):

.. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example.c
    :language: c

The code is built to be executed by a single tasklet:

.. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example.compile_dpu

Such a code can be tested with the ``dpu-lldb``, by loading a pre-defined **MRAM** image...

Such an image is a binary file forged by the developers. For example, to load an MRAM image called ``sample.bin`` and run the checksum computation on it:

.. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example.lldb_script

As usual, the print of the ``checksum`` variable allows to verify that the returned value is correct:

.. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example.output_reference

A host application can trigger the checksum computation by filling the **MRAM** with the data, as illustrated here-after:

.. tabs::
    .. group-tab:: C

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example_host.c
           :language: c

    .. group-tab:: C++

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example_host.cpp
           :language: c++

    .. group-tab:: Java

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/TrivialChecksumExample.java
           :language: java

    .. group-tab:: Python

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example_host.py
           :language: python

**Note:** In C++, Java and Python, a ``copy`` method is used for the data transfers between the host and the DPU, instead of the ``dpu_copy_from`` and ``dpu_broadcast_to`` functions used in C.

Compile the program, for example:

.. tabs::
    .. group-tab:: C

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example.compile_host
           :language: bash

    .. group-tab:: C++

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example.compile_host_cpp
           :language: bash

    .. group-tab:: Java

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example.compile_host_java
           :language: bash

    .. group-tab:: Python

        N/A


The result printed by this program should be the checksum of 64Kbyte of counting bytes:

.. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example_host.output_reference

.. _dpu-rank-transfer-interface-label:

Rank Transfer Interface
~~~~~~~~~~~~~~~~~~~~~~~

The previous functions do not provide the needed precision when trying to transfer different data from/to the DPUs while
keeping the performance of transferring to a whole rank. To do so, one can use the following C functions:

  * ``dpu_prepare_xfer`` attributes a buffer to a set of DPUs, which will be used as input or output when ``dpu_push_xfer`` is called
  * ``dpu_push_xfer`` executes the current transfer with the given direction, DPU symbol name, and DPU symbol length, using the buffers defined with ``dpu_prepare_xfer``.
    No transfer is done for a DPU with no defined buffer.

Here is an example doing the same computation as before, but using multiple DPUs:

.. tabs::
    .. group-tab:: C

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example_host_multirank.c
           :language: c

    .. group-tab:: C++

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example_host_multirank.cpp
           :language: c++

    .. group-tab:: Java

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/TrivialChecksumExampleMultiRank.java
           :language: java

    .. group-tab:: Python

       .. literalinclude:: ../../../endtests/documentation/trivial_checksum_example/trivial_checksum_example_host_multirank.py
           :language: python

**Note:** In C++, Java and Python, the same ``copy`` method is used for the data transfers between the host and a rank of DPUs. However this method is an overload of the ``copy`` method used in the single DPU example, as it takes a two dimensional vector as input. The first dimension of the vector corresponds to each DPU in the rank.