Logo
2025.1.0

Getting started

  • The UPMEM DPU toolchain
  • Installing the UPMEM DPU toolchain
  • Hello World! Example

Programming

  • Introduction
  • Tasklet management and synchronization
  • Memory management
  • Standard library functions
  • Exceptions
  • Controlling the execution of DPUs from host applications
  • Communication with host applications
  • Advanced Features of the Host API
  • Logging
  • Coding tips and recommended practices
    • Programming DPUs
      • Persistent and non-persistent objects amongst boots
      • Multiplications and divisions of shorts and integers are expensive
      • 64-bit variables are expensive
      • Multi-threaded programs are more efficient than single-threaded
      • Floating-point support
    • Host applications
      • Data sharing
      • Memory locality
  • The Clangd Language Server

Debugging

  • Introduction
  • About dpu-lldb
  • An example of debugging
  • Debugging a Host application
  • An example of debugging a DPU booted by a host application
  • Attaching to a DPU without having a host application
  • DPU Core Dump
  • dpu-lldb limitations
  • Using dputrace tool
  • Verifying memory accesses with dpugrind

Support

  • Release notes
  • Reporting errors

Assembler and Instruction set

  • Assembler syntax
  • DPU condition classes
  • Examples of an assembly program
  • Integrating assembly code with C programs
  • DPU ABI
  • Instruction Set Architecture
  • DPU Handbook

Libraries

  • Runtime Library
  • Host Library
  • Low-level Host Library (to be used with caution)
  • C++ Host API
  • Java Library
  • Python Library

Advanced

  • SDK Configuration
  • DPU Version Selection
  • Kernel Driver
  • Profiling DPU binary
  • Performance Counters
  • Application profiling
  • WRAM Parallel Access
  • Unaligned MRAM Accesses
  • Scatter Gather Memory Transfer
  • Stack Analyzer
  • Server installation
  • Permissions Requirements
UPMEM development tools documentation
  • Coding tips and recommended practices
  • View page source

Coding tips and recommended practices

Programming DPUs

Persistent and non-persistent objects amongst boots

Let’s consider the following program, computing !5 (in factorial.c):

#include <mram.h>
#include <stdint.h>

__host __dma_aligned int64_t factorial = 1;

int main() {
  for (int64_t i = 5; i > 0; i--)
    factorial *= i;

  return 0;
}

And its associated host program:

#include <dpu.h>
#include <stdio.h>
#include <stdint.h>

#ifndef DPU_EXE
#define DPU_EXE "factorial"
#endif

int main() {
  struct dpu_set_t set, dpu;
  int64_t factorial;

  DPU_ASSERT(dpu_alloc(1, NULL, &set));
  DPU_ASSERT(dpu_load(set, DPU_EXE, NULL));
  DPU_ASSERT(dpu_launch(set, DPU_SYNCHRONOUS));
  DPU_FOREACH(set, dpu) {
    DPU_ASSERT(dpu_copy_from(dpu, "factorial", 0, &factorial, sizeof(factorial)));
  }

  printf("first iteration: factorial = %lli\n", (long long)factorial);

  DPU_ASSERT(dpu_launch(set, DPU_SYNCHRONOUS));
  DPU_FOREACH(set, dpu) {
    DPU_ASSERT(dpu_copy_from(dpu, "factorial", 0, &factorial, sizeof(factorial)));
  }

  printf("second iteration: factorial = %lli\n", (long long)factorial);

  DPU_ASSERT(dpu_free(set));

  return 0;
}
#include <dpu>
#include <iostream>

#ifndef DPU_BINARY
#define DPU_BINARY "factorial"
#endif

using namespace dpu;

int main() {

  try {
    auto dpu = DpuSet::allocate(1);
    dpu.load(DPU_BINARY);
    dpu.exec();

    std::vector<std::vector<uint64_t>> factorial(1);
    factorial.front().resize(1);
    dpu.copy(factorial, "factorial");
    std::cout << "first iteration: factorial = " << 
      (long long)(factorial.front().front()) << std::endl;

    dpu.exec();
    dpu.copy(factorial, "factorial");
    std::cout << "second iteration: factorial = " << 
      (long long)(factorial.front().front()) << std::endl;

  }
  catch (const DpuError & e) {
    std::cerr << e.what() << std::endl;
  }
  return 0;
}
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import com.upmem.dpu.Dpu;
import com.upmem.dpu.DpuException;
import com.upmem.dpu.DpuSystem;

public class FactorialHost {

  public static void main(String[] args) throws DpuException {
    try(DpuSystem dpu = DpuSystem.allocate(1, "")) {

      dpu.load("factorial");
      dpu.exec();

      byte[][] factorial = new byte[1][8];
      dpu.copy(factorial, "factorial");
      ByteBuffer wrapped = ByteBuffer.wrap(factorial[0]);
      wrapped.order(ByteOrder.LITTLE_ENDIAN);

      System.out.println("first iteration: factorial = " + wrapped.getLong(0));

      dpu.exec();
      dpu.copy(factorial, "factorial");

      System.out.println("second iteration: factorial = " + wrapped.getLong(0));
    }
  }
}
#!/usr/bin/env python3

from dpu import DpuSet
from dpu import ALLOCATE_ALL

with DpuSet(1, binary = "factorial") as dpus:

    dpus.exec()
    factorial = [bytearray(8) for _ in dpus]
    dpus.copy(factorial, 'factorial')
    print("first iteration: factorial = ", int.from_bytes(factorial[0], 'little'))

    dpus.exec()
    dpus.copy(factorial, 'factorial')
    print("second iteration: factorial = ", int.from_bytes(factorial[0], 'little'))


Build and execute it:

dpu-upmem-dpurte-clang -o factorial factorial.c
gcc --std=c99 -o factorial_host factorial_host.c `dpu-pkg-config --cflags --libs dpu`
./factorial_host
dpu-upmem-dpurte-clang -o factorial factorial.c
g++ --std=c++11 -o factorial_host_cpp factorial_host.cpp `dpu-pkg-config --cflags --libs dpu` -g
./factorial_host_cpp
dpu-upmem-dpurte-clang -o factorial factorial.c
javac -cp $(dpu-pkg-config --variable=java dpu) FactorialHost.java
java -cp .:$(dpu-pkg-config --variable=java dpu) FactorialHost
python3 factorial_host.py

The first time the program runs, the returned value is 120, as expected But when rebooting the DPU and checking the results again, one will observe that the returned value is not !5, but 14400 (3840 in hexadecimal, equal to !5x!5).

The reason why the result after a second boot is 120x120 instead of 120 is that the factorial variable is not re-initialized at the second boot. In other words, saying:

__dma_aligned int64_t factorial = 1;

This means that the initial value for this variable is 1 only during the first boot.

More generally, the Runtime Library does not reset system resources when re-booting. In particular, mutexes are not relaxed, and semaphore or barrier counters are not reset to their initial value.

Multiplications and divisions of shorts and integers are expensive

Multiplications of 32-bit words rely on the UPMEM DPU instruction mul_step, implying an over cost up to 42 clock cycles per multiplication. The same applies to the 32-bit division and the remainder.

As a consequence, avoid using these operations when not needed.

64-bit variables are expensive

The DPU is a native 32-bit machine. 64-bit instructions are emulated by the toolchain and are usually more expensive than 32-bit ones. Typically, and addition is emulated by 2 or three instructions, so is twice or thrice more expensive.

As a consequence, 64-bit code is slower and requires more program memory than 32-bit code.

Multi-threaded programs are more efficient than single-threaded

The DPU pipeline is reaching the nominal performance (about 1 instruction per cycle) when there are more threads in action than the depth of the pipeline.

It is recommended to implement algorithms with 16 active tasklets to absorb the latency of memory accesses.

Floating-point support

Albeit understood natively by the compiler, floating points are emulated by software. As a consequence, floating point operations are very slow and should be avoided.

Host applications

Data sharing

Communication with the DPU WRAM is slower than copies to/from MRAM. Moreover, the WRAM is a smaller memory compared to MRAM. As a consequence, the DPU WRAM should be used to share small amounts of data (tens to a hundred bytes). To share large buffers users should use copies to/from MRAM.

Memory locality

Each DPU can only access data in its own MRAM. It is recommended to organize the data flow to make DPU execution as much as possible independent from external data.

Previous Next

© Copyright 2015-2024, UPMEM SAS - All rights reserved.