Using dputrace tool

dputrace provides an exact representation of the low-level execution of a DPU program. Each executed assembly instruction is printed, with detailed information on the different registers modified by the instruction. This tool is mainly a low-level/assembly debugging helper.

How to generate trace files

Trace files can only be generated by a DPU running on the functional simulator backend. By setting the environment variable UPMEM_TRACE_DIR, the simulator will output a binary trace file, named trace-XXXX-YY, where XXXX is the DPU rank ID and YY the DPU ID in the rank, in the directory specified by UPMEM_TRACE_DIR. Theses trace files are directly usable by the dputrace tool.

Example

We are going to consider a rather simple DPU program:

__attribute__((noinline))
int foo(int x) {
    return x * 5;
}

int main() {
    return foo(4) + 2;
}

Let’s build the program:

dpu-upmem-dpurte-clang -DNR_TASKLETS=16 dputrace_example.c -o dputrace_example -O2

We now need to execute this program. Here, we are using dpu-lldb as a runner. Any host application based on the Host API would work. From a terminal, we are first setting UPMEM_TRACE_DIR, then running the program, making sure that we run it on a functional simulator:

export UPMEM_TRACE_DIR=.
export UPMEM_PROFILE_BASE=backend=simulator

We can now launch dpu-lldb:

file dputrace_example
process launch
exit

A new file, trace-0000-00, is available in the UPMEM_TRACE_DIR directory (in this example, the current directory). Now let’s execute dputrace, focusing on the thread 1 (hence the -t 1 option):

dputrace -i trace-0000-00 --print-trace-count -t 1 -no-color -no-tree

[#0x00000008][01@0x80000000]   jnz id (0x00000001), 0x80000030
[#0x0000000a][01@0x80000030]   jeq id (0x00000001), 15 (0xf), 0x80000040
[#0x0000000c][01@0x80000038]   boot id (0x00000001), 1 (0x1)
[#0x0000000e][01@0x80000040]   ld d22 (0x000004d800000400), id8 (0x00000008), 24 (0x18)                   R: 0x00000020 (__sys_thread_stack_table_ptr + 8)
[#0x00000011][01@0x80000048]   call r23 (0x0000000a), 0x80000068                                          => main
[#0x00000014][01@0x80000068]   sd r22 (0x000004d8), 0 (0x0), d22 (0x000004d80000000a)                     W: 0x000004d8
[#0x00000017][01@0x80000070]   add r22 (0x000004e0), r22 (0x000004d8), 8 (0x8)
[#0x0000001b][01@0x80000078]   move r0 (0x00000004), 4 (0x4)
[#0x0000001f][01@0x80000080]   call r23 (0x00000011), 0x80000058                                          => foo
[#0x00000023][01@0x80000058]   lsl_add r0 (0x00000014), r0 (0x00000004), r0 (0x00000004), 2 (0x2)
[#0x00000028][01@0x80000060]   jump r23 (0x00000011)                                                      <= foo
[#0x0000002d][01@0x80000088]   add r0 (0x00000016), r0 (0x00000014), 2 (0x2)
[#0x00000031][01@0x80000090]   ld d22 (0x000004d80000000a), r22 (0x000004e0), -8                          R: 0x000004d8
[#0x00000036][01@0x80000098]   jump r23 (0x0000000a)                                                      <= main
[#0x0000003b][01@0x80000050]   stop true, 0x80000050

Each trace follows this schema:

Trace counter | Thread @ PC | Flags | Disassembled instruction | Function information

The trace counter should not be interpreted as a cycle counter, as the functional simulator is not cycle-accurate. It is mainly used as a potential marker, to search and find a particular point from an execution program to another.

Currently, the only flag is the “replay” flag, represented by an *. It indicates that the instruction was played twice, because of hardware constraints (cf Efficient scheduling for more details).

The disassembled instruction embeds the value of the input registers before executing the instruction, and the value of the output register, if any, after executing the instruction.

When the symbols are available, dputrace will try to give some context from the function calls, and provide some information when entering and exiting a function.

Host traces

When running dputrace without specifying a thread, some other traces are displayed, describing the host actions. Here is a fast description of the different possibles traces:

Loading program:

[   LOAD PROGRAM    ]                   main.dpu

Signals that a program is going to be loaded in the DPU, giving the path to the executable.

Writing IRAM instruction:

[   WRITE IRAM      ][CI@0x80000000]    jnz id, 0x80000020

Signals that an instruction has been written by the host, giving the instruction address and its disassembled form.

Watch mode

In the previous example, the execution program finished before starting to use dputrace. Sometimes, this is not a flexible workflow (for example when running a very long application, or when using lldb to have an interactive execution). In these cases, one can use the -w option that enables the watch mode. In that mode, dputrace will not exit at the end of the file. Instead, it will wait for more input and process it on the fly.

Advanced options

Some parts of the DPU execution are hidden from the``dputrace`` output. They correspond to programs executed by the Host API, which are usually of no interest for an end-user. One can manage this setting with the -enable-event and -disable-event options. A list of the different possible values can be printed with the -list-events option.