Profiling DPU binary
dpu-statistics mode
How to use it
The UPMEM dpu-profiling tool allows profiling the DPU to know how much time is spent in each function of a DPU program.
To use this feature the DPU program needs to be compiled with the -pg compilation option.
Then you need to execute the program with dpu-profiling.
The program can either be a host application offloading the DPU program, or dpu-lldb can be used to run the DPU program:
dpu-profiling dpu-statistics -- <host_program>
dpu-profiling dpu-statistics -- dpu-lldb --batch --one-line run -- <dpu_program>
Example
Considering the following DPU program (in dpu_statistics_profiling.c):
#define BASE 10000
int foo() {
int res = 0;
for (int i = 0; i < 15; i++) {
res++;
}
return res;
}
int bar() {
int res = 0;
for (int i = 0; i < 85; i++) {
res++;
}
return res;
}
int main() {
int res = 0;
for (int i = 0; i < BASE; i++) {
res += foo() + bar();
}
return res;
}
Compile it with the -pg option:
dpu-upmem-dpurte-clang -pg dpu_statistics_profiling.c -o dpu_statistics_profiling
The program can be executed with dpu-lldb:
dpu-profiling dpu-statistics -- dpu-lldb --batch --one-line run -- dpu_statistics_profiling
And it will produce the following output:
**************
*** Global ***
**************
main (#occ = 1) = 0.000517
__bootstrap (#occ = 1) = 0.000517
foo (#occ = 29034) = 15.000620
bar (#occ = 164516) = 84.998347
Total samples = 193552
dpu-sections mode
How to use it
The UPMEM dpu-profiling tool allows profiling the DPU to know how much time each thread spent in a specified code section, compared to the global duration of the program.
This is especially useful when the DPU program is composed of a small number of functions, or when doing link time optimization (LTO), where the other function-based profiling methods do not provide as much information.
To use this feature the DPU program needs to be compiled with the -pg compilation option.
Then you need to execute the program with dpu-profiling.
The program can either be a host application offloading the DPU program, or dpu-lldb can be used to run the DPU program:
dpu-profiling dpu-sections -- <host_program>
dpu-profiling dpu-sections -- dpu-lldb --batch --one-line run -- <dpu_program>
Please note that the dpu-sections mode is only accurate when using UPMEM DIMMs.
Example
Considering the following DPU program (in dpu_sections_profiling.c):
#include <profiling.h>
#define BASE 10000
PROFILING_INIT(foo);
PROFILING_INIT(bar);
int main() {
int res = 0;
for (int i = 0; i < BASE; i++) {
profiling_start(&foo);
for (int i = 0; i < 15; i++) {
res++;
}
profiling_stop(&foo);
profiling_start(&bar);
for (int i = 0; i < 85; i++) {
res++;
}
profiling_stop(&bar);
}
return res;
}
Compile it with the -pg option:
dpu-upmem-dpurte-clang -pg dpu_sections_profiling.c -o dpu_sections_profiling
The program can be executed with dpu-lldb:
dpu-profiling dpu-sections -- dpu-lldb --batch --one-line run -- dpu_sections_profiling
And it will produce the following output:
****************
*** Sections ***
****************
bar:
thread#00 = 81.319500%
foo:
thread#00 = 15.338500%
How does that work / Advance usage
By using -pg clang option, the compiler will add a call to the mcount function at each function entry.
There are 3 different profiling modes at the moment:
nop:
The nop mode patches the calls to the mcount function with a nop instruction.
statistics:
When using the statistics mode, the mcount function calls are patched with a sh instruction in order to write the thread PC in a dedicated WRAM area. This location is then regularly read from the host to compute how many times a thread was in a given function.
sections:
For a given DPU code section (defined by profiling_start and profiling_stop functions), we count for each thread the number of cyles spent in the section using the performance counter. We also register the DPU program duration time in order to compute the ratio of the time spent in the section.
The mode can be selected using the enableProfiling property in the DPU profile (examples: “enableProfiling=statistics” or “enableProfiling=nop”).
The UPMEM dpu-profiling tool is using either the statistics or the sections mode depending on the specified parameter. The DPU profile is automatically set up.
Notes
If a dpu function should not be profiled, use
"__attribute__((no_instrument_function))"at the function definition.Only one DPU can be profiled in a rank at once, but one can choose a specific DPU by using the following properties in the DPU profile:
profilingDpuIdandprofilingSliceId(default values are0for both properties)