DPU Handbook

1) Introduction

The DPU is a multithreaded 32-bit processor that has several hardware threads available, depending on the version of the DPU:

On a v1A DPU, there are 24 threads, indexed from 0 through 23.

On a v1B DPU, there are 16 threads, indexed from 0 through 15.

A thread can be running or stopped. The state of the thread i is reflected in the 24-lsb of a 64-bit register named RUN (the 40-msb being used for other purposes, described later):

RUN[ i ] = 0 -> the thread i is stopped (not executing),
RUN[ i ] = 1 -> the thread i is running (executing).

The full performance of the DPU is achieved when enough hardware threads are running so that the DPU pipeline remains filled (this number being > 10). Note that ‘overfilling’ the pipeline is recommended to palliate the fact that threads issuing DMA instructions are temporarily removed from the pipeline.

2) DPU state

2.1) Threads 32-bit registers

A thread knows 32 x 32 bit registers:

24 general purpose 32-bit registers, private to the thread: R0 - R23
4 fixed 32-bit registers, common to all threads:
- ZERO : fixed to the value 0,
- ONE : fixed to the value 1,
- LNEG : fixed to the value 0xFFFFFFFF (Least NEGative),
- MNEG : fixed to the value 0x80000000 (Most NEGative).
4 fixed 32-bit registers, private to the thread:
- ID : fixed to the thread index.
- ID2 : fixed to the thread index x 2.
- ID4 : fixed to the thread index x 4.
- ID8 : fixed to the thread index x 8.

2.1.1) R0 - R23 seen as Stack Registers

The 24 general-purpose 32-bit registers can be seen as well as 24 x 32-bit stack registers: S0 - S23.

Some instructions support the specification of an Sn register instead of an Rn register. While the register value is unchanged the way this value is used is changed, as described in the the stack exceptions chapter.

2.1.2) R0 - R23 register pair seen as 64-bit registers

The 24 general-purpose 32-bit registers can be seen as well as 12 x 64-bit registers:

D0 = { R0 , R1 }
D2 = { R2 , R3 }
D4 = { R4 , R5 }
D6 = { R6 , R7 }
D8 = { R8 , R9 }
D10 = { R10 , R11 }
D12 = { R12 , R13 }
D14 = { R14 , R15 }
D16 = { R16 , R17 }
D18 = { R18 , R19 }
D20 = { R20 , R21 }
D22 = { R22 , R23 }

The Dn 32-msb are held by the even Rn register, the 32-lsb being held by the odd Rn+1 register.

2.3) Threads PC register

A thread comprises a PC register, whose width is implementation dependant:

the PC width is in the range 12-16 bits,
the first DPU implementation has a 12-bit PC.

Note: the PC contains an instruction address, not a byte address.

2.4) Naming Conventions

To specify which operands are allowed for each instruction, the following naming conventions are used:

#32 : a 32-bit immediate value.
#28 : a 28-bit immediate value, sign extended to 32-bit.
#27 : a 27-bit immediate value, sign extended to 32-bit.
#24 : a 24-bit immediate value, sign extended to 32-bit.
#16 : a 16-bit immediate value, that is, according to the instruction considered, either:
- not extended,
- sign extended to 32-bit,
- sign extended to 64-bit.
#8 : a 8-bit immediate value.
#WRAM : an immediate signed value whose width is p + 1 when the WRAM size is 2 ^ p.
disp24 : a 24-bit immediate value.
disp12 : a 12-bit immediate value, sign extended to 24-bit.
#28-PC : an immediate value whose width is 28 minus the width of PC, sign-extended to 32-bit.
#27-PC : an immediate value whose width is 27 minus the width of PC, sign-extended to 32-bit.
#24-PC : an immediate value whose width is 24 minus the width of PC, sign-extended to 32-bit.
#PC : an immediate value whose width is the width of PC.
#6 : a 6-bit immediate value.
#5 : a 5-bit immediate value.
Rm, Rn, Rp : one of the register R0-R23.
Rnx : one of the register R0-R23 or one of the fiXed register: ZERO, ONE, LNEG, MNEG, ID, ID2, ID4 or ID8.
Rmz : one of the register R0-R23 or the ZERO register
Dm, Dp : one of the register D0-D22.
Dmz : one of the register D0-D22 or the ZERO register.
Xm : one of the register R0-R23 or D0-D22.
Xmz : one of the register R0-R23, D0-D22 or the ZERO register.

2.5) Threads ZF and CF flags

To help the execution of 64-bit arithmetic, each thread has 2 x 1-bit flags:

ZF : Zero Flag,
CF : Carry Flag.

2.6) The TIME register

This 36-bit register is common to all the threads. According to its configuration, TIME either:

stays unchanged,
increments at every cycle,
increments at every executed instruction.

The TIME_CFG (TIME ConFiGure) instruction allows:

the optional setting of the TIME configuration
the optional clearing of TIME[35:0]

The 32-msb of TIME can be obtained through the TIME and TIME_CFG instructions.

2.6) The IRAM

A DPU comprises an Instruction memory named IRAM holding 2 ^ p 48-bit wide instructions, where p is the PC width.

PC_width and instruction encoding

In many instructions, the width of the immediate value that can be encoded varies counter wise to the PC width.
The current implementation has a 12-bit PC but supports (through the configuration by the HCPU of the PC_MODE control register) the execution of binaries generated for DPU with larger PC width, as long these binaries fit into the IRAM.

The IRAM can be accessed:

by the HCPU through the control interface,
- The HCPU can read/write the IRAM even when threads are running.
by the DPU through the execution of ldmai instructions,
- the DPU reads the IRAM only through the fetching of instructions.

2.7) The WRAM

The WRAM is a 64 KB memory that is accessible:

by the HCPU through the control interface,
by the DPU through:
- 8-bit, 16-bit, 32-bit and 64-bit load/store instructions,
- ldma/sdma instructions.

Note 1: The WRAM has a 24-bit wide address space, where currently only the range 0x000000 - 0x00FFFF is used.

Note 2: On the v1B DPU, only 63488 bytes is usable.

2.7.1) Load/Store Memory Exception

Load/Store generates a memory exception when:

the address is not aligned with respect of the access size,
the address is outside the range 0: (64 KB – 1),
the address is a stack address and cross its associated bound.

Note: exception handling is performed by the HCPU.

2.7.2) Stack Overflow Exception

Since up to 24 threads are running, up to 24 different stacks are present, thus the DPU comprises a hardware mechanism to detect stack overflow early on.

2.7.2.1) Stack Overflow Exception Caused By Load/Store

Load/Store allows the specification of an Sn register instead of an Rn register as the base of the effective address calculation. While Sn and Rn contents are identical, specifying an Sn register changes the way this content is used.

Considering a WRAM of size 2 ^ p, then:

Sn [31 : p ] contains the stack bound address (or its MSB),
Sn [ p -1:0 ] contains the current stack address.

The stack bound address encoding adapted to the WRAM size as follow:

64 KB : stack bound [15:0] is Rn[31:16]
128 KB : stack bound [16:0] is { Rn[31:17], 00 }
256 KB : stack bound [17:0] is { Rn[31:18], 0000 }
512 KB : stack bound [18:0] is { Rn[31:19], 000000 }
1 MB : stack bound [19:0] is { Rn[31:20], 00000000 }

The STACK_UP control register (configurable by the HCPU) specifies the progression direction for all the stacks.

STACK_UP set … upward progressing stacks: an Sn-based load/store at an address bigger or equal to the stack bound address generates a memory exception.
STACK_UP cleared … downward progressing stacks: an Sn-based load/store at an address strictly smaller than the stack bound address generates a memory exception.

2.7.2.1) Stack Overflow Exception Caused by Addition/Subtraction

An addition/subtraction to a stack pointer must keep the msb of this stack pointer unchanged as these MSB specify the stack bound. Thus add/addc/sub/subc/rsub/rsubc instructions with an Sn register specified as first source operand will generate an exception if the result [31: p ] differs from Sn [31: p ].

Note: when an addition/subtraction has an Sn register as the first source operand, then the assembler allows, for naming coherency/cosmetic purpose, to use an Sm register as the destination register.

2.8) The MRAM

The MRAM is a 64 MB memory accessible:

by the HCPU through the DDR4 legacy interface,
by the DPU through ldma instructions,
by the DPU through sdma instructions.

2.9) The ATOMIC memory

This 256-bit memory is used for thread synchronization.

A bit of the ATOMIC memory can be set by a thread through the ACQUIRE instruction,
- the thread conditionally jumping according to the bit initial value.
A bit of the ATOMIC memory can be cleared by a thread through the RELEASE instruction,
- the thread conditionnally jumping according to the bit initial value.
A bit of the ATOMIC memory can be set or cleared by the HCPU through the control interface,
- the HCPU obtaining in return the bit initial value.

2.10) The RUN memory

The RUN memory is a 64-bit memory used to manage threads and HCPU synchronization:

The bits [0] through [23] reflect the status of the 24 threads:
- RUN [ i ] set means the thread i is running,
- RUN [ i ] cleared means the thread i is stopped (not running).
The bits [24] through [63] are used for DPU / HCPU synchronization.
- the DPU can set/clr these bits through the CLR_RUN and BOOT instructions,
  - the thread conditionally jumping according to the bit initial value.
- the HCPU can set/clr these bits through the control interface,
  - the HCPU obtaining in return the bit initial value.

3) Result Destination

3.1) ZERO as destination register

When the specified destination register is the ZERO register, then the instruction 32-bit or 64-bit result is discarded, the remaining functionality of the instruction being performed as usual.

3.2) The ‘.u’ and ‘.s’ instruction modifiers

Instructions generating 32-bit results can be modified:

by adding to the mnemonic the postfix ".u": the instruction now generate a 64-bit result made by the zero-extension of the initial 32-bit result,
- now the destination register must be a Dm 64-bit register.
- a 32-bit result that is made by the sign extension of a smaller result cannot be zero-extended to 64-bit.
  - For example LBS.u is illegal.
by adding to the mnemonic the postfix '.s': the instruction now generate a 64-bit result made by the sign-extension of the initial 32-bit result,
- now the destination register must be a Dm 64-bit register.
- a 32-bit result that is made by the zero extension of a smaller result cannot be sign-extended to 64-bit.
  - For example LBU.s is illegal.

To cope with the multiple possible combinations, the instruction description uses:

Xm to refers to Rm or Dm, according to the fact that the instruction is used or not with the ‘.u’ or ‘.s’ modifier.
Xmz to refers to Xm or the ZERO register.

4) Jump & Boolean Conditions

4.1) Introduction

Most DPU instructions know conditions based on their result or the properties of one of their source operands:

an instruction can include a condition, such that, after having performed its native functionality:
- the instruction execution continues at a specified address if the condition is true,
- the instruction execution continues sequentially otherwise.
an instruction can include a condition, such that, after having performed its native functionality and generated a native result:
- the instruction, instead of writing its native result, replaces this result with the Boolean value that corresponds to the trueness of the condition,
- the instruction execution continuing sequentially.

The allowed conditions are specific to each instruction. They are specified as follow:

a conditional jump is specified by placing a condition identifier and an IRAM address after the original operands.
- in the instruction description the term Jcc means a Jump condition.
a Boolean Replacement is specified by placing only a condition identifier after the original operands.
- in the instruction description the term Bcc means a Boolean Replacement condition.

Examples:

add R2, R3, R4                  //      R2 =   R3 + R4 ;
add R2, R3, R4, z, null_result  // if ((R2 =   R3 + R4 ) == 0) GOTO null_result;
add R2, R3, R4, z               //      R2 = ( R3 + R4 ) == 0;

Note: the add instruction allows the same condition z as the Jump condition and as Boolean Replacement condition.

4.2) Condition Identifier

4.2.1) Common Conditions

t : true
z : true when the native result is null (Zero)
nz : true when the native result is not null (Not Zero)
sz : true when the first Source operand is null (Zero)
snz : true when the first Source operand is not null (Not Zero)
pl : true when the native result is positive (PLus)
mi : true when the native result is negative (MInus)
spl : true when the first Source operand is positive (PLus)
smi : true when the first Source operand is negative (MInus)

4.2.2) Specific Conditions Common To Addition and Subtraction

Notations

op1 : means the first operand
op2 : means the second operand

carry numbering

carry p is the carry generated by:

for         addition    :  op1[ *p* : 0 ] +  op2[ *p* : 0 ]
for         subtraction :  op1[ *p* : 0 ] + ~op2[ *p* : 0 ] + 1
for reverse subtraction : ~op1[ *p* : 0 ] +  op2[ *p* : 0 ] + 1

The v/nv/c/nc Conditions

v : true when an oVerflow has been generated by an addition or subtraction
nv : true when No oVerflow has been generated by an addition or subtraction
c : true when:
- a Carry 31 is generated by an addition,
- no carry 31 is generated by a subtraction.
nc (No Carry) condition is the opposite of c.

4.2.3) Addition Specific Conditions

nc4  : true when no carry 4  is generated
nc5  : true when no carry 5  is generated
nc6  : true when no carry 6  is generated
nc7  : true when no carry 7  is generated
nc8  : true when no carry 8  is generated
nc9  : true when no carry 9  is generated
nc10 : true when no carry 10 is generated
nc11 : true when no carry 11 is generated
nc12 : true when no carry 12 is generated
nc13 : true when no carry 13 is generated

Why So Many No Carry Conditions ?

Because considering:

a memory buffer of size 2 ^ s, aligned onto its own size,
a pointer initially pointing inside this memory buffer, this pointer being used to read/write data from/to this buffer,
the addition or subtraction performed onto this pointer after each access to this buffer,
THEN:
- nc s true means the new pointer value is still inside this buffer,
- nc s false means the new pointer value is now outside this buffer.

Note: these conditions work even though the added value is positive or negative, as long its absolute value is strictly smaller than the buffer size.

4.2.4) Comparison specific Conditions

These conditions are available only for instruction based on subtraction but the lsl_sub instruction (that performs a shift then a subtraction):

ltu : op1  < op2  // unsigned comparison
geu : op1 >= op2  // unsigned comparison
leu : op1 <= op2  // unsigned comparison
gtu : op1  > op2  // unsigned comparison
lts : op1  < op2  //   signed comparison
ges : op1 >= op2  //   signed comparison
les : op1 <= op2  //   signed comparison
gts : op1 >  op2  //   signed comparison

4.2.5) Extended Z Conditions

When an instruction supports the z/nz conditions, then sequentially:

it generates internally a z property based on the instruction native result,
it generates internally an extended z property: z && ZF.

An Extended conditions use the extended z where the non extended condition use z:

 xz  : true when the extended z is true
nxz  : true when the extended z is false
xleu : op1 <= op2  // unsigned comparison using extended z instead of z
xgtu : op1 >  op2  // unsigned comparison using extended z instead of z
xles : op1 <= op2  //   signed comparison using extended z instead of z
xgts : op1 >  op2  //   signed comparison using extended z instead of z

Extended conditions ease the construction of conditions on 64-bit results.

ZF update

The ZF flag is let unchanged by the following instructions:

ldma, ldmai, sdma                             (DMA   )
sb, sh, sw, sd, sb_id, sh_id, sw_id, sd_id    (Stores)
lbu, lbs, lhu, lhs, lw, ld                    (Loads )
acquire, release, stop, clr_run, boot, resume (ATOMIC)
nop, bkp, call

The others instructions update the ZF flag with the z property (BEWARE: not with the extended z property).

4.2.6) Shift specific Conditions

 se   : true when op1[0] == 0  // Source Even
 so   : true when op1[0] == 1  // Source Odd
nsh32 : true when op2[5] == 0  // Not Shift 32
 sh32 : true when op2[5] == 1  //     Shift 32

The sh32/nsh32 conditions can be used to speedup 64-bit shift operations where the shift amount is in the 6-lsb of a register, as they enable a quick differentiation of the following cases:

shift by 32 bits or more,
shift by strictly less than 32 bits.

The se/so conditions can be used to speedup 64-bit right shift by 1-bit, as they enable a quick differentiation of the following cases:

a 1 will be shifted out by a 1-bit right shift,
a 0 will be shifted out by a 1-bit right shift.

4.2.7) Bit Count Specific Conditions

max (MAXimal result) true when the result is equal to:
- 32 for the CAO instruction ( Count All Ones )
- 32 for the CLO instruction ( Count Leading ones )
- 32 for the CLZ instruction ( Count Leading Zero )
- 31 for the CLS instruction ( Count Leading Sign )
nmax (Not MAXimal result) is the opposite of max

These conditions allow the speeding-up of multi 32-bit words bit counting operations

4.2.8) 8-bit Multiply Specific Conditions

small : (op1[15:8] == 0) && (op2[15:8] == 0)
large : opposite of small

These conditions detect the case where a 16 x 16 multiply can be reduced to a single 8 x 8 multiply.

5) LDMA / LDMAI / SDMA (DMA)

5.1) Generalities

A thread requests a DMA transfer through a DMA instruction.

DMA transfers are 64-bit aligned,
transfers sizes are n x 64-bit, where n ranges from 1 to 256,
When executing a DMA instruction, the thread is suspended for the duration of the transfer,
- the thread is temporarily absent from the pipeline,
  - it is useful to have more than 11 threads running, to palliate for the ones that temporarily leave the pipeline as they wait for the completion of a DMA instruction.
- the thread RUN bit remaining set during this suspension.

The DMA is capable of:

Moving MRAM data to IRAM
- Only the 48-lsb of the 64-bit words are written into the IRAM, the 16-msb being discarded.
Moving MRAM data to WRAM
Moving WRAM data to MRAM

Note: MRAM, WRAM, and IRAM have separate address spaces.

Note: the HCPU is not capable of performing DMA operations.

5.2) Behaviours

ldma  #8, Rnx, Rp  // Load  WRAM (Rnx address) with MRAM (Rp address)
ldmai #8, Rnx, Rp  // Load  IRAM (Rnx address) with MRAM (Rp address)
sdma  #8, Rnx, Rp  // Store WRAM (Rnx address) into MRAM (Rp address)

Transfer size is: 1 + ((Rnx[30:24] + #8) & 0xFF), allowing transfers from 1 - 256 word of 64-bits.

Source and destination addresses are specified as follow:

ldma, ldmai, sdma: the 32-bit MRAM byte address is {Rp [31 :3], 0b000}
ldma, sdma       : the 24-bit WRAM byte address is {Rnx[23 :3], 0b000}
ldmai            : the IRAM instruction address is  Rnx[p+2:3] where p is the PC width

For all DMA instructions, if the MRAM byte address is bigger than the implemented MRAM then the instruction fails and generates a memory exception. MRAM size in first DPU implementation is 64 MB.

For ldma and sdma, if the WRAM byte address is bigger than the implemented WRAM then the instruction fails and generates a memory exception. WRAM size is 64 KB in v1A, and 63488 B in v1B.

For ldmai, if the IRAM instruction address is bigger than the implemented IRAM then the instruction fails and generates a memory exception. IRAM size is 4K instructions in v1A, and 3968 instructions in v1B.

Additional characteristics:

DMA instructions support no jump nor Boolean Replacement.
DMA instructions affect no registers.

1) Loads / Stores

6.1) Common Properties

The 24-bit effective address is given by the sum of a 24-bit displacement and the 24-lsb of base Rnx register
- Rnx[31:24] are ignored,
- for most instruction the displacement is a 24-bit immediate value,
- for some store the 24-bit displacement is the sign extension of a 12-bit immediate value.
The access effective address must be aligned according to the access width,
ZF and CF flags are left unchanged,
no condition is supported.

6.2) Loads

lbu    Xm, Rnx, disp24  // Xm is loaded with the Unsigned Byte @ Rnx + disp24   .s and .sb modifiers illegal
lbs    Xm, Rnx, disp24  // Xm is loaded with the   signed Byte @ Rnx + disp24   .u and .ub modifiers illegal
lhu    Xm, Rnx, disp24  // Xm is loaded with the Unsigned Half @ Rnx + disp24   .s and .sb modifiers illegal
lhs    Xm, Rnx, disp24  // Xm is loaded with the   signed Half @ Rnx + disp24   .u and .ub modifiers illegal
lw     Xm, Rnx, disp24  // Xm is loaded with the          Word @ Rnx + disp24
ld     Dm, Rnx, disp24  // Dm is loaded with the Double   word @ Rnx + disp24

6.3) Stores Register

sb     Rnx, disp24, Rp  // Rp[ 7:0] is stored @ Rnx + disp24
sh     Rnx, disp24, Rp  // Rp[15:0] is stored @ Rnx + disp24
sw     Rnx, disp24, Rp  // Rp       is stored @ Rnx + disp24
sd     Rnx, disp24, Dp  // Dp       is stored @ Rnx + disp24

6.4) Stores Immediate Value

sb     Rnx, disp12, #8   // store                     #8    @ Rnx + sign_extend24( disp12 )
sh     Rnx, disp12, #16  // store                     #16   @ Rnx + sign_extend24( disp12 )
sw     Rnx, disp12, #16  // store      sign_extend32( #16 ) @ Rnx + sign_extend24( disp12 )
sd     Rnx, disp12, #16  // store      sign_extend64( #16 ) @ Rnx + sign_extend24( disp12 )

6.5) Stores ID ORed With Immediate Value

sb_id  Rnx, disp12, #8   // store ID |                #8    @ Rnx + sign_extend24( disp12 )
sh_id  Rnx, disp12, #16  // store ID |                #16   @ Rnx + sign_extend24( disp12 )
sw_id  Rnx, disp12, #16  // store ID | sign_extend32( #16 ) @ Rnx + sign_extend24( disp12 )
sd_id  Rnx, disp12, #16  // store ID | sign_extend64( #16 ) @ Rnx + sign_extend24( disp12 )

6.6) Endianness Modifiers

By default, the load/store instruction uses the little-endian memory organization. Load/store instructions operating on 16-bit, 32-bit, or 64-bit data, may have the ‘.b’ modifier added to their mnemonic, forcing these instructions to use the big-endian memory organization.

For lhu, lhs and lw, use the .ub/.sb modifier to cummulate the .u/.s and .b modifiers.

7) Additions and Subtractions

Addition, result = op1 + op2

add    Xmz,  Sn ,  Rp
add    Xmz,  Rnx,  Rp
add    Xmz,  Rnx,  Rp   , Bcc
add    Xmz,  Rnx,  Rp   , Jcc, IRAM_address
-----------------------------------------------
add    Xmz,  Sn ,  #WRAM
add    ZERO, Rn ,  #32
add    Rm,   Rnx,  #32
add    Dm,   Rn ,  #32
add    ZERO, Rnx,  #27
add    Xm,   Rnx,  #24
-----------------------------------------------
add    Xm,   Rnx,  #24  , Bcc
add    ZERO, Rnx,  #27PC, Jcc, IRAM_address
add    Xm,   Rnx,  #24PC, Jcc, IRAM_address

Addition with Carry, result = op1 + op2 + CF

addc   Xmz,  Sn ,  Rp
addc   Xmz,  Rnx,  Rp
addc   Xmz,  Rnx,  Rp   , Bcc
addc   Xmz,  Rnx,  Rp   , Jcc, IRAM_address
-----------------------------------------------
addc   Xmz,  Sn ,  #WRAM
addc   ZERO, Rn ,  #32
addc   Rm,   Rnx,  #32
addc   ZERO, Rnx,  #27
addc   Xm,   Rnx,  #24
-----------------------------------------------
addc   Xm,   Rnx,  #24  , Bcc
addc   ZERO, Rnx,  #27PC, Jcc, IRAM_address
addc   Xm,   Rnx,  #24PC, Jcc, IRAM_address

Reverse subtraction, result = op1 + ~op2 + 1

rsub   Xmz,  Sn ,  Rp
rsub   Xmz,  Rnx,  Rp
rsub   Xmz,  Rnx,  Rp   , Bcc
rsub   Xmz,  Rnx,  Rp   , Jcc, IRAM_address
-----------------------------------------------
rsub   Xmz,  Sn ,  #WRAM
rsub   ZERO, Rn ,  #32
rsub   Rm,   Rnx,  #32
rsub   ZERO, Rnx,  #27
rsub   Xm,   Rnx,  #24
-----------------------------------------------
rsub   Xm,   Rnx,  #24  , Bcc
rsub   ZERO, Rnx,  #27PC, Jcc, IRAM_address
rsub   Xm,   Rnx,  #24PC, Jcc, IRAM_address

Reverse subtraction with Carry, result = op1 + ~op2 + CF

rsubc  Xmz,  Sn ,  Rp
rsubc  Xmz,  Rnx,  Rp
rsubc  Xmz,  Rnx,  Rp   , Bcc
rsubc  Xmz,  Rnx,  Rp   , Jcc, IRAM_address
-----------------------------------------------
rsubc  Xmz,  Sn ,  #WRAM
rsubc  ZERO, Rn ,  #32
rsubc  Rm,   Rnx,  #32
rsubc  ZERO, Rnx,  #27
rsubc  Xm,   Rnx,  #24
-----------------------------------------------
rsubc  Xm,   Rnx,  #24  , Bcc
rsubc  ZERO, Rnx,  #27PC, Jcc, target_address
rsubc  Xm,   Rnx,  #24PC, Jcc, target_address

Subtraction, result = opa + ~opB + 1

sub    Xmz,  Sn ,  Rp
sub    Xmz,  Rnx,  Rp
sub    Xmz,  Rnx,  Rp   , Bcc
sub    Xmz,  Rnx,  Rp   , Jcc, IRAM_address
-------------------------------------------------------------------
sub    Xmz,  Sn ,  #WRAM
sub    Dmz , Rn ,  #32    // replaced with add instructions
sub    Rm  , Rnx,  #32    // ...
sub    ZERO, Rnx,  #27
sub    Xm,   Rnx,  #24
-------------------------------------------------------------------
sub    Xm,   Rnx,  #24  , Bcc
sub    ZERO, Rnx,  #27PC, Jcc, target_address
sub    Xm,   Rnx,  #24PC, Jcc, target_address

Subtraction with carry, result = op1 + ~op2 + CF

subc   Xmz,  Sn ,  Rp
subc   Xmz,  Rnx,  Rp
subc   Xmz,  Rnx,  Rp   , Bcc
subc   Xmz,  Rnx,  Rp   , Jcc, IRAM_address
-------------------------------------------------------------------
subc   Xmz,  Sn ,  #WRAM
subc   ZERO, Rn ,  #32    // replaced with addc instructions
subc   Rm  , Rnx,  #32    // ...
subc   ZERO, Rnx,  #27
subc   Xm,   Rnx,  #24
-------------------------------------------------------------------
subc   Xm,   Rnx,  #24  , Bcc
subc   ZERO, Rnx,  #27PC, Jcc, target_address
subc   Xm,   Rnx,  #24PC, Jcc, target_address

7.1) CF update

As shown in the descriptions above, add/addc/sub/subc/rsub/rsubc use a 32-bit adder. These instructions update CF with is the native carry 31 of this 32-bit adder. Another way of expressing the new CF value is:

when executing a             ADD or  ADDC, CF is set to the c   condition,
when executing a SUB, SUBC, RSUB, or RSUBC, CF is set to the geu condition.

7.2) Why sub #32 is replaced with an add?

There is no encoding for the sub instruction with #32 because the two instructions:

sub Rm, Rn,  #32
add Rm, Rn, ~#32 + 1

would be equivalent in terms of 32-bit result generated. Concerning the CF flag update setting:

if #32 <> 0 then ~#32+1 generates no carry, thus sub #32 is entirely equivalent to add -#32,
if #32 == 0 then the sub Rm, Rn, #0 instruction is encoded.

7.3) Why subc #32 is replaced with an addc?

There is no encoding for subc #32 as it is entirely equivalent to addc Rm, Rn, ~#32.

7.4) Supported Conditions

add and addc

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, c, nc, nc4, nc5, nc6, nc7, nc8, nc9, nc10, nc11, nc12, nc13.
Bcc:    z, nz, xz, nxz.

sub and subc

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, ltu, geu, lts, ges, les, gts, leu, gtu, xles, xgts, xleu, xgtu.
Bcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, ltu, geu, lts, ges, les, gts, leu, gtu, xles, xgts, xleu, xgtu.

rsub and rsubc

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, ltu, geu, lts, ges, les, gts, leu, gtu, xles, xgts, xleu, xgtu.
Bcc:    z, nz, xz, nxz.

8) Logical instructions

AND, result = op1 & op2

AND   Xmz , Rnx, Rp
AND   Xmz , Rnx, Rp   , Bcc
AND   Xmz , Rnx, Rp   , Jcc, IRAM_address
-----------------------------------------------
AND   Rmz , Rn , #32
AND   Dm  , Rnx, #32
AND   ZERO, Rnx, #28
AND   Xm  , Rnx, #24
-----------------------------------------------
AND   Xm  , Rnx, #24  , Bcc
AND   ZERO, Rnx, #28PC, Jcc, IRAM_address
AND   Xm  , Rnx, #24PC, Jcc, IRAM_address

NAND, result = ~(op1 & op2)

NAND  Xmz , Rnx, Rp
NAND  Xmz , Rnx, Rp   , Bcc
NAND  Xmz , Rnx, Rp   , Jcc, IRAM_address
-----------------------------------------------
NAND  ZERO, Rnx, #28
NAND  Xm  , Rnx, #24
-----------------------------------------------
NAND  Xm  , Rnx, #24  , Bcc
NAND  ZERO, Rnx, #28PC, Jcc, IRAM_address
NAND  Xm  , Rnx, #24PC, Jcc, IRAM_address

ANDN, result = (~op1) & op2

ANDN  Xmz , Rnx, Rp
ANDN  Xmz , Rnx, Rp   , Bcc
ANDN  Xmz , Rnx, Rp   , Jcc, IRAM_address
-----------------------------------------------
ANDN  ZERO, Rnx, #28
ANDN  Xm  , Rnx, #24
-----------------------------------------------
ANDN  Xm  , Rnx, #24  , Bcc
ANDN  ZERO, Rnx, #28PC, Jcc, IRAM_address
ANDN  Xm  , Rnx, #24PC, Jcc, IRAM_address

OR, result = op1 | op2

OR    Xmz , Rnx, Rp
OR    Xmz , Rnx, Rp   , Bcc
OR    Xmz , Rnx, Rp   , Jcc, IRAM_address
-----------------------------------------------
OR    Dmz , Rn , #32
OR    Rm  , Rnx, #32
OR    ZERO, Rnx, #28
OR    Xm  , Rnx, #24
-----------------------------------------------
OR    Xm  , Rnx, #24  , Bcc
OR    ZERO, Rnx, #28PC, Jcc, IRAM_address
OR    Xm  , Rnx, #24PC, Jcc, IRAM_address

NOR, result = ~(op1 | op2)

NOR   Xmz , Rnx, Rp
NOR   Xmz , Rnx, Rp   , Bcc
NOR   Xmz , Rnx, Rp   , Jcc, IRAM_address
-----------------------------------------------
NOR   ZERO, Rnx, #28
NOR   Xm  , Rnx, #24
-----------------------------------------------
NOR   Xm  , Rnx, #24  , Bcc
NOR   ZERO, Rnx, #28PC, Jcc, IRAM_address
NOR   Xm  , Rnx, #24PC, Jcc, IRAM_address

ORN, result = (~op1) | op2

ORN   Xmz , Rnx, Rp
ORN   Xmz , Rnx, Rp   , Bcc
ORN   Xmz , Rnx, Rp   , Jcc, IRAM_address
-----------------------------------------------
ORN   ZERO, Rnx, #28
ORN   Xm  , Rnx, #24
-----------------------------------------------
ORN   Xm  , Rnx, #24  , Bcc
ORN   ZERO, Rnx, #28PC, Jcc, IRAM_address
ORN   Xm  , Rnx, #24PC, Jcc, IRAM_address

XOR, result = op1 ^ op2

XOR   Xmz , Rnx, Rp
XOR   Xmz , Rnx, Rp   , Bcc
XOR   Xmz , Rnx, Rp   , Jcc, IRAM_address
---------------------------------------------
XOR   ZERO, Rn , #32
XOR   Rm  , Rnx, #32
XOR   ZERO, Rnx, #28
XOR   Xm  , Rnx, #24
-----------------------------------------------
XOR   Xm  , Rnx, #24  , Bcc
XOR   ZERO, Rnx, #28PC, Jcc, IRAM_address
XOR   Xm  , Rnx, #24PC, Jcc, IRAM_address

NXOR, result = ~(op1 ^ op2)

NXOR  Xmz , Rnx, Rp
NXOR  Xmz , Rnx, Rp   , Bcc
NXOR  Xmz , Rnx, Rp   , Jcc, IRAM_address
-------------------------------------------------------------------
NXOR  ZERO, Rn , #32  // replaced with XOR instructions
NXOR  Rm  , Rnx, #32  // ...
NXOR  ZERO, Rnx, #28
NXOR  Xm  , Rnx, #24
-------------------------------------------------------------------
NXOR  Xm  , Rnx, #24  , Bcc
NXOR  ZERO, Rnx, #28PC, Jcc, IRAM_address
NXOR  Xm  , Rnx, #24PC, Jcc, IRAM_address

Supported Conditions

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc:    z, nz, xz, nxz

Note: Logical instructions updates ZF but let CF unchanged.

9) EXTUB / EXTSB / EXTUH / EXTSH (Zero/Sign Extensions)

The following instructions don’t support the .s modifier:

Extub  Xmz, Rn                     // 8-bit (Byte) to 32-bit zero (Unsigned) extension
Extub  Xmz, Rn, Bcc                // ...
Extub  Xmz, Rn, Jcc, IRAM_address  // ...
---------------------------------------------------------------------------------------
Extub  Xmz, Rn                     // 16-bit (Half) to 32-bit zero (Unsigned) extension
Extub  Xmz, Rn, Bcc                // ...
Extub  Xmz, Rn, Jcc, IRAM_address  // ...

The following instructions don’t support the .u modifier:

Extsb  Xmz, Rn                     // 8-bit (Byte) to 32-bit Signed extension
Extsb  Xmz, Rn, Bcc                // ...
Extsb  Xmz, Rn, Jcc, IRAM_address  // ...
---------------------------------------------------------------------------------------
Extsh  Xmz, Rn                     // 16-bit (Half) to 32-bit Signed extension
Extsh  Xmz, Rn, Bcc                // ...
Extsh  Xmz, Rn, Jcc, IRAM_address  // ...

Supported Conditions

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc:    z, nz, xz, nxz

10) HASH

These instructions don’t support the .s modifier.

hash Xmz, Rnx, Rp
hash Xmz, Rnx, Rp   , Bcc
hash Xmz, Rnx, Rp   , Jcc, IRAM_address
-----------------------------------------------
hash Xmz, Rnx, #24
hash Xmz, Rnx, #24  , Bcc
hash Xmz, Rnx, #24PC, Jcc, IRAM_address

10.1) Hash operation

The instruction result is given by the following table:

op2[18:17]	op2[16]	Result
00	0	Op1[6:0] ^ Op1[13: 7]
00	1	Op1[6:0] ^ Op1[13: 7] ^ Op1[20:14]
01	0	Op1[7:0] ^ Op1[15: 8]
01	1	Op1[7:0] ^ Op1[15: 8] ^ Op1[23:16]
10	0	Op1[8:0] ^ Op1[17: 9]
10	1	Op1[8:0] ^ Op1[17: 9] ^ Op1[26:18]
11	0	Op1[9:0] ^ Op1[19:10]
11	1	Op1[9:0] ^ Op1[19:10] ^ Op1[29:20]

Supported Conditions

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc:    z, nz, xz, nxz

11) SATS (SATuration, Signed)

sats  Xmz, Rnx
sats  Xmz, Rnx, Bcc
sats  Xmz, Rnx, Jcc, IRAM_address

result = (Rx[31] == 1) ? 0x7FFFFFFF : 0x80000000

Supported Conditions

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc:    z, nz, xz, nxz

12) Shift / Rotate

The shift value is the 5-lsb of the second operand, thus the shift/rotate amount ranges from 0 through 31: it can be the 5-lsb of an Rp register or a 5-bit immediate value.

The following table describes the Shift/Rotate instructions:

	Description	examples
	Description	initial	shift	result
ROL	ROtate Left	12345678	4	23456781
ROR	ROtate Right	12345678	4	81234567
LSL	Logical Shift Left	12345678	4	23456780
LSL1	Logical Shift Left with 1 insertion	12345678	4	2345678F
LSR	Logical Shift Right	12345678	4	01234567
LSR1	Logical Shift Right with 1 insertion	12345678	4	F1234567
ASR	Arithmetic Shift Right	12345678	4	01234567
ASR	Arithmetic Shift Right	89ABCDEF	4	F89ABCDE
LSLX	LSL eXtended. The result is the part that would be shifted out by an LSL, its MSB being 0-filled.	12345678	0	00000000
		12345678	4	00000001
		12345678	28	01234567
LSL1X	LSL1 eXtended. The result is the part that would be shifted out by a LSL1, its MSB being 1-filled.	12345678	0	FFFFFFFF
		12345678	4	FFFFFFF1
		12345678	28	F1234567
LSRX	LSR eXtended. The result is the part that would be shifted out by an LSR, its LSB being 0-filled.	12345678	0	00000000
		12345678	4	80000000
		12345678	28	23456780
LSR1X	LSR1 eXtended. The result is the part that would be shifted out by a LSR1, its LSB being 1-filled.	12345678	0	FFFFFFFF
		12345678	4	8FFFFFFF
		12345678	28	2345678F

All shift/rotate instructions allow for the same operands combinations

LSL  Xmz, Rnx, Rp
LSL  Xmz, Rnx, Rp, Bcc
LSL  Xmz, Rnx, Rp, Jcc, IRAM_address
-------------------------------------
LSL  Xmz, Rnx, #5
LSL  Xmz, Rnx, #5, Bcc
LSL  Xmz, Rnx, #5, Jcc, IRAM_address

Supported Conditions

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, nsh32, sh32, se, so
Bcc:    z, nz, xz, nxz

13) Shift/Rotate & add/sub

rol_add  Xmz, Rnx, Rp, #5                     // rotate left then addition
rol_add  Xmz, Rnx, Rp, #5, Bcc                // ...
rol_add  Xmz, Rnx, Rp, #5, Jcc, IRAM_address  // ...
-----------------------------------------------------------------------------
lsr_add  Xmz, Rnx, Rp, #5                     // shift right then addition
lsr_add  Xmz, Rnx, Rp, #5, Bcc                // ...
lsr_add  Xmz, Rnx, Rp, #5, Jcc, IRAM_address  // ...
-----------------------------------------------------------------------------
lsl_add  Xmz, Rnx, Rp, #5                     // shift left then addition
lsl_add  Xmz, Rnx, Rp, #5, Bcc                // ...
lsl_add  Xmz, Rnx, Rp, #5, Jcc, IRAM_address  // ...
-----------------------------------------------------------------------------
lsl_sub  Xmz, Rnx, Rp, #5                     // shift left then subtraction
lsl_sub  Xmz, Rnx, Rp, #5, Bcc                // ...
lsl_sub  Xmz, Rnx, Rp, #5, Jcc, IRAM_address  // ...

For all these instructions the content of Rnx is shifted or rotated by the #5 immediate value, giving an intermediary result that is added or subtracted to the Rp value giving the instruction final result.

Supported Conditions

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc:    z, nz, xz, nxz

NOTE: the z, nz, xz, nxz, pl and mi CONDITIONS ARE EVALUATED AGAINST THE INTERMEDIARY RESULT, NOT AGAINST THE FINAL RESULT

14) CLZ / CLO / CLS / CAO (bit count)

These instructions don’t support the .s modifier.

CLZ  Xmz, Rnx                     // Count Leading Zero
CLZ  Xmz, Rnx, Bcc                // ...
CLZ  Xmz, Rnx, Jcc, IRAM_address  // ...
-------------------------------------------------------------------------------------------------
CLO  Xmz, Rnx                     // Count Leading Ones
CLO  Xmz, Rnx, Bcc                // ...
CLO  Xmz, Rnx, Jcc, IRAM_address  // ...
-------------------------------------------------------------------------------------------------
CLS  Xmz, Rnx                     // Count Leading Sign: Indicates by how many bits the
CLS  Xmz, Rnx, Bcc                // source operand can be left-shifted without having
CLS  Xmz, Rnx, Jcc, IRAM_address  // its sign changed, the result being in the range 0-31.
-------------------------------------------------------------------------------------------------
CAO  Xmz, Rnx                     // Count All Ones: counts the number
CAO  Xmz, Rnx, Bcc                // of one in the source operand
CAO  Xmz, Rnx, Jcc, IRAM_address  // ...

Supported Conditions

Jcc: t, z, nz, xz, nxz, max, nmax, sz, nsz, spl, smi
Bcc:    z, nz, xz, nxz

For CLS, the max (MAXimum) condition is true when the result is 31, For CLZ, CLO and CAO, the max condition is true when the result is 32. The nmax condition is always the opposite of the max condition.

15) MUL_STEP / DIV_STEP / MOVD / SWAPD

15.1) mul_step

mul_step  Dmz, Rnx, Dp, #5
mul_step  Dmz, Rnx, Dp, #5, Bcc
mul_step  Dmz, Rnx, Dp, #5, Jcc, IRAM_address

Action performed

if (Dp[32] & 1) Dm[31: 0] = Dp[31: 0] + (Rnx << #5)  // if the destination is the ZERO register,
                Dm[63:32] = Dp[63:32]        >> 1    // ... then no register is affected

Supported Conditions

Jcc: t, z, nz, sz, nsz, spl, smi

15.2) div_step

div_step  Dmz, Rnx, Dp, #5
div_step  Dmz, Rnx, Dp, #5, Bcc
div_step  Dmz, Rnx, Dp, #5, Jcc, IRAM_address

Action performed

if  (Dp[31: 0] >=              (Rnx << #5)){  // the comparison is unsigned
     Dm[31: 0]  =  Dp[31: 0] - (Rnx << #5);   // if the destination is the ZERO register
     Dm[63:32]  = (Dp[63:32] << 1)  | 1   ;   // ... then no register is affected
}                                             // ...
else Dm[63:32]  = (Dp[63:32] << 1)        ;   // ...

Supported Conditions

Jcc: t, sz, nsz, spl, smi

15.3) movd

movd  Dmz, Dp
movd  Dmz, Dp, Bcc
movd  Dmz, Dp, Jcc, IRAM_address

result = Dp

Supported Conditions

Jcc:  t, sz, nsz, spl, smi

15.3 swapd

swapd  Dmz, Dp
swapd  Dmz, Dp, Bcc
swapd  Dmz, Dp, Jcc, IRAM_address

result = { Dp[31:0], Dp[63:32] }

Supported Conditions

Jcc:  t, sz, nsz, spl, smi

16) 8 x 8 Multiplications

The result of a 8 x 8 multiplication is initially 16-bit, then this 16-bit result is:

zero-extended to 32-bit for unsigned x unsigned multiplication,
sign-extended to 32-bit otherwise.

mnemonic	result[15:0]	multiply variant	Comment
mul_ul_ul	op1[ 7:0] x op2[7: 0]	unsigned x unsigned (zero-extended to 32-bit)	.s forbidden
mul_ul_uh	op1[ 7:0] x op2[15:8]
mul_uh_ul	op1[15:8] x op2[ 7:0]
mul_uh_uh	op1[15:8] x op2[15:8]
mul_sl_ul	op1[ 7:0] x op2[7: 0]	signed x unsigned (sign-extended to 32-bit)	.u forbidden
mul_sl_uh	op1[ 7:0] x op2[15:8]
mul_sh_ul	op1[15:8] x op2[ 7:0]
mul_sh_uh	op1[15:8] x op2[15:8]
mul_sl_sl	op1[ 7:0] x op2[7: 0]	signed x signed (sign-extended to 32-bit)
mul_sl_sh	op1[ 7:0] x op2[15:8]
mul_sh_sl	op1[15:8] x op2[ 7:0]
mul_sh_sh	op1[15:8] x op2[15:8]

syntax

mul_ul_ul     Xmz, Rnx, Rp                     // similar syntax for the others
mul_ul_ul     Xmz, Rnx, Rp, Bcc                // 8 x 8 multiplications instructions
mul_ul_ul     Xmz, Rnx, Rp, Jcc, IRAM_address  // ...

Supported Conditions

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, ms8, nms8, mu8, nmu8
Bcc:    z, nz, xz, nxz

17) CMPB4

cmpb4  Xmz, Rnx, Rp

Functionality

result[31:24] = (Rx[31:24] == Rp[31:24]) ? 0x01 : 0x00;
result[23:16] = (Rx[23:16] == Rp[23:16]) ? 0x01 : 0x00;
result[15: 8] = (Rx[15: 8] == Rp[15: 8]) ? 0x01 : 0x00;
result[ 7: 0] = (Rx[ 7: 0] == Rp[ 7: 0]) ? 0x01 : 0x00;

Supported Conditions

Bcc:    z, nz, xz, nxz
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi

18) CALL

call  Xmz, Rnx, Rp
call  Xmz, Rnx, #PC  // #PC is an immediate whose width is the one of the PC

Functionality

result = current PC + 1
The thread jump to the IRAM address given by Rnx + Rp or by Rnx + #PC

Note: there is no RETURN instruction: a “CALL ZERO, Rnx” instruction is used instead, where Rnx is the register where the return address has been previously saved.

19) ACQUIRE / RELEASE

acquire: Rnx, #16
acquire: Rnx, #16, Jcc, IRAM_address
release: Rnx, #16
release: Rnx, #16, Jcc, IRAM_address

Functionality

For both instruction an 8-bit index i is calculated as follows:

tmp[15:0] = Rnx + #16
i = tmp[15:8] ^ tmp[7:0]

Then:

for ACQUIRE: ATOMIC[ i ] = 1,
for RELEASE: ATOMIC[ i ] = 0.

In both cases, the z/nz conditions are evaluated using the initial value of the ATOMIC[ i ] bit.

Supported Jcc Conditions for ACQUIRE: t, z, nz

Supported Jcc Conditions for RELEASE: nz

Note: when ACQUIRE/RELEASE is used correctly, the nz condition is always true for RELEASE.

20) STOP

stop
stop  t, IRAM_address  // only the t (True) condition is supported

Functionality

The RUN bit corresponding to the thread executing the STOP instruction is cleared, if a t condition is present, the thread PC is set to the specified jump address, independently of the presence of the t condition, the thread is non longer running.

21) BOOT / RESUME / CLR_RUN

boot     Rnx, #6
boot     Rnx, #6, Jcc, IRAM_address
resume   Rnx, #6
resume   Rnx, #6, Jcc, IRAM_address
clr_run  Rnx, #6
clr_run  Rnx, #6, Jcc, IRAM_address

Functionality

Both instructions generate first a 6-bit unsigned index i:

tmp[13:0] = Rnx[13:0] + #6
i = tmp[13:8] ^ tmp[5:0]

21.1) CLR_RUN

clr_run just clears the bit RUN[ i ], the CLR_RUN instruction is now over.

21.2) BOOT / RESUME

If RUN[ i ] is initially set then the BOOT/RESUME instruction is over.

Otherwise:

RUN[ i ] is set
if i < 24 then the execution of the thread i is resumed:
- for BOOT instructions: at the IRAM address 0
- for RESUME instructions: at the current value of PC[ i ] (the PC of the thread i).

21.3) Supported conditions

The CLR_RUN, BOOT, and RESUME instructions support the same set of conditions

Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi

Note: the z, nz, xz, nxz conditions use the nullity/non-nullity of the initial value of the bit RUN[ i ].

22) TIME / TIME_CFG

time      Xmz
time      Xmz,      t, IRAM_address  // only the t (True) condition is allowed
time_cfg  Xmz, Rnx
time_cfg  Xmz, Rnx, t, IRAM_address  // only the t (True) condition is allowed

For both instructions: result = TIME[35:4] (the 32-msb of TIME)

22.1) TIME Increment Configuration

This part concerns only the time_cfg instruction

To have Rnx[0] set clears the TIME[35:0] register, the field Rnx[2:1] being used as follow:

keep the current increment configuration
set the configuration such that TIME[35:0] is     incremented every DPU cycle
set the configuration such that TIME[35:0] is     incremented every executed instruction
set the configuration such that TIME[35:0] is not incremented

23) NOP / BKP

NOP  // Does nothing
BKP  // Does nothing besides causing a BKP exception