DPU Handbook
1) Introduction
The DPU is a multithreaded 32-bit processor that has several hardware threads available, depending on the version of the DPU:
On a v1A DPU, there are 24 threads, indexed from 0 through 23.
On a v1B DPU, there are 16 threads, indexed from 0 through 15.
A thread can be running or stopped. The state of the thread i is reflected in the 24-lsb of a 64-bit register named RUN (the 40-msb being used for other purposes, described later):
RUN[ i ] = 0 -> the thread i is stopped (not executing),
RUN[ i ] = 1 -> the thread i is running (executing).
The full performance of the DPU is achieved when enough hardware threads are running so that the DPU pipeline remains filled (this number being > 10). Note that ‘overfilling’ the pipeline is recommended to palliate the fact that threads issuing DMA instructions are temporarily removed from the pipeline.
2) DPU state
2.1) Threads 32-bit registers
A thread knows 32 x 32 bit registers:
24 general purpose 32-bit registers, private to the thread: R0 - R23
4 fixed 32-bit registers, common to all threads:
ZERO : fixed to the value 0,
ONE : fixed to the value 1,
LNEG : fixed to the value 0xFFFFFFFF (Least NEGative),
MNEG : fixed to the value 0x80000000 (Most NEGative).
4 fixed 32-bit registers, private to the thread:
ID : fixed to the thread index.
ID2 : fixed to the thread index x 2.
ID4 : fixed to the thread index x 4.
ID8 : fixed to the thread index x 8.
2.1.1) R0 - R23 seen as Stack Registers
The 24 general-purpose 32-bit registers can be seen as well as 24 x 32-bit stack registers: S0 - S23.
Some instructions support the specification of an Sn register instead of an Rn register. While the register value is unchanged the way this value is used is changed, as described in the the stack exceptions chapter.
2.1.2) R0 - R23 register pair seen as 64-bit registers
The 24 general-purpose 32-bit registers can be seen as well as 12 x 64-bit registers:
D0 = { R0 , R1 }
D2 = { R2 , R3 }
D4 = { R4 , R5 }
D6 = { R6 , R7 }
D8 = { R8 , R9 }
D10 = { R10 , R11 }
D12 = { R12 , R13 }
D14 = { R14 , R15 }
D16 = { R16 , R17 }
D18 = { R18 , R19 }
D20 = { R20 , R21 }
D22 = { R22 , R23 }
The Dn 32-msb are held by the even Rn register, the 32-lsb being held by the odd Rn+1 register.
2.3) Threads PC register
A thread comprises a PC register, whose width is implementation dependant:
the PC width is in the range 12-16 bits,
the first DPU implementation has a 12-bit PC.
Note: the PC contains an instruction address, not a byte address.
2.4) Naming Conventions
To specify which operands are allowed for each instruction, the following naming conventions are used:
#32 : a 32-bit immediate value.
#28 : a 28-bit immediate value, sign extended to 32-bit.
#27 : a 27-bit immediate value, sign extended to 32-bit.
#24 : a 24-bit immediate value, sign extended to 32-bit.
#16 : a 16-bit immediate value, that is, according to the instruction considered, either:
not extended,
sign extended to 32-bit,
sign extended to 64-bit.
#8 : a 8-bit immediate value.
#WRAM : an immediate signed value whose width is p + 1 when the WRAM size is 2 ^ p.
disp24 : a 24-bit immediate value.
disp12 : a 12-bit immediate value, sign extended to 24-bit.
#28-PC : an immediate value whose width is 28 minus the width of PC, sign-extended to 32-bit.
#27-PC : an immediate value whose width is 27 minus the width of PC, sign-extended to 32-bit.
#24-PC : an immediate value whose width is 24 minus the width of PC, sign-extended to 32-bit.
#PC : an immediate value whose width is the width of PC.
#6 : a 6-bit immediate value.
#5 : a 5-bit immediate value.
Rm, Rn, Rp : one of the register R0-R23.
Rnx : one of the register R0-R23 or one of the fiXed register: ZERO, ONE, LNEG, MNEG, ID, ID2, ID4 or ID8.
Rmz : one of the register R0-R23 or the ZERO register
Dm, Dp : one of the register D0-D22.
Dmz : one of the register D0-D22 or the ZERO register.
Xm : one of the register R0-R23 or D0-D22.
Xmz : one of the register R0-R23, D0-D22 or the ZERO register.
2.5) Threads ZF and CF flags
To help the execution of 64-bit arithmetic, each thread has 2 x 1-bit flags:
ZF : Zero Flag,
CF : Carry Flag.
2.6) The TIME register
This 36-bit register is common to all the threads. According to its configuration, TIME either:
stays unchanged,
increments at every cycle,
increments at every executed instruction.
The TIME_CFG (TIME ConFiGure) instruction allows:
the optional setting of the TIME configuration
the optional clearing of TIME[35:0]
The 32-msb of TIME can be obtained through the TIME and TIME_CFG instructions.
2.6) The IRAM
A DPU comprises an Instruction memory named IRAM holding 2 ^ p 48-bit wide instructions, where p is the PC width.
PC_width and instruction encoding
In many instructions, the width of the immediate value that can be encoded varies counter wise to the PC width.
The current implementation has a 12-bit PC but supports (through the configuration by the HCPU of the PC_MODE control register) the execution of binaries generated for DPU with larger PC width, as long these binaries fit into the IRAM.
The IRAM can be accessed:
by the HCPU through the control interface,
The HCPU can read/write the IRAM even when threads are running.
by the DPU through the execution of ldmai instructions,
the DPU reads the IRAM only through the fetching of instructions.
2.7) The WRAM
The WRAM is a 64 KB memory that is accessible:
by the HCPU through the control interface,
by the DPU through:
8-bit, 16-bit, 32-bit and 64-bit load/store instructions,
ldma/sdma instructions.
Note 1: The WRAM has a 24-bit wide address space, where currently only the range 0x000000 - 0x00FFFF is used.
Note 2: On the v1B DPU, only 63488 bytes is usable.
2.7.1) Load/Store Memory Exception
Load/Store generates a memory exception when:
the address is not aligned with respect of the access size,
the address is outside the range 0: (64 KB – 1),
the address is a stack address and cross its associated bound.
Note: exception handling is performed by the HCPU.
2.7.2) Stack Overflow Exception
Since up to 24 threads are running, up to 24 different stacks are present, thus the DPU comprises a hardware mechanism to detect stack overflow early on.
2.7.2.1) Stack Overflow Exception Caused By Load/Store
Load/Store allows the specification of an Sn register instead of an Rn register as the base of the effective address calculation. While Sn and Rn contents are identical, specifying an Sn register changes the way this content is used.
Considering a WRAM of size 2 ^ p, then:
Sn [31 : p ] contains the stack bound address (or its MSB),
Sn [ p -1:0 ] contains the current stack address.
The stack bound address encoding adapted to the WRAM size as follow:
64 KB : stack bound [15:0] is Rn[31:16]
128 KB : stack bound [16:0] is { Rn[31:17], 00 }
256 KB : stack bound [17:0] is { Rn[31:18], 0000 }
512 KB : stack bound [18:0] is { Rn[31:19], 000000 }
1 MB : stack bound [19:0] is { Rn[31:20], 00000000 }
The STACK_UP control register (configurable by the HCPU) specifies the progression direction for all the stacks.
STACK_UP set … upward progressing stacks: an Sn-based load/store at an address bigger or equal to the stack bound address generates a memory exception.
STACK_UP cleared … downward progressing stacks: an Sn-based load/store at an address strictly smaller than the stack bound address generates a memory exception.
2.7.2.1) Stack Overflow Exception Caused by Addition/Subtraction
An addition/subtraction to a stack pointer must keep the msb of this stack pointer unchanged as these MSB specify the stack bound. Thus add/addc/sub/subc/rsub/rsubc instructions with an Sn register specified as first source operand will generate an exception if the result [31: p ] differs from Sn [31: p ].
Note: when an addition/subtraction has an Sn register as the first source operand, then the assembler allows, for naming coherency/cosmetic purpose, to use an Sm register as the destination register.
2.8) The MRAM
The MRAM is a 64 MB memory accessible:
by the HCPU through the DDR4 legacy interface,
by the DPU through ldma instructions,
by the DPU through sdma instructions.
2.9) The ATOMIC memory
This 256-bit memory is used for thread synchronization.
A bit of the ATOMIC memory can be set by a thread through the ACQUIRE instruction,
the thread conditionally jumping according to the bit initial value.
A bit of the ATOMIC memory can be cleared by a thread through the RELEASE instruction,
the thread conditionnally jumping according to the bit initial value.
A bit of the ATOMIC memory can be set or cleared by the HCPU through the control interface,
the HCPU obtaining in return the bit initial value.
2.10) The RUN memory
The RUN memory is a 64-bit memory used to manage threads and HCPU synchronization:
The bits [0] through [23] reflect the status of the 24 threads:
RUN [ i ] set means the thread i is running,
RUN [ i ] cleared means the thread i is stopped (not running).
The bits [24] through [63] are used for DPU / HCPU synchronization.
the DPU can set/clr these bits through the CLR_RUN and BOOT instructions,
the thread conditionally jumping according to the bit initial value.
the HCPU can set/clr these bits through the control interface,
the HCPU obtaining in return the bit initial value.
3) Result Destination
3.1) ZERO as destination register
When the specified destination register is the ZERO register, then the instruction 32-bit or 64-bit result is discarded, the remaining functionality of the instruction being performed as usual.
3.2) The ‘.u’ and ‘.s’ instruction modifiers
Instructions generating 32-bit results can be modified:
by adding to the mnemonic the postfix ".u": the instruction now generate a 64-bit result made by the zero-extension of the initial 32-bit result,
now the destination register must be a Dm 64-bit register.
a 32-bit result that is made by the sign extension of a smaller result cannot be zero-extended to 64-bit.
For example LBS.u is illegal.
by adding to the mnemonic the postfix '.s': the instruction now generate a 64-bit result made by the sign-extension of the initial 32-bit result,
now the destination register must be a Dm 64-bit register.
a 32-bit result that is made by the zero extension of a smaller result cannot be sign-extended to 64-bit.
For example LBU.s is illegal.
To cope with the multiple possible combinations, the instruction description uses:
Xm to refers to Rm or Dm, according to the fact that the instruction is used or not with the ‘.u’ or ‘.s’ modifier.
Xmz to refers to Xm or the ZERO register.
4) Jump & Boolean Conditions
4.1) Introduction
Most DPU instructions know conditions based on their result or the properties of one of their source operands:
an instruction can include a condition, such that, after having performed its native functionality:
the instruction execution continues at a specified address if the condition is true,
the instruction execution continues sequentially otherwise.
an instruction can include a condition, such that, after having performed its native functionality and generated a native result:
the instruction, instead of writing its native result, replaces this result with the Boolean value that corresponds to the trueness of the condition,
the instruction execution continuing sequentially.
The allowed conditions are specific to each instruction. They are specified as follow:
a conditional jump is specified by placing a condition identifier and an IRAM address after the original operands.
in the instruction description the term Jcc means a Jump condition.
a Boolean Replacement is specified by placing only a condition identifier after the original operands.
in the instruction description the term Bcc means a Boolean Replacement condition.
Examples:
add R2, R3, R4 // R2 = R3 + R4 ;
add R2, R3, R4, z, null_result // if ((R2 = R3 + R4 ) == 0) GOTO null_result;
add R2, R3, R4, z // R2 = ( R3 + R4 ) == 0;
Note: the add instruction allows the same condition z as the Jump condition and as Boolean Replacement condition.
4.2) Condition Identifier
4.2.1) Common Conditions
t : true
z : true when the native result is null (Zero)
nz : true when the native result is not null (Not Zero)
sz : true when the first Source operand is null (Zero)
snz : true when the first Source operand is not null (Not Zero)
pl : true when the native result is positive (PLus)
mi : true when the native result is negative (MInus)
spl : true when the first Source operand is positive (PLus)
smi : true when the first Source operand is negative (MInus)
4.2.2) Specific Conditions Common To Addition and Subtraction
Notations
op1 : means the first operand
op2 : means the second operand
carry numbering
carry p is the carry generated by:
for addition : op1[ *p* : 0 ] + op2[ *p* : 0 ]
for subtraction : op1[ *p* : 0 ] + ~op2[ *p* : 0 ] + 1
for reverse subtraction : ~op1[ *p* : 0 ] + op2[ *p* : 0 ] + 1
The v/nv/c/nc Conditions
v : true when an oVerflow has been generated by an addition or subtraction
nv : true when No oVerflow has been generated by an addition or subtraction
c : true when:
a Carry 31 is generated by an addition,
no carry 31 is generated by a subtraction.
nc (No Carry) condition is the opposite of c.
4.2.3) Addition Specific Conditions
nc4 : true when no carry 4 is generated
nc5 : true when no carry 5 is generated
nc6 : true when no carry 6 is generated
nc7 : true when no carry 7 is generated
nc8 : true when no carry 8 is generated
nc9 : true when no carry 9 is generated
nc10 : true when no carry 10 is generated
nc11 : true when no carry 11 is generated
nc12 : true when no carry 12 is generated
nc13 : true when no carry 13 is generated
Why So Many No Carry Conditions ?
Because considering:
a memory buffer of size 2 ^ s, aligned onto its own size,
a pointer initially pointing inside this memory buffer, this pointer being used to read/write data from/to this buffer,
the addition or subtraction performed onto this pointer after each access to this buffer,
THEN:
nc s true means the new pointer value is still inside this buffer,
nc s false means the new pointer value is now outside this buffer.
Note: these conditions work even though the added value is positive or negative, as long its absolute value is strictly smaller than the buffer size.
4.2.4) Comparison specific Conditions
These conditions are available only for instruction based on subtraction but the lsl_sub instruction (that performs a shift then a subtraction):
ltu : op1 < op2 // unsigned comparison
geu : op1 >= op2 // unsigned comparison
leu : op1 <= op2 // unsigned comparison
gtu : op1 > op2 // unsigned comparison
lts : op1 < op2 // signed comparison
ges : op1 >= op2 // signed comparison
les : op1 <= op2 // signed comparison
gts : op1 > op2 // signed comparison
4.2.5) Extended Z Conditions
When an instruction supports the z/nz conditions, then sequentially:
it generates internally a z property based on the instruction native result,
it generates internally an extended z property: z && ZF.
An Extended conditions use the extended z where the non extended condition use z:
xz : true when the extended z is true
nxz : true when the extended z is false
xleu : op1 <= op2 // unsigned comparison using extended z instead of z
xgtu : op1 > op2 // unsigned comparison using extended z instead of z
xles : op1 <= op2 // signed comparison using extended z instead of z
xgts : op1 > op2 // signed comparison using extended z instead of z
Extended conditions ease the construction of conditions on 64-bit results.
ZF update
The ZF flag is let unchanged by the following instructions:
ldma, ldmai, sdma (DMA )
sb, sh, sw, sd, sb_id, sh_id, sw_id, sd_id (Stores)
lbu, lbs, lhu, lhs, lw, ld (Loads )
acquire, release, stop, clr_run, boot, resume (ATOMIC)
nop, bkp, call
The others instructions update the ZF flag with the z property (BEWARE: not with the extended z property).
4.2.6) Shift specific Conditions
se : true when op1[0] == 0 // Source Even
so : true when op1[0] == 1 // Source Odd
nsh32 : true when op2[5] == 0 // Not Shift 32
sh32 : true when op2[5] == 1 // Shift 32
The sh32/nsh32 conditions can be used to speedup 64-bit shift operations where the shift amount is in the 6-lsb of a register, as they enable a quick differentiation of the following cases:
shift by 32 bits or more,
shift by strictly less than 32 bits.
The se/so conditions can be used to speedup 64-bit right shift by 1-bit, as they enable a quick differentiation of the following cases:
a 1 will be shifted out by a 1-bit right shift,
a 0 will be shifted out by a 1-bit right shift.
4.2.7) Bit Count Specific Conditions
max (MAXimal result) true when the result is equal to:
32 for the CAO instruction ( Count All Ones )
32 for the CLO instruction ( Count Leading ones )
32 for the CLZ instruction ( Count Leading Zero )
31 for the CLS instruction ( Count Leading Sign )
nmax (Not MAXimal result) is the opposite of max
These conditions allow the speeding-up of multi 32-bit words bit counting operations
4.2.8) 8-bit Multiply Specific Conditions
small : (op1[15:8] == 0) && (op2[15:8] == 0)
large : opposite of small
These conditions detect the case where a 16 x 16 multiply can be reduced to a single 8 x 8 multiply.
5) LDMA / LDMAI / SDMA (DMA)
5.1) Generalities
A thread requests a DMA transfer through a DMA instruction.
DMA transfers are 64-bit aligned,
transfers sizes are n x 64-bit, where n ranges from 1 to 256,
When executing a DMA instruction, the thread is suspended for the duration of the transfer,
the thread is temporarily absent from the pipeline,
it is useful to have more than 11 threads running, to palliate for the ones that temporarily leave the pipeline as they wait for the completion of a DMA instruction.
the thread RUN bit remaining set during this suspension.
The DMA is capable of:
Moving MRAM data to IRAM
Only the 48-lsb of the 64-bit words are written into the IRAM, the 16-msb being discarded.
Moving MRAM data to WRAM
Moving WRAM data to MRAM
Note: MRAM, WRAM, and IRAM have separate address spaces.
Note: the HCPU is not capable of performing DMA operations.
5.2) Behaviours
ldma #8, Rnx, Rp // Load WRAM (Rnx address) with MRAM (Rp address)
ldmai #8, Rnx, Rp // Load IRAM (Rnx address) with MRAM (Rp address)
sdma #8, Rnx, Rp // Store WRAM (Rnx address) into MRAM (Rp address)
Transfer size is: 1 + ((Rnx[30:24] + #8) & 0xFF), allowing transfers from 1 - 256 word of 64-bits.
Source and destination addresses are specified as follow:
ldma, ldmai, sdma: the 32-bit MRAM byte address is {Rp [31 :3], 0b000}
ldma, sdma : the 24-bit WRAM byte address is {Rnx[23 :3], 0b000}
ldmai : the IRAM instruction address is Rnx[p+2:3] where p is the PC width
For all DMA instructions, if the MRAM byte address is bigger than the implemented MRAM then the instruction fails and generates a memory exception. MRAM size in first DPU implementation is 64 MB.
For ldma and sdma, if the WRAM byte address is bigger than the implemented WRAM then the instruction fails and generates a memory exception. WRAM size is 64 KB in v1A, and 63488 B in v1B.
For ldmai, if the IRAM instruction address is bigger than the implemented IRAM then the instruction fails and generates a memory exception. IRAM size is 4K instructions in v1A, and 3968 instructions in v1B.
Additional characteristics:
DMA instructions support no jump nor Boolean Replacement.
DMA instructions affect no registers.
1) Loads / Stores
6.1) Common Properties
The 24-bit effective address is given by the sum of a 24-bit displacement and the 24-lsb of base Rnx register
Rnx[31:24] are ignored,
for most instruction the displacement is a 24-bit immediate value,
for some store the 24-bit displacement is the sign extension of a 12-bit immediate value.
The access effective address must be aligned according to the access width,
ZF and CF flags are left unchanged,
no condition is supported.
6.2) Loads
lbu Xm, Rnx, disp24 // Xm is loaded with the Unsigned Byte @ Rnx + disp24 .s and .sb modifiers illegal
lbs Xm, Rnx, disp24 // Xm is loaded with the signed Byte @ Rnx + disp24 .u and .ub modifiers illegal
lhu Xm, Rnx, disp24 // Xm is loaded with the Unsigned Half @ Rnx + disp24 .s and .sb modifiers illegal
lhs Xm, Rnx, disp24 // Xm is loaded with the signed Half @ Rnx + disp24 .u and .ub modifiers illegal
lw Xm, Rnx, disp24 // Xm is loaded with the Word @ Rnx + disp24
ld Dm, Rnx, disp24 // Dm is loaded with the Double word @ Rnx + disp24
6.3) Stores Register
sb Rnx, disp24, Rp // Rp[ 7:0] is stored @ Rnx + disp24
sh Rnx, disp24, Rp // Rp[15:0] is stored @ Rnx + disp24
sw Rnx, disp24, Rp // Rp is stored @ Rnx + disp24
sd Rnx, disp24, Dp // Dp is stored @ Rnx + disp24
6.4) Stores Immediate Value
sb Rnx, disp12, #8 // store #8 @ Rnx + sign_extend24( disp12 )
sh Rnx, disp12, #16 // store #16 @ Rnx + sign_extend24( disp12 )
sw Rnx, disp12, #16 // store sign_extend32( #16 ) @ Rnx + sign_extend24( disp12 )
sd Rnx, disp12, #16 // store sign_extend64( #16 ) @ Rnx + sign_extend24( disp12 )
6.5) Stores ID ORed With Immediate Value
sb_id Rnx, disp12, #8 // store ID | #8 @ Rnx + sign_extend24( disp12 )
sh_id Rnx, disp12, #16 // store ID | #16 @ Rnx + sign_extend24( disp12 )
sw_id Rnx, disp12, #16 // store ID | sign_extend32( #16 ) @ Rnx + sign_extend24( disp12 )
sd_id Rnx, disp12, #16 // store ID | sign_extend64( #16 ) @ Rnx + sign_extend24( disp12 )
6.6) Endianness Modifiers
By default, the load/store instruction uses the little-endian memory organization. Load/store instructions operating on 16-bit, 32-bit, or 64-bit data, may have the ‘.b’ modifier added to their mnemonic, forcing these instructions to use the big-endian memory organization.
For lhu, lhs and lw, use the .ub/.sb modifier to cummulate the .u/.s and .b modifiers.
7) Additions and Subtractions
Addition, result = op1 + op2
add Xmz, Sn , Rp
add Xmz, Rnx, Rp
add Xmz, Rnx, Rp , Bcc
add Xmz, Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
add Xmz, Sn , #WRAM
add ZERO, Rn , #32
add Rm, Rnx, #32
add Dm, Rn , #32
add ZERO, Rnx, #27
add Xm, Rnx, #24
-----------------------------------------------
add Xm, Rnx, #24 , Bcc
add ZERO, Rnx, #27PC, Jcc, IRAM_address
add Xm, Rnx, #24PC, Jcc, IRAM_address
Addition with Carry, result = op1 + op2 + CF
addc Xmz, Sn , Rp
addc Xmz, Rnx, Rp
addc Xmz, Rnx, Rp , Bcc
addc Xmz, Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
addc Xmz, Sn , #WRAM
addc ZERO, Rn , #32
addc Rm, Rnx, #32
addc ZERO, Rnx, #27
addc Xm, Rnx, #24
-----------------------------------------------
addc Xm, Rnx, #24 , Bcc
addc ZERO, Rnx, #27PC, Jcc, IRAM_address
addc Xm, Rnx, #24PC, Jcc, IRAM_address
Reverse subtraction, result = op1 + ~op2 + 1
rsub Xmz, Sn , Rp
rsub Xmz, Rnx, Rp
rsub Xmz, Rnx, Rp , Bcc
rsub Xmz, Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
rsub Xmz, Sn , #WRAM
rsub ZERO, Rn , #32
rsub Rm, Rnx, #32
rsub ZERO, Rnx, #27
rsub Xm, Rnx, #24
-----------------------------------------------
rsub Xm, Rnx, #24 , Bcc
rsub ZERO, Rnx, #27PC, Jcc, IRAM_address
rsub Xm, Rnx, #24PC, Jcc, IRAM_address
Reverse subtraction with Carry, result = op1 + ~op2 + CF
rsubc Xmz, Sn , Rp
rsubc Xmz, Rnx, Rp
rsubc Xmz, Rnx, Rp , Bcc
rsubc Xmz, Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
rsubc Xmz, Sn , #WRAM
rsubc ZERO, Rn , #32
rsubc Rm, Rnx, #32
rsubc ZERO, Rnx, #27
rsubc Xm, Rnx, #24
-----------------------------------------------
rsubc Xm, Rnx, #24 , Bcc
rsubc ZERO, Rnx, #27PC, Jcc, target_address
rsubc Xm, Rnx, #24PC, Jcc, target_address
Subtraction, result = opa + ~opB + 1
sub Xmz, Sn , Rp
sub Xmz, Rnx, Rp
sub Xmz, Rnx, Rp , Bcc
sub Xmz, Rnx, Rp , Jcc, IRAM_address
-------------------------------------------------------------------
sub Xmz, Sn , #WRAM
sub Dmz , Rn , #32 // replaced with add instructions
sub Rm , Rnx, #32 // ...
sub ZERO, Rnx, #27
sub Xm, Rnx, #24
-------------------------------------------------------------------
sub Xm, Rnx, #24 , Bcc
sub ZERO, Rnx, #27PC, Jcc, target_address
sub Xm, Rnx, #24PC, Jcc, target_address
Subtraction with carry, result = op1 + ~op2 + CF
subc Xmz, Sn , Rp
subc Xmz, Rnx, Rp
subc Xmz, Rnx, Rp , Bcc
subc Xmz, Rnx, Rp , Jcc, IRAM_address
-------------------------------------------------------------------
subc Xmz, Sn , #WRAM
subc ZERO, Rn , #32 // replaced with addc instructions
subc Rm , Rnx, #32 // ...
subc ZERO, Rnx, #27
subc Xm, Rnx, #24
-------------------------------------------------------------------
subc Xm, Rnx, #24 , Bcc
subc ZERO, Rnx, #27PC, Jcc, target_address
subc Xm, Rnx, #24PC, Jcc, target_address
7.1) CF update
As shown in the descriptions above, add/addc/sub/subc/rsub/rsubc use a 32-bit adder. These instructions update CF with is the native carry 31 of this 32-bit adder. Another way of expressing the new CF value is:
when executing a ADD or ADDC, CF is set to the c condition,
when executing a SUB, SUBC, RSUB, or RSUBC, CF is set to the geu condition.
7.2) Why sub #32 is replaced with an add?
There is no encoding for the sub instruction with #32 because the two instructions:
sub Rm, Rn, #32
add Rm, Rn, ~#32 + 1
would be equivalent in terms of 32-bit result generated. Concerning the CF flag update setting:
if #32 <> 0 then ~#32+1 generates no carry, thus sub #32 is entirely equivalent to add -#32,
if #32 == 0 then the sub Rm, Rn, #0 instruction is encoded.
7.3) Why subc #32 is replaced with an addc?
There is no encoding for subc #32 as it is entirely equivalent to addc Rm, Rn, ~#32.
7.4) Supported Conditions
add and addc
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, c, nc, nc4, nc5, nc6, nc7, nc8, nc9, nc10, nc11, nc12, nc13.
Bcc: z, nz, xz, nxz.
sub and subc
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, ltu, geu, lts, ges, les, gts, leu, gtu, xles, xgts, xleu, xgtu.
Bcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, ltu, geu, lts, ges, les, gts, leu, gtu, xles, xgts, xleu, xgtu.
rsub and rsubc
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, ltu, geu, lts, ges, les, gts, leu, gtu, xles, xgts, xleu, xgtu.
Bcc: z, nz, xz, nxz.
8) Logical instructions
AND, result = op1 & op2
AND Xmz , Rnx, Rp
AND Xmz , Rnx, Rp , Bcc
AND Xmz , Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
AND Rmz , Rn , #32
AND Dm , Rnx, #32
AND ZERO, Rnx, #28
AND Xm , Rnx, #24
-----------------------------------------------
AND Xm , Rnx, #24 , Bcc
AND ZERO, Rnx, #28PC, Jcc, IRAM_address
AND Xm , Rnx, #24PC, Jcc, IRAM_address
NAND, result = ~(op1 & op2)
NAND Xmz , Rnx, Rp
NAND Xmz , Rnx, Rp , Bcc
NAND Xmz , Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
NAND ZERO, Rnx, #28
NAND Xm , Rnx, #24
-----------------------------------------------
NAND Xm , Rnx, #24 , Bcc
NAND ZERO, Rnx, #28PC, Jcc, IRAM_address
NAND Xm , Rnx, #24PC, Jcc, IRAM_address
ANDN, result = (~op1) & op2
ANDN Xmz , Rnx, Rp
ANDN Xmz , Rnx, Rp , Bcc
ANDN Xmz , Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
ANDN ZERO, Rnx, #28
ANDN Xm , Rnx, #24
-----------------------------------------------
ANDN Xm , Rnx, #24 , Bcc
ANDN ZERO, Rnx, #28PC, Jcc, IRAM_address
ANDN Xm , Rnx, #24PC, Jcc, IRAM_address
OR, result = op1 | op2
OR Xmz , Rnx, Rp
OR Xmz , Rnx, Rp , Bcc
OR Xmz , Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
OR Dmz , Rn , #32
OR Rm , Rnx, #32
OR ZERO, Rnx, #28
OR Xm , Rnx, #24
-----------------------------------------------
OR Xm , Rnx, #24 , Bcc
OR ZERO, Rnx, #28PC, Jcc, IRAM_address
OR Xm , Rnx, #24PC, Jcc, IRAM_address
NOR, result = ~(op1 | op2)
NOR Xmz , Rnx, Rp
NOR Xmz , Rnx, Rp , Bcc
NOR Xmz , Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
NOR ZERO, Rnx, #28
NOR Xm , Rnx, #24
-----------------------------------------------
NOR Xm , Rnx, #24 , Bcc
NOR ZERO, Rnx, #28PC, Jcc, IRAM_address
NOR Xm , Rnx, #24PC, Jcc, IRAM_address
ORN, result = (~op1) | op2
ORN Xmz , Rnx, Rp
ORN Xmz , Rnx, Rp , Bcc
ORN Xmz , Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
ORN ZERO, Rnx, #28
ORN Xm , Rnx, #24
-----------------------------------------------
ORN Xm , Rnx, #24 , Bcc
ORN ZERO, Rnx, #28PC, Jcc, IRAM_address
ORN Xm , Rnx, #24PC, Jcc, IRAM_address
XOR, result = op1 ^ op2
XOR Xmz , Rnx, Rp
XOR Xmz , Rnx, Rp , Bcc
XOR Xmz , Rnx, Rp , Jcc, IRAM_address
---------------------------------------------
XOR ZERO, Rn , #32
XOR Rm , Rnx, #32
XOR ZERO, Rnx, #28
XOR Xm , Rnx, #24
-----------------------------------------------
XOR Xm , Rnx, #24 , Bcc
XOR ZERO, Rnx, #28PC, Jcc, IRAM_address
XOR Xm , Rnx, #24PC, Jcc, IRAM_address
NXOR, result = ~(op1 ^ op2)
NXOR Xmz , Rnx, Rp
NXOR Xmz , Rnx, Rp , Bcc
NXOR Xmz , Rnx, Rp , Jcc, IRAM_address
-------------------------------------------------------------------
NXOR ZERO, Rn , #32 // replaced with XOR instructions
NXOR Rm , Rnx, #32 // ...
NXOR ZERO, Rnx, #28
NXOR Xm , Rnx, #24
-------------------------------------------------------------------
NXOR Xm , Rnx, #24 , Bcc
NXOR ZERO, Rnx, #28PC, Jcc, IRAM_address
NXOR Xm , Rnx, #24PC, Jcc, IRAM_address
Supported Conditions
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc: z, nz, xz, nxz
Note: Logical instructions updates ZF but let CF unchanged.
9) EXTUB / EXTSB / EXTUH / EXTSH (Zero/Sign Extensions)
The following instructions don’t support the .s modifier:
Extub Xmz, Rn // 8-bit (Byte) to 32-bit zero (Unsigned) extension
Extub Xmz, Rn, Bcc // ...
Extub Xmz, Rn, Jcc, IRAM_address // ...
---------------------------------------------------------------------------------------
Extub Xmz, Rn // 16-bit (Half) to 32-bit zero (Unsigned) extension
Extub Xmz, Rn, Bcc // ...
Extub Xmz, Rn, Jcc, IRAM_address // ...
The following instructions don’t support the .u modifier:
Extsb Xmz, Rn // 8-bit (Byte) to 32-bit Signed extension
Extsb Xmz, Rn, Bcc // ...
Extsb Xmz, Rn, Jcc, IRAM_address // ...
---------------------------------------------------------------------------------------
Extsh Xmz, Rn // 16-bit (Half) to 32-bit Signed extension
Extsh Xmz, Rn, Bcc // ...
Extsh Xmz, Rn, Jcc, IRAM_address // ...
Supported Conditions
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc: z, nz, xz, nxz
10) HASH
These instructions don’t support the .s modifier.
hash Xmz, Rnx, Rp
hash Xmz, Rnx, Rp , Bcc
hash Xmz, Rnx, Rp , Jcc, IRAM_address
-----------------------------------------------
hash Xmz, Rnx, #24
hash Xmz, Rnx, #24 , Bcc
hash Xmz, Rnx, #24PC, Jcc, IRAM_address
10.1) Hash operation
The instruction result is given by the following table:
op2[18:17] |
op2[16] |
Result |
|---|---|---|
00 |
0 |
Op1[6:0] ^ Op1[13: 7] |
1 |
Op1[6:0] ^ Op1[13: 7] ^ Op1[20:14] |
|
01 |
0 |
Op1[7:0] ^ Op1[15: 8] |
1 |
Op1[7:0] ^ Op1[15: 8] ^ Op1[23:16] |
|
10 |
0 |
Op1[8:0] ^ Op1[17: 9] |
1 |
Op1[8:0] ^ Op1[17: 9] ^ Op1[26:18] |
|
11 |
0 |
Op1[9:0] ^ Op1[19:10] |
1 |
Op1[9:0] ^ Op1[19:10] ^ Op1[29:20] |
Supported Conditions
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc: z, nz, xz, nxz
11) SATS (SATuration, Signed)
sats Xmz, Rnx
sats Xmz, Rnx, Bcc
sats Xmz, Rnx, Jcc, IRAM_address
result = (Rx[31] == 1) ? 0x7FFFFFFF : 0x80000000
Supported Conditions
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc: z, nz, xz, nxz
12) Shift / Rotate
The shift value is the 5-lsb of the second operand, thus the shift/rotate amount ranges from 0 through 31: it can be the 5-lsb of an Rp register or a 5-bit immediate value.
The following table describes the Shift/Rotate instructions:
Description |
examples |
|||
|---|---|---|---|---|
initial |
shift |
result |
||
ROL |
ROtate Left |
12345678 |
4 |
23456781 |
ROR |
ROtate Right |
12345678 |
4 |
81234567 |
LSL |
Logical Shift Left |
12345678 |
4 |
23456780 |
LSL1 |
Logical Shift Left with 1 insertion |
12345678 |
4 |
2345678F |
LSR |
Logical Shift Right |
12345678 |
4 |
01234567 |
LSR1 |
Logical Shift Right with 1 insertion |
12345678 |
4 |
F1234567 |
ASR |
Arithmetic Shift Right |
12345678 |
4 |
01234567 |
89ABCDEF |
4 |
F89ABCDE |
||
LSLX |
LSL eXtended. The result is the part that would be shifted out by an LSL, its MSB being 0-filled. |
12345678 |
0 |
00000000 |
12345678 |
4 |
00000001 |
||
12345678 |
28 |
01234567 |
||
LSL1X |
LSL1 eXtended. The result is the part that would be shifted out by a LSL1, its MSB being 1-filled. |
12345678 |
0 |
FFFFFFFF |
12345678 |
4 |
FFFFFFF1 |
||
12345678 |
28 |
F1234567 |
||
LSRX |
LSR eXtended. The result is the part that would be shifted out by an LSR, its LSB being 0-filled. |
12345678 |
0 |
00000000 |
12345678 |
4 |
80000000 |
||
12345678 |
28 |
23456780 |
||
LSR1X |
LSR1 eXtended. The result is the part that would be shifted out by a LSR1, its LSB being 1-filled. |
12345678 |
0 |
FFFFFFFF |
12345678 |
4 |
8FFFFFFF |
||
12345678 |
28 |
2345678F |
||
All shift/rotate instructions allow for the same operands combinations
LSL Xmz, Rnx, Rp
LSL Xmz, Rnx, Rp, Bcc
LSL Xmz, Rnx, Rp, Jcc, IRAM_address
-------------------------------------
LSL Xmz, Rnx, #5
LSL Xmz, Rnx, #5, Bcc
LSL Xmz, Rnx, #5, Jcc, IRAM_address
Supported Conditions
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, nsh32, sh32, se, so
Bcc: z, nz, xz, nxz
13) Shift/Rotate & add/sub
rol_add Xmz, Rnx, Rp, #5 // rotate left then addition
rol_add Xmz, Rnx, Rp, #5, Bcc // ...
rol_add Xmz, Rnx, Rp, #5, Jcc, IRAM_address // ...
-----------------------------------------------------------------------------
lsr_add Xmz, Rnx, Rp, #5 // shift right then addition
lsr_add Xmz, Rnx, Rp, #5, Bcc // ...
lsr_add Xmz, Rnx, Rp, #5, Jcc, IRAM_address // ...
-----------------------------------------------------------------------------
lsl_add Xmz, Rnx, Rp, #5 // shift left then addition
lsl_add Xmz, Rnx, Rp, #5, Bcc // ...
lsl_add Xmz, Rnx, Rp, #5, Jcc, IRAM_address // ...
-----------------------------------------------------------------------------
lsl_sub Xmz, Rnx, Rp, #5 // shift left then subtraction
lsl_sub Xmz, Rnx, Rp, #5, Bcc // ...
lsl_sub Xmz, Rnx, Rp, #5, Jcc, IRAM_address // ...
For all these instructions the content of Rnx is shifted or rotated by the #5 immediate value, giving an intermediary result that is added or subtracted to the Rp value giving the instruction final result.
Supported Conditions
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Bcc: z, nz, xz, nxz
NOTE: the z, nz, xz, nxz, pl and mi CONDITIONS ARE EVALUATED AGAINST THE INTERMEDIARY RESULT, NOT AGAINST THE FINAL RESULT
14) CLZ / CLO / CLS / CAO (bit count)
These instructions don’t support the .s modifier.
CLZ Xmz, Rnx // Count Leading Zero
CLZ Xmz, Rnx, Bcc // ...
CLZ Xmz, Rnx, Jcc, IRAM_address // ...
-------------------------------------------------------------------------------------------------
CLO Xmz, Rnx // Count Leading Ones
CLO Xmz, Rnx, Bcc // ...
CLO Xmz, Rnx, Jcc, IRAM_address // ...
-------------------------------------------------------------------------------------------------
CLS Xmz, Rnx // Count Leading Sign: Indicates by how many bits the
CLS Xmz, Rnx, Bcc // source operand can be left-shifted without having
CLS Xmz, Rnx, Jcc, IRAM_address // its sign changed, the result being in the range 0-31.
-------------------------------------------------------------------------------------------------
CAO Xmz, Rnx // Count All Ones: counts the number
CAO Xmz, Rnx, Bcc // of one in the source operand
CAO Xmz, Rnx, Jcc, IRAM_address // ...
Supported Conditions
Jcc: t, z, nz, xz, nxz, max, nmax, sz, nsz, spl, smi
Bcc: z, nz, xz, nxz
For CLS, the max (MAXimum) condition is true when the result is 31, For CLZ, CLO and CAO, the max condition is true when the result is 32. The nmax condition is always the opposite of the max condition.
15) MUL_STEP / DIV_STEP / MOVD / SWAPD
15.1) mul_step
mul_step Dmz, Rnx, Dp, #5
mul_step Dmz, Rnx, Dp, #5, Bcc
mul_step Dmz, Rnx, Dp, #5, Jcc, IRAM_address
Action performed
if (Dp[32] & 1) Dm[31: 0] = Dp[31: 0] + (Rnx << #5) // if the destination is the ZERO register,
Dm[63:32] = Dp[63:32] >> 1 // ... then no register is affected
Supported Conditions
Jcc: t, z, nz, sz, nsz, spl, smi
15.2) div_step
div_step Dmz, Rnx, Dp, #5
div_step Dmz, Rnx, Dp, #5, Bcc
div_step Dmz, Rnx, Dp, #5, Jcc, IRAM_address
Action performed
if (Dp[31: 0] >= (Rnx << #5)){ // the comparison is unsigned
Dm[31: 0] = Dp[31: 0] - (Rnx << #5); // if the destination is the ZERO register
Dm[63:32] = (Dp[63:32] << 1) | 1 ; // ... then no register is affected
} // ...
else Dm[63:32] = (Dp[63:32] << 1) ; // ...
Supported Conditions
Jcc: t, sz, nsz, spl, smi
15.3) movd
movd Dmz, Dp
movd Dmz, Dp, Bcc
movd Dmz, Dp, Jcc, IRAM_address
result = Dp
Supported Conditions
Jcc: t, sz, nsz, spl, smi
15.3 swapd
swapd Dmz, Dp
swapd Dmz, Dp, Bcc
swapd Dmz, Dp, Jcc, IRAM_address
result = { Dp[31:0], Dp[63:32] }
Supported Conditions
Jcc: t, sz, nsz, spl, smi
16) 8 x 8 Multiplications
The result of a 8 x 8 multiplication is initially 16-bit, then this 16-bit result is:
zero-extended to 32-bit for unsigned x unsigned multiplication,
sign-extended to 32-bit otherwise.
mnemonic |
result[15:0] |
multiply variant |
Comment |
|---|---|---|---|
mul_ul_ul |
op1[ 7:0] x op2[7: 0] |
unsigned x unsigned (zero-extended to 32-bit) |
.s forbidden |
mul_ul_uh |
op1[ 7:0] x op2[15:8] |
||
mul_uh_ul |
op1[15:8] x op2[ 7:0] |
||
mul_uh_uh |
op1[15:8] x op2[15:8] |
||
mul_sl_ul |
op1[ 7:0] x op2[7: 0] |
signed x unsigned (sign-extended to 32-bit) |
.u forbidden |
mul_sl_uh |
op1[ 7:0] x op2[15:8] |
||
mul_sh_ul |
op1[15:8] x op2[ 7:0] |
||
mul_sh_uh |
op1[15:8] x op2[15:8] |
||
mul_sl_sl |
op1[ 7:0] x op2[7: 0] |
signed x signed (sign-extended to 32-bit) |
|
mul_sl_sh |
op1[ 7:0] x op2[15:8] |
||
mul_sh_sl |
op1[15:8] x op2[ 7:0] |
||
mul_sh_sh |
op1[15:8] x op2[15:8] |
syntax
mul_ul_ul Xmz, Rnx, Rp // similar syntax for the others
mul_ul_ul Xmz, Rnx, Rp, Bcc // 8 x 8 multiplications instructions
mul_ul_ul Xmz, Rnx, Rp, Jcc, IRAM_address // ...
Supported Conditions
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, ms8, nms8, mu8, nmu8
Bcc: z, nz, xz, nxz
17) CMPB4
cmpb4 Xmz, Rnx, Rp
Functionality
result[31:24] = (Rx[31:24] == Rp[31:24]) ? 0x01 : 0x00;
result[23:16] = (Rx[23:16] == Rp[23:16]) ? 0x01 : 0x00;
result[15: 8] = (Rx[15: 8] == Rp[15: 8]) ? 0x01 : 0x00;
result[ 7: 0] = (Rx[ 7: 0] == Rp[ 7: 0]) ? 0x01 : 0x00;
Supported Conditions
Bcc: z, nz, xz, nxz
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
18) CALL
call Xmz, Rnx, Rp
call Xmz, Rnx, #PC // #PC is an immediate whose width is the one of the PC
Functionality
result = current PC + 1
The thread jump to the IRAM address given by Rnx + Rp or by Rnx + #PC
Note: there is no RETURN instruction: a “CALL ZERO, Rnx” instruction is used instead, where Rnx is the register where the return address has been previously saved.
19) ACQUIRE / RELEASE
acquire: Rnx, #16
acquire: Rnx, #16, Jcc, IRAM_address
release: Rnx, #16
release: Rnx, #16, Jcc, IRAM_address
Functionality
For both instruction an 8-bit index i is calculated as follows:
tmp[15:0] = Rnx + #16
i = tmp[15:8] ^ tmp[7:0]
Then:
for ACQUIRE: ATOMIC[ i ] = 1,
for RELEASE: ATOMIC[ i ] = 0.
In both cases, the z/nz conditions are evaluated using the initial value of the ATOMIC[ i ] bit.
Supported Jcc Conditions for ACQUIRE: t, z, nz
Supported Jcc Conditions for RELEASE: nz
Note: when ACQUIRE/RELEASE is used correctly, the nz condition is always true for RELEASE.
20) STOP
stop
stop t, IRAM_address // only the t (True) condition is supported
Functionality
The RUN bit corresponding to the thread executing the STOP instruction is cleared, if a t condition is present, the thread PC is set to the specified jump address, independently of the presence of the t condition, the thread is non longer running.
21) BOOT / RESUME / CLR_RUN
boot Rnx, #6
boot Rnx, #6, Jcc, IRAM_address
resume Rnx, #6
resume Rnx, #6, Jcc, IRAM_address
clr_run Rnx, #6
clr_run Rnx, #6, Jcc, IRAM_address
Functionality
Both instructions generate first a 6-bit unsigned index i:
tmp[13:0] = Rnx[13:0] + #6
i = tmp[13:8] ^ tmp[5:0]
21.1) CLR_RUN
clr_run just clears the bit RUN[ i ], the CLR_RUN instruction is now over.
21.2) BOOT / RESUME
If RUN[ i ] is initially set then the BOOT/RESUME instruction is over.
Otherwise:
RUN[ i ] is set
if i < 24 then the execution of the thread i is resumed:
for BOOT instructions: at the IRAM address 0
for RESUME instructions: at the current value of PC[ i ] (the PC of the thread i).
21.3) Supported conditions
The CLR_RUN, BOOT, and RESUME instructions support the same set of conditions
Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
Note: the z, nz, xz, nxz conditions use the nullity/non-nullity of the initial value of the bit RUN[ i ].
22) TIME / TIME_CFG
time Xmz
time Xmz, t, IRAM_address // only the t (True) condition is allowed
time_cfg Xmz, Rnx
time_cfg Xmz, Rnx, t, IRAM_address // only the t (True) condition is allowed
For both instructions: result = TIME[35:4] (the 32-msb of TIME)
22.1) TIME Increment Configuration
This part concerns only the time_cfg instruction
To have Rnx[0] set clears the TIME[35:0] register, the field Rnx[2:1] being used as follow:
00: keep the current increment configuration
01: set the configuration such that TIME[35:0] is incremented every DPU cycle
10: set the configuration such that TIME[35:0] is incremented every executed instruction
11: set the configuration such that TIME[35:0] is not incremented
23) NOP / BKP
NOP // Does nothing
BKP // Does nothing besides causing a BKP exception