==============
 DPU Handbook
==============

1)  Introduction
================

The DPU is a multithreaded 32-bit processor that has several hardware threads available, depending on the version of the DPU:

 * On a v1A DPU, there are 24 threads, indexed from 0 through 23.
 * On a v1B DPU, there are 16 threads, indexed from 0 through 15.

A thread can be running or stopped. The state of the thread *i* is reflected in the 24-lsb
of a 64-bit register named RUN (the 40-msb being used for other purposes, described later):

* RUN[ *i* ] = 0 -> the thread *i* is stopped (not executing),
* RUN[ *i* ] = 1 -> the thread *i* is running (executing).

The full performance of the DPU is achieved when enough hardware threads are 
running so that the DPU pipeline remains filled (this number being > 10).
Note that 'overfilling' the pipeline is recommended to palliate the fact that
threads issuing DMA instructions are temporarily removed from the pipeline.

2)  DPU state
=============

2.1)  Threads 32-bit registers
------------------------------

A thread knows 32 x 32 bit registers:

* 24 general purpose 32-bit registers, private to the thread: **R0 - R23**
* 4 fixed 32-bit registers, common to all threads:

  * **ZERO** : fixed to the value 0,
  * **ONE**  : fixed to the value 1,
  * **LNEG** : fixed to the value 0xFFFFFFFF (Least NEGative),
  * **MNEG** : fixed to the value 0x80000000 (Most NEGative).

* 4 fixed 32-bit registers, private to the thread:

  * **ID**   : fixed to the thread index.
  * **ID2**  : fixed to the thread index **x 2**.
  * **ID4**  : fixed to the thread index **x 4**.
  * **ID8**  : fixed to the thread index **x 8**.

2.1.1)  R0 - R23 seen as Stack Registers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The 24 general-purpose 32-bit registers can be seen
as well as 24 x 32-bit stack registers: S0 - S23.

Some instructions support the specification of an **Sn** register instead of
an **Rn** register. While the register value is unchanged the way this value
is used is changed, as described in the **the stack exceptions chapter**.

2.1.2)  R0 - R23 register pair seen as 64-bit registers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The 24 general-purpose 32-bit registers can be seen as well as 12 x 64-bit registers:

* **D0**  = { **R0**  , **R1**  }
* **D2**  = { **R2**  , **R3**  }
* **D4**  = { **R4**  , **R5**  }
* **D6**  = { **R6**  , **R7**  }
* **D8**  = { **R8**  , **R9**  }
* **D10** = { **R10** , **R11** }
* **D12** = { **R12** , **R13** }
* **D14** = { **R14** , **R15** }
* **D16** = { **R16** , **R17** }
* **D18** = { **R18** , **R19** }
* **D20** = { **R20** , **R21** }
* **D22** = { **R22** , **R23** }

The Dn 32-msb are held by the even Rn register, the 32-lsb being held by the odd Rn+1 register.

2.3)  Threads PC register
-------------------------

A thread comprises a **PC** register, whose width is implementation dependant:

* the PC width is in the range 12-16 bits,
* the first DPU implementation has a 12-bit PC.

**Note:** the PC contains an **instruction address**, not **a byte address**.

2.4)  Naming Conventions
-------------------------

To specify which operands are allowed for each instruction, the following naming conventions are used:

* **#32**                : a 32-bit immediate value.
* **#28**                : a 28-bit immediate value, sign extended to 32-bit.
* **#27**                : a 27-bit immediate value, sign extended to 32-bit.
* **#24**                : a 24-bit immediate value, sign extended to 32-bit.
* **#16**                : a 16-bit immediate value, that is, according to the instruction considered, either:

  * not extended,
  * sign extended to 32-bit,
  * sign extended to 64-bit.

* **#8**                 : a 8-bit immediate value.
* **#WRAM**              : an immediate signed value whose width is *p* + 1 when the WRAM size is 2 ^ *p*.
* **disp24**             : a 24-bit immediate value.
* **disp12**             : a 12-bit immediate value, sign extended to 24-bit.
* **#28-PC**             : an immediate value whose width is 28 minus the width of PC, sign-extended to 32-bit.
* **#27-PC**             : an immediate value whose width is 27 minus the width of PC, sign-extended to 32-bit.
* **#24-PC**             : an immediate value whose width is 24 minus the width of PC, sign-extended to 32-bit.
* **#PC**                : an immediate value whose width is the width of PC.
* **#6**                 : a 6-bit immediate value.
* **#5**                 : a 5-bit immediate value.
* **Rm**, **Rn**, **Rp** : one of the register R0-R23.
* **Rnx**                : one of the register R0-R23 or one of the fiXed register: ZERO, ONE, LNEG, MNEG, ID, ID2, ID4 or ID8.
* **Rmz**                : one of the register R0-R23 or the ZERO register
* **Dm**, **Dp**         : one of the register D0-D22.
* **Dmz**                : one of the register D0-D22 or the ZERO register.
* **Xm**                 : one of the register R0-R23 or D0-D22.
* **Xmz**                : one of the register R0-R23, D0-D22 or the ZERO register.

2.5)  Threads ZF and CF flags
-----------------------------

To help the execution of 64-bit arithmetic, each thread has 2 x 1-bit flags:

* **ZF** : Zero Flag,
* **CF** : Carry Flag.

2.6)  The TIME register
-----------------------

This 36-bit register is common to all the threads. According to its configuration, TIME either:

* stays unchanged,
* increments at every cycle,
* increments at every executed instruction.

The TIME_CFG (TIME ConFiGure) instruction allows:

* the optional setting of the TIME configuration
* the optional clearing of TIME[35:0]

The 32-msb of TIME can be obtained through the TIME and TIME_CFG instructions.

2.6)  The IRAM
--------------

A DPU comprises an Instruction memory named **IRAM** holding
2 ^ *p* 48-bit wide instructions, where *p* is the **PC** width.

**PC_width and instruction encoding**

* In many instructions, the width of the immediate value
  that can be encoded varies counter wise to the PC width.
* The current implementation has a 12-bit PC but supports (through the configuration
  by the **HCPU** of the **PC_MODE** control register) the execution of binaries
  generated for DPU with larger PC width, as long these binaries fit into the **IRAM**.

The **IRAM** can be accessed:

* by the **HCPU** through the control interface,

  * The HCPU can read/write the **IRAM** even when threads are running.

* by the **DPU** through the execution of **ldmai** instructions,

  * the **DPU** reads the **IRAM** only through the fetching of instructions.

2.7)  The WRAM
--------------

The **WRAM** is a **64 KB** memory that is accessible:

* by the **HCPU** through the control interface,
* by the **DPU** through:

  * 8-bit, 16-bit, 32-bit and 64-bit load/store instructions,
  * **ldma**/**sdma** instructions.

**Note 1:** The **WRAM** has a 24-bit wide address space, where currently only the range 0x000000 - 0x00FFFF is used.

**Note 2:** On the v1B DPU, only 63488 bytes is usable.

2.7.1)  Load/Store Memory Exception
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Load/Store generates a memory exception when:

* the address is not aligned with respect of the access size,
* the address is outside the range 0: (64 KB – 1),
* the address is a stack address and cross its associated bound.

**Note:** exception handling is performed by the **HCPU**.

2.7.2)  Stack Overflow Exception
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since up to 24 threads are running, up to 24 different stacks are present,
thus the DPU comprises a hardware mechanism to detect stack overflow early on.

2.7.2.1)  Stack Overflow Exception Caused By Load/Store
+++++++++++++++++++++++++++++++++++++++++++++++++++++++

Load/Store allows the specification of an **Sn** register instead of an **Rn** register as the base of the effective address calculation.
While **Sn** and **Rn** contents are identical, specifying an **Sn** register changes the way this content is used.

Considering a WRAM of size 2 ^ *p*, then:

* **Sn** [31 : *p*  ] contains the stack bound address (or its MSB),
* **Sn** [ *p* -1:0 ] contains the current stack address.

The stack bound address encoding adapted to the WRAM size as follow:

* **64  KB** : stack bound [15:0] is   Rn[31:16]
* **128 KB** : stack bound [16:0] is { Rn[31:17], 00 }
* **256 KB** : stack bound [17:0] is { Rn[31:18], 0000 }
* **512 KB** : stack bound [18:0] is { Rn[31:19], 000000 }
* **1   MB** : stack bound [19:0] is { Rn[31:20], 00000000 }

The STACK_UP control register (configurable by the **HCPU**) specifies the progression direction for all the stacks.

* **STACK_UP set** ... **upward progressing stacks**: an Sn-based load/store at
  an address bigger or equal to the stack bound address generates a memory exception.

* **STACK_UP cleared** ... **downward progressing stacks**: an Sn-based load/store at
  an address strictly smaller than the stack bound address generates a memory exception.

2.7.2.1)  Stack Overflow Exception Caused by Addition/Subtraction
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

An addition/subtraction to a stack pointer must keep the msb of this stack pointer unchanged
as these MSB specify the stack bound. Thus **add**/**addc**/**sub**/**subc**/**rsub**/**rsubc**
instructions with an **Sn** register specified as first source operand will generate
an exception if the result [31: *p* ] differs from **Sn** [31: *p* ].

**Note:** when an addition/subtraction has an **Sn** register as the first source operand, then the assembler
allows, for naming coherency/cosmetic purpose, to use an **Sm** register as the destination register.

2.8)  The MRAM
--------------

The **MRAM** is a **64 MB** memory accessible:

* by the **HCPU** through the DDR4 legacy interface,
* by the **DPU**  through **ldma** instructions,
* by the **DPU**  through **sdma** instructions.

2.9)  The ATOMIC memory
-----------------------

This 256-bit memory is used for thread synchronization.

* A bit of the **ATOMIC** memory can be set by a thread through the **ACQUIRE** instruction,

  * the thread conditionally jumping according to the bit initial value.

* A bit of the **ATOMIC** memory can be cleared by a thread through the **RELEASE** instruction,

  * the thread conditionnally jumping according to the bit initial value.

* A bit of the **ATOMIC** memory can be set or cleared by the **HCPU** through the control interface,

  * the **HCPU** obtaining in return the bit initial value.

2.10)  The RUN memory
---------------------

The **RUN** memory is a 64-bit memory used to manage threads and HCPU synchronization:

* The bits [0] through [23] reflect the status of the 24 threads:

  * **RUN** [ *i* ] set means the thread *i* is running,

  * **RUN** [ *i* ] cleared means the thread *i* is stopped (not running).

* The bits **[24]** through **[63]** are used for **DPU** / **HCPU** synchronization.

  * the **DPU** can set/clr these bits through the **CLR_RUN** and **BOOT** instructions,

    * the thread conditionally jumping according to the bit initial value.

  * the **HCPU** can set/clr these bits through the control interface,

    * the **HCPU** obtaining in return the bit initial value.

3)  Result Destination
======================

3.1)  **ZERO** as destination register
--------------------------------------

When the specified destination register is the **ZERO** register, then the instruction 32-bit or 64-bit result is discarded, the remaining functionality of the instruction being performed as usual.

3.2)  The '.u' and '.s' instruction modifiers
---------------------------------------------

Instructions generating 32-bit results can be modified:

* by adding to the mnemonic the postfix **\".u\"**: the instruction now generate
  a 64-bit result made by the **zero-extension** of the initial 32-bit result,

  * now the destination register must be a Dm 64-bit register.
  * a 32-bit result that is made by the sign extension of a smaller result cannot be zero-extended to 64-bit.
  
    * For example LBS.u is illegal.

* by adding to the mnemonic the postfix **\'.s\'**: the instruction now generate
  a 64-bit result made by the **sign-extension** of the initial 32-bit result,

  * now the destination register must be a Dm 64-bit register.
  * a 32-bit result that is made by the zero extension of a smaller result cannot be sign-extended to 64-bit.  

    * For example LBU.s is illegal.


To cope with the multiple possible combinations, the instruction description uses:

* Xm to refers to Rm or Dm, according to the fact that the instruction is used or not with the '.u' or '.s' modifier.
* Xmz to refers to Xm or the **ZERO** register.

4)  Jump & Boolean Conditions
=============================

4.1)  Introduction
------------------

Most DPU instructions know conditions based on their result or the properties of one of their source operands:

* an instruction can include a condition, such that, after having performed its native functionality:
 
  * the instruction execution continues at a specified address if the condition is true,
  * the instruction execution continues sequentially otherwise.

* an instruction can include a condition, such that, after having
  performed its native functionality and generated a native result:

  * the instruction, instead of writing its native result, replaces this result
    with the Boolean value that corresponds to the trueness of the condition,
  * the instruction execution continuing sequentially.

The allowed conditions are specific to each instruction. They are specified as follow:

* a conditional jump is specified by placing a condition identifier
  and an IRAM address after the original operands.

  * in the instruction description the term **Jcc** means a Jump condition.

* a Boolean Replacement is specified by placing only a condition identifier after the original operands.

  * in the instruction description the term **Bcc** means a Boolean Replacement condition.

**Examples:** ::

  add R2, R3, R4                  //      R2 =   R3 + R4 ;
  add R2, R3, R4, z, null_result  // if ((R2 =   R3 + R4 ) == 0) GOTO null_result;
  add R2, R3, R4, z               //      R2 = ( R3 + R4 ) == 0;

**Note:** the add instruction allows the same condition **z** as the Jump condition and as Boolean Replacement condition.

4.2)  Condition Identifier
--------------------------

4.2.1)  Common Conditions
~~~~~~~~~~~~~~~~~~~~~~~~~

* **t**   : true
* **z**   : true when the native result is null (Zero)
* **nz**  : true when the native result is not null (Not Zero)
* **sz**  : true when the first Source operand is null (Zero)
* **snz** : true when the first Source operand is not null (Not Zero)
* **pl**  : true when the native result is positive (PLus)
* **mi**  : true when the native result is negative (MInus)
* **spl** : true when the first Source operand is positive (PLus)
* **smi** : true when the first Source operand is negative (MInus)

4.2.2)  Specific Conditions Common To Addition and Subtraction 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Notations**

* **op1** : means the first operand
* **op2** : means the second operand

**carry numbering**

carry *p* is the carry generated by: ::

  for         addition    :  op1[ *p* : 0 ] +  op2[ *p* : 0 ]
  for         subtraction :  op1[ *p* : 0 ] + ~op2[ *p* : 0 ] + 1
  for reverse subtraction : ~op1[ *p* : 0 ] +  op2[ *p* : 0 ] + 1

**The v/nv/c/nc Conditions**

* **v**  : true when an oVerflow has been generated by an addition or subtraction
* **nv** : true when No oVerflow has been generated by an addition or subtraction
* **c**  : true when:

  * a Carry 31 is generated by an addition,
  * no carry 31 is generated by a subtraction.

* **nc** (No Carry) condition is the opposite of **c**.

4.2.3)  Addition Specific Conditions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

  nc4  : true when no carry 4  is generated
  nc5  : true when no carry 5  is generated
  nc6  : true when no carry 6  is generated
  nc7  : true when no carry 7  is generated
  nc8  : true when no carry 8  is generated
  nc9  : true when no carry 9  is generated
  nc10 : true when no carry 10 is generated
  nc11 : true when no carry 11 is generated
  nc12 : true when no carry 12 is generated
  nc13 : true when no carry 13 is generated

**Why So Many No Carry Conditions ?**

Because considering:

* a memory buffer of size 2 ^ *s*, aligned onto its own size,
* a pointer initially pointing inside this memory buffer, this pointer being used to read/write data from/to this buffer,
* the addition or subtraction performed onto this pointer after each access to this buffer,
* THEN:

  * nc *s* true  means the new pointer value is still inside this buffer,
  * nc *s* false means the new pointer value is now outside this buffer.

**Note:** these conditions work even though the added value is positive
or negative, as long its absolute value is strictly smaller than the buffer size.

4.2.4)  Comparison specific Conditions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

These conditions are available only for instruction based on subtraction
but the **lsl_sub** instruction (that performs a shift then a subtraction): ::

  ltu : op1  < op2  // unsigned comparison
  geu : op1 >= op2  // unsigned comparison
  leu : op1 <= op2  // unsigned comparison
  gtu : op1  > op2  // unsigned comparison
  lts : op1  < op2  //   signed comparison
  ges : op1 >= op2  //   signed comparison
  les : op1 <= op2  //   signed comparison
  gts : op1 >  op2  //   signed comparison

4.2.5)  Extended Z Conditions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When an instruction supports the z/nz conditions, then sequentially:

* it generates internally a z property based on the instruction native result,
* it generates internally an **extended z property**: **z && ZF**.

An Extended conditions use the **extended z** where the non extended condition use **z**: ::

   xz  : true when the extended z is true
  nxz  : true when the extended z is false
  xleu : op1 <= op2  // unsigned comparison using extended z instead of z
  xgtu : op1 >  op2  // unsigned comparison using extended z instead of z
  xles : op1 <= op2  //   signed comparison using extended z instead of z
  xgts : op1 >  op2  //   signed comparison using extended z instead of z

Extended conditions ease the construction of conditions on 64-bit results.

**ZF update**

The **ZF** flag is let unchanged by the following instructions: ::

  ldma, ldmai, sdma                             (DMA   )
  sb, sh, sw, sd, sb_id, sh_id, sw_id, sd_id    (Stores)
  lbu, lbs, lhu, lhs, lw, ld                    (Loads )
  acquire, release, stop, clr_run, boot, resume (ATOMIC)
  nop, bkp, call

**The others instructions update the ZF flag with the z property (BEWARE: not with the extended z property).**

4.2.6)  Shift specific Conditions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

::

   se   : true when op1[0] == 0  // Source Even
   so   : true when op1[0] == 1  // Source Odd
  nsh32 : true when op2[5] == 0  // Not Shift 32
   sh32 : true when op2[5] == 1  //     Shift 32

The **sh32**/**nsh32** conditions can be used to speedup 64-bit shift operations where the shift amount is in the 6-lsb of a register, as they enable a quick differentiation of the following cases:

* shift by 32 bits or more,
* shift by strictly less than 32 bits.

The **se**/**so** conditions can be used to speedup 64-bit right shift
by 1-bit, as they enable a quick differentiation of the following cases:

* a 1 will be shifted out by a 1-bit right shift,
* a 0 will be shifted out by a 1-bit right shift.

4.2.7)  Bit Count Specific Conditions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* **max** (MAXimal result) true when the result is equal to:

  * 32 for the CAO instruction ( **Count All     Ones** )
  * 32 for the CLO instruction ( **Count Leading ones** )
  * 32 for the CLZ instruction ( **Count Leading Zero** )
  * 31 for the CLS instruction ( **Count Leading Sign** )

* **nmax** (Not MAXimal result) is the opposite of **max**

These conditions allow the speeding-up of multi 32-bit words bit counting operations

4.2.8)  8-bit Multiply Specific Conditions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* **small** : (op1[15:8] == 0) && (op2[15:8] == 0)
* **large** : opposite of small

These conditions detect the case where a 16 x 16 multiply can be reduced to a single 8 x 8 multiply.

5)  LDMA / LDMAI / SDMA (DMA)
=============================

5.1)  Generalities
------------------

A thread requests a DMA transfer through a DMA instruction.

* DMA transfers are 64-bit aligned,
* transfers sizes are *n* x 64-bit, where *n* ranges from 1 to 256,
* When executing a DMA instruction, the thread is suspended for the duration of the transfer,

  * the thread is temporarily absent from the pipeline,

    * it is useful to have more than 11 threads running, to palliate for the ones that
      temporarily leave the pipeline as they wait for the completion of a DMA instruction.

  * the thread **RUN** bit remaining set during this suspension.

The DMA is capable of:

* Moving **MRAM** data to **IRAM**

  * Only the 48-lsb of the 64-bit words are written into the **IRAM**, the 16-msb being discarded.

* Moving **MRAM** data to **WRAM**
* Moving **WRAM** data to **MRAM**

**Note: MRAM**, **WRAM**, and **IRAM** have separate address spaces.

**Note:** the **HCPU** is not capable of performing DMA operations.

5.2)  Behaviours
----------------

::

  ldma  #8, Rnx, Rp  // Load  WRAM (Rnx address) with MRAM (Rp address)
  ldmai #8, Rnx, Rp  // Load  IRAM (Rnx address) with MRAM (Rp address)
  sdma  #8, Rnx, Rp  // Store WRAM (Rnx address) into MRAM (Rp address)

Transfer size is: 1 + ((Rnx[30:24] + #8) & 0xFF), allowing transfers from 1 - 256 word of 64-bits.

Source and destination addresses are specified as follow: ::

  ldma, ldmai, sdma: the 32-bit MRAM byte address is {Rp [31 :3], 0b000}
  ldma, sdma       : the 24-bit WRAM byte address is {Rnx[23 :3], 0b000}
  ldmai            : the IRAM instruction address is  Rnx[p+2:3] where p is the PC width

For all DMA instructions, if the MRAM byte address is bigger than the implemented MRAM then the
instruction fails and generates a memory exception. MRAM size in first DPU implementation is **64 MB**.

For ldma and sdma, if the WRAM byte address is bigger than the implemented WRAM then the instruction
fails and generates a memory exception. WRAM size is **64 KB** in v1A, and **63488 B** in v1B.

For ldmai, if the IRAM instruction address is bigger than the implemented IRAM then the instruction
fails and generates a memory exception. IRAM size is **4K instructions** in v1A, and **3968 instructions** in v1B.

**Additional characteristics:**

* DMA instructions support no jump nor Boolean Replacement.
* DMA instructions affect no registers.

1)  Loads / Stores
==================

6.1)  Common Properties
-----------------------

* The 24-bit effective address is given by the sum of a
  24-bit displacement and the 24-lsb of base Rnx register

  * Rnx[31:24] are ignored,
  * for most instruction the displacement is a 24-bit immediate value,
  * for some store the 24-bit displacement is the sign extension of a 12-bit immediate value.

* The access effective address must be aligned according to the access width,
* ZF and CF flags are left unchanged,
* no condition is supported.

6.2)  Loads
-----------

::

  lbu    Xm, Rnx, disp24  // Xm is loaded with the Unsigned Byte @ Rnx + disp24   .s and .sb modifiers illegal
  lbs    Xm, Rnx, disp24  // Xm is loaded with the   signed Byte @ Rnx + disp24   .u and .ub modifiers illegal
  lhu    Xm, Rnx, disp24  // Xm is loaded with the Unsigned Half @ Rnx + disp24   .s and .sb modifiers illegal
  lhs    Xm, Rnx, disp24  // Xm is loaded with the   signed Half @ Rnx + disp24   .u and .ub modifiers illegal
  lw     Xm, Rnx, disp24  // Xm is loaded with the          Word @ Rnx + disp24
  ld     Dm, Rnx, disp24  // Dm is loaded with the Double   word @ Rnx + disp24

6.3)  Stores Register
---------------------

::

  sb     Rnx, disp24, Rp  // Rp[ 7:0] is stored @ Rnx + disp24
  sh     Rnx, disp24, Rp  // Rp[15:0] is stored @ Rnx + disp24
  sw     Rnx, disp24, Rp  // Rp       is stored @ Rnx + disp24
  sd     Rnx, disp24, Dp  // Dp       is stored @ Rnx + disp24

6.4)  Stores Immediate Value
----------------------------

::

  sb     Rnx, disp12, #8   // store                     #8    @ Rnx + sign_extend24( disp12 )
  sh     Rnx, disp12, #16  // store                     #16   @ Rnx + sign_extend24( disp12 )
  sw     Rnx, disp12, #16  // store      sign_extend32( #16 ) @ Rnx + sign_extend24( disp12 )
  sd     Rnx, disp12, #16  // store      sign_extend64( #16 ) @ Rnx + sign_extend24( disp12 )

6.5)  Stores ID ORed With Immediate Value
-----------------------------------------

::

  sb_id  Rnx, disp12, #8   // store ID |                #8    @ Rnx + sign_extend24( disp12 )
  sh_id  Rnx, disp12, #16  // store ID |                #16   @ Rnx + sign_extend24( disp12 )
  sw_id  Rnx, disp12, #16  // store ID | sign_extend32( #16 ) @ Rnx + sign_extend24( disp12 )
  sd_id  Rnx, disp12, #16  // store ID | sign_extend64( #16 ) @ Rnx + sign_extend24( disp12 )

6.6)  Endianness Modifiers
--------------------------

By default, the load/store instruction uses the little-endian memory organization. Load/store
instructions operating on 16-bit, 32-bit, or 64-bit data, may have the '.b' modifier added
to their mnemonic, forcing these instructions to use the big-endian memory organization.

For lhu, lhs and lw, use the .ub/.sb modifier to cummulate the .u/.s and .b modifiers.

7)  Additions and Subtractions
==============================

**Addition, result = op1 + op2** ::

  add    Xmz,  Sn ,  Rp
  add    Xmz,  Rnx,  Rp
  add    Xmz,  Rnx,  Rp   , Bcc
  add    Xmz,  Rnx,  Rp   , Jcc, IRAM_address
  -----------------------------------------------
  add    Xmz,  Sn ,  #WRAM
  add    ZERO, Rn ,  #32
  add    Rm,   Rnx,  #32
  add    Dm,   Rn ,  #32
  add    ZERO, Rnx,  #27
  add    Xm,   Rnx,  #24
  -----------------------------------------------
  add    Xm,   Rnx,  #24  , Bcc
  add    ZERO, Rnx,  #27PC, Jcc, IRAM_address
  add    Xm,   Rnx,  #24PC, Jcc, IRAM_address

**Addition with Carry, result = op1 + op2 + CF** ::

  addc   Xmz,  Sn ,  Rp
  addc   Xmz,  Rnx,  Rp
  addc   Xmz,  Rnx,  Rp   , Bcc
  addc   Xmz,  Rnx,  Rp   , Jcc, IRAM_address
  -----------------------------------------------
  addc   Xmz,  Sn ,  #WRAM
  addc   ZERO, Rn ,  #32
  addc   Rm,   Rnx,  #32
  addc   ZERO, Rnx,  #27
  addc   Xm,   Rnx,  #24
  -----------------------------------------------
  addc   Xm,   Rnx,  #24  , Bcc
  addc   ZERO, Rnx,  #27PC, Jcc, IRAM_address
  addc   Xm,   Rnx,  #24PC, Jcc, IRAM_address

**Reverse subtraction, result = op1 + ~op2 + 1** ::

  rsub   Xmz,  Sn ,  Rp
  rsub   Xmz,  Rnx,  Rp
  rsub   Xmz,  Rnx,  Rp   , Bcc
  rsub   Xmz,  Rnx,  Rp   , Jcc, IRAM_address
  -----------------------------------------------
  rsub   Xmz,  Sn ,  #WRAM
  rsub   ZERO, Rn ,  #32
  rsub   Rm,   Rnx,  #32
  rsub   ZERO, Rnx,  #27
  rsub   Xm,   Rnx,  #24
  -----------------------------------------------
  rsub   Xm,   Rnx,  #24  , Bcc
  rsub   ZERO, Rnx,  #27PC, Jcc, IRAM_address
  rsub   Xm,   Rnx,  #24PC, Jcc, IRAM_address

**Reverse subtraction with Carry, result = op1 + ~op2 + CF** ::

  rsubc  Xmz,  Sn ,  Rp
  rsubc  Xmz,  Rnx,  Rp
  rsubc  Xmz,  Rnx,  Rp   , Bcc
  rsubc  Xmz,  Rnx,  Rp   , Jcc, IRAM_address
  -----------------------------------------------
  rsubc  Xmz,  Sn ,  #WRAM
  rsubc  ZERO, Rn ,  #32
  rsubc  Rm,   Rnx,  #32
  rsubc  ZERO, Rnx,  #27
  rsubc  Xm,   Rnx,  #24 
  -----------------------------------------------
  rsubc  Xm,   Rnx,  #24  , Bcc
  rsubc  ZERO, Rnx,  #27PC, Jcc, target_address
  rsubc  Xm,   Rnx,  #24PC, Jcc, target_address

**Subtraction, result = opa + ~opB + 1** ::

  sub    Xmz,  Sn ,  Rp
  sub    Xmz,  Rnx,  Rp
  sub    Xmz,  Rnx,  Rp   , Bcc
  sub    Xmz,  Rnx,  Rp   , Jcc, IRAM_address
  -------------------------------------------------------------------
  sub    Xmz,  Sn ,  #WRAM
  sub    Dmz , Rn ,  #32    // replaced with add instructions
  sub    Rm  , Rnx,  #32    // ...
  sub    ZERO, Rnx,  #27
  sub    Xm,   Rnx,  #24
  -------------------------------------------------------------------
  sub    Xm,   Rnx,  #24  , Bcc
  sub    ZERO, Rnx,  #27PC, Jcc, target_address
  sub    Xm,   Rnx,  #24PC, Jcc, target_address

**Subtraction with carry, result = op1 + ~op2 + CF** ::

  subc   Xmz,  Sn ,  Rp
  subc   Xmz,  Rnx,  Rp
  subc   Xmz,  Rnx,  Rp   , Bcc
  subc   Xmz,  Rnx,  Rp   , Jcc, IRAM_address
  -------------------------------------------------------------------
  subc   Xmz,  Sn ,  #WRAM
  subc   ZERO, Rn ,  #32    // replaced with addc instructions
  subc   Rm  , Rnx,  #32    // ...
  subc   ZERO, Rnx,  #27
  subc   Xm,   Rnx,  #24
  -------------------------------------------------------------------
  subc   Xm,   Rnx,  #24  , Bcc
  subc   ZERO, Rnx,  #27PC, Jcc, target_address
  subc   Xm,   Rnx,  #24PC, Jcc, target_address

7.1) CF update
--------------

As shown in the descriptions above, add/addc/sub/subc/rsub/rsubc use a **32-bit adder**.
These instructions update CF with is the native carry 31 of this **32-bit adder**.
Another way of expressing the new CF value is: ::

  when executing a             ADD or  ADDC, CF is set to the c   condition,
  when executing a SUB, SUBC, RSUB, or RSUBC, CF is set to the geu condition.

7.2)  Why sub #32 is replaced with an add?
-------------------------------------------

There is no encoding for the sub instruction with #32 because the two instructions: ::

  sub Rm, Rn,  #32
  add Rm, Rn, ~#32 + 1

would be equivalent in terms of 32-bit result generated. Concerning the CF flag update setting: ::

  if #32 <> 0 then ~#32+1 generates no carry, thus sub #32 is entirely equivalent to add -#32,
  if #32 == 0 then the sub Rm, Rn, #0 instruction is encoded.

7.3)  Why subc #32 is replaced with an addc?
---------------------------------------------

There is no encoding for subc #32 as it is entirely equivalent to addc Rm, Rn, ~#32.

7.4)  Supported Conditions
--------------------------

**add and addc** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, c, nc, nc4, nc5, nc6, nc7, nc8, nc9, nc10, nc11, nc12, nc13.
  Bcc:    z, nz, xz, nxz.

**sub and subc** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, ltu, geu, lts, ges, les, gts, leu, gtu, xles, xgts, xleu, xgtu.
  Bcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, ltu, geu, lts, ges, les, gts, leu, gtu, xles, xgts, xleu, xgtu.

**rsub and rsubc** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, v, nv, ltu, geu, lts, ges, les, gts, leu, gtu, xles, xgts, xleu, xgtu.
  Bcc:    z, nz, xz, nxz.

8)  Logical instructions
========================

**AND, result = op1 & op2** ::

  AND   Xmz , Rnx, Rp
  AND   Xmz , Rnx, Rp   , Bcc
  AND   Xmz , Rnx, Rp   , Jcc, IRAM_address
  -----------------------------------------------
  AND   Rmz , Rn , #32
  AND   Dm  , Rnx, #32
  AND   ZERO, Rnx, #28
  AND   Xm  , Rnx, #24
  -----------------------------------------------
  AND   Xm  , Rnx, #24  , Bcc
  AND   ZERO, Rnx, #28PC, Jcc, IRAM_address
  AND   Xm  , Rnx, #24PC, Jcc, IRAM_address

**NAND, result = ~(op1 & op2)** ::

  NAND  Xmz , Rnx, Rp
  NAND  Xmz , Rnx, Rp   , Bcc
  NAND  Xmz , Rnx, Rp   , Jcc, IRAM_address
  -----------------------------------------------
  NAND  ZERO, Rnx, #28
  NAND  Xm  , Rnx, #24
  -----------------------------------------------
  NAND  Xm  , Rnx, #24  , Bcc
  NAND  ZERO, Rnx, #28PC, Jcc, IRAM_address
  NAND  Xm  , Rnx, #24PC, Jcc, IRAM_address

**ANDN, result = (~op1) & op2** ::

  ANDN  Xmz , Rnx, Rp
  ANDN  Xmz , Rnx, Rp   , Bcc
  ANDN  Xmz , Rnx, Rp   , Jcc, IRAM_address
  -----------------------------------------------
  ANDN  ZERO, Rnx, #28
  ANDN  Xm  , Rnx, #24
  -----------------------------------------------
  ANDN  Xm  , Rnx, #24  , Bcc
  ANDN  ZERO, Rnx, #28PC, Jcc, IRAM_address
  ANDN  Xm  , Rnx, #24PC, Jcc, IRAM_address

**OR, result = op1 | op2** ::

  OR    Xmz , Rnx, Rp
  OR    Xmz , Rnx, Rp   , Bcc
  OR    Xmz , Rnx, Rp   , Jcc, IRAM_address
  -----------------------------------------------
  OR    Dmz , Rn , #32
  OR    Rm  , Rnx, #32
  OR    ZERO, Rnx, #28
  OR    Xm  , Rnx, #24
  -----------------------------------------------
  OR    Xm  , Rnx, #24  , Bcc
  OR    ZERO, Rnx, #28PC, Jcc, IRAM_address
  OR    Xm  , Rnx, #24PC, Jcc, IRAM_address

**NOR, result = ~(op1 | op2)** ::

  NOR   Xmz , Rnx, Rp
  NOR   Xmz , Rnx, Rp   , Bcc
  NOR   Xmz , Rnx, Rp   , Jcc, IRAM_address
  -----------------------------------------------
  NOR   ZERO, Rnx, #28
  NOR   Xm  , Rnx, #24
  -----------------------------------------------
  NOR   Xm  , Rnx, #24  , Bcc
  NOR   ZERO, Rnx, #28PC, Jcc, IRAM_address
  NOR   Xm  , Rnx, #24PC, Jcc, IRAM_address

**ORN, result = (~op1) | op2** ::

  ORN   Xmz , Rnx, Rp
  ORN   Xmz , Rnx, Rp   , Bcc
  ORN   Xmz , Rnx, Rp   , Jcc, IRAM_address
  -----------------------------------------------
  ORN   ZERO, Rnx, #28
  ORN   Xm  , Rnx, #24
  -----------------------------------------------
  ORN   Xm  , Rnx, #24  , Bcc
  ORN   ZERO, Rnx, #28PC, Jcc, IRAM_address
  ORN   Xm  , Rnx, #24PC, Jcc, IRAM_address

**XOR, result = op1 ^ op2** ::

  XOR   Xmz , Rnx, Rp
  XOR   Xmz , Rnx, Rp   , Bcc
  XOR   Xmz , Rnx, Rp   , Jcc, IRAM_address
  ---------------------------------------------
  XOR   ZERO, Rn , #32
  XOR   Rm  , Rnx, #32
  XOR   ZERO, Rnx, #28
  XOR   Xm  , Rnx, #24
  -----------------------------------------------
  XOR   Xm  , Rnx, #24  , Bcc
  XOR   ZERO, Rnx, #28PC, Jcc, IRAM_address
  XOR   Xm  , Rnx, #24PC, Jcc, IRAM_address

**NXOR, result = ~(op1 ^ op2)** ::

  NXOR  Xmz , Rnx, Rp
  NXOR  Xmz , Rnx, Rp   , Bcc
  NXOR  Xmz , Rnx, Rp   , Jcc, IRAM_address
  -------------------------------------------------------------------
  NXOR  ZERO, Rn , #32  // replaced with XOR instructions
  NXOR  Rm  , Rnx, #32  // ...
  NXOR  ZERO, Rnx, #28
  NXOR  Xm  , Rnx, #24
  -------------------------------------------------------------------
  NXOR  Xm  , Rnx, #24  , Bcc
  NXOR  ZERO, Rnx, #28PC, Jcc, IRAM_address
  NXOR  Xm  , Rnx, #24PC, Jcc, IRAM_address

**Supported Conditions** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
  Bcc:    z, nz, xz, nxz

**Note:** Logical instructions updates ZF but let CF unchanged.

9)  EXTUB / EXTSB / EXTUH / EXTSH (Zero/Sign Extensions)
========================================================

The following instructions don't support the .s modifier: ::

  Extub  Xmz, Rn                     // 8-bit (Byte) to 32-bit zero (Unsigned) extension
  Extub  Xmz, Rn, Bcc                // ...
  Extub  Xmz, Rn, Jcc, IRAM_address  // ...
  ---------------------------------------------------------------------------------------
  Extub  Xmz, Rn                     // 16-bit (Half) to 32-bit zero (Unsigned) extension
  Extub  Xmz, Rn, Bcc                // ...
  Extub  Xmz, Rn, Jcc, IRAM_address  // ...
  
The following instructions don't support the .u modifier: ::
  
  Extsb  Xmz, Rn                     // 8-bit (Byte) to 32-bit Signed extension
  Extsb  Xmz, Rn, Bcc                // ...
  Extsb  Xmz, Rn, Jcc, IRAM_address  // ...
  ---------------------------------------------------------------------------------------
  Extsh  Xmz, Rn                     // 16-bit (Half) to 32-bit Signed extension
  Extsh  Xmz, Rn, Bcc                // ...
  Extsh  Xmz, Rn, Jcc, IRAM_address  // ...

**Supported Conditions** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
  Bcc:    z, nz, xz, nxz

10)  HASH
=========

These instructions don't support the .s modifier. ::

  hash Xmz, Rnx, Rp
  hash Xmz, Rnx, Rp   , Bcc
  hash Xmz, Rnx, Rp   , Jcc, IRAM_address
  -----------------------------------------------
  hash Xmz, Rnx, #24
  hash Xmz, Rnx, #24  , Bcc
  hash Xmz, Rnx, #24PC, Jcc, IRAM_address

10.1)  Hash operation
---------------------

The instruction result is given by the following table:

+------------+---------+------------------------------------+
| op2[18:17] | op2[16] |  Result                            |
+============+=========+====================================+
|     00     |    0    | Op1[6:0] ^ Op1[13: 7]              |
+            +---------+------------------------------------+
|            |    1    | Op1[6:0] ^ Op1[13: 7] ^ Op1[20:14] |
+------------+---------+------------------------------------+
|     01     |    0    | Op1[7:0] ^ Op1[15: 8]              |
+            +---------+------------------------------------+
|            |    1    | Op1[7:0] ^ Op1[15: 8] ^ Op1[23:16] |
+------------+---------+------------------------------------+
|     10     |    0    | Op1[8:0] ^ Op1[17: 9]              |
+            +---------+------------------------------------+
|            |    1    | Op1[8:0] ^ Op1[17: 9] ^ Op1[26:18] |
+------------+---------+------------------------------------+
|     11     |    0    | Op1[9:0] ^ Op1[19:10]              |
+            +---------+------------------------------------+
|            |    1    | Op1[9:0] ^ Op1[19:10] ^ Op1[29:20] |
+------------+---------+------------------------------------+

**Supported Conditions** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
  Bcc:    z, nz, xz, nxz

11)  SATS (SATuration, Signed)
==============================

::

  sats  Xmz, Rnx
  sats  Xmz, Rnx, Bcc
  sats  Xmz, Rnx, Jcc, IRAM_address

**result** = (Rx[31] == 1) ? 0x7FFFFFFF : 0x80000000

**Supported Conditions** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
  Bcc:    z, nz, xz, nxz  
  
12)  Shift / Rotate
===================

The shift value is the 5-lsb of the second operand, thus the shift/rotate amount ranges
from 0 through 31: it can be the 5-lsb of an Rp register or a 5-bit immediate value.

The following table describes the Shift/Rotate instructions:

+-------+---------------------------+----------+-------+----------+
|       |       Description         |           examples          |
|       |                           +----------+-------+----------+
|       |                           | initial  | shift |  result  |
+=======+===========================+==========+=======+==========+
| ROL   | ROtate Left               | 12345678 |  4    | 23456781 |
+-------+---------------------------+----------+-------+----------+
| ROR   | ROtate Right              | 12345678 |  4    | 81234567 |
+-------+---------------------------+----------+-------+----------+
| LSL   | Logical Shift Left        | 12345678 |  4    | 23456780 |
+-------+---------------------------+----------+-------+----------+
| LSL1  | Logical Shift Left        | 12345678 |  4    | 2345678F |
|       | with 1 insertion          |          |       |          |
+-------+---------------------------+----------+-------+----------+
| LSR   | Logical Shift Right       | 12345678 |  4    | 01234567 |
+-------+---------------------------+----------+-------+----------+
| LSR1  | Logical Shift Right       | 12345678 |  4    | F1234567 |
|       | with 1 insertion          |          |       |          |
+-------+---------------------------+----------+-------+----------+
| ASR   | Arithmetic Shift Right    | 12345678 |  4    | 01234567 |
|       |                           +----------+-------+----------+
|       |                           | 89ABCDEF |  4    | F89ABCDE |
+-------+---------------------------+----------+-------+----------+
| LSLX  | LSL  eXtended. The result | 12345678 |  0    | 00000000 |
|       | is the part that would    +----------+-------+----------+
|       | be shifted out by an LSL, | 12345678 |  4    | 00000001 |
|       | its MSB being 0-filled.   +----------+-------+----------+
|       |                           | 12345678 |  28   | 01234567 |
+-------+---------------------------+----------+-------+----------+
| LSL1X | LSL1 eXtended. The result | 12345678 |  0    | FFFFFFFF |
|       | is the part that would    +----------+-------+----------+
|       | be shifted out by a LSL1, | 12345678 |  4    | FFFFFFF1 |
|       | its MSB being 1-filled.   +----------+-------+----------+
|       |                           | 12345678 |  28   | F1234567 |
+-------+---------------------------+----------+-------+----------+
| LSRX  | LSR  eXtended. The result | 12345678 |  0    | 00000000 |
|       | is the part that would    +----------+-------+----------+
|       | be shifted out by an LSR, | 12345678 |  4    | 80000000 |
|       | its LSB being 0-filled.   +----------+-------+----------+
|       |                           | 12345678 |  28   | 23456780 |
+-------+---------------------------+----------+-------+----------+  
| LSR1X | LSR1 eXtended. The result | 12345678 |  0    | FFFFFFFF |
|       | is the part that would    +----------+-------+----------+
|       | be shifted out by a LSR1, | 12345678 |  4    | 8FFFFFFF |
|       | its LSB being 1-filled.   +----------+-------+----------+
|       |                           | 12345678 |  28   | 2345678F |
+-------+---------------------------+----------+-------+----------+

All shift/rotate instructions allow for the same operands combinations ::

  LSL  Xmz, Rnx, Rp
  LSL  Xmz, Rnx, Rp, Bcc
  LSL  Xmz, Rnx, Rp, Jcc, IRAM_address
  -------------------------------------
  LSL  Xmz, Rnx, #5
  LSL  Xmz, Rnx, #5, Bcc
  LSL  Xmz, Rnx, #5, Jcc, IRAM_address

**Supported Conditions** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, nsh32, sh32, se, so
  Bcc:    z, nz, xz, nxz

13)  Shift/Rotate & add/sub
===========================

::

  rol_add  Xmz, Rnx, Rp, #5                     // rotate left then addition 
  rol_add  Xmz, Rnx, Rp, #5, Bcc                // ...
  rol_add  Xmz, Rnx, Rp, #5, Jcc, IRAM_address  // ...
  -----------------------------------------------------------------------------
  lsr_add  Xmz, Rnx, Rp, #5                     // shift right then addition
  lsr_add  Xmz, Rnx, Rp, #5, Bcc                // ...
  lsr_add  Xmz, Rnx, Rp, #5, Jcc, IRAM_address  // ...
  -----------------------------------------------------------------------------
  lsl_add  Xmz, Rnx, Rp, #5                     // shift left then addition
  lsl_add  Xmz, Rnx, Rp, #5, Bcc                // ...
  lsl_add  Xmz, Rnx, Rp, #5, Jcc, IRAM_address  // ...
  -----------------------------------------------------------------------------
  lsl_sub  Xmz, Rnx, Rp, #5                     // shift left then subtraction
  lsl_sub  Xmz, Rnx, Rp, #5, Bcc                // ...
  lsl_sub  Xmz, Rnx, Rp, #5, Jcc, IRAM_address  // ...
  
For all these instructions the content of Rnx is shifted or rotated by
the #5 immediate value, giving an intermediary result that is added
or subtracted to the Rp value giving the instruction final result.

**Supported Conditions** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi
  Bcc:    z, nz, xz, nxz

**NOTE: the z, nz, xz, nxz, pl and mi CONDITIONS ARE EVALUATED AGAINST THE INTERMEDIARY RESULT, NOT AGAINST THE FINAL RESULT**

14)  CLZ / CLO / CLS / CAO (bit count)
======================================

These instructions don't support the .s modifier. ::

  CLZ  Xmz, Rnx                     // Count Leading Zero
  CLZ  Xmz, Rnx, Bcc                // ...
  CLZ  Xmz, Rnx, Jcc, IRAM_address  // ...
  -------------------------------------------------------------------------------------------------
  CLO  Xmz, Rnx                     // Count Leading Ones
  CLO  Xmz, Rnx, Bcc                // ...
  CLO  Xmz, Rnx, Jcc, IRAM_address  // ...
  -------------------------------------------------------------------------------------------------
  CLS  Xmz, Rnx                     // Count Leading Sign: Indicates by how many bits the
  CLS  Xmz, Rnx, Bcc                // source operand can be left-shifted without having
  CLS  Xmz, Rnx, Jcc, IRAM_address  // its sign changed, the result being in the range 0-31.
  -------------------------------------------------------------------------------------------------
  CAO  Xmz, Rnx                     // Count All Ones: counts the number
  CAO  Xmz, Rnx, Bcc                // of one in the source operand
  CAO  Xmz, Rnx, Jcc, IRAM_address  // ...

**Supported Conditions** ::

  Jcc: t, z, nz, xz, nxz, max, nmax, sz, nsz, spl, smi
  Bcc:    z, nz, xz, nxz

For CLS, the **max** (MAXimum) condition is true when the result is 31, For CLZ, CLO and CAO, the **max**
condition is true when the result is 32. The **nmax** condition is always the opposite of the **max** condition.

15)  MUL_STEP / DIV_STEP / MOVD / SWAPD
=======================================

15.1)  mul_step
---------------

::

  mul_step  Dmz, Rnx, Dp, #5
  mul_step  Dmz, Rnx, Dp, #5, Bcc
  mul_step  Dmz, Rnx, Dp, #5, Jcc, IRAM_address

**Action performed** ::

  if (Dp[32] & 1) Dm[31: 0] = Dp[31: 0] + (Rnx << #5)  // if the destination is the ZERO register,
                  Dm[63:32] = Dp[63:32]        >> 1    // ... then no register is affected

**Supported Conditions** ::

  Jcc: t, z, nz, sz, nsz, spl, smi

15.2)  div_step
---------------

::

  div_step  Dmz, Rnx, Dp, #5
  div_step  Dmz, Rnx, Dp, #5, Bcc
  div_step  Dmz, Rnx, Dp, #5, Jcc, IRAM_address

**Action performed** ::

  if  (Dp[31: 0] >=              (Rnx << #5)){  // the comparison is unsigned
       Dm[31: 0]  =  Dp[31: 0] - (Rnx << #5);   // if the destination is the ZERO register
       Dm[63:32]  = (Dp[63:32] << 1)  | 1   ;   // ... then no register is affected
  }                                             // ...
  else Dm[63:32]  = (Dp[63:32] << 1)        ;   // ...

**Supported Conditions** ::

  Jcc: t, sz, nsz, spl, smi

15.3)  movd
-----------

::

  movd  Dmz, Dp
  movd  Dmz, Dp, Bcc
  movd  Dmz, Dp, Jcc, IRAM_address

**result = Dp**

**Supported Conditions** ::

  Jcc:  t, sz, nsz, spl, smi

15.3  swapd
-----------

::

  swapd  Dmz, Dp
  swapd  Dmz, Dp, Bcc
  swapd  Dmz, Dp, Jcc, IRAM_address

**result = { Dp[31:0], Dp[63:32] }**

**Supported Conditions** ::

  Jcc:  t, sz, nsz, spl, smi

16)  8 x 8 Multiplications
==========================

The result of a 8 x 8 multiplication is initially 16-bit, then this 16-bit result is:

* zero-extended to 32-bit for unsigned x unsigned multiplication,
* sign-extended to 32-bit otherwise.

+-----------+-----------------------+--------------------------+--------------+
| mnemonic  |     result[15:0]      |     multiply variant     | Comment      |
+===========+=======================+==========================+==============+
| mul_ul_ul | op1[ 7:0] x op2[7: 0] |unsigned x unsigned       | .s forbidden |
+-----------+-----------------------+                          +              +
| mul_ul_uh | op1[ 7:0] x op2[15:8] |(zero-extended to 32-bit) |              |
+-----------+-----------------------+                          +              +
| mul_uh_ul | op1[15:8] x op2[ 7:0] |                          |              |
+-----------+-----------------------+                          +              +
| mul_uh_uh | op1[15:8] x op2[15:8] |                          |              |
+-----------+-----------------------+--------------------------+--------------+
| mul_sl_ul | op1[ 7:0] x op2[7: 0] |signed x unsigned         | .u forbidden |
+-----------+-----------------------+                          +              +
| mul_sl_uh | op1[ 7:0] x op2[15:8] |(sign-extended to 32-bit) |              |
+-----------+-----------------------+                          +              +
| mul_sh_ul | op1[15:8] x op2[ 7:0] |                          |              |
+-----------+-----------------------+                          +              +
| mul_sh_uh | op1[15:8] x op2[15:8] |                          |              |
+-----------+-----------------------+--------------------------+              +
| mul_sl_sl | op1[ 7:0] x op2[7: 0] |signed x signed           |              |
+-----------+-----------------------+                          +              +
| mul_sl_sh | op1[ 7:0] x op2[15:8] |(sign-extended to 32-bit) |              |
+-----------+-----------------------+                          +              +
| mul_sh_sl | op1[15:8] x op2[ 7:0] |                          |              |
+-----------+-----------------------+                          +              +
| mul_sh_sh | op1[15:8] x op2[15:8] |                          |              |
+-----------+-----------------------+--------------------------+--------------+

**syntax** ::

  mul_ul_ul	Xmz, Rnx, Rp                     // similar syntax for the others
  mul_ul_ul	Xmz, Rnx, Rp, Bcc                // 8 x 8 multiplications instructions
  mul_ul_ul	Xmz, Rnx, Rp, Jcc, IRAM_address  // ...

**Supported Conditions** ::

  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi, ms8, nms8, mu8, nmu8
  Bcc:    z, nz, xz, nxz

17)  CMPB4
==========

::

  cmpb4  Xmz, Rnx, Rp

**Functionality** ::

  result[31:24] = (Rx[31:24] == Rp[31:24]) ? 0x01 : 0x00;
  result[23:16] = (Rx[23:16] == Rp[23:16]) ? 0x01 : 0x00;
  result[15: 8] = (Rx[15: 8] == Rp[15: 8]) ? 0x01 : 0x00;
  result[ 7: 0] = (Rx[ 7: 0] == Rp[ 7: 0]) ? 0x01 : 0x00;

**Supported Conditions** ::

  Bcc:    z, nz, xz, nxz
  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi

18)  CALL
=========

::

  call  Xmz, Rnx, Rp
  call  Xmz, Rnx, #PC  // #PC is an immediate whose width is the one of the PC

**Functionality**

* result = current PC + 1
* The thread jump to the IRAM address given by Rnx + Rp or by Rnx + #PC

**Note:** there is no RETURN instruction: a “CALL ZERO, Rnx” instruction is used
instead, where Rnx is the register where the return address has been previously saved.

19)  ACQUIRE / RELEASE
======================

::

  acquire: Rnx, #16
  acquire: Rnx, #16, Jcc, IRAM_address
  release: Rnx, #16
  release: Rnx, #16, Jcc, IRAM_address

**Functionality**

For both instruction an 8-bit index *i* is calculated as follows:

* tmp[15:0] = Rnx + #16
* *i* = tmp[15:8] ^ tmp[7:0]

Then:

* for ACQUIRE: ATOMIC[ *i* ] = 1,
* for RELEASE: ATOMIC[ *i* ] = 0.

In both cases, the z/nz conditions are evaluated using the initial value of the ATOMIC[ *i* ] bit.

**Supported Jcc Conditions for ACQUIRE**: t, z, nz

**Supported Jcc Conditions for RELEASE**:       nz

**Note:** when ACQUIRE/RELEASE is used correctly, the nz condition is always true for RELEASE.

20)  STOP
=========

::

  stop
  stop  t, IRAM_address  // only the t (True) condition is supported

**Functionality**

The RUN bit corresponding to the thread executing the STOP instruction is cleared,
if a t condition is present, the thread PC is set to the specified jump address,
independently of the presence of the t condition, the thread is non longer running.

21)  BOOT / RESUME / CLR_RUN
============================

::

  boot     Rnx, #6
  boot     Rnx, #6, Jcc, IRAM_address
  resume   Rnx, #6
  resume   Rnx, #6, Jcc, IRAM_address
  clr_run  Rnx, #6
  clr_run  Rnx, #6, Jcc, IRAM_address

**Functionality**

Both instructions generate first a 6-bit unsigned index *i*:

* tmp[13:0] = Rnx[13:0] + #6
* *i*       = tmp[13:8] ^ tmp[5:0]

21.1) CLR_RUN
-------------

clr_run just clears the bit RUN[ *i* ], the CLR_RUN instruction is now over.

21.2) BOOT / RESUME
-------------------

If RUN[ *i* ] is initially set then the BOOT/RESUME instruction is over.

Otherwise:

* RUN[ *i* ] is set
* if *i* < 24 then the execution of the thread *i* is resumed:

  * for BOOT instructions: at the IRAM address 0
  * for RESUME instructions: at the current value of PC[ *i* ] (the PC of the thread *i*).

21.3) Supported conditions
--------------------------

The CLR_RUN, BOOT, and RESUME instructions support the same set of conditions ::
   
  Jcc: t, z, nz, xz, nxz, pl, mi, sz, nsz, spl, smi

**Note:** the z, nz, xz, nxz conditions use the nullity/non-nullity of the initial value of the bit RUN[ *i* ].

22)  TIME / TIME_CFG
====================

::

  time      Xmz
  time      Xmz,      t, IRAM_address  // only the t (True) condition is allowed
  time_cfg  Xmz, Rnx
  time_cfg  Xmz, Rnx, t, IRAM_address  // only the t (True) condition is allowed

For both instructions: result = TIME[35:4] (the 32-msb of TIME)

22.1) TIME Increment Configuration
----------------------------------

**This part concerns only the time_cfg instruction**

To have Rnx[0] set clears the TIME[35:0] register, the field Rnx[2:1] being used as follow: ::

  00: keep the current increment configuration
  01: set the configuration such that TIME[35:0] is     incremented every DPU cycle
  10: set the configuration such that TIME[35:0] is     incremented every executed instruction
  11: set the configuration such that TIME[35:0] is not incremented

23)  NOP / BKP
==============

::

  NOP  // Does nothing
  BKP  // Does nothing besides causing a BKP exception