Ri5cy User Manual
Ri5cy User Manual
November 2017
Revision 1.8
Andreas Traber ([email protected])
Michael Gautschi ([email protected])
Pasquale Davide Schiavone ([email protected])
Copyright and related rights are licensed under the Solderpad Hardware License, Version 0.51 (the “License”); you may not use
this file except in compliance with the License. You may obtain a copy of the License at http://solderpad.org/licenses/SHL-0.51.
Unless required by applicable law or agreed to in writing, software, hardware and materials distributed under this License is
distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under the License.
Document Revisions
Rev. Date Author Description
0.1 25.02.16 Andreas Traber First Draft
0.8 13.05.16 Andreas Traber Added instruction encoding
0.9 19.05.16 Michael Gautschi Typos and general corrections
1.1 12.07.16 P.D. Schiavone Removed pv.ball, and replaced with p.beqimm
1.2 14.11.16 P.D. Schiavone Added register variants of clip, addnorm, and bit
manipulation instructions
1.3 04.01.17 Michael Gautschi Fixed typos, references, foot notes and date style
1.4 08.03.17 P.D. Schiavone Updated to priv spec 1.9 and new IRQ handling
1.5 06.06.17 P.D. Schiavone General updates
1.6 03.07.17 Michael Gautschi Extended with optional FP support
1.7 12.07.17 P.D. Schiavone Revised instructions added in Rev. 1.2
1.8 08.11.17 P.D. Schiavone Add note in HW Loop
Table of Contents
1 Introduction....................................................................................................................................... 7
1.1 Supported Instruction Set ................................................................................................................. 7
1.2 Optional Floating Point Support........................................................................................................ 8
1.3 ASIC Synthesis ................................................................................................................................ 8
1.4 FPGA Synthesis ............................................................................................................................... 8
1.5 Outline .............................................................................................................................................. 8
2 Instruction Fetch ............................................................................................................................... 9
2.1 Protocol ............................................................................................................................................ 9
3 Load-Store-Unit (LSU) .................................................................................................................... 10
3.1 Misaligned Accesses ...................................................................................................................... 10
3.2 Protocol .......................................................................................................................................... 10
3.3 Post-Incrementing Load and Store Instructions.............................................................................. 12
4 Multiply-Accumulate ....................................................................................................................... 13
5 PULP ALU Extensions.................................................................................................................... 14
6 Optional private Floating Point Unit (FPU)...................................................................................... 15
6.1 FP CSR .......................................................................................................................................... 16
6.2 Floating-point Performance Counters: ............................................................................................ 17
6.3 Some hints on synthesizing the FPU.............................................................................................. 17
7 PULP Hardware Loop Extensions .................................................................................................. 18
7.1 CSR Mapping ................................................................................................................................. 18
8 Pipeline .......................................................................................................................................... 19
9 Register File ................................................................................................................................... 20
9.1 Latch-based Register File............................................................................................................... 20
9.2 FPU Register File ........................................................................................................................... 20
10 Control and Status Registers.......................................................................................................... 21
10.1 Machine Status (MSTATUS) .......................................................................................................... 21
10.2 Machine Trap-Vector Base Address (MTVEC) ............................................................................... 22
10.3 Machine Exception PC (MEPC) ..................................................................................................... 22
10.4 Machine Cause (MCAUSE) ............................................................................................................ 23
10.5 Privilege Level ................................................................................................................................ 23
10.6 MHARTID/UHARTID ...................................................................................................................... 24
11 Performance Counters ................................................................................................................... 25
11.1 Performance Counter Mode Register (PCMR) ............................................................................... 25
11.2 Performance Counter Event Register (PCER) ............................................................................... 25
11.3 Performance Counter Counter Register (PCCR0-31)..................................................................... 27
1 Introduction
RI5CY is a 4-stage in-order 32b RISC-V processor core. The ISA of RI5CY was extended to support multiple
additional instructions including hardware loops, post-increment load and store instructions and additional ALU
instructions that are not part of the standard RISC-V ISA.
• Full support for RV32M Integer Multiplication and Division Instruction Set Extension
• Optional full support for RV32F Single Precision Floating Point Extensions
1.5 Outline
This document summarizes all the functionality of the Ri5CY core in more detail. First, the instruction and
data interfaces are explained in Chapter 2 and 3. The multiplier as well as the ALU are then explained in
Chapter 4 and 5. Chapter 7 focuses on the hardware loop extensions and Chapter 9 explains the register
file. Control and status registers are explained in Chapter 10 and Chapter 11 gives an overview of all
performance counters. Chapter 12 deals with exceptions and interrupts, and Chapter 13 summarizes the
accessible debug registers. Finally, Chapter 14 gives an overview of all instruction-extensions, its encodings
and meanings.
2 Instruction Fetch
The instruction fetcher of the core is able to supply one instruction to the ID stage per cycle if the instruction cache or
the instruction memory is able to serve one instruction per cycle. The instruction address must be half-word-aligned
due to the support of compressed instructions. It is not possible to jump to instruction addresses that have the LSB bit
set.
For optimal performance and timing closure reasons, a prefetcher is used which fetches instruction from the
instruction memory, or instruction cache.
There are two prefetch flavors available:
• 32-Bit word prefetcher. It stores the fetched words in a FIFO with three entries.
• 128-Bit cache line prefetcher. It stores one 128-bit wide cache line plus 32-bit to allow for cross-cache line
misaligned instructions.
Table 1 describes the signals that are used to fetch instructions. This interface is a simplified version that is used by
the LSU that is described in Chapter 3. The difference is that no writes are possible and thus it needs less signals.
2.1 Protocol
The protocol used to communicate with the instruction cache or the instruction memory is the same as the protocol
used by the LSU. See the description of the LSU in Chapter 3.2 for details about the protocol.
3 Load-Store-Unit (LSU)
The LSU of the core takes care of accessing the data memory. Load and stores on words (32 bit), half words (16 bit)
and bytes (8 bit) are supported.
Table 2 describes the signals that are used by the LSU.
3.2 Protocol
The protocol that is used by the LSU to communicate with a memory works as follows:
The LSU provides a valid address in data_addr_o and sets data_req_o high. The memory then answers with a
data_gnt_i set high as soon as it is ready to serve the request. This may happen in the same cycle as the request was
sent or any number of cycles later. After a grant was received, the address may be changed in the next cycle by the
LSU. In addition, the data_wdata_o, data_we_o and data_be_o signals may be changed as it is assumed that the
memory has already processed and stored that information. After receiving a grant, the memory answers with a
data_rvalid_i set high if data_rdata_i is valid. This may happen one or more cycles after the grant has been received.
Note that data_rvalid_i must also be set when a write was performed, although the data_rdata_i has no meaning in
this case.
4 Multiply-Accumulate
RI5CY uses a single-cycle 32-bit x 32-bit multiplier with a 32-bit result. All instructions of the RISC-V M instruction set
extension are supported.
The multiplications with upper-word result (MSP of 32-bit x 32-bit multiplication), take 4 cycles to compute. The
division and remainder instructions take between 2 and 32 cycles. The number of cycles depends on the operand
values.
Additionally, RI5CY supports non-standard extensions for multiply-accumulate and half-word multiplications
with an optional post-multiplication shift.
6.1 FP CSR
When using floating-point extensions the standard specifies a floating-point status and control register (fcsr)
which contains the exceptions that occurred since it was last reset and the rounding mode. fflags and frm
can be accessed directly or over fcsr which is mapped to those two registers.
Since RISCY includes an iterative div/sqrt unit, its precision and latency can be controlled over a custom csr
(fprec). This allows faster division / square-root operations at the lower precision. By default, the single-
precision equivalents are computed with a latency of 8 cycles.
A hardware loop is defined by its start address (pointing to the first instruction in the loop), its end address (pointing to
the instruction that will be executed last in the loop) and a counter that is decremented every time the loop body is
executed. RI5CY contains two hardware loop register sets to support nested hardware loops, each of them can store
these three values in separate flip flops which are mapped in the CSR address space.
If the end address of the two hardware loops is identical, loop 0 has higher priority and only the loop counter for
hardware loop 0 is decremented. As soon as the counter of loop 0 reaches 1 at an end address, meaning it is
decremented to 0 now, loop 1 gets active too. In this case, both counters will be decremented and the core jumps to
the start of loop 1.
In order to use hardware loops, the compiler needs to setup the loop beforehand with the following instructions. Note
that the minimum loop size is two instructions and the last instruction cannot be any jump or branch instruction.
For debugging and context switches, the hardware loop registers are mapped into the CSR address space and thus it
is possible to read and write them via csrr and csrw instructions. Since hardware loop registers could be overwritten in
when processing interrupts, the registers have to be saved in the interrupt routine together with the general purpose
registers.
8 Pipeline
RI5CY has a fully independent pipeline, meaning that whenever possible data will propagate through the pipeline and
therefor does not suffer from any unneeded stalls.
The pipeline design is easily extendable to incorporate out-of-order completion. E.g., it would be possible to complete
an instruction that only needs the EX stage before the WB stage, that is currently blocked waiting for an rvalid, is
ready. Currently this is not done in RI5CY, but might be added in the future.
Figure 5 shows the relevant control signals for the pipeline operation. The main control signals, the ready signals of
each pipeline stage, are propagating from right to left. Each pipeline stage has two control inputs: an enable and a
clear. The enable activates the pipeline stage and the core moves forward by one instruction. The clear removes the
instruction from the pipeline stage as it is completed. Every pipeline stage is cleared if the ready coming from the
stage to the right is high, and the valid signal of the stage is low. If the valid signal is high, it is enabled.
Every pipeline stage is independent of its left neighbor, meaning that it can finish its execution no matter if a stage to
its left is currently stalled or not. On the other hand, an instruction can only propagate to the next stage if the stage to
its right is ready to receive a new instruction. This means that in order to process an instruction in a stage, its own
stage needs to be ready and so does its right neighbor.
9 Register File
RI5CY has 31 _ 32-bit wide registers which form registers x1 to x31. Register x0 is statically bound to 0 and can only
be read, it does not contain any sequential logic.
While the latch-based register file is recommended for ASICs, the flip-flop based register file is recommended for
FPGA synthesis, although both are compatible with either synthesis target. Note the flip-flop based register file is
significantly larger than the latch-based register-file for an ASIC implementation.
It is assumed that there is a clock gating cell for the target technology that is wrapped in a module called
cluster_clock_gating and has the following ports:
• clk_i: Clock Input
• en_i: Clock Enable Input
• test_en_i: Test Enable Input (activates the clock even though en_i is not set)
• clk_o: Gated Clock Output
MIE
Detailed:
Bit # R/W Description
12:11 R MPP: Statically 2’b11 and cannot be altered (read-only).
7 R/W Previous Interrupt Enable: When an exception is encountered, MPIE will be set to IE.
When the mret instruction is executed, the value of MPIE will be stored to IE.
When an exception is encountered, the core jumps to the corresponding handler using the content of the MTVEC as
base address. It is a read-only register which contains the boot address.
Table 8: MTVEC
When an exception is encountered, the current program counter is saved in MEPC, and the core jumps to the
exception address. When an mret instruction is executed, the value from MEPC replaces the current program counter.
Exception
Code
Detailed:
Bit # R/W Description
31 R Interrupt: This bit is set when the exception was triggered by an interrupt.
4:0 R Exception Code
Table 6: MCAUSE
10.6 MHARTID/UHARTID
CSR Address: 0xF14/0x014
Reset Value: Defined
31 10 5 4 3 0
Cluster ID Core ID
Detailed:
Bit # R/W Description
10:5 R Cluster ID: ID of the cluster
3:0 R Core ID: ID of the core within the cluster
Table 8: MHARTID
11 Performance Counters
Performance Counters in RI5CY are placed inside the Control and Status Registers and can be accessed with csrr
and csrw instructions. See Table 9.1 for the address map of the performance counter registers
Global Enable
Saturation
Detailed:
Bit # R/W Description
1 R/W Global Enable: Activate/deactivate all performance counters. If this bit is 0, all
performance counters are disabled. After reset, this bit is set.
0 R/W Saturation: If this bit is set, saturating arithmetic is used in the performance counter
counters. After reset, this bit is set.
Table 9: PCMR
COMP_INSTR
LD_EXT_CYC
ST_EXT_CYC
TCDM_CONT
BRANCH_TAKEN
JMP_STALL
LD_STALL
FP_CONT
FP_TYPE
BRANCH
CYCLES
FP_DEP
LD_EXT
ST_EXT
FP_WB
INSTR
JUMP
IMISS
LD
ST
Detailed:
Bit # R/W Description
16 R/W TCDM_CONT
15 R/W ST_EXT_CYC
14 R/W LD_EXT_CYC
Each bit in the PCER register controls one performance counter. If the bit is 1, the counter is enabled and starts
counting events. If it is 0, the counter is disabled and its value won’t change.
In the ASIC there is only one counter register, thus all counter events are masked by PCER and ORed together, i.e. if
one of the enabled event happens, the counter will be increased. If multiple non-masked events happen at the same
time, the counter will only be increased by one.
In order to be able to count separate events on the ASIC, the program can be executed in a loop with different events
configured.
In the FPGA or RTL simulation version, each event has its own counter and can be accessed separately.
PCCR registers support both saturating and wrap-around arithmetic. This is controlled by the saturation bit in PCMR.
In the FPGA, RTL simulation and Virtual-Platform there are individual counters for each event type, i.e. PCCR0-30
each represent a separate register. To save area in the ASIC, there is only one counter and one counter register.
Accessing PCCR0-30 will access the same counter register in the ASIC. Reading/writing from/to PCCR31 in the ASIC
will access the same register as PCCR0-30.
Figure 6 shows how events are first masked with the PCER register and then ORed together to increase the one
performance counter PCCR.
Address Description
0x00-0x7C Interrupts 0 - 31
0x80 Reset
0x84 Illegal Instruction
0x88 ECALL Instruction Executed
Table 12: Interrupt/Exception Offset Vector Table
The base address of the interrupt vector table is given by the boot address. The most significant 3 bytes of the boot
address given to the core are used for the first instruction fetch of the core and as the basis of the interrupt vector
table. The core starts fetching at the address made by concatenating the most significant 3 bytes of the boot address
and the reset value (0x80) as the least significant byte. The boot address can be changed after the first instruction
was fetched to change the interrupt vector table address. It is assumed that the boot address is supplied via a register
to avoid long paths to the instruction fetch unit.
12.1 Interrupts
Interrupts can only be enabled/disabled on a global basis and not individually. It is assumed that there is an
event/interrupt controller outside of the core that performs masking and buffering of the interrupt lines. The global
interrupt enable is done via the CSR register MSTATUS.
Multiple interrupts requests are assumed to be handled by event/interrupt controller. When an interrupt is taken, the
core gives an acknowledge signal to the event/interrupt controller as well as the interrupt id taken.
12.2 Exceptions
The illegal instruction exception and ecall instruction exceptions cannot be disabled and are always active.
12.3 Handling
RI5CY does support nested interrupt/exception handling. Exceptions inside interrupt/exception handlers cause
another exception, thus exceptions during the critical part of your exception
handlers, i.e. before having saved the MEPC and MESTATUS registers, will cause those register to be overwritten.
Interrupts during interrupt/exception handlers are disabled by default, but can be explicitly enabled if desired.
Upon executing an mret instruction, the core jumps to the program counter saved in the CSR register MEPC and
restores the MPIE value of the register MSTATUS to IE. When entering an interrupt/exception handler, the core sets
MEPC to the current program counter and saves the current value of MIE in MPIE of the MSTATUS register.
13 Debug Unit
13.1 Address Map
Addresses are intended for a bus system with 32-bit wide words.
FPR get more address space than GPR because they can be 64-bit wide even in a 32-bit system.
Addresses have to be aligned to word-boundaries.
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
HALT
reserved
R/W
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
SSTE
reserved
R/W
Detailed:
Bit # R/W Description
16 W1 HALT: When 1 written, core enters debug mode, when 0 written, core exits debug
mode.
When read, 1 means core is in debug mode
0 R/W SSTE: Single-step enable
Table 5: DBG_CTRL register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
SLEEP
reserved
R
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
SSTH
reserved
R/W
Detailed:
Bit # R/W Description
16 R SLEEP: Set when the core is in a sleeping state and waits for an event
0 R/W SSTH: Single-step hit, sticky bit that must be cleared by external debugger
Table 16: DBG_HIT register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
TO BE DEFINED
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
ECAL
SAF SAM LAF LAM BP ILL IAF IAM
reserved L reserved
R/W R/W R/W R/W R/W R/W R/W R/W R/W
Detailed:
Bit # R/W Description
11 R/W ECALL: Environment call from M-Mode
7 R/W SAF: Store Access Fault (together with LAF)
6 R/W SAM: Store Address Misaligned (never traps)
5 R/W LAF: Load Access Fault (together with SAF)
4 R/W LAM: Load Address Misaligned (never traps)
3 R/W BP: EBREAK instruction causes trap
2 R/W ILL: Illegal Instruction
1 R/W IAF: Instruction Access Fault (not implemented)
0 R/W IAM: Instruction Address Misaligned (never traps)
Table 17: DBG_IE register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
IRQ
reserved
R
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
CAUSE
reserved
R
Detailed:
Bit # R/W Description
31 R IRQ: Interrupt caused us to enter debug mode
4:0 R CAUSE: Exception/interrupt number
Table 138: DBG_CAUSE register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
reserved
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
IMPL
reserved
R0
Detailed:
Bit # R/W Description
0 R IMPL: RI5CY does not implement hardware breakpoints. Always read as 0.
Table19: DBG_BPCTRLx register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
NPC[31:16]
R/W
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
NPC[15:0]
R/W
Detailed:
Bit # R/W Description
31:0 R/W NPC: Next PC to be executed
Table 140: DBG_NPC register
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
PPC[31:16]
R
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
PPC[15:0]
R
Detailed:
Bit # R/W Description
31:0 W PPC: Previous PC, already executed
Table 151: DBG_PPC register
13.4 Interface
debug_halted_o, debug_halt_i and debug_resume_i are intended for cross-triggering between multiple
cores. They are not required for single-core debug, thus debug_halt_i and debug-resume_i can be tied to 0.
debug_halt_i and debug_resume_i should be high for only one single cycle to avoid deadlock issues.
Mnemonic Description
Register-Immediate Loads with Post-Increment
p.lb rD, Imm(rs1!) rD = Sext(Mem8(rs1))
rs1 += Imm[11:0]
p.lbu rD, Imm(rs1!) rD = Zext(Mem8(rs1))
rs1 += Imm[11:0]
p.lh rD, Imm(rs1!) rD = Sext(Mem16(rs1))
rs1 += Imm[11:0]
p.lhu rD, Imm(rs1!) rD = Zext(Mem16(rs1))
rs1 += Imm[11:0]
p.lw rD, Imm(rs1!) rD = Mem32(rs1)
rs1 += Imm[11:0]
Register-Register Loads with Post-Increment
p.lb rD, rs2(rs1!) rD = Sext(Mem8(rs1))
rs1 += rs2
p.lbu rD, rs2(rs1!) rD = Zext(Mem8(rs1))
rs1 += rs2
p.lh rD, rs2(rs1!) rD = Sext(Mem16(rs1))
rs1 += rs2
p.lhu rD, rs2(rs1!) rD = Zext(Mem16(rs1))
rs1 += rs2
p.lw rD, rs2(rs1!) rD = Mem32(rs1)
rs1 += rs2
Register-Register Loads
p.lb rD, rs2(rs1) rD = Sext(Mem8(rs1 + rs2))
p.lbu rD, rs2(rs1) rD = Zext(Mem8(rs1 + rs2))
Mnemonic Description
p.lh rD, rs2(rs1) rD = Sext(Mem16(rs1 + rs2))
p.lhu rD, rs2(rs1) rD = Zext(Mem16(rs1 + rs2))
p.lw rD, rs2(rs1) rD = Mem32(rs1 + rs2)
Mnemonic Description
Register-Immediate Stores with Post-Increment
p.sb rs2, Imm(rs1!) Mem8(rs1) = rs2
rs1 += Imm[11:0]
p.sh rs2, Imm(rs1!) Mem16(rs1) = rs2
rs1 += Imm[11:0]
p.sw rs2, Imm(rs1!) Mem32(rs1) = rs2
rs1 += Imm[11:0]
Register-Register Stores with Post-Increment
p.sb rs2, rs3(rs1!) Mem8(rs1) = rs2
rs1 += rs3
p.sh rs2, rs3(rs1!) Mem16(rs1) = rs2
rs1 += rs3
p.sw rs2, rs3(rs1!) Mem32(rs1) = rs2
rs1 += rs3
Register-Register Stores
p.sb rs2, rs3(rs1) Mem8(rs1 + rs3) = rs2
p.sh rs2 rs3(rs1) Mem16(rs1 + rs3) = rs2
p.sw rs2, rs3(rs1) Mem32(rs1 + rs3) = rs2
14.1.2 Encoding
31 20 19 15 14 12 11 76 0
imm[11:0] rs1 funct3 rd opcode
offset base 000 dest 000 1011 p.lb rD, Imm(rs1!)
offset base 100 dest 000 1011 p.lbu rD, Imm(rs1!)
offset base 001 dest 000 1011 p.lh rD, Imm(rs1!)
offset base 101 dest 000 1011 p.lhu rD, Imm(rs1!)
31 25 24 20 19 15 14 12 11 76 0
funct7 rs2 rs1 funct3 rd opcode
000 0000 offset base 111 dest 000 1011 p.lb rD, rs2(rs1!)
010 0000 offset base 111 dest 000 1011 p.lbu rD, rs2(rs1!)
000 1000 offset base 111 dest 000 1011 p.lh rD, rs2(rs1!)
010 1000 offset base 111 dest 000 1011 p.lhu rD, rs2(rs1!)
001 0000 offset base 111 dest 000 1011 p.lw rD, rs2(rs1!)
31 25 24 20 19 15 14 12 11 76 0
funct7 rs2 rs1 funct3 rd opcode
000 0000 offset base 111 dest 000 0011 p.lb rD, rs2(rs1)
010 0000 offset base 111 dest 000 0011 p.lbu rD, rs2(rs1)
000 1000 offset base 111 dest 000 0011 p.lh rD, rs2(rs1)
010 1000 offset base 111 dest 000 0011 p.lhu rD, rs2(rs1)
001 0000 offset base 111 dest 000 0011 p.lw rD, rs2(rs1)
31 20 19 15 14 12 11 76 0
imm[11:5] rs2 rs1 funct3 imm[4:0] opcode
offset[11:5] src base 000 offset[4:0] 010 1011 p.sb rs2, Imm(rs1!)
offset[11:5] src base 001 offset[4:0] 010 1011 p.sh rs2, Imm(rs1!)
offset[11:5] src base 010 offset[4:0] 010 1011 p.sw rs2, Imm(rs1!)
31 20 19 15 14 12 11 76 0
funct7 rs2 rs1 funct3 rs3 opcode
000 0000 src base 100 offset 010 1011 p.sb rs2, rs3(rs1!)
000 0000 src base 101 offset 010 1011 p.sh rs2, rs3(rs1!)
000 0000 src base 110 offset 010 1011 p.sw rs2, rs3(rs1!)
31 20 19 15 14 12 11 76 0
funct7 rs2 rs1 funct3 rs3 opcode
000 0000 src base 100 offset 010 0011 p.sb rs2, rs3(rs1)
000 0000 src base 101 offset 010 0011 p.sh rs2, rs3(rs1)
000 0000 src base 110 offset 010 0011 p.sw rs2, rs3(rs1)
14.2.1 Operations
Mnemonic Description
Long Hardware Loop Setup instructions
lp.starti L, uimmL lpstart[L] = PC + (uimmL << 1)
lp.endi L, uimmL lpend[L] = PC + (uimmL << 1)
lp.count L, rs1 lpcount[L] = rs1
lp.counti L, uimmL lpcount[L] = uimmL
Short Hardware Loop Setup Instructions
lp.setup L, rs1, uimmL lpstart[L] = pc + 4
lpend[L] = pc + (uimmL << 1)
lpcount[L] = rs1
lp.setupi L, uimmS, uimmL lpstart[L] = pc + 4
lpend[L] = pc + (uimmS << 1)
lpcount[L] = uimmL
14.2.2 Encoding
31 20 19 15 14 12 11 10 7 6 0
uimmL[11:0] rs1 funct3 0000 L opcode
uimmL[11:0] 00000 000 0000 L 111 1011 lp.starti L, uimmL
uimmL[11:0] 00000 001 0000 L 111 1011 lp.endi L, uimmL
0000 0000 0000 src1 010 0000 L 111 1011 lp.count L, rs1
uimmL[11:0] 00000 011 0000 L 111 1011 lp.counti L, uimmL
uimmL[11:0] src1 100 0000 L 111 1011 lp.setup L, rs1, uimmL
uimmL[11:0] uimmS[4:0] 101 0000 111 1011 lp.setupi L, uimmS, uimmL
14.3 ALU
The ALU extensions are split into several subgroups that belong together.
• Bit manipulation instructions are useful to work on single bits or groups of bits within a word, see
Section 12.3.1.
• General ALU instructions try to fuse common used sequences into a single instruction and thus
increase the performance of small kernels that use those sequence, see Section 12.3.3.
• Immediate branching instructions are useful to compare a register with an immediate value before
taking or not a branch, see Section 13.3.5.
Mnemonic Description
p.fl1 rD, rs1 rD = bit position of the last bit set in rs1, starting from MSB. If bit 31 is set,
rD will be 31. If only bit 0 is set, rD will be 0.
If rs1 is 0, rD will be 32.
p.clb rD, rs1 rD = count leading bits of rs1
Note: This is the number of consecutive 1’s or 0’s from MSB.
Note: If rs1 is 0, rD will be 0.
p.cnt rD, rs1 rD = Population count of rs1, i.e. number of bits set in rs1
p.ror rD, rs1, rs2 rD = RotateRight(rs1, rs2)
11 Luimm5[4:0] Iuimm5[4:0] src 001 dest 011 0011 p.extractu rD, rs1, Is3, Is2
11 Luimm5[4:0] Iuimm5[4:0] src 010 dest 011 0011 p.insert rD, rs1, Is3, Is2
11 Luimm5[4:0] Iuimm5[4:0] src 011 dest 011 0011 p.bclr rD, rs1, Is3, Is2
11 Luimm5[4:0] Iuimm5[4:0] src 100 dest 011 0011 p.bset rD, rs1, Is3, Is2
10 00000 src2 src1 000 dest 011 0011 p.extractr rD, rs1, rs2
10 00000 src2 src1 001 dest 011 0011 p.extractur rD, rs1, rs2
10 00000 src2 src1 010 dest 011 0011 p.insertr rD, rs1, rs2
10 00000 src2 src1 011 dest 011 0011 p.bclrr rD, rs1, rs2
10 00000 src2 scr1 100 dest 011 0011 p.bsetr rD, rs1, rs2
31 25 24 20 19 15 14 12 11 76 0
funct7 rs2 rs1 funct3 rD opcode
000 0100 src2 src1 101 dest 011 0011 p.ror rD, rs1, rs2
000 1000 00000 src1 000 dest 011 0011 p.ff1 rD, rs1
000 1000 00000 src1 001 dest 011 0011 p.fl1 rD, rs1
000 1000 00000 src1 010 dest 011 0011 p.clb rD, rs1
000 1000 00000 src1 011 dest 011 0011 p.cnt rD, rs1
Mnemonic Description
p.slet rD, rs1, rs2 rD = rs1 <= rs2 ? 1 : 0
Note: Comparison is signed
p.sletu rD, rs1, rs2 rD = rs1 <= rs2 ? 1 : 0
Note: Comparison is unsigned
p.min rD, rs1, rs2 rD = rs1 < rs2 ? rs1 : rs2
Note: Comparison is signed
p.minu rD, rs1, rs2 rD = rs1 < rs2 ? rs1 : rs2
Note: Comparison is unsigned
p.max rD, rs1, rs2 rD = rs1 < rs2 ? rs2 : rs1
Note: Comparison is signed
p.maxu rD, rs1, rs2 rD = rs1 < rs2 ? rs2 : rs1
Note: Comparison is unsigned
p.exths rD, rs1 rD = Sext(rs1[15:0])
p.exthz rD, rs1 rD = Zext(rs1[15:0])
p.extbs rD, rs1 rD = Sext(rs1[7:0])
p.extbz rD, rs1 rD = Zext(rs1[7:0])
p.clip rD, rs1, Is2 if rs1 <= -2^(Is2-1), rD = -2^(Is2-1),
else if rs1 >= 2^(Is2-1)–1, rD = 2^(Is2-1)-1,
else rD = rs1
p.clipr rD, rs1, rs2 if rs1 <= -(rs2+1), rD = -(rs2+1),
else if rs1 >=rs2, rD = rs2,
else rD = rs1
p.clipu rD, rs1, Is2 if rs1 <= 0, rD = 0,
else if rs1 >= 2^(Is2–1)-1, rD = 2^(Is2-1)-1,
else rD = rs1
p.clipur rD, rs1, rs2 if rs1 <= 0, rD = 0,
else if rs1 >= rs2, rD = rs2,
else rD = rs1
p.addN rD, rs1, rs2, Is3 rD = (rs1 + rs2) >>> Is3
Note: Arithmetic shift right. Setting Is3 to 2 replaces former
p.avg
p.adduN rD, rs1, rs2, Is3 rD = (rs1 + rs2) >> Is3
Note: Logical shift right. Setting Is3 to 2 replaces former
p.avg
p.addRN rD, rs1, rs2, Is3 rD = (rs1 + rs2 + 2^(Is3-1)) >>> Is3
Note: Arithmetic shift right.
p.adduRN rD, rs1, rs2, Is3 rD = (rs1 + rs2 + 2^(Is3-1))) >> Is3
Note: Logical shift right.
Mnemonic Description
p.addNr rD, rs1, rs2 rD = (rD + rs1) >>> rs2[4:0]
Note: Arithmetic shift right.
p.adduNr rD, rs1, rs2 rD = (rD + rs1) >> rs2[4:0]
p.addRNr rD, rs1, rs2 rD = (rD + rs1 + 2^(rs2[4:0])) >>> rs2[4:0]
Note: Arithmetic shift right.
p.adduRNr rD, rs1, rs2 rD = (rD + rs1 + 2^(rs2[4:0]-1))) >> rs2[4:0]
Note: Logical shift right.
p.subN rD, rs1, rs2, Is3 rD = (rs1 - rs2) >>> Is3
Note: Arithmetic shift right.
p.subuN rD, rs1, rs2, Is3 rD = (rs1 - rs2) >> Is3
Note: Logical shift right.
p.subRN rD, rs1, rs2, Is3 rD = (rs1 - rs2 + 2^(Is3-1)) >>> Is3
Note: Arithmetic shift right.
p.subuRN rD, rs1, rs2, Is3 rD = (rs1 - rs2 + 2^(Is3-1))) >> Is3
Note: Logical shift right.
p.subNr rD, rs1, rs2 rD = (rD – rs1) >>> rs2[4:0]
Note: Arithmetic shift right.
p.subuNr rD, rs1, rs2 rD = (rD – rs1) >> rs2[4:0]
Note: Logical shift right.
p.subRNr rD, rs1, rs2 rD = (rD – rs1+ 2^(rs2[4:0]-1)) >>> rs2[4:0]
Note: Arithmetic shift right.
p.subuRNr rD, rs1, rs2 rD = (rD – rs1+ 2^(rs2[4:0]-1))) >> rs2[4:0]
Note: Logical shift right.
000 1000 00000 src1 101 dest 011 0011 p.exthz rD, rs1
000 1000 00000 src1 110 dest 011 0011 p.extbs rD, rs1
000 1000 00000 src1 111 dest 011 0011 p.extbz rD, rs1
31 25 24 20 19 15 14 12 11 76 0
funct7 Is2[4:0] rs1 funct3 rD opcode
000 1010 Iuimm5[4:0] src1 001 dest 011 0011 p.clip rD, rs1, Is2
000 1010 Iuimm5[4:0] src1 010 dest 011 0011 p.clipu rD, rs1, Is2
000 1010 src2 src1 010 dest 011 0011 p.clipr rD, rs1, Is2
000 1010 src2 src1 010 dest 011 0011 p.clipur rD, rs1, Is2
31 30 29 25 24 20 19 15 14 12 11 76 0
f2 Is3[4:0] rs2 rs1 funct3 rD opcode
00 Luimm5[4:0] src2 src1 010 dest 101 1011 p.addN rD, rs1, rs2, Is3
10 Luimm5[4:0] src2 src1 010 dest 101 1011 p.adduN rD, rs1, rs2, Is3
00 Luimm5[4:0] src2 src1 110 dest 101 1011 p.addRN rD, rs1, rs2, Is3
10 Luimm5[4:0] src2 src1 110 dest 101 1011 p.adduRN rD, rs1, rs2, Is3
00 Luimm5[4:0] src2 src1 011 dest 101 1011 p.subN rD, rs1, rs2, Is3
10 Luimm5[4:0] src2 src1 011 dest 101 1011 p.subuN rD, rs1, rs2, Is3
00 Luimm5[4:0] src2 src1 111 dest 101 1011 p.subRN rD, rs1, rs2, Is3
10 Luimm5[4:0] src2 src1 111 dest 101 1011 p.subuRN rD, rs1, rs2, Is3
01 Luimm5[4:0] src2 src1 010 dest 101 1011 p.addNr rD, rs1, rs2
11 00000 src2 src1 010 dest 101 1011 p.adduNr rD, rs1, rs
01 00000 src2 src1 110 dest 101 1011 p.addRNr rD, rs1, rs
11 00000 src2 src1 110 dest 101 1011 p.adduRNr rD, rs1, rs2
01 00000 src2 src1 011 dest 101 1011 p.subNr rD, rs1, rs2
11 00000] src2 src1 011 dest 101 1011 p.subuN r rD, rs1, rs2
01 00000 src2 src1 111 dest 101 1011 p.subRNr rD, rs1, rs2
11 00000 src2 src1 111 dest 101 1011 p.subuRNr rD, rs1, rs2
Mnemonic Description
p.beqimm rs1, Imm5, Imm12 Branch to PC + (Imm12 << 1) if rs1 is equal to
Imm5. Imm5 is signed.
p.bneimm rs1, Imm5, Imm12 Branch to PC + (Imm12 << 1) if rs1 is not equal
to Imm5.
Imm5 is signed.
[12] [10:5] [4:0] Src1 011 [4:1] [11] 1100011 p.bneimm rs1, Imm5, Imm12
14.4 Multiply-Accumulate
Mnemonic Description
p.mulhhuRN rD, rs1, rs2, Is3 rD[31:0] = (Zext(rs1[31:15]) * Zext(rs2[31:15]) + 2^(Is3-1)) >>> Is3
Note: Logical shift right
16-Bit x 16-Bit Multiply-Accumulate
p.macsN rD, rs1, rs2, Is3 rD[31:0] = (Sext(rs1[15:0]) * Sext(rs2[15:0]) + rD) >>> Is3
Note: Arithmetic shift right
p.machhsN rD, rs1, rs2, Is3 rD[31:0] = (Sext(rs1[31:15]) * Sext(rs2[31:15]) + rD) >>> Is3
Note: Arithmetic shift right
p.macsRN rD, rs1, rs2, Is3 rD[31:0] = (Sext(rs1[15:0]) * Sext(rs2[15:0]) + rD + 2^(Is3-1)) >>> Is3
Note: Arithmetic shift right
p.machhsRN , rD, rs1, rs2, Is3 rD[31:0] = (Sext(rs1[31:15]) * Sext(rs2[31:15]) + rD + 2^(Is3-1)) >>> Is3
Note: Arithmetic shift right
p.macuN rD, rs1, rs2, Is3 rD[31:0] = (Zext(rs1[15:0]) * Zext(rs2[15:0]) + rD) >>> Is3
Note: Logical shift right
p.machhuN rD, rs1, rs2, Is3 rD[31:0] = (Zext(rs1[31:15]) * Zext(rs2[31:15]) + rD) >>> Is3
Note: Logical shift right
p.macuRN rD, rs1, rs2, Is3 rD[31:0] = (Zext(rs1[15:0]) * Zext(rs2[15:0]) + rD + 2^(Is3-1)) >>> Is3
Note: Logical shift right
p.machhuRN rD, rs1, rs2, Is3 rD[31:0] = (Zext(rs1[31:15]) * Zext(rs2[31:15]) + rD + 2^(Is3-1)) >>> Is3
Note: Logical shift right
010 0001 src2 src1 001 dest 011 0011 p.msu rD, rs1, rs2
31 30 29 25 24 20 19 15 14 12 11 76 0
f2 Is3[4:0] rs2 rs1 funct3 rD opcode
10 00000 src2 src1 000 dest 101 1011 p.muls rD, rs1, rs2
11 00000 src2 src1 000 dest 101 1011 p.mulhhs rD, rs1, rs2
10 Luimm5[4:0] src2 src1 000 dest 101 1011 p.mulsN rD, rs1, rs2, Is3
11 Luimm5[4:0] src2 src1 000 dest 101 1011 p.mulhhsN rD, rs1, rs2, Is3
10 Luimm5[4:0] src2 src1 100 dest 101 1011 p.mulsRN rD, rs1, rs2, Is3
11 Luimm5[4:0] src2 src1 100 dest 101 1011 p.mulhhsRN rD, rs1, rs2, Is3
00 00000 src2 src1 000 dest 101 1011 p.mulu rD, rs1, rs2
01 00000 src2 src1 000 dest 101 1011 p.mulhhu rD, rs1, rs2
00 Luimm5[4:0] src2 src1 000 dest 101 1011 p.muluN rD, rs1, rs2, Is3
01 Luimm5[4:0] src2 src1 000 dest 101 1011 p.mulhhuN rD, rs1, rs2, Is3
00 Luimm5[4:0] src2 src1 100 dest 101 1011 p.muluRN rD, rs1, rs2, Is3
01 Luimm5[4:0] src2 src1 100 dest 101 1011 p.mulhhuRN rD, rs1, rs2, Is3
10 Luimm5[4:0] src2 src1 001 dest 101 1011 p.macsN rD, rs1, rs2, Is3
11 Luimm5[4:0] src2 src1 001 dest 101 1011 p.machhsN rD, rs1, rs2, Is3
10 Luimm5[4:0] src2 src1 101 dest 101 1011 p.macsRN rD, rs1, rs2, Is3
11 Luimm5[4:0] src2 src1 101 dest 101 1011 p.machhsRN rD, rs1, rs2, Is3
00 Luimm5[4:0] src2 src1 001 dest 101 1011 p.macuN rD, rs1, rs2, Is3
01 Luimm5[4:0] src2 src1 001 dest 101 1011 p.machhuN rD, rs1, rs2, Is3
00 Luimm5[4:0] src2 src1 101 dest 101 1011 p.macuRN rD, rs1, rs2, Is3
01 Luimm5[4:0] src2 src1 101 dest 101 1011 p.machhuRN rD, rs1, rs2, Is3
14.5 Vectorial
Vectorial instructions perform operations in a SIMD-like manner on multiple sub-word elements at the same
time. This is done by segmenting the data path into smaller parts when 8 or 16-bit operations should be
performed.
Vectorial instructions are available in two flavors:
• 8-Bit, to perform four operations on the 4 bytes inside a 32-bit word at the same time
• 16-Bit, to perform two operations on the 2 half-words inside a 32-bit word at the same time
Additionally, there are three modes that influence the second operand:
1. Normal mode, vector-vector operation. Both operands, from rs1 and rs2, are treated as vectors of
bytes or half-words.
2. Scalar replication mode (.sc), vector-scalar operation. Operand 1 is treated as a vector, while
operand 2 is treated as a scalar and replicated two or four times to form a complete vector. The LSP
is used for this purpose.
3. Immediate scalar replication mode (.sci), vector-scalar operation. Operand 1 is treated as vector,
while operand 2 is treated as a scalar and comes from an immediate. The immediate is either sign-
or zero-extended, depending on the operation. If not specified, the immediate is sign-extended.
Mnemonic Description
pv.dotusp[.sc,.sci].h rD = rs1[0] * op2[0] + rs1[1] * op2[1]
Note: rs1 is treated as unsigned, while rs2 is treated as signed
pv.dotusp[.sc,.sci].b rD = rs1[0] * op2[0] + rs1[1] * op2[1] + rs1[2] * op2[2] + rs1[3] * op2[3]
Note: rs1 is treated as unsigned, while rs2 is treated as signed
pv.dotsp[.sc,.sci].h rD = rs1[0] * op2[0] + rs1[1] * op2[1]
Note: All operations are signed
pv.dotsp[.sc,.sci].b rD = rs1[0] * op2[0] + rs1[1] * op2[1] + rs1[2] * op2[2] + rs1[3] * op2[3]
Note: All operations are signed
pv.sdotup[.sc,.sci].h rD = rD + rs1[0] * op2[0] + rs1[1] * op2[1]
Note: All operations are unsigned
pv.sdotup[.sc,.sci].b rD = rD + rs1[0] * op2[0] + rs1[1] * op2[1] + rs1[2] * op2[2] + rs1[3] * op2[3]
Note: All operations are unsigned
pv.sdotusp[.sc,.sci].h rD = rD + rs1[0] * op2[0] + rs1[1] * op2[1]
Note: rs1 is treated as unsigned, while rs2 is treated as signed
pv.sdotusp[.sc,.sci].b rD = rD + rs1[0] * op2[0] + rs1[1] * op2[1] + rs1[2] * op2[2] + rs1[3] * op2[3]
Note: rs1 is treated as unsigned, while rs2 is treated as signed
pv.sdotsp[.sc,.sci].h rD = rD + rs1[0] * op2[0] + rs1[1] * op2[1]
Note: All operations are signed
pv.sdotsp[.sc,.sci].b rD = rD + rs1[0] * op2[0] + rs1[1] * op2[1] + rs1[2] * op2[2] + rs1[3] * op2[3]
Note: All operations are signed
Shuffle and Pack Instructions
pv.shuffle.h rD[31:16] = rs1[rs2[16]*16+15:rs2[16]*16]
rD[15:0] = rs1[rs2[0]*16+15:rs2[0]*16]
pv.shuffle.sci.h rD[31:16] = rs1[I1*16+15:I1*16]
rD[15:0] = rs1[I0*16+15:I0*16]
Note: I1 and I0 represent bits 1 and 0 of the immediate
pv.shuffle.b rD[31:24] = rs1[rs2[25:24]*8+7:rs2[25:24]*8]
rD[23:16] = rs1[rs2[17:16]*8+7:rs2[17:16]*8]
rD[15:8] = rs1[rs2[9:8]*8+7:rs2[9:8]*8]
rD[7:0] = rs1[rs2[1:0]*8+7:rs2[1:0]*8]
pv.shuffleI0.sci.b rD[31:24] = rs1[7:0]
rD[23:16] = rs1[(I5:I4)*8+7: (I5:I4)*8]
rD[15:8] = rs1[(I3:I2)*8+7: (I3:I2)*8]
rD[7:0] = rs1[(I1:I0)*8+7:(I1:I0)*8]
pv.shuffleI1.sci.b rD[31:24] = rs1[15:8]
rD[23:16] = rs1[(I5:I4)*8+7: (I5:I4)*8]
rD[15:8] = rs1[(I3:I2)*8+7: (I3:I2)*8]
rD[7:0] = rs1[(I1:I0)*8+7:(I1:I0)*8]
Mnemonic Description
pv.shuffleI2.sci.b rD[31:24] = rs1[23:16]
rD[23:16] = rs1[(I5:I4)*8+7: (I5:I4)*8]
rD[15:8] = rs1[(I3:I2)*8+7: (I3:I2)*8]
rD[7:0] = rs1[(I1:I0)*8+7:(I1:I0)*8]
pv.shuffleI3.sci.b rD[31:24] = rs1[31:24]
rD[23:16] = rs1[(I5:I4)*8+7: (I5:I4)*8]
rD[15:8] = rs1[(I3:I2)*8+7: (I3:I2)*8]
rD[7:0] = rs1[(I1:I0)*8+7:(I1:I0)*8]
pv.shuffle2.h rD[31:16] = ((rs2[17] == 1) ? rs1 : rD)[rs2[16]*16+15:rs2[16]*16]
rD[15:0] = ((rs2[1] == 1) ? rs1 : rD)[rs2[0]*16+15:rs2[0]*16]
pv.shuffle2.b rD[31:24] = ((rs2[26] == 1) ? rs1 : rD)[rs2[25:24]*8+7:rs2[25:24]*8]
rD[23:16] = ((rs2[18] == 1) ? rs1 : rD)[rs2[17:16]*8+7:rs2[17:16]*8]
rD[15:8] = ((rs2[10] == 1) ? rs1 : rD)[rs2[9:8]*8+7:rs2[9:8]*8]
rD[7:0] = ((rs2[2] == 1) ? rs1 : rD)[rs2[1:0]*8+7:rs2[1:0]*8]
pv.pack.h rD[31:16] = rs1[15:0]
rD[15:0] = rs2[15:0]
pv.packhi.b rD[31:24] = rs1[7:0]
rD[23:16] = rs2[7:0]
Note: The rest of the bits of rD are untouched and keep their previous value
pv.packlo.b rD[15:8] = rs1[7:0]
rD[7:0] = rs2[7:0]
Note: The rest of the bits of rD are untouched and keep their previous value
0 0000 0 0 src2 src1 100 dest 101 0111 pv.add.sc.h rD, rs1, rs2
0 0000 0 Imm6[5:0]s src1 110 dest 101 0111 pv.add.sci.h rD, rs1, Imm6
0 0000 0 0 src2 src1 001 dest 101 0111 pv.add.b rD, rs1, rs2
0 0000 0 0 src2 src1 101 dest 101 0111 pv.add.sc.b rD, rs1, rs2
0 0000 0 Imm6[5:0] src1 111 dest 101 0111 pv.add.sci.b rD, rs1, Imm6
0 0001 0 0 src2 src1 000 dest 101 0111 pv.sub.h rD, rs1, rs2
0 0001 0 0 src2 src1 100 dest 101 0111 pv.sub.sc.h rD, rs1, rs2
0 0001 0 Imm6[5:0]s src1 110 dest 101 0111 pv.sub.sci.h rD, rs1, Imm6
0 0001 0 0 src2 src1 001 dest 101 0111 pv.sub.b rD, rs1, rs2
0 0001 0 0 src2 src1 101 dest 101 0111 pv.sub.sc.b rD, rs1, rs2
0 0001 0 Imm6[5:0] src1 111 dest 101 0111 pv.sub.sci.b rD, rs1, Imm6
0 0010 0 0 src2 src1 000 dest 101 0111 pv.avg.h rD, rs1, rs2
0 0010 0 0 src2 src1 100 dest 101 0111 pv.avg.sc.h rD, rs1, rs2
0 0010 0 Imm6[5:0]s src1 110 dest 101 0111 pv.avg.sci.h rD, rs1, Imm6
0 0010 0 0 src2 src1 001 dest 101 0111 pv.avg.b rD, rs1, rs2
0 0010 0 0 src2 src1 101 dest 101 0111 pv.avg.sc.b rD, rs1, rs2
0 0010 0 Imm6[5:0] src1 111 dest 101 0111 pv.avg.sci.b rD, rs1, Imm6
0 0011 0 0 src2 src1 000 dest 101 0111 pv.avgu.h rD, rs1, rs2
0 0011 0 0 src2 src1 100 dest 101 0111 pv.avgu.sc.h rD, rs1, rs2
0 0011 0 Imm6[5:0]s src1 110 dest 101 0111 pv.avgu.sci.h rD, rs1, Imm6
0 0011 0 0 src2 src1 001 dest 101 0111 pv.avgu.b rD, rs1, rs2
0 0011 0 0 src2 src1 101 dest 101 0111 pv.avgu.sc.b rD, rs1, rs2
0 0011 0 Imm6[5:0] src1 111 dest 101 0111 pv.avgu.sci.b rD, rs1, Imm6
0 0100 0 0 src2 src1 000 dest 101 0111 pv.min.h rD, rs1, rs2
0 0100 0 0 src2 src1 100 dest 101 0111 pv.min.sc.h rD, rs1, rs2
0 0100 0 Imm6[5:0]s src1 110 dest 101 0111 pv.min.sci.h rD, rs1, Imm6
0 0100 0 0 src2 src1 001 dest 101 0111 pv.min.b rD, rs1, rs2
0 0100 0 0 src2 src1 101 dest 101 0111 pv.min.sc.b rD, rs1, rs2
0 0100 0 Imm6[5:0] src1 111 dest 101 0111 pv.min.sci.b rD, rs1, Imm6
0 0101 0 0 src2 src1 000 dest 101 0111 pv.minu.h rD, rs1, rs2
0 0101 0 0 src2 src1 100 dest 101 0111 pv.minu.sc.h rD, rs1, rs2
0 0101 0 Imm6[5:0]s src1 110 dest 101 0111 pv.minu.sci.h rD, rs1, Imm6
0 0101 0 0 src2 src1 001 dest 101 0111 pv.minu.b rD, rs1, rs2
0 0101 0 0 src2 src1 101 dest 101 0111 pv.minu.sc.b rD, rs1, rs2
0 0101 0 Imm6[5:0] src1 111 dest 101 0111 pv.minu.sci.b rD, rs1, Imm6
0 0110 0 0 src2 src1 000 dest 101 0111 pv.max.h rD, rs1, rs2
0 0110 0 0 src2 src1 100 dest 101 0111 pv.max.sc.h rD, rs1, rs2
0 0110 0 Imm6[5:0]s src1 110 dest 101 0111 pv.max.sci.h rD, rs1, Imm6
0 0110 0 0 src2 src1 001 dest 101 0111 pv.max.b rD, rs1, rs2
0 0110 0 0 src2 src1 101 dest 101 0111 pv.max.sc.b rD, rs1, rs2
0 0110 0 Imm6[5:0] src1 111 dest 101 0111 pv.max.sci.b rD, rs1, Imm6
0 0111 0 0 src2 src1 000 dest 101 0111 pv.maxu.h rD, rs1, rs2
0 0111 0 0 src2 src1 100 dest 101 0111 pv.maxu.sc.h rD, rs1, rs2
0 0111 0 Imm6[5:0]s src1 110 dest 101 0111 pv.maxu.sci.h rD, rs1, Imm6
0 0111 0 0 src2 src1 001 dest 101 0111 pv.maxu.b rD, rs1, rs2
0 0111 0 0 src2 src1 101 dest 101 0111 pv.maxu.sc.b rD, rs1, rs2
0 0111 0 Imm6[5:0] src1 111 dest 101 0111 pv.maxu.sci.b rD, rs1, Imm6
0 1000 0 0 src2 src1 000 dest 101 0111 pv.srl.h rD, rs1, rs2
0 1000 0 0 src2 src1 100 dest 101 0111 pv.srl.sc.h rD, rs1, rs2
0 1000 0 Imm6[5:0]s src1 110 dest 101 0111 pv.srl.sci.h rD, rs1, Imm6
0 1000 0 0 src2 src1 001 dest 101 0111 pv.srl.b rD, rs1, rs2
0 1000 0 0 src2 src1 101 dest 101 0111 pv.srl.sc.b rD, rs1, rs2
0 1000 0 Imm6[5:0] src1 111 dest 101 0111 pv.srl.sci.b rD, rs1, Imm6
0 1001 0 0 src2 src1 000 dest 101 0111 pv.sra.h rD, rs1, rs2
0 1001 0 0 src2 src1 100 dest 101 0111 pv.sra.sc.h rD, rs1, rs2
0 1001 0 Imm6[5:0]s src1 110 dest 101 0111 pv.sra.sci.h rD, rs1, Imm6
0 1001 0 0 src2 src1 001 dest 101 0111 pv.sra.b rD, rs1, rs2
0 1001 0 0 src2 src1 101 dest 101 0111 pv.sra.sc.b rD, rs1, rs2
0 1001 0 Imm6[5:0] src1 111 dest 101 0111 pv.sra.sci.b rD, rs1, Imm6
0 1010 0 0 src2 src1 000 dest 101 0111 pv.sll.h rD, rs1, rs2
0 1010 0 0 src2 src1 100 dest 101 0111 pv.sll.sc.h rD, rs1, rs2
0 1010 0 Imm6[5:0]s src1 110 dest 101 0111 pv.sll.sci.h rD, rs1, Imm6
0 1010 0 0 src2 src1 001 dest 101 0111 pv.sll.b rD, rs1, rs2
0 1010 0 0 src2 src1 101 dest 101 0111 pv.sll.sc.b rD, rs1, rs2
0 1010 0 Imm6[5:0] src1 111 dest 101 0111 pv.sll.sci.b rD, rs1, Imm6
0 1011 0 0 src2 src1 000 dest 101 0111 pv.or.h rD, rs1, rs2
0 1011 0 0 src2 src1 100 dest 101 0111 pv.or.sc.h rD, rs1, rs2
0 1011 0 Imm6[5:0]s src1 110 dest 101 0111 pv.or.sci.h rD, rs1, Imm6
0 1011 0 0 src2 src1 001 dest 101 0111 pv.or.b rD, rs1, rs2
0 1011 0 0 src2 src1 101 dest 101 0111 pv.or.sc.b rD, rs1, rs2
0 1011 0 Imm6[5:0] src1 111 dest 101 0111 pv.or.sci.b rD, rs1, Imm6
0 1100 0 0 src2 src1 000 dest 101 0111 pv.xor.h rD, rs1, rs2
0 1100 0 0 src2 src1 100 dest 101 0111 pv.xor.sc.h rD, rs1, rs2
0 1100 0 Imm6[5:0]s src1 110 dest 101 0111 pv.xor.sci.h rD, rs1, Imm6
0 1100 0 0 src2 src1 001 dest 101 0111 pv.xor.b rD, rs1, rs2
0 1100 0 0 src2 src1 101 dest 101 0111 pv.xor.sc.b rD, rs1, rs2
0 1100 0 Imm6[5:0] src1 111 dest 101 0111 pv.xor.sci.b rD, rs1, Imm6
0 1101 0 0 src2 src1 000 dest 101 0111 pv.and.h rD, rs1, rs2
0 1101 0 0 src2 src1 100 dest 101 0111 pv.and.sc.h rD, rs1, rs2
0 1101 0 Imm6[5:0]s src1 110 dest 101 0111 pv.and.sci.h rD, rs1, Imm6
0 1101 0 0 src2 src1 001 dest 101 0111 pv.and.b rD, rs1, rs2
0 1101 0 0 src2 src1 101 dest 101 0111 pv.and.sc.b rD, rs1, rs2
0 1101 0 Imm6[5:0] src1 111 dest 101 0111 pv.and.sci.b rD, rs1, Imm6
0 1110 0 0 00000 src1 000 dest 101 0111 pv.abs.h rD, rs1
0 1110 0 0 00000 src1 001 dest 101 0111 pv.abs.b rD, rs1
0 1111 0 Imm6[5:0] src1 110 dest 101 0111 pv.extract.h rD, Imm6
0 1111 0 Imm6[5:0] src1 111 dest 101 0111 pv.extract.b rD, Imm6
1 0010 0 Imm6[5:0] src1 110 dest 101 0111 pv.extractu.h rD, Imm6
1 0010 0 Imm6[5:0] src1 111 dest 101 0111 pv.extractu.b rD, Imm6
1 0110 0 Imm6[5:0] src1 110 dest 101 0111 pv.insert.h rD, Imm6
1 0110 0 Imm6[5:0] src1 111 dest 101 0111 pv.insert.b rD, Imm6
1 0000 0 0 src2 src1 000 dest 101 0111 pv.dotup.h rD, rs1, rs2
1 0000 0 0 src2 src1 100 dest 101 0111 pv.dotup.sc.h rD, rs1, rs2
1 0000 0 Imm6[5:0]s src1 110 dest 101 0111 pv.dotup.sci.h rD, rs1, Imm6
1 0000 0 0 src2 src1 001 dest 101 0111 pv.dotup.b rD, rs1, rs2
1 0000 0 0 src2 src1 101 dest 101 0111 pv.dotup.sc.b rD, rs1, rs2
1 0000 0 Imm6[5:0] src1 111 dest 101 0111 pv.dotup.sci.b rD, rs1, Imm6
1 0001 0 0 src2 src1 000 dest 101 0111 pv.dotusp.h rD, rs1, rs2
1 0001 0 0 src2 src1 100 dest 101 0111 pv.dotusp.sc.h rD, rs1, rs2
1 0001 0 Imm6[5:0]s src1 110 dest 101 0111 pv.dotusp.sci.h rD, rs1, Imm6
1 0001 0 0 src2 src1 001 dest 101 0111 pv.dotusp.b rD, rs1, rs2
1 0001 0 0 src2 src1 101 dest 101 0111 pv.dotusp.sc.b rD, rs1, rs2
1 0001 0 Imm6[5:0] src1 111 dest 101 0111 pv.dotusp.sci.b rD, rs1, Imm6
1 0011 0 0 src2 src1 000 dest 101 0111 pv.dotsp.h rD, rs1, rs2
1 0011 0 0 src2 src1 100 dest 101 0111 pv.dotsp.sc.h rD, rs1, rs2
1 0011 0 Imm6[5:0]s src1 110 dest 101 0111 pv.dotsp.sci.h rD, rs1, Imm6
1 0011 0 0 src2 src1 001 dest 101 0111 pv.dotsp.b rD, rs1, rs2
1 0011 0 0 src2 src1 101 dest 101 0111 pv.dotsp.sc.b rD, rs1, rs2
1 0011 0 Imm6[5:0] src1 111 dest 101 0111 pv.dotsp.sci.b rD, rs1, Imm6
1 0100 0 0 src2 src1 000 dest 101 0111 pv.sdotup.h rD, rs1, rs2
1 0100 0 0 src2 src1 100 dest 101 0111 pv.sdotup.sc.h rD, rs1, rs2
1 0100 0 Imm6[5:0]s src1 110 dest 101 0111 pv.sdotup.sci.h rD, rs1, Imm6
1 0100 0 0 src2 src1 001 dest 101 0111 pv.sdotup.b rD, rs1, rs2
1 0100 0 0 src2 src1 101 dest 101 0111 pv.sdotup.sc.b rD, rs1, rs2
1 0100 0 Imm6[5:0] src1 111 dest 101 0111 pv.sdotup.sci.b rD, rs1, Imm6
1 0101 0 0 src2 src1 000 dest 101 0111 pv.sdotusp.h rD, rs1, rs2
1 0101 0 0 src2 src1 100 dest 101 0111 pv.sdotusp.sc.h rD, rs1, rs2
1 0101 0 Imm6[5:0]s src1 110 dest 101 0111 pv.sdotusp.sci.h rD, rs1, Imm6
1 0101 0 0 src2 src1 001 dest 101 0111 pv.sdotusp.b rD, rs1, rs2
1 0101 0 0 src2 src1 101 dest 101 0111 pv.sdotusp.sc.b rD, rs1, rs2
1 0101 0 Imm6[5:0] src1 111 dest 101 0111 pv.sdotusp.sci.b rD, rs1, Imm6
1 0111 0 0 src2 src1 000 dest 101 0111 pv.sdotsp.h rD, rs1, rs2
1 0111 0 0 src2 src1 100 dest 101 0111 pv.sdotsp.sc.h rD, rs1, rs2
1 0111 0 Imm6[5:0]s src1 110 dest 101 0111 pv.sdotsp.sci.h rD, rs1, Imm6
1 0111 0 0 src2 src1 001 dest 101 0111 pv.sdotsp.b rD, rs1, rs2
1 0111 0 0 src2 src1 101 dest 101 0111 pv.sdotsp.sc.b rD, rs1, rs2
1 0111 0 Imm6[5:0] src1 111 dest 101 0111 pv.sdotsp.sci.b rD, rs1, Imm6
1 1000 0 0 src2 src1 000 dest 101 0111 pv.shuffle.h rD, rs1, rs2
1 1000 0 Imm6[5:0] src1 110 dest 101 0111 pv.shuffle.sci.h rD, rs1, Imm6
1 1000 0 0 src2 src1 001 dest 101 0111 pv.shuffle.b rD, rs1, rs2
1 1000 0 Imm6[5:0] src1 111 dest 101 0111 pv.shuffleI0.sci.b rD, rs1, Imm6
1 1101 0 Imm6[5:0] src1 111 dest 101 0111 pv.shuffleI1.sci.b rD, rs1, Imm6
1 1110 0 Imm6[5:0] src1 111 dest 101 0111 pv.shuffleI2.sci.b rD, rs1, Imm6
1 1111 0 Imm6[5:0] src1 111 dest 101 0111 pv.shuffleI3.sci.b rD, rs1, Imm6
1 1001 0 0 src2 src1 000 dest 101 0111 pv.shuffle2.h rD, rs1, rs2
1 1001 0 0 src2 src1 001 dest 101 0111 pv.shuffle2.b rD, rs1, rs2
1 1010 0 0 src2 src1 000 dest 101 0111 pv.pack.h rD, rs1, rs2
1 1011 0 0 src2 src1 001 dest 101 0111 pv.packhi.b rD, rs1, rs2
1 1100 0 0 src2 src1 001 dest 101 0111 pv.packlo.b rD, rs1, rs2
Note: Imm6[5:0] is encoded as { Imm6[0], Imm6[5:1] }, LSB at the 25th bit of the instruction
Mnemonic Description
pv.cmpeq[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] == op2 ? ‘1 : ‘0
pv.cmpne[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] != op2 ? ‘1 : ‘0
pv.cmpgt[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] > op2 ? ‘1 : ‘0
pv.cmpge[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] >=op2 ? ‘1 : ‘0
pv.cmplt[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] < op2 ? ‘1 : ‘0
pv.cmple[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] <= op2 ? ‘1 : ‘0
Mnemonic Description
pv.cmpgtu[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] > op2 ? ‘1 : ‘0
Note: Unsigned comparison
pv.cmpgeu[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] >= op2 ? ‘1 : ‘0
Note: Unsigned comparison
pv.cmpltu[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] < op2 ? ‘1 : ‘0
Note: Unsigned comparison
pv.cmpleu[.sc,.sci]{.h,.b} rD, rs1, {rs2, Imm6} rD[i] = rs1[i] <= op2 ? ‘1 : ‘0
Note: Unsigned comparison
0 0000 1 0 src2 src1 100 dest 101 0111 pv.cmpeq.sc.h rD, rs1, rs2
0 0000 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmpeq.sci.h rD, rs1, Imm6
0 0000 1 0 src2 src1 001 dest 101 0111 pv.cmpeq.b rD, rs1, rs2
0 0000 1 0 src2 src1 101 dest 101 0111 pv.cmpeq.sc.b rD, rs1, rs2
0 0000 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmpeq.sci.b rD, rs1, Imm6
0 0001 1 0 src2 src1 000 dest 101 0111 pv.cmpne.h rD, rs1, rs2
0 0001 1 0 src2 src1 100 dest 101 0111 pv.cmpne.sc.h rD, rs1, rs2
0 0001 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmpne.sci.h rD, rs1, Imm6
0 0001 1 0 src2 src1 001 dest 101 0111 pv.cmpne.b rD, rs1, rs2
0 0001 1 0 src2 src1 101 dest 101 0111 pv.cmpne.sc.b rD, rs1, rs2
0 0001 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmpne.sci.b rD, rs1, Imm6
0 0010 1 0 src2 src1 000 dest 101 0111 pv.cmpgt.h rD, rs1, rs2
0 0010 1 0 src2 src1 100 dest 101 0111 pv.cmpgt.sc.h rD, rs1, rs2
0 0010 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmpgt.sci.h rD, rs1, Imm6
0 0010 1 0 src2 src1 001 dest 101 0111 pv.cmpgt.b rD, rs1, rs2
0 0010 1 0 src2 src1 101 dest 101 0111 pv.cmpgt.sc.b rD, rs1, rs2
0 0010 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmpgt.sci.b rD, rs1, Imm6
0 0011 1 0 src2 src1 000 dest 101 0111 pv.cmpge.h rD, rs1, rs2
0 0011 1 0 src2 src1 100 dest 101 0111 pv.cmpge.sc.h rD, rs1, rs2
0 0011 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmpge.sci.h rD, rs1, Imm6
0 0011 1 0 src2 src1 001 dest 101 0111 pv.cmpge.b rD, rs1, rs2
0 0011 1 0 src2 src1 101 dest 101 0111 pv.cmpge.sc.b rD, rs1, rs2
0 0011 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmpge.sci.b rD, rs1, Imm6
0 0100 1 0 src2 src1 000 dest 101 0111 pv.cmplt.h rD, rs1, rs2
0 0100 1 0 src2 src1 100 dest 101 0111 pv.cmplt.sc.h rD, rs1, rs2
0 0100 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmplt.sci.h rD, rs1, Imm6
0 0100 1 0 src2 src1 001 dest 101 0111 pv.cmplt.b rD, rs1, rs2
0 0100 1 0 src2 src1 101 dest 101 0111 pv.cmplt.sc.b rD, rs1, rs2
0 0100 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmplt.sci.b rD, rs1, Imm6
0 0101 1 0 src2 src1 000 dest 101 0111 pv.cmple.h rD, rs1, rs2
0 0101 1 0 src2 src1 100 dest 101 0111 pv.cmple.sc.h rD, rs1, rs2
0 0101 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmple.sci.h rD, rs1, Imm6
0 0101 1 0 src2 src1 001 dest 101 0111 pv.cmple.b rD, rs1, rs2
0 0101 1 0 src2 src1 101 dest 101 0111 pv.cmple.sc.b rD, rs1, rs2
0 0101 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmple.sci.b rD, rs1, Imm6
0 0110 1 0 src2 src1 000 dest 101 0111 pv.cmpgtu.h rD, rs1, rs2
0 0110 1 0 src2 src1 100 dest 101 0111 pv.cmpgtu.sc.h rD, rs1, rs2
0 0110 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmpgtu.sci.h rD, rs1, Imm6
0 0110 1 0 src2 src1 001 dest 101 0111 pv.cmpgtu.b rD, rs1, rs2
0 0110 1 0 src2 src1 101 dest 101 0111 pv.cmpgtu.sc.b rD, rs1, rs2
0 0110 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmpgtu.sci.b rD, rs1, Imm6
0 0111 1 0 src2 src1 000 dest 101 0111 pv.cmpgeu.h rD, rs1, rs2
0 0111 1 0 src2 src1 100 dest 101 0111 pv.cmpgeu.sc.h rD, rs1, rs2
0 0111 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmpgeu.sci.h rD, rs1, Imm6
0 0111 1 0 src2 src1 001 dest 101 0111 pv.cmpgeu.b rD, rs1, rs2
0 0111 1 0 src2 src1 101 dest 101 0111 pv.cmpgeu.sc.b rD, rs1, rs2
0 0111 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmpgeu.sci.b rD, rs1, Imm6
0 1000 1 0 src2 src1 000 dest 101 0111 pv.cmpltu.h rD, rs1, rs2
0 1000 1 0 src2 src1 100 dest 101 0111 pv.cmpltu.sc.h rD, rs1, rs2
0 1000 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmpltu.sci.h rD, rs1, Imm6
0 1000 1 0 src2 src1 001 dest 101 0111 pv.cmpltu.b rD, rs1, rs2
0 1000 1 0 src2 src1 101 dest 101 0111 pv.cmpltu.sc.b rD, rs1, rs2
0 1000 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmpltu.sci.b rD, rs1, Imm6
0 1001 1 0 src2 src1 000 dest 101 0111 pv.cmpleu.h rD, rs1, rs2
0 1001 1 0 src2 src1 100 dest 101 0111 pv.cmpleu.sc.h rD, rs1, rs2
0 1001 1 Imm6[5:0] src1 110 dest 101 0111 pv.cmpleu.sci.h rD, rs1, Imm6
0 1001 1 0 src2 src1 001 dest 101 0111 pv.cmpleu.b rD, rs1, rs2
0 1001 1 0 src2 src1 101 dest 101 0111 pv.cmpleu.sc.b rD, rs1, rs2
0 1001 1 Imm6[5:0] src1 111 dest 101 0111 pv.cmpleu.sci.b rD, rs1, Imm6
Note: Imm6[5:0] is encoded as { Imm6[0], Imm6[5:1] }, LSB at the 25th bit of the instruction