0% found this document useful (0 votes)

24 views192 pages

Ther Is CV Reader

This document contains endorsements of a book about the RISC-V instruction set architecture from several experts in computer architecture and instruction set design. The experts praise the book for providing a clear introduction to RISC-V, insightful comparisons to other instruction set architectures, and avoiding unnecessary complex features. One expert says the book will be an invaluable reference for working with the RISC-V ISA.

Uploaded by

noodlesjupiter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views192 pages

Ther Is CV Reader

Uploaded by

noodlesjupiter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 192

In Praise of The RISC-V Reader

I like RISC-V and this book as they are elegant—brief, to the point, and complete. The
book’s commentaries provide a gratuitous history, motivation, and architecture critique.
—C. Gordon Bell, Microsoft and designer of the Digital PDP-11 and VAX-11
instruction set architectures

This book tells what RISC-V can do and why its designers chose to endow it with those
abilities. Even more interesting, the authors tell why RISC-V omits things found in earlier
machines. The reasons are at least as interesting as RISC-V’s endowments and omissions.
—Ivan Sutherland, Turing Award laureate called the father of computer graphics

RISC-V will change the world, and this book will help you become part of that change.
—Professor Michael B. Taylor, University of Washington

RISC-V is a fine choice for students to learn about instruction set architecture and
assembly-level programming, the basic underpinnings for later work in higher-level lan-
guages. This clearly-written book offers a good introduction to RISC-V, augmented with
insightful comments on its evolutionary history and comparisons with other familiar ar-
chitectures. Drawing on past experience with other architectures, RISC-V designers were
able avoid unnecessary, often irregular features, yielding easy pedagogy. Although sim-
ple, it is still powerful enough for widespread use in real applications. Long ago, I used
to teach a first course in assembly programming and if I were doing that now, I’d happily
use this book.
—John Mashey, one of the designers of the MIPS instruction set architecture

This book will be an invaluable reference for anyone working with the RISC-V ISA. The
opcodes are presented in several useful formats for quick reference, making assembly
coding and interpretation easy. In addition, the explanations and examples of how to use
the ISA make the programmer’s job even simpler. The comparisons with other ISAs are
interesting and demonstrate why the RISC-V creators made the design decisions they did.
—Megan Wachs, PhD, SiFive Engineer
Open Reference Card ①
Base Integer Instructions: RV32I and RV64I RV Privileged Instructions
Category Name Fmt RV32I Base +RV64I Category Name Fmt RV mnemonic
Shifts Shift Left Logical R SLL rd,rs1,rs2 SLLW rd,rs1,rs2 Trap Mach-mode trap return R MRET
Shift Left Log. Imm. I SLLI rd,rs1,shamt SLLIW rd,rs1,shamt Supervisor-mode trap return R SRET
Shift Right Logical R SRL rd,rs1,rs2 SRLW rd,rs1,rs2 Interrupt Wait for Interrupt R WFI
Shift Right Log. Imm. I SRLI rd,rs1,shamt SRLIW rd,rs1,shamt MMU Virtual Memory FENCE R SFENCE.VMA rs1,rs2
Shift Right Arithmetic R SRA rd,rs1,rs2 SRAW rd,rs1,rs2 Examples of the 60 RV Pseudoinstructions
Shift Right Arith. Imm. I SRAI rd,rs1,shamt SRAIW rd,rs1,shamt Branch = 0 (BEQ rs,x0,imm) J BEQZ rs,imm
Arithmetic ADD R ADD rd,rs1,rs2 ADDW rd,rs1,rs2 Jump (uses JAL x0,imm) J J imm
ADD Immediate I ADDI rd,rs1,imm ADDIW rd,rs1,imm MoVe (uses ADDI rd,rs,0) R MV rd,rs
SUBtract R SUB rd,rs1,rs2 SUBW rd,rs1,rs2 RETurn (uses JALR x0,0,ra) I RET
Load Upper Imm U LUI rd,imm Optional Compressed (16-bit) Instruction Extension: RV32C
Add Upper Imm to PC U AUIPC rd,imm Category Name Fmt RVC RISC-V equivalent
Logical XOR R XOR rd,rs1,rs2 Loads Load Word CL C.LW rd′,rs1′,imm LW rd′,rs1′,imm*4
XOR Immediate I XORI rd,rs1,imm Load Word SP CI C.LWSP rd,imm LW rd,sp,imm*4
OR R OR rd,rs1,rs2 Float Load Word SP CL C.FLW rd′,rs1′,imm FLW rd′,rs1′,imm*8
OR Immediate I ORI rd,rs1,imm Float Load Word CI C.FLWSP rd,imm FLW rd,sp,imm*8
AND R AND rd,rs1,rs2 Float Load Double CL C.FLD rd′,rs1′,imm FLD rd′,rs1′,imm*16
AND Immediate I ANDI rd,rs1,imm Float Load Double SP CI C.FLDSP rd,imm FLD rd,sp,imm*16
Compare Set < R SLT rd,rs1,rs2 Stores Store Word CS C.SW rs1′,rs2′,imm SW rs1′,rs2′,imm*4
Set < Immediate I SLTI rd,rs1,imm Store Word SP CSS C.SWSP rs2,imm SW rs2,sp,imm*4
Set < Unsigned R SLTU rd,rs1,rs2 Float Store Word CS C.FSW rs1′,rs2′,imm FSW rs1′,rs2′,imm*8
Set < Imm Unsigned I SLTIU rd,rs1,imm Float Store Word SP CSS C.FSWSP rs2,imm FSW rs2,sp,imm*8
Branches Branch = B BEQ rs1,rs2,imm Float Store Double CS C.FSD rs1′,rs2′,imm FSD rs1′,rs2′,imm*16
Branch ≠ B BNE rs1,rs2,imm Float Store Double SP CSS C.FSDSP rs2,imm FSD rs2,sp,imm*16
Branch < B BLT rs1,rs2,imm Arithmetic ADD CR C.ADD rd,rs1 ADD rd,rd,rs1
Branch ≥ B BGE rs1,rs2,imm ADD Immediate CI C.ADDI rd,imm ADDI rd,rd,imm
Branch < Unsigned B BLTU rs1,rs2,imm ADD SP Imm * 16 CI C.ADDI16SP x0,imm ADDI sp,sp,imm*16
Branch ≥ Unsigned B BGEU rs1,rs2,imm ADD SP Imm * 4 CIW C.ADDI4SPN rd',imm ADDI rd',sp,imm*4
Jump & Link J&L J JAL rd,imm SUB CR C.SUB rd,rs1 SUB rd,rd,rs1
Jump & Link Register I JALR rd,rs1,imm AND CR C.AND rd,rs1 AND rd,rd,rs1
Synch Synch thread I FENCE AND Immediate CI C.ANDI rd,imm ANDI rd,rd,imm
Synch Instr & Data I FENCE.I OR CR C.OR rd,rs1 OR rd,rd,rs1
Environment CALL I ECALL eXclusive OR CR C.XOR rd,rs1 AND rd,rd,rs1
BREAK I EBREAK MoVe CR C.MV rd,rs1 ADD rd,rs1,x0
Load Immediate CI C.LI rd,imm ADDI rd,x0,imm
Control Status Register (CSR) Load Upper Imm CI C.LUI rd,imm LUI rd,imm
Read/Write I CSRRW rd,csr,rs1 Shifts Shift Left Imm CI C.SLLI rd,imm SLLI rd,rd,imm
Read & Set Bit I CSRRS rd,csr,rs1 Shift Right Ari. Imm. CI C.SRAI rd,imm SRAI rd,rd,imm
Read & Clear Bit I CSRRC rd,csr,rs1 Shift Right Log. Imm. CI C.SRLI rd,imm SRLI rd,rd,imm
Read/Write Imm I CSRRWI rd,csr,imm Branches Branch=0 CB C.BEQZ rs1′,imm BEQ rs1',x0,imm
Read & Set Bit Imm I CSRRSI rd,csr,imm Branch≠0 CB C.BNEZ rs1′,imm BNE rs1',x0,imm
Read & Clear Bit Imm I CSRRCI rd,csr,imm Jump Jump CJ C.J imm JAL x0,imm
Jump Register CR C.JR rd,rs1 JALR x0,rs1,0
Jump & Link J&L CJ C.JAL imm JAL ra,imm
Loads Load Byte I LB rd,rs1,imm Jump & Link Register CR C.JALR rs1 JALR ra,rs1,0
Load Halfword I LH rd,rs1,imm System Env. BREAK CI C.EBREAK EBREAK
Load Byte Unsigned I LBU rd,rs1,imm +RV64I Optional Compressed Extention: RV64C
Load Half Unsigned I LHU rd,rs1,imm LWU rd,rs1,imm All RV32C (except C.JAL, 4 word loads, 4 word strores) plus:
Load Word I LW rd,rs1,imm LD rd,rs1,imm ADD Word (C.ADDW) Load Doubleword (C.LD)
Stores Store Byte S SB rs1,rs2,imm ADD Imm. Word (C.ADDIW) Load Doubleword SP (C.LDSP)
Store Halfword S SH rs1,rs2,imm SUBtract Word (C.SUBW) Store Doubleword (C.SD)
Store Word S SW rs1,rs2,imm SD rs1,rs2,imm Store Doubleword SP (C.SDSP)
32-bit Instruction Formats 16-bit (RVC) Instruction Formats
R CR
I CI
S CSS
B CIW
U CL
J CS
CB
CJ
RISC-V Integer Base (RV32I/64I), privileged, and optional RV32/64C. Registers x1-x31 and the PC are 32 bits wide in RV32I and 64 in
RV64I (x0=0). RV64I adds 12 instructions for the wider data. Every 16-bit RVC instruction maps to an existing 32-bit RISC-V instruction.
Open Reference Card ②
Optional Multiply-Divide Instruction Extension: RVM Optional Vector Extension: RVV
Category Name Fmt RV32M (Multiply-Divide) +RV64M Name Fmt RV32V/R64V
Multiply MULtiply R MUL rd,rs1,rs2 MULW rd,rs1,rs2 SET Vector Len. R SETVL rd,rs1
MULtiply High R MULH rd,rs1,rs2 MULtiply High R VMULH rd,rs1,rs2
MULtiply High Sign/Uns R MULHSU rd,rs1,rs2 REMainder R VREM rd,rs1,rs2
MULtiply High Uns R MULHU rd,rs1,rs2 Shift Left Log. R VSLL rd,rs1,rs2
Divide DIVide R DIV rd,rs1,rs2 DIVW rd,rs1,rs2 Shift Right Log. R VSRL rd,rs1,rs2
DIVide Unsigned R DIVU rd,rs1,rs2 Shift R. Arith. R VSRA rd,rs1,rs2
Remainder REMainder R REM rd,rs1,rs2 REMW rd,rs1,rs2 LoaD I VLD rd,rs1,imm
REMainder Unsigned R REMU rd,rs1,rs2 REMUW rd,rs1,rs2 LoaD Strided R VLDS rd,rs1,rs2

Optional Atomic Instruction Extension: RVA LoaD indeXed R VLDX rd,rs1,rs2

Category Name Fmt RV32A (Atomic) +RV64A STore S VST rd,rs1,imm
Load Load Reserved R LR.W rd,rs1 LR.D rd,rs1 STore Strided R VSTS rd,rs1,rs2
Store Store Conditional R SC.W rd,rs1,rs2 SC.D rd,rs1,rs2 STore indeXed R VSTX rd,rs1,rs2
Swap SWAP R AMOSWAP.W rd,rs1,rs2 AMOSWAP.D rd,rs1,rs2 AMO SWAP R AMOSWAP rd,rs1,rs2
Add ADD R AMOADD.W rd,rs1,rs2 AMOADD.D rd,rs1,rs2 AMO ADD R AMOADD rd,rs1,rs2
Logical XOR R AMOXOR.W rd,rs1,rs2 AMOXOR.D rd,rs1,rs2 AMO XOR R AMOXOR rd,rs1,rs2
AND R AMOAND.W rd,rs1,rs2 AMOAND.D rd,rs1,rs2 AMO AND R AMOAND rd,rs1,rs2
OR R AMOOR.W rd,rs1,rs2 AMOOR.D rd,rs1,rs2 AMO OR R AMOOR rd,rs1,rs2
Min/Max MINimum R AMOMIN.W rd,rs1,rs2 AMOMIN.D rd,rs1,rs2 AMO MINimum R AMOMIN rd,rs1,rs2
MAXimum R AMOMAX.W rd,rs1,rs2 AMOMAX.D rd,rs1,rs2 AMO MAXimum R AMOMAX rd,rs1,rs2
MINimum Unsigned R AMOMINU.W rd,rs1,rs2 AMOMINU.D rd,rs1,rs2 Predicate = R VPEQ rd,rs1,rs2
MAXimum Unsigned R AMOMAXU.W rd,rs1,rs2 AMOMAXU.D rd,rs1,rs2 Predicate ≠ R VPNE rd,rs1,rs2
Two Optional Floating-Point Instruction Extensions: RVF & RVD Predicate < R VPLT rd,rs1,rs2
Category Name Fmt RV32{F|D} (SP,DP Fl. Pt.) +RV64{F|D} Predicate ≥ R VPGE rd,rs1,rs2
Move Move from Integer R FMV.W.X rd,rs1 FMV.D.X rd,rs1 Predicate AND R VPAND rd,rs1,rs2
Move to Integer R FMV.X.W rd,rs1 FMV.X.D rd,rs1 Pred. AND NOT R VPANDN rd,rs1,rs2
Convert ConVerT from Int R FCVT.{S|D}.W rd,rs1 FCVT.{S|D}.L rd,rs1 Predicate OR R VPOR rd,rs1,rs2
ConVerT from Int Unsigned R FCVT.{S|D}.WU rd,rs1 FCVT.{S|D}.LU rd,rs1 Predicate XOR R VPXOR rd,rs1,rs2
ConVerT to Int R FCVT.W.{S|D} rd,rs1 FCVT.L.{S|D} rd,rs1 Predicate NOT R VPNOT rd,rs1
ConVerT to Int Unsigned R FCVT.WU.{S|D} rd,rs1 FCVT.LU.{S|D} rd,rs1 Pred. SWAP R VPSWAP rd,rs1
Load Load I FL{W,D} rd,rs1,imm Calling Convention MOVe R VMOV rd,rs1
Store Store S FS{W,D} rs1,rs2,imm Register ABI Name Saver ConVerT R VCVT rd,rs1
Arithmetic ADD R FADD.{S|D} rd,rs1,rs2 x0 zero --- ADD R VADD rd,rs1,rs2
SUBtract R FSUB.{S|D} rd,rs1,rs2 x1 ra Caller SUBtract R VSUB rd,rs1,rs2
MULtiply R FMUL.{S|D} rd,rs1,rs2 x2 sp Callee MULtiply R VMUL rd,rs1,rs2
DIVide R FDIV.{S|D} rd,rs1,rs2 x3 gp --- DIVide R VDIV rd,rs1,rs2
SQuare RooT R FSQRT.{S|D} rd,rs1 x4 tp --- SQuare RooT R VSQRT rd,rs1,rs2
Mul-Add Multiply-ADD R FMADD.{S|D} rd,rs1,rs2,rs3 x5-7 t0-2 Caller Multiply-ADD R VFMADD rd,rs1,rs2,rs3
Multiply-SUBtract R FMSUB.{S|D} rd,rs1,rs2,rs3 x8 s0/fp Callee Multiply-SUB R VFMSUB rd,rs1,rs2,rs3
Negative Multiply-SUBtract R FNMSUB.{S|D} rd,rs1,rs2,rs3 x9 s1 Callee Neg. Mul.-SUB R VFNMSUB rd,rs1,rs2,rs3
Negative Multiply-ADD R FNMADD.{S|D} rd,rs1,rs2,rs3 x10-11 a0-1 Caller Neg. Mul.-ADD R VFNMADD rd,rs1,rs2,rs3
Sign Inject SiGN source R FSGNJ.{S|D} rd,rs1,rs2 x12-17 a2-7 Caller SiGN inJect R VSGNJ rd,rs1,rs2
Negative SiGN source R FSGNJN.{S|D} rd,rs1,rs2 x18-27 s2-11 Callee Neg SiGN inJect R VSGNJN rd,rs1,rs2
Xor SiGN source R FSGNJX.{S|D} rd,rs1,rs2 x28-31 t3-t6 Caller Xor SiGN inJect R VSGNJX rd,rs1,rs2
Min/Max MINimum R FMIN.{S|D} rd,rs1,rs2 f0-7 ft0-7 Caller MINimum R VMIN rd,rs1,rs2
MAXimum R FMAX.{S|D} rd,rs1,rs2 f8-9 fs0-1 Callee MAXimum R VMAX rd,rs1,rs2
Compare compare Float = R FEQ.{S|D} rd,rs1,rs2 f10-11 fa0-1 Caller XOR R VXOR rd,rs1,rs2
compare Float < R FLT.{S|D} rd,rs1,rs2 f12-17 fa2-7 Caller OR R VOR rd,rs1,rs2
compare Float ≤ R FLE.{S|D} rd,rs1,rs2 f18-27 fs2-11 Callee AND R VAND rd,rs1,rs2
Categorize CLASSify type R FCLASS.{S|D} rd,rs1 f28-31 ft8-11 Caller CLASS R VCLASS rd,rs1
Configure Read Status R FRCSR rd zero Hardwired zero SET Data Conf. R VSETDCFG rd,rs1
Read Rounding Mode R FRRM rd ra Return address EXTRACT R VEXTRACT rd,rs1,rs2
Read Flags R FRFLAGS rd sp Stack pointer MERGE R VMERGE rd,rs1,rs2
Swap Status Reg R FSCSR rd,rs1 gp Global pointer SELECT R VSELECT rd,rs1,rs2
Swap Rounding Mode R FSRM rd,rs1 tp Thread pointer
Swap Flags R FSFLAGS rd,rs1 t0-0,ft0-7 Temporaries
Swap Rounding Mode Imm I FSRMI rd,imm s0-11,fs0-11 Saved registers
Swap Flags Imm I FSFLAGSI rd,imm a0-7,fa0-7 Function args
RISC-V calling convention and five optional extensions: 8 RV32M; 11 RV32A; 34 floating-point instructions each for 32- and 64-bit data (RV32F,
RV32D); and 53 RV32V. Using regex notation, {} means set, so FADD.{F|D} is both FADD.F and FADD.D. RV32{F|D} adds registers f0-f31,
whose width matches the widest precision, and a floating-point control and status register fcsr. RV32V adds vector registers v0-v31, vector
predicate registers vp0-vp7, and vector length register vl. RV64 adds a few instructions: RVM gets 4, RVA 11, RVF 6, RVD 6, and RVV 0.
The RISC-V Reader:
An Open Architecture Atlas
Beta Edition, 0.0.1

David Patterson and Andrew Waterman

October 4, 2017
Copyright 2017 Strawberry Canyon LLC. All rights reserved.
No part of this book or its related materials may be reproduced in any form without the
written consent of the copyright holder.

Book version: 0.0.1

The cover background is a photo of the Mona Lisa. It is a portrait of Lisa Gherardini,
painted between 1503 and 1506, by the Leonardo da Vinci. The King of France bought it
from Leonardo in about 1530, and it has been on display at the Louvre Museum in Paris
since 1797. The Mona Lisa is considered the best known work of art in the world. Mona Lisa
represents elegance, which we believe is a feature of RISC-V.

Both the print book and ebook were prepared with LATEX, tex4ht, and Ruby scripts
that use Nokogiri (based on libxml2) to massage the XHTML output and HTTParty to au-
tomatically keep the GitHub Gists and screencast URIs up-to-date in the text. The neces-
sary Makefiles, style files and most of the scripts are available under the BSD License at
http://github.com/armandofox/latex2ebook.
Arthur Klepchukov designed the covers and graphics for all versions.

Publisher’s Cataloging-in-Publication

Names: Patterson, David A. | Waterman, Andrew, 1986-

Title: The RISC-V reader: an open architecture atlas / David Patterson and Andrew Waterman.
Description: Beta edition, 0.0.1. | [Berkeley, California] : Strawberry Canyon LLC, 2017. |
Includes bibliographical references and index.
Identifiers: ISBN 978-0-9992491-0-9
Subjects: LCSH: Computer architecture. | RISC microprocessors. |
Assembly languages (Electronic computers)
Classification: LCC QA76.9.A73 P38 2017 | DDC 004.22- -dc23
Dedication

David Patterson dedicates this book to his parents:

—To my father David, from whom I inherited

inventiveness, athleticism, and the courage to fight for
what is right; and

—To my mother Lucie, from whom I inherited

intelligence, optimism, and my temperament.

Thank you for being such great role models, which

taught me what it means to be a good spouse, parent,
and grandparent.

Andrew Waterman dedicates this book to his parents,

John and Elizabeth, who have been enormously sup-
portive, even while thousands of miles away.

i
About the Authors
David Patterson retired after 40 years as a Professor of Computer Science at UC Berkeley in
2016, and then joined Google as a distinguished engineer. He also serves as Vice-Chair of the
Board of Directors of the RISC-V Foundation. In the past, he was named Chair of Berkeley’s
Computer Science Division and was elected to be Chair of the Computing Research Associ-
ation and President of the Association for Computing Machinery. In the 1980s, he led four
generations of Reduced Instruction Set Computer (RISC) projects, which inspired Berkeley’s
latest RISC to be named “RISC Five.” Along with Andrew Waterman, he was one of the four
architects of RISC-V. Beyond RISC, his best-known research projects are Redundant Arrays
of Inexpensive Disks (RAID) and Networks of Workstations (NOW). This research led to
many papers, 7 books, and more than 35 honors, including election to the National Academy
of Engineering, the National Academy of Sciences, and the Silicon Valley Engineering Hall
of Fame as well as being named a Fellow of the Computer History Museum, ACM, IEEE, and
both AAAS organizations. His teaching awards include the Distinguished Teaching Award
(UC Berkeley), the Karlstrom Outstanding Educator Award (ACM), the Mulligan Education
Medal (IEEE), and the Undergraduate Teaching Award (IEEE). He also won Textbook Ex-
cellence Awards (“Texty”) from the Text and Academic Authors Association for a computer
architecture book and for a software engineering book. He received all his degrees from
UCLA, which awarded him an Outstanding Engineering Academic Alumni Award. He grew
up in Southern California, and for fun he plays soccer and rides bikes with his sons and walks
on the beach with his wife. Originally high-school sweethearts, they celebrated their 50th
wedding anniversary a few days after the Beta edition was published.

Andrew Waterman serves as SiFive’s Chief Engineer and co-founder. SiFive was founded
by the creators of the RISC-V architecture to provide low-cost custom chips based on RISC-
V. He received his PhD in Computer Science from UC Berkeley, where, weary of the vagaries
of existing instruction set architectures, he co-designed the RISC-V ISA and the first RISC-V
microprocessors. Andrew is one of the main contributors to the open-source RISC-V-based
Rocket chip generator, the Chisel hardware construction language, and the RISC-V ports of
the Linux operating system kernel and the GNU C Compiler and C Library. He also has an
MS from UC Berkeley, which was the basis of the RVC extension for RISC-V, and a BSE
from Duke University.

ii
Quick Contents

RISC-V Reference Card . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

1 Why RISC-V? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 RV32I: RISC-V Base Integer ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 RISC-V Assembly Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 RV32M: Multiply and Divide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 RV32FD: Single/Double Floating Point . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 RV32A: Atomic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 RV32C: Compressed Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8 RV32V: Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

9 RV64: 64-bit Address Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

10 RV32/64 Privileged Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

11 Future RISC-V Optional Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Appendix A RISC-V Instruction Listings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Contents

List of Figures x

Preface xii

1 Why RISC-V? 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Modular vs. Incremental ISAs . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 ISA Design 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 An Overview of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 RV32I: RISC-V Base Integer ISA 14

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 RV32I Instruction formats . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 RV32I Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 RV32I Integer Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 RV32I Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 RV32I Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 RV32I Unconditional Jump . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 RV32I Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Comparing RV32I, ARM-32, MIPS-32, and x86-32 . . . . . . . . . . . . . . 23
2.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 RISC-V Assembly Language 32

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Calling convention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Linker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Static vs. Dynamic Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6 Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 RV32M: Multiply and Divide 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 RV32FD: Single/Double Floating Point 48

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 Floating-Point Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Floating-Point Loads, Stores, and Arithmetic . . . . . . . . . . . . . . . . . 49
5.4 Floating-Point Moves and Converts . . . . . . . . . . . . . . . . . . . . . . 53
5.5 Miscellaneous Floating-Point Instructions . . . . . . . . . . . . . . . . . . . 53
5.6 Comparing RV32FD, ARM-32, MIPS-32, and x86-32 using DAXPY . . . . 55
5.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.8 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 RV32A: Atomic 60
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 RV32C: Compressed Instructions 64

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Comparing RV32GC, Thumb-2, microMIPS, and x86-32 . . . . . . . . . . . 66
7.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.4 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8 RV32V: Vector 72
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2 Vector Computation Instructions . . . . . . . . . . . . . . . . . . . . . . . . 73
8.3 Vector Registers and Dynamic Typing . . . . . . . . . . . . . . . . . . . . . 74
8.4 Vector Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.5 Parallelism During Vector Execution . . . . . . . . . . . . . . . . . . . . . . 76
8.6 Conditional Execution of Vector Operations . . . . . . . . . . . . . . . . . . 76
8.7 Miscellaneous Vector Instructions . . . . . . . . . . . . . . . . . . . . . . . 77
8.8 Vector Example: DAXPY in RV32V . . . . . . . . . . . . . . . . . . . . . . 78
8.9 Comparing RV32V, MIPS-32 MSA SIMD, and x86-32 AVX SIMD . . . . . 79
8.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.11 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

9 RV64: 64-bit Address Instructions 86

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.2 Comparison to Other 64-bit ISAs using Insertion Sort . . . . . . . . . . . . . 90
9.3 Program size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
9.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.5 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

v
10 RV32/64 Privileged Architecture 100
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.2 Machine Mode for Simple Embedded Systems . . . . . . . . . . . . . . . . . 101
10.3 Machine-Mode Exception Handling . . . . . . . . . . . . . . . . . . . . . . 103
10.4 User Mode and Process Isolation in Embedded Systems . . . . . . . . . . . . 105
10.5 Supervisor Mode for Modern Operating Systems . . . . . . . . . . . . . . . 108
10.6 Page-Based Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.8 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

11 Future RISC-V Optional Extensions 116

11.1 “B” Standard Extension for Bit Manipulation . . . . . . . . . . . . . . . . . 116
11.2 “E” Standard Extension for Embedded . . . . . . . . . . . . . . . . . . . . . 116
11.3 “H” Privileged Architecture Extension for Hypervisor Support . . . . . . . . 116
11.4 “J” Standard Extension for Dynamically Translated Languages . . . . . . . . 116
11.5 “L” Standard Extension for Decimal Floating-Point . . . . . . . . . . . . . . 116
11.6 “N” Standard Extension for User-Level Interrupts . . . . . . . . . . . . . . . 117
11.7 “P” Standard Extension for Packed-SIMD Instructions . . . . . . . . . . . . 117
11.8 “Q” Standard Extension for Quad-Precision Floating-Point . . . . . . . . . . 117
11.9 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

A RISC-V Instruction Listings 118

Index 166

vi
vii
List of Figures

1.1 The corporate members of the RISC-V Foundation . . . . . . . . . . . . . . 3

1.2 Growth of x86 instruction set over its lifetime. . . . . . . . . . . . . . . . . . 3
1.3 Description of the x86-32 ASCII Adjust after Addition (aaa) instruction. . . . 4
1.4 An 8-inch diameter wafer of RISC-V dies designed by SiFive. . . . . . . . . 6
1.5 Relative program sizes for RV32G, ARM-32, x86-32, RV32C, and Thumb-2. 9
1.6 Number of pages and words of ISA manuals . . . . . . . . . . . . . . . . . . 12

2.1 Diagram of the RV32I instructions. . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 RISC-V instruction formats. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 RV32I opcode map has instruction layout, opcodes, format type, and names. . 16
2.4 The registers of RV32I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Insertion Sort in C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Number of instructions and code size for Insertion Sort for these ISAs. . . . . 23
2.7 Lessons that RISC-V architects learned from past instruction set mistakes. . . 25
2.8 RV32I code for Insertion Sort in Figure 2.5. . . . . . . . . . . . . . . . . . . 27
2.9 ARM-32 code for Insertion Sort in Figure 2.5. . . . . . . . . . . . . . . . . . 28
2.10 MIPS-32 code for Insertion Sort in Figure 2.5. . . . . . . . . . . . . . . . . . 29
2.11 x86-32 code for Insertion Sort in Figure 2.5. . . . . . . . . . . . . . . . . . . 30

3.1 Steps of translation from C source code to a running program. . . . . . . . . 33

3.2 Assembler mnemonics for RISC-V integer and floating-point registers. . . . . 34
3.3 32 RISC-V pseudoinstructions that rely on x0, the zero register. . . . . . . . 36
3.4 28 RISC-V pseudoinstructions that are independent of x0, the zero register. . 37
3.5 Hello World program in C (hello.c). . . . . . . . . . . . . . . . . . . . . . 38
3.6 Hello World program in RISC-V assembly language (hello.s). . . . . . . . 38
3.7 Hello World program in RISC-V machine language (hello.o). . . . . . . . 38
3.8 Hello World program as RISC-V machine language program after linking. . . 39
3.9 Common RISC-V assembler directives. . . . . . . . . . . . . . . . . . . . . 39
3.10 RV32I allocation of memory to program and data. . . . . . . . . . . . . . . . 40

4.1 Diagram of the RV32M instructions. . . . . . . . . . . . . . . . . . . . . . . 44

4.2 RV32M opcode map has instruction layout, opcodes, format type, and names. 45
4.3 RV32M code to divide by a constant by multiplying. . . . . . . . . . . . . . 45
5.1 Diagram of the RV32F and RV32D instructions. . . . . . . . . . . . . . . . . 49
5.2 RV32F opcode map has instruction layout, opcodes, format type, and names. 50
5.3 RV32D opcode map has instruction layout, opcodes, format type, and names. 51
5.4 The floating-point registers of RV32F and RV32D. . . . . . . . . . . . . . . 52
5.5 Floating-point control and status register. . . . . . . . . . . . . . . . . . . . 53
5.6 RV32F and RV32D conversion instructions. . . . . . . . . . . . . . . . . . . 54
5.7 The floating-point intensive DAXPY program in C. . . . . . . . . . . . . . . 55
5.8 Number of instructions and code size of DAXPY for four ISAs. . . . . . . . 55
5.9 RV32D code for DAXPY in Figure 5.7. . . . . . . . . . . . . . . . . . . . . 57
5.10 ARM-32 code for DAXPY in Figure 5.7. . . . . . . . . . . . . . . . . . . . 57
5.11 MIPS-32 code for DAXPY in Figure 5.7. . . . . . . . . . . . . . . . . . . . 58
5.12 x86-32 code for DAXPY in Figure 5.7. . . . . . . . . . . . . . . . . . . . . . 58

6.1 Diagram of the RV32A instructions. . . . . . . . . . . . . . . . . . . . . . . 60

6.2 RV32A opcode map has instruction layout, opcodes, format type, and names. 61
6.3 Two examples of synchronization. . . . . . . . . . . . . . . . . . . . . . . . 62

7.1 Diagram of the RV32C instructions. . . . . . . . . . . . . . . . . . . . . . . 65

7.2 Instructions and code size for Insertion Sort and DAXPY for compressed ISAs. 66
7.3 RV32C code for Insertion Sort. . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4 RV32DC code for DAXPY. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.5 RV32C opcode map (bits[1 : 0] = 01) lists layout, opcodes, format, and names. 69
7.6 RV32C opcode map (bits[1 : 0] = 00) lists layout, opcodes, format, and names. 70
7.7 RV32C opcode map (bits[1 : 0] = 10) lists layout, opcodes, format, and names. 70
7.8 Compressed 16-bit RVC instruction formats. . . . . . . . . . . . . . . . . . . 71

8.1 Diagram of the RV32V instructions. . . . . . . . . . . . . . . . . . . . . . . 73

8.2 RV32V encodings of vector register types. . . . . . . . . . . . . . . . . . . . 75
8.3 RV32V code for DAXPY in Figure 5.7. . . . . . . . . . . . . . . . . . . . . 78
8.4 Number of instructions and code size of DAXPY for vector ISAs. . . . . . . 80
8.5 MIPS-32 MSA code for DAXPY in Figure 5.7. . . . . . . . . . . . . . . . . 83
8.6 x86-32 AVX2 code for DAXPY in Figure 5.7. . . . . . . . . . . . . . . . . . 84

9.1 Diagram of the RV64I instructions. . . . . . . . . . . . . . . . . . . . . . . . 87

9.2 Diagrams of the RV64M and RV64A instructions. . . . . . . . . . . . . . . . 87
9.3 Diagram of the RV64F and RV64D instructions. . . . . . . . . . . . . . . . . 88
9.4 Diagram of the RV64C instructions. . . . . . . . . . . . . . . . . . . . . . . 88
9.5 RV64 opcode map of the base instructions and optional extensions. . . . . . . 89
9.6 Number of instructions and code size for Insertion Sort for four ISAs. . . . . 91
9.7 Relative program sizes for RV64G, ARM-64, and x86-64 versus RV64GC. . . 92
9.8 RV64I code for Insertion Sort in Figure 2.5. . . . . . . . . . . . . . . . . . . 95
9.9 ARM-64 code for Insertion Sort in Figure 2.5. . . . . . . . . . . . . . . . . . 96
9.10 MIPS-64 code for Insertion Sort in Figure 2.5. . . . . . . . . . . . . . . . . . 97
9.11 x86-64 code for Insertion Sort in Figure 2.5. . . . . . . . . . . . . . . . . . . 98

10.1 Diagram of the RISC-V privileged instructions instructions. . . . . . . . . . . 100

10.2 RISC-V privileged instruction layout, opcodes, format type, and name. . . . . 101
10.3 RISC-V exception and interrupt causes. . . . . . . . . . . . . . . . . . . . . 102

ix
x

10.4 The mstatus CSR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

10.5 RISC-V privilege levels and their encoding. . . . . . . . . . . . . . . . . . . 104
10.6 RISC-V code for a simple timer interrupt handler. . . . . . . . . . . . . . . . 106
10.7 A PMP address and configuration register. . . . . . . . . . . . . . . . . . . . 107
10.8 The layout of PMP configurations in the pmpcfg CSRs. . . . . . . . . . . . . 107
10.9 The sstatus CSR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10.10An RV32 Sv32 page-table entry (PTE). . . . . . . . . . . . . . . . . . . . . 110
10.11An RV64 Sv39 page-table entry (PTE). . . . . . . . . . . . . . . . . . . . . 110
10.12The satp CSR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
10.13The encoding of the MODE field in the satp CSR. . . . . . . . . . . . . . . 111
10.14Diagram of the Sv32 address-translation process. . . . . . . . . . . . . . . . 112
10.15Machine interrupt registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.16Supervisor interrupt registers. . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.17Machine and supervisor trap-vector base-address register (mtvec and stvec)
CSRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
10.18Machine and supervisor cause (mcause and scause) CSRs. . . . . . . . . . 114
10.19The complete algorithm for virtual-to-physical address translation. . . . . . . 115
xi
Preface

Welcome!
RISC-V has been a phenomenon, rapidly growing in popularity since its introduction in 2011.
We thought a slim programmer’s guide would help fuel its ride and encourage newcomers to
understand why it is an attractive instruction set and see how it differs from conventional
instruction set architectures (ISA) of the past.
Books for other ISAs inspired us, although we hoped that the simplicity of RISC-V would
mean writing much less than the 500+ pages of fine books such as See MIPS Run. At one-
third the overall length, at least by that measure we’ve succeeded. In fact, the ten chapters
that introduce each component of the modular RISC-V instruction set take just 100 pages—
despite averaging nearly one figure per page (75 total)—which makes for quick reading.
After explaining the principles of instruction set design, we show how the RISC-V ar-
chitects learned from the instruction sets of the past 40 years to borrow their good ideas and
avoid their mistakes. ISAs are judged as much by what is omitted as by what is included.
We then introduce each component of this modular architecture in a sequence of chap-
ters. Every chapter has a program in RISC-V assembly language that demonstrates use of
the instructions introduced in that chapter, which makes it easier for the assembly language
programmer to learn RISC-V code. We also often show equivalent programs in ARM, MIPS,
and x86 that highlight the simplicity and cost-energy-performance benefits of RISC-V.
To make the book more fun to read, we include almost 50 sidebars in the page margins
with what we hope are interesting commentaries about the text. We also include about 75
images in the margins to emphasize examples of good ISA design. (Our margins are well-
used!) Finally, for the dedicated reader, we add roughly 25 elaborations throughout the text.
You can delve into these optional sections if you are interested in a topic. These sections
aren’t required to understand the other material in the book, so feel free to skip them if they
don’t catch your interest. For computer architecture buffs, we cite 25 papers and books that
may broaden your horizons. We learned a lot by reading them in order to write this book!

Why So Many Quotes?

We think quotes also make the book more fun to read, so we’re sprinkled 25 of them through-
out the text. They likewise are an efficient mechanism to pass along wisdom from elders to
novices, and help set cultural standards for good ISA design. We want readers to pick up a bit
xiii

of history of the field too, which is why we feature quotes from famous computer scientists
and engineers throughout the text.

Introduction and Reference

We intend this slim book to work as both an introduction and a reference to RISC-V for
students and embedded systems programmers interested in writing RISC-V code. This book
assumes readers have seen at least one instruction set beforehand. If not, you might want to
browse our related introductory architecture book based on RISC-V: Computer Organization
and Design RISC-V Edition: The Hardware Software Interface.
The compact references in this book include:

• Reference Card – This one page (two sides) condensed description of RISC-V covers
both RV32GCV and RV64GCV, which includes the base and all defined extensions:
RVI, RVM, RVA, RVF, RVD, RVC, and even RVV, even though it is still under devel-
opment.
• Instruction Diagrams – These half-page graphical descriptions of each instruction
extension, which are the first figures of the chapters, list the full names of all RISC-V
instructions in a format that let’s you easily see the variations of each instruction. See
Figures 2.1, 4.1, 5.1, 6.1, 7.1, 8.1, 9.1, 9.2, 9.3, and 9.4.
• Opcode Maps – These tables show the instruction layout, opcodes, format type, and
instruction mnemonic for each instruction extension in a fraction of a page. See Fig-
ures 2.3, 3.3, 3.4, 4.2, 5.2, 5.3, 6.2, 7.6, 7.5, 7.7, 9.5, and 10.1. (The instruction
diagrams and opcode maps inspired the use of the word atlas in the book’s subtitle.)
• Instruction Glossary – Appendix A is a thorough description of every RISC-V
instruction and pseudoinstruction.1 It includes everything: the operation name
and operands, an English description, a register-transfer language definition, which
RISC-V extension it is in, the full name of the instruction, the instruction format, a
diagram of the instruction showing the opcodes, and references to compact versions of
the instruction. Amazingly, this all fits into less than 50 pages.
• Index – It helps you find the page that describes the instruction explanation, definition,
or diagram either by the full name or by mnemonic. It is organized like a dictionary.

Errata and Supplementary Content

We intend to collect the Errata together and release updates a few times a year. The book’s
website shows the latest version of the book and a brief description of the changes since the
previous version. Previous errata can be reviewed, and new ones reported, on the book’s
website (www.riscvbook.com). We apologize in advance for the problems you find in this
edition, and look forward to your feedback on how to improve this material.

1 The committee defining RV32V did not complete their work in time for the Beta edition, so we omit those

instructions from Appendix A. Chapter 8 is our best guess of what RV32V will be, although it is likely to change a
little.
xiv

History of this Book

At the Sixth RISC-V Workshop held May 8–11, 2017 in Shanghai, we saw the need for
such a book. We started a few weeks later. Given Patterson’s much greater experience in
writing books, the plan was for him to write most chapters. Both of us collaborated on the
organization and were first reviewers for each other’s chapters. Patterson authored Chapters 1,
2, 3, 4, 5, 6, 7, 8, 9, 11, the Reference Card, and this Preface, while Waterman wrote 10
and Appendix A—the largest section of the book—and coded all the programs in the book.
Waterman also maintained the LaTeX pipeline from Armando Fox that let us produce the
book.
We offered a Beta edition of the textbook for 800 UC Berkeley students in the Fall
semester 2017. After incorporating their feedback, the first edition should be published in
time for the Seventh RISC Workshop in Silicon Valley from November 28–30, 2017.
RISC-V was a byproduct of a Berkeley research project1 that was developing technology
to make it easier to build parallel hardware and software.

Acknowledgments
We wish to thank Armando Fox for use of his LaTeX pipeline and advice on navigating the
world of self publishing.
Our deepest thanks go to the people who read early drafts of the book and offered helpful
suggestions: Krste Asanović, Nikhil Athreya, C. Gordon Bell, Stuart Hoad, David Kanter,
John Mashey, Ivan Sutherland, Ted Speers, Michael Taylor, Megan Wachs, ... .
Finally, we thank the hundreds of UC Berkeley students for their debugging help and their
continuing interest in this material!

David Patterson and Andrew Waterman

September 1, 2017
Berkeley, California
1 Why RISC-V?

Leonardo da Vinci Simplicity is the ultimate sophistication.

(1452-1519) was a Renais- —Leonardo da Vinci
sance architect, engineer,
sculptor, and painter of the
Mona Lisa.
1.1 Introduction
The goal for RISC-V (“RISC five”) is to become a universal instruction set architecture (ISA):
• It should suit all sizes of processors, from the tiniest embedded controller to the fastest
high-performance computer.
• It should work well with a wide variety of popular software stacks and programming
languages.
• It should accommodate all implementation technologies: Field-Programmable Gate
Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), full-custom chips,
and even future device technologies.
• It should be efficient for all microarchitecture styles: microcoded or hardwired control;
in-order, decoupled, or out-of-order pipelines; single or superscalar instruction issue;
We add sidebars in
and so on.
the margins to offer • It should support extensive specialization to act as a base for customized accelerators,
hopefully interesting com-
mentary. For example, which rise in importance as Moore’s Law fades.
RISC-V was originally de-
veloped for internal use in • It should be stable, in that the base ISA should not change. More importantly, the ISA
UC Berkeley research and cannot be discontinued, as has happened in the past to proprietary ISAs such as the
courses. It became open AMD Am29000, the Digital Alpha, the Digital VAX, the Hewlett Packard PA-RISC,
because outsiders started the Intel i860, the Intel i960, the Motorola 88000, and the Zilog Z8000.
using it on their own. The
RISC-V architects learned RISC-V is unusual not only because it is a recent ISA—born this decade when most alter-
about the external interest natives date from the 1970s or 1980s—but also because it is an open ISA. Unlike practically
when they started receiv-
ing complaints about ISA all prior architectures, its future is free from the fate or the whims of any single corporation,
changes in their course- which has doomed many ISAs in the past. It belongs instead to an open, non-profit founda-
work, which was on the tion. The goal of the RISC-V Foundation is to maintain the stability of RISC-V, evolve it
web. Only after the archi-
slowly and carefully, solely for technical reasons, and try to make it as popular for hardware
tects understood the need
did they try to make it an as Linux is for operating systems. As a sign of its vitality, Figure 1.1 lists the largest corporate
open ISA standard. members of the RISC-V Foundation.
1.1. INTRODUCTION 3

>$50B >$5B, <$50B >$0.5B, <$5B

Google USA BAE Systems UK AMD USA
Huawei China MediaTek Taiwan Andes Technology China
IBM USA Micron Tech. USA C-SKY Microsystems China
Microsoft USA Nvidia USA Integrated Device Tech. USA
Samsung Korea NXP Semi. Netherlands Mellanox Technology Israel
Qualcomm USA Microsemi Corp. USA
Western Digital USA

Figure 1.1: The corporate members of the RISC-V Foundation as of the Sixth RISC-V Workshop in May
2017 ranked by annual sales. The left column companies all exceed $US 50B in annual sales, the middle
column companies sell less than $US 50B but more than $US 5B, and the sales of those in the right column
are less than $US 5B but more than $US 0.5B. The foundation includes another 25 smaller companies, 5
startup companies (Antmicro Ltd, Blockstream, Esperanto Technologies, Greenwaves Technologies, and
SiFive), 4 nonprofit organizations (CSEM, Draper Laboratory, ICT, and lowRISC), and 6 universities (ETH
Zurich, IIT Madras, National University of Defense Technology, Princeton, and UC Berkeley). Most of the
60 organizations have their headquarters outside the US. To learn more, see www.riscv.org.

1600
Number x86 Instruc0ons

1338
1200
1048

800 670
500
437 446
400 293
140 145 162 166 223
80
0
1978 1982 1986 1990 1994 1998 2002 2006 2010 2014

Figure 1.2: Growth of x86 instruction set over its lifetime. x86 started with 80 instructions in 1978. It grew
16X to 1338 instructions by 2015, and it’s still growing. Amazingly, this graph is conservative. An Intel blog
puts the count at 3600 instructions in 2015 [Rodgers and Uhlig 2017], which would raise the x86 rate to one
new instruction every four days between 1978 and 2015. We count assembly language instructions, and they
presumably count machine language instructions. As Chapter 8 explains, a large part of the growth is
because the x86 ISA relies on SIMD instructions for data level parallelism.
4 CHAPTER 1. WHY RISC-V?

The AL register is the default source and destination.

If the low 4-bits of AL register are > 9,
or the auxiliary carry flag AF = 1,
Then
Add 6 to low 4-bits of AL and discard overflow
Increment the high byte of AL
Carry flag CF = 1
Auxiliary carry flag AF = 1
Else
CF = AF = 0
Upper 4-bits of AL = 0

Figure 1.3: Description of the x86-32 ASCII Adjust after Addition (aaa) instruction. It performs computer
arithmetic in Binary Coded Decimal (BCD), which has fallen into the dustbin of information technology
history. The x86 also has three related instructions for subtraction (aas), multiplication (aam), and division
(aad). As each is a one-byte instruction, they collectively occupy 1.6% (4/256) of the precious opcode space.

1.2 Modular vs. Incremental ISAs

Intel was betting its future on a high-end microprocessor, but that was still years away.
To counter Zilog, Intel developed a stop-gap processor and called it the 8086. It was
intended to be short-lived and not have any successors, but that’s not how things turned
out. The high-end processor ended up being late to market, and when it did come out,
it was too slow. So the 8086 architecture lived on—it evolved into a 32-bit processor
and eventually into a 64-bit one. The names kept changing (80186, 80286, i386, i486,
Pentium), but the underlying instruction set remained intact.
—Stephen P. Morse, architect of the 8086 [Morse 2017]

The conventional approach to computer architecture is incremental ISAs, where new proces-
sors must implement not only new ISA extensions but also all extensions of the past. The
purpose is to maintain backwards binary-compatibility so that binary versions of decades-old
programs can still run correctly on the latest processor. This requirement, when combined
with the marketing appeal of announcing new instructions with a new generation of proces-
sors, has led to ISAs that grow substantially in size with age. For example, Figure 1.2 shows
the growth in the number of instructions for a dominant ISA today: the 80x86. It dates back
to 1978, yet it has added about three instructions per month over its long lifetime.
This convention means that every implementation of the x86-32 (the name we use for
the 32-bit address version of x86) must implement the mistakes of past extensions, even
when they no longer make sense. For example, Figure 1.3 describes the ASCII Adjust after
Addition (aaa) instruction of the x86, which has long outlived its usefulness.
As an analogy, suppose a restaurant serves only a fixed-price meal, which starts out as a
small dinner of just a hamburger and a milkshake. Over time, it adds fries, and then an ice
cream sundae, followed by salad, pie, wine, vegetarian pasta, steak, beer, ad infinitum until
it becomes a gigantic banquet. It may make little sense in total, but diners can find whatever
they’ve ever eaten in a past meal at that restaurant. The bad news is that diners must pay the
rising cost of the expanding banquet for each dinner.
Beyond being recent and open, RISC-V is unusual since, unlike almost all prior ISAs, it is
modular. At the core is a base ISA, called RV32I, which runs a full software stack. RV32I is
1.3. ISA DESIGN 101 5

frozen and will never change, which gives compiler writers, operating system developers, and
assembly language programmers a stable target. The modularity comes from optional stan-
dard extensions that hardware can include or not depending on the needs of the application.
This modularity enables very small and low energy implementations of RISC-V, which can
be critical for embedded applications. By informing the RISC-V compiler what extensions If software uses
are included, it can generate the best code for that hardware. The convention is to append an omitted RISC-V
the extension letters to the name to indicate which are included. For example, RV32IMFD instruction from an
optional extension,
adds the multiply (RV32M), single-precision floating point (RV32F), and double-precision the hardware traps and
floating point extensions (RV32D) to the mandatory base instructions (RV32I). executes the desired
Returning to our analogy, RISC-V offers a menu instead of a buffet; the chef need cook function in software as part
only what the customers want—not a feast for every meal—and the customers pay only for of a standard library.
what they order. RISC-V has no need to add instructions simply for the marketing sizzle. The
RISC-V Foundation decides when to add a new option to the menu, and they will do so only
for solid technical reasons after an extended open discussion by a committee of hardware and
software experts. Even when new choices appear on the menu, they remain optional and not
a new requirement for all future implementations, like incremental ISAs.

1.3 ISA Design 101

Before introducing the RISC-V ISA, it will be helpful to understand the underlying principles
and trade-offs that a computer architect must make while designing an ISA. Below is a list of
the seven measures, along with icons we’ll put in page margins to highlight instances when
RISC-V addresses them in the following chapters. (The back cover of the print book has a
legend for the icons.)
• cost (US dollar coin icon)
• simplicity (wheel)
• performance (speedometer)

• isolation of architecture from implementation (detached halves of a circle)

• room for growth (accordion)
• program size (opposing arrows compressing line)
• ease of programming / compiling / linking (children’s blocks for “as easy as ABC”).

To illustrate what we mean, in this section we’ll show some choices from older ISAs that
look unwise in retrospect and where RISC-V often made much better decisions.
Cost. Processors are implemented as integrated circuits, commonly called chips or dies.
They are called dies because they start life as a piece of a single round wafer, which is diced
into many individual pieces. Figure 1.4 shows a wafer of RISC-V processors. The cost is
very sensitive to the area of the die:

cost ≈ f (die area2 )

Obviously, the smaller the die, the more dies per wafer, and most of the cost of the die is
the processed wafer itself. Less obvious is that the smaller the die, the higher the yield, the
6 CHAPTER 1. WHY RISC-V?

Figure 1.4: An 8-inch diameter wafer of RISC-V dies designed by SiFive. It has two types of RISC-V dies
using an older, larger processing line. An FE310 die is 2.65 mm×2.72 mm and a SiFive test die that is
2.89 mm×2.72 mm. The wafer contains 1846 of the former and 1866 of the latter, totaling 3712 chips.
1.3. ISA DESIGN 101 7

fraction of manufactured dies that work. The reason is that the silicon manufacturing will
result in small flaws scattered about the wafer, so the smaller the die, the lower the fraction
that will be flawed.
An architect wants to keep the ISA simple to shrink the size of processors that imple-
ment it. As we shall see in the following chapters, the RISC-V ISA is much simpler ISA
than the ARM-32 ISA. As a concrete example of the impact of simplicity, let’s compare
a RISC-V Rocket processor to an ARM-32 Cortex-A5 processor in the same technology
(TSMC40GPLUS) using the same-sized caches (16 KiB). The RISC-V die is 0.27 mm2 ver-
sus 0.53 mm2 for ARM-32. Around twice the area, the ARM-32 Cortex-A5 die costs approx-
imately 4X (22 ) as much as RISC-V Rocket die. Even a 10% smaller die reduces cost by a
factor of 1.2 (1.12 ).
High-end proces-
Simplicity. Given the cost sensitivity to complexity, architects want a simple ISA to
sors can gain perfor-
reduce die area. Simplicity also reduces chip design time and verification time, which can be mance by combining
much of the cost of development of the chip. These costs must be added to the cost of the simple instructions to-
chip, with this overhead dependent on the number of chips shipped. Simplicity also reduces gether without burdening
all lower-end implementa-
the cost of documentation and the difficulty of getting customers to understand how to use tions with a larger, more
the ISA. complicated ISA. This
Below is a glaring example of ISA complexity from ARM-32: technique is called macro-
fusion, as it fuses “macro”
ldmiaeq SP!, {R4-R7, PC} instructions together.

The instruction stands for LoaD Multiple, Increment-Address, on EQual. It performs 5 data
loads and writes to 6 registers but executes only if the EQ condition code is set. Moreover, it
writes a result to the PC, so it is also performing a conditional branch. Quite a handful!
Ironically, simple instructions are much more likely to be used than complex ones. For A simple processor
example, x86-32 includes an enter instruction, which was intended to be the first instruction can be helpful for
executed on entering a procedure to create a stack frame for it (see Chapter 3). Most compilers embedded applica-
tions since it is eas-
instead use only these two simple x86-32 instructions: ier to predict execution
push ebp # Push the frame pointer onto the stack time. Assembly-language
programmers of micro-
mov ebp, esp # Copy the stack pointer to the frame pointer controllers often want to
maintain exact timing, so
Performance. Except for the tiny chips for embedded applications, architects are typ- they rely on code taking
ically concerned about performance as well as cost. Performance can be factored into three a predictable number of
terms: clock cycles that they can
instructions average clock cycles time time count by hand.
× × =
program instruction clock cycle program
The last factor is
Even if a simple ISA might execute more instructions per program than a complex ISA, it
the inverse of the
can more than make up for that by having a faster clock cycle or average fewer clock cycles clock rate, so a 1 GHz
per instruction (CPI). clock rate means the time
For example, for the CoreMark benchmark [Gal-On and Levy 2012] (100,000 iterations), per clock cycle is 1 ns
(1/109 ).
the performance on the ARM-32 Cortex-A9 is

32.27 B instructions 0.79 clock cycles 0.71 ns 18.15 secs The average num-
× × = ber of clock cycles
program instruction clock cycle program can be less than
1 because the A9 and
For the BOOM implementation of RISC-V, the equation is BOOM [Celio et al. 2015]
are so-called superscalar
29.51 B instructions 0.72 clock cycles 0.67 ns 14.26 secs processors, which execute
× × = more than one instruction
program instruction clock cycle program
per clock cycle.
8 CHAPTER 1. WHY RISC-V?

The ARM processor didn’t execute fewer instructions than RISC-V in this case. As we
shall see, the simple instructions are also the most popular instructions, so ISA simplicity can
win in all metrics. For this program, the RISC-V processor gains nearly 10% in each of the
three factors, which results in a performance advantage of almost 30%. If a simpler ISA also
results in a smaller chip, its cost-performance will be excellent.
Isolation of Architecture from Implementation. The original distinction between ar-
chitecture and implementation, which goes back to the 1960s, is that architecture is what a
machine language programmer needs to know to write a correct program, but not the perfor-
mance of that program. The temptation for an architect is to include instructions in an ISA
that help performance or cost of one implementation at a particular time, but burden different
or future implementations.
For the MIPS-32 ISA, the regrettable example was the delayed branch. Conditional
branches cause problems in pipelined execution because the processor wants to have the next
instruction to execute already in the pipeline, but it can’t decide whether it wants the next
sequential one (if the branch isn’t taken) or the one at the branch target address (if it is taken).
For their first microprocessor with a 5-stage pipeline, this indecision could have caused a one
Pipelined proces-
clock-cycle stall of the pipeline. MIPS-32 solved this problem by redefining branch to occur
sors today antic-
ipate branch out- in the instruction after the next one. Thus, the following instruction is always executed. The
comes using hardware job of the programmer or compiler writer was to put something useful into the delay slot.
predictors, which can ex- Alas, this “solution” didn’t help later MIPS-32 processors with many more pipeline stages
ceed 90% accuracy and
work with any pipeline
(hence many more instructions fetched before the branch outcome is computed), but it made
length. They only need a life harder for MIPS-32 programmers, compiler writers, and processor designers ever after,
mechanism to flush and since incremental ISAs demand backwards compatibility (see Section 1.2). In addition, it
restart the pipeline when makes the MIPS-32 code much harder to understand (see Figure 2.10 on page 29 ).
they mispredict.
While architects shouldn’t put features that help just one implementation at a point in
time, they also shouldn’t put in features that hinder some implementations. For example,
ARM-32 and some other ISAs have a Load Multiple instruction, as mentioned on the previ-
ous page. These instructions can improve performance of single-instruction issue pipelined
designs, but hurt multiple-instruction issue pipelines. The reason is that the straightforward
implementation precludes scheduling the individual loads of a Load Multiple in parallel with
other instructions, reducing instruction throughput of such processors.
Room for Growth. With ending of Moore’s Law, the only path forward for major
improvements in cost-performance is to add custom instructions for specific domains, such as
deep learning, augmented reality, combinatorial optimization, graphics, and so. That means
it’s important today for an ISA to reserve opcode space for future enhancements.
In the 1970s and 1980s, when Moore’s Law was in full force, there was little thought
of saving opcode space for future accelerators. Architects instead valued larger address and
immediate fields to reduce the number of instructions executed per program, the first factor
The ARM-32 in- in the performance equation on the prior page.
struction ldmiaeq An example of the impact of paucity of opcode space was when the architects of ARM-32
mentioned above later tried to reduce code size by adding 16-bit length instructions to the formerly uniform
is even more com-
plicated, since when 32-bit length ISA. There was simply no room left. Thus, the only solution was to create a new
it branches it can also ISA first with 16-bit instructions (Thumb) and later a new ISA with both 16-bit and 32-bit
change instruction set instructions (Thumb-2) using a mode bit to switch between ARM ISAs. To change modes,
modes between ARM-32
the programmer or compiler branches to a byte address with a 1 in the least-significant bit,
and Thumb/Thumb-2.
which worked because 16-bit and 32-bit instructions should have 0 in that bit.
1.3. ISA DESIGN 101 9

1.37 1.34
1.4
Code Size Rela,ve to RV32GC 1.26
1.2
1 0.99
1

0.8

0.6

0.4

0.2

0
RISC-V RV32GC RISC-V RV32G ARM Thumb2 ARM-32 INTEL x86-32
(16b & 32b) (32b) (16b & 32b) (32b) (variable 8b)

Figure 1.5: Relative program sizes for RV32G, ARM-32, x86-32, RV32C, and Thumb-2. The last two ISAs
are aimed at small code size. The programs were the SPEC CPU2006 benchmarks using the GCC compilers.
The small size advantage of Thumb-2 over RV32C is due to the code size savings of Load and Store Multiple
on procedure entry. RV32C excludes them to maintain the one-to-one mapping to instructions of RV32G,
which omits Load and Store Multiple to reduce implementation complexity for high-end processors (see
below). Chapter 7 explains RV32C. RV32G indicates a popular combination of RISC-V extensions (RV32M,
RV32F, RV32D, and RV32A), properly called RV32IMAFD. [Waterman 2016]

Program Size. The smaller the program, the smaller the area on a chip needed for the
program memory, which can be a significant cost for embedded devices. Indeed, that issue
inspired ARM architects to retroactively add shorter instructions in the Thumb and Thumb-2
ISAs. Smaller programs also lead to fewer misses in instruction caches, which saves power
since off-chip DRAM accesses use much more energy than on-chip SRAM accesses, and
One example 15-
improves performance as well. Small code size can be one of the goals of ISA architects. byte x86-32 in-
The x86-32 ISA has instructions as short as 1 byte and as long as 15 bytes. One would struction is lock
expect that the byte-variable length instructions of the x86 should certainly lead to smaller add dword ptr
programs than ISAs limited to 32-bit length instructions, like ARM-32 and RISC-V. Logi- ds:[esi+ecx*4
+0x12345678],
cally, 8-bit variable length instructions should also be smaller than ISAs that offer only 16-bit 0xefcdab89. It as-
and 32-bit instructions, like Thumb-2 and RISC-V using the RV32C extension (see Chap- sembles into (in hexadec-
ter 7). Figure 1.5 shows that, while ARM-32 and RISC-V code is 6% to 9% larger than code imal): 67 66 f0 3e 81 84
for x86-32 when all instructions are 32 bits long, surprisingly x86-32 is 26% larger than the 8e 78 56 34 12 89 ab cd
ef. The last 8 bytes are
compressed versions (RV32C and Thumb-2) that offer both 16-bit and 32-bit instructions. 2 addresses and the first
While a new ISA using 8-bit variable instructions would likely lead to smaller code than 7 bytes specify atomic
RV32C and Thumb-2, the architects of the first x86 in the 1970s had different concerns. memory operation, the add
Moreover, given the requirement of backwards binary-compatibility of an incremental ISA operation, 32-bit data, the
data segment register, the
(Section 1.2), the hundreds of new x86-32 instructions are longer than one might expect, 2 address registers, and
since they bear the burden of a one- or two-byte prefix to squeeze them into the limited free scaled indexed addressing
opcode space of the original x86. mode. An example 1-byte
instruction is inc eax
that assembles into 40.
10 CHAPTER 1. WHY RISC-V?

Ease of programming, compiling, and linking. Since data in a register is so much

faster to access than data in memory, it is critical for compilers to do a good job at register
allocation. That task is much easier when there are many registers rather than fewer. In
that light, ARM-32 has 16 registers and x86-32 has only 8. Most modern ISAs, including
RISC-V, have a relatively generous 32 integer registers. More registers surely make life
easier for compilers and assembly language programmers.
Another issue for compilers and assembly language programmers is figuring out the speed
of a code sequence. As we shall see, RISC-V instructions are typically at most one clock cy-
cle per instruction (ignoring cache misses), while as we saw earlier both ARM-32 and x86-32
have instructions that take many clock cycles even when everything fits in the cache. More-
over, unlike ARM-32 and RISC-V, x86-32 arithmetic instructions can have operands in mem-
ory instead of requiring all operands to be in registers. Complex instructions and operands in
memory make it difficult for processor designers to deliver performance predictability.
It’s useful for an ISA to support position independent code (PIC), because it supports
dynamic linking (see Section 3.5), since shared library code can reside at different addresses
in different programs. PC-relative branches and data addressing are a boon to PIC. While
nearly all ISAs provide PC-relative branches, x86-32 and MIPS-32 omit PC-relative data
addressing.

Elaboration: ARM-32, MIPS-32, and x86-32

Elaborations are optional sections that readers can delve into if they are interested in a topic,
but you don’t need to read them to understand the rest of the book. For example, our ISA
names aren’t the official ones. The 32-bit-address ARM ISA has many versions, with the
first in 1986 and the latest called ARMv7 in 2005. ARM-32 generally refers to the ARMv7
ISA. MIPS also had many 32-bit versions, but we’re referring to the original, called MIPS I.
(“MIPS32” is a different, later ISA than what we call MIPS-32.) Intel’s first 16-bit address
architecture was the 8086 in 1978, which the 80386 ISA expanded to 32-bit addresses in
1985. Our x86-32 notation generally refers to the IA-32, the 32-bit-address version of its
x86 ISA. Given the myriad variants of these ISAs, we find our nonstandard terminology least
confusing.

1.4 An Overview of this Book

This book assumes you have seen other instruction sets before RISC-V. If not, look at our
related introductory architecture book based on RISC-V [Patterson and Hennessy 2017].
Chapter 2 introduces RV32I, the frozen base integer instructions that are the heart of
RISC-V. Chapter 3 explains the remaining RISC-V assembly language beyond that intro-
duced in Chapter 2, including calling conventions and some clever tricks for linking. As-
sembly language includes all of the proper RISC-V instructions plus some useful instructions
that are outside RISC-V. These pseudoinstructions, which are clever variations of real in-
structions, make it easier to write assembly language programs without having to complicate
the ISA.
The next three chapters explain the standard RISC-V extensions that, when added to
RV32I, we collectively call RV32G (G is for general):
1.5. CONCLUDING REMARKS 11

• Chapter 4: Multiply and Divide (RV32M)

• Chapter 5: Floating Point (RV32F and RV32D)
• Chapter 6: Atomic (RV32A)

The RISC-V “reference card” on pages 3 and 4 is a handy summary of all RISC-V instruc- The reference card
tions in this book: RV32G, RV64G, and RV32/64V. is also called the
Chapter 7 describes the optional compressed extension RV32C, an excellent example of green card because
of the shade of the back-
the elegance of RISC-V. By restricting the 16-bit instructions to be short versions of existing ground color of the one-
32-bit RV32G instructions, they are almost free. The assembler can pick the instruction size, page cardboard summary
allowing the assembly language programmer and the compiler to be oblivious to RV32C. of ISAs from the 1960s.
The hardware decoder to translate 16-bit RV32C instructions into 32-bit RV32G instructions We kept the background
white for legibility instead
needs just 400 gates, which is a few percent of even the simplest implementation of RISC-V. of green for historical
Chapter 8 introduces RV32V, the vector extension. Vector instructions are another ex- accuracy.
ample of ISA elegance as compared to the numerous, brute-force Single Instruction Multiple
Data (SIMD) instructions of ARM-32, MIPS-32, and x86-32. Indeed, hundreds of the in-
structions added to x86-32 in Figure 1.2 were SIMD, and hundreds more are coming. RV32V
is even simpler than most vector ISAs, as it associates the data type and length with the vec-
tor registers instead of embedding them in the opcodes. RV32V may be the most compelling
reason for switching from a conventional SIMD-based ISA to RISC-V.
Chapter 9 shows the 64-bit address version of RISC-V, RV64G. As the chapter explains,
the RISC-V architects needed only to widen the registers and add a few word, doubleword,
or long versions of RV32G instructions to extend the address from 32 to 64 bits.
Chapter 10 explains the system instructions, showing how RISC-V handles paging and
the Machine, User, and Supervisor privilege modes.
The last chapter gives a quick description of the remaining extensions that are currently
under consideration by the RISC-V Foundation.
Next comes the largest section of the book, Appendix A, an instruction set summary in
alphabetical order. It defines the full RISC-V ISA with all extensions mentioned above and
all pseudoinstructions in about 50 pages, a testimony to the simplicity of RISC-V.
We end the book with an index.

1.5 Concluding Remarks

It is easy to see by formal-logical methods that there exist certain [instruction sets] that A previous version
are in abstract adequate to control and cause the execution of any sequence of operations of John von Neu-
... The really decisive considerations from the present point of view, in selecting an [in- mann’s well-written
struction set], are more of a practical nature: simplicity of the equipment demanded by report was so influential
that this style of computer
the [instruction set], and the clarity of its application to the actually important problems is commonly called a
together with the speed of its handling of those problems. von Neumann architec-
—[von Neumann et al. 1947, 1947] ture, although this report
was based on the work
RISC-V is a recent, clean-slate, minimalist, and open ISA informed by mistakes of past ISAs. of others. It was written
three years before the first
The goal of the RISC-V architects is for it to be effective for all computing devices, from the stored program computer
smallest to the fastest. Following von Neumann’s 70-year-old advice, this ISA emphasizes was operational!
simplicity to keep costs low while having plenty of registers and transparent instruction speed
to help compilers and assembly language programmers map actually important problems to
appropriate, quick code.
12 REFERENCES

ISA Pages Words Hours to read Weeks to read

RISC-V 236 76,702 6 0.2
ARM-32 2736 895,032 79 1.9
x86-32 2198 2,186,259 182 4.5

Figure 1.6: Number of pages and words of ISA manuals [Waterman and Asanović 2017a], [Waterman and
Asanović 2017b], [Intel Corporation 2016], [ARM Ltd. 2014]. Hours and weeks to complete assumes reading
at 200 words per minute for 40 hours a week. Based in part of Figure 1 of [Baumann 2017].

One indication of complexity is the size of the documentation. Figure 1.6 shows the
size of the instruction set manuals for RISC-V, ARM-32, and x86-32 measured in pages and
words. If you read manuals as a full-time job—8 hours a day for 5 days a week—it would
take half a month to make a single pass over the ARM-32 manual and a full month for the
x86-32. At this level of intricacy, perhaps no single person fully understands ARM-32 or
1 1
x86-32. Using this common-sense metric, RISC-V is 12 complexity of the ARM-32 and 10
1
to 30 the complexity of x86-32. Indeed, the summary of RISC-V ISA including all extensions
is only two pages (see the Reference Card).
This minimal, open ISA was unveiled in 2011 and is now backed by a foundation that
will evolve it by adding optional extensions based strictly on technical justifications after
a prolonged debate. The openness enables free, shared implementations of RISC-V, which
lowers costs and the odds of unwanted malicious secrets being hidden in a processor.
However, hardware alone does not a system make. Software development costs likely
dwarf hardware development costs, so while stable hardware is important, stable software is
more so. It needs operating systems, boot-loaders, reference software, and popular software
tools. The foundation offers stability for the overall ISA, and the frozen base means that the
RV32I core that is the target for the software stack will never change. By its broad adoption
and openness, RISC-V can challenge the dominance of the prevailing proprietary ISAs.
Elegant is a word rarely applied to ISAs, but after reading this book, you may agree with
us that it applies to RISC-V. We’ll highlight features that we believe indicate elegance with a
Mona Lisa icon in the margins.

1.6 To Learn More

ARM Ltd. ARM Architecture Reference Manual: ARMv7-A and ARMv7-R Edition, 2014.
URL http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406c/.
A. Baumann. Hardware is the new software. In Proceedings of the 16th Workshop on Hot
Topics in Operating Systems, pages 132–137. ACM, 2017.
C. Celio, D. Patterson, and K. Asanovic. The Berkeley Out-of-Order Machine (BOOM):
an industry-competitive, synthesizable, parameterized RISC-V processor. Tech. Rep.
UCB/EECS-2015–167, EECS Department, University of California, Berkeley, 2015.
S. Gal-On and M. Levy. Exploring CoreMark - a benchmark maximizing simplicity and
efficacy. The Embedded Microprocessor Benchmark Consortium, 2012.
Intel Corporation. Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume
2: Instruction Set Reference. September 2016.
NOTES 13

S. P. Morse. The Intel 8086 chip and the future of microprocessor design. Computer, 50(4):
8–9, 2017.
D. A. Patterson and J. L. Hennessy. Computer Organization and Design RISC-V Edition:
The Hardware Software Interface. Morgan Kaufmann, 2017.
S. Rodgers and R. Uhlig. X86: Approaching 40 and still going strong, 2017.
J. L. von Neumann, A. W. Burks, and H. H. Goldstine. Preliminary discussion of the logical
design of an electronic computing instrument. Report to the U.S. Army Ordnance Depart-
ment, 1947.
A. Waterman. Design of the RISC-V Instruction Set Architecture. PhD thesis, EECS Depart-
ment, University of California, Berkeley, Jan 2016. URL http://www2.eecs.berkeley.
edu/Pubs/TechRpts/2016/EECS-2016-1.html.
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual Volume
II: Privileged Architecture Version 1.10. May 2017a. URL https://riscv.org/
specifications/privileged-isa/.
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017b. URL https://riscv.org/specifications/.

Notes
1 http://parlab.eecs.berkeley.edu
2 RV32I: RISC-V Base Integer ISA

Frances Elizabeth . . . the only way to realistically realize the performance goals and make them accessible
“Fran” Allen (1932-) to the user was to design the compiler and the computer at the same time. In this way
was bestowed the Turing features would not be put in the hardware which the software could not use . . .
Award primarily for her work
on optimizing compilers. —Frances Elizabeth “Fran” Allen, 1981
The Turing Award is the
greatest prize in Computer
Science.
2.1 Introduction
Figure 2.1 is a one-page graphical representation of the RV32I base instruction set. You can
see the full RV32I instruction set by concatenating the underlined letters from left to right
for each diagram. The set notation using { } lists the possible variations of the instruction,
using either underlined letters or the underscore character _, which means no letter for this
variation. For example

_ _
set less than
immediate unsigned

represents these four RV32I instructions: slt, slti, sltu, sltiu.

The goal of these diagrams, which will be the first figure of the following chapters, is give
a quick, insightful overview of the instructions of a chapter.

2.2 RV32I Instruction formats

Figure 2.2 shows the six base instruction formats: R-type for register-register operations;
I-type for short immediates and loads; S-type for stores; B-type for conditional branches;
U-type for long immediates; and J-type for unconditional jumps. Figure 2.3 lists the opcodes
of the RV32I instructions in Figure 2.1 using the formats of Figure 2.2.
Even the instruction formats demonstrate several examples where the simpler RISC-V
ISA improves cost-performance. First, there are only six formats and all instructions are
32 bits long, which simplifies instruction decoding. ARM-32 and particularly x86-32 have
numerous formats, which make decoding expensive in low-end implementations and a perfor-
mance challenge for medium and high-end processor designs. Second, RISC-V instructions
offer three register operands, rather than having one field shared for source and destination,
as with x86-32. When an operation naturally has three distinct operands but the ISA provides
2.2. RV32I INSTRUCTION FORMATS 15

RV32I
Integer Computation Loads and Stores
_
add byte
immediate load halfword
subtract store word
and _
or byte
immediate load unsigned
exclusive or halfword

shift left logical _ Miscellaneous instructions

shift right arithmetic immediate fence loads & stores
shift right logical fence.instruction & data
load upper immediate
add upper immediate to pc environment break
call
_ _ read & clear bit
set less than _
immediate unsigned control status register read & set bit
Control transfer immediate
read & write
branch equal
not equal
branch greater than or equal _
less than unsigned

jump and link _

Figure 2.1: Diagram of the RV32I instructions. The underlined letters are concatenated from left to right to
form RV32I instructions. The curly bracket notation { } means each vertical item in the set is a different
variation of the instruction. The underscore _ within a set means that one option is simply the instruction
name so far without a letter from this set. For example, the notation near the upper left-hand corner
represents the following six instructions: and, or, xor, andi, ori, xori.

31 30 25 24 21 20 19 15 14 12 11 8 7 6 0
funct7 rs2 rs1 funct3 rd opcode R-type

imm[11:0] rs1 funct3 rd opcode I-type

imm[11:5] rs2 rs1 funct3 imm[4:0] opcode S-type

imm[12] imm[10:5] rs2 rs1 funct3 imm[4:1] imm[11] opcode B-type

imm[31:12] rd opcode U-type

imm[20] imm[10:1] imm[11] imm[19:12] rd opcode J-type

Figure 2.2: RISC-V instruction formats. We label each immediate subfield with the bit position (imm[x]) in
the immediate value being produced, rather than the bit position in the instruction’s immediate field as is
usually done. Chapter 10 explains how the control status register instructions use the I-type format slightly
differently. (Figure 2.2 of Waterman and Asanović 2017 is the basis of this figure).
16 CHAPTER 2. RV32I: RISC-V BASE INTEGER ISA

31 25 24 20 19 15 14 12 11 7 6 0
imm[31:12] rd 0110111 U lui
imm[31:12] rd 0010111 U auipc
imm[20|10:1|11|19:12] rd 1101111 J jal
imm[11:0] rs1 000 rd 1100111 I jalr
imm[12|10:5] rs2 rs1 000 imm[4:1|11] 1100011 B beq
imm[12|10:5] rs2 rs1 001 imm[4:1|11] 1100011 B bne
imm[12|10:5] rs2 rs1 100 imm[4:1|11] 1100011 B blt
imm[12|10:5] rs2 rs1 101 imm[4:1|11] 1100011 B bge
imm[12|10:5] rs2 rs1 110 imm[4:1|11] 1100011 B bltu
imm[12|10:5] rs2 rs1 111 imm[4:1|11] 1100011 B bgeu
imm[11:0] rs1 000 rd 0000011 I lb
imm[11:0] rs1 001 rd 0000011 I lh
imm[11:0] rs1 010 rd 0000011 I lw
imm[11:0] rs1 100 rd 0000011 I lbu
imm[11:0] rs1 101 rd 0000011 I lhu
imm[11:5] rs2 rs1 000 imm[4:0] 0100011 S sb
imm[11:5] rs2 rs1 001 imm[4:0] 0100011 S sh
imm[11:5] rs2 rs1 010 imm[4:0] 0100011 S sw
imm[11:0] rs1 000 rd 0010011 I addi
imm[11:0] rs1 010 rd 0010011 I slti
imm[11:0] rs1 011 rd 0010011 I sltiu
imm[11:0] rs1 100 rd 0010011 I xori
imm[11:0] rs1 110 rd 0010011 I ori
imm[11:0] rs1 111 rd 0010011 I andi
0000000 shamt rs1 001 rd 0010011 I slli
0000000 shamt rs1 101 rd 0010011 I srli
0100000 shamt rs1 101 rd 0010011 I srai
0000000 rs2 rs1 000 rd 0110011 R add
0100000 rs2 rs1 000 rd 0110011 R sub
0000000 rs2 rs1 001 rd 0110011 R sll
0000000 rs2 rs1 010 rd 0110011 R slt
0000000 rs2 rs1 011 rd 0110011 R sltu
0000000 rs2 rs1 100 rd 0110011 R xor
0000000 rs2 rs1 101 rd 0110011 R srl
0100000 rs2 rs1 101 rd 0110011 R sra
0000000 rs2 rs1 110 rd 0110011 R or
0000000 rs2 rs1 111 rd 0110011 R and
0000 pred succ 00000 000 00000 0001111 I fence
0000 0000 0000 00000 001 00000 0001111 I fence.i
000000000000 00000 000 00000 1110011 I ecall
000000000001 00000 000 00000 1110011 I ebreak
csr rs1 001 rd 1110011 I csrrw
csr rs1 010 rd 1110011 I csrrs
csr rs1 011 rd 1110011 I csrrc
csr zimm 101 rd 1110011 I csrrwi
csr zimm 110 rd 1110011 I csrrsi
csr zimm 111 rd 1110011 I csrrci

Figure 2.3: RV32I opcode map has instruction layout, opcodes, format type, and names. (Table 19.2 of
[Waterman and Asanović 2017] is the basis of this figure.)
2.2. RV32I INSTRUCTION FORMATS 17

only a two-operand instruction, the compiler or assembly language programmer must use an Sign-extended
extra move instruction to preserve the destination operand. Third, in RISC-V the specifiers of immediates even
help logical instruc-
the registers to be read and written are always in the same location in all instructions, which tions. For example, x
means the register accesses can begin before decoding the instruction. Many other ISAs & 0xfffffff0 uses only the
reuse a field as a source in some instructions and as a destination in others (e.g., ARM-32 and single instruction andi in
MIPS-32), which forces addition of extra hardware to be placed in a potentially time-critical RISC-V, but it requires two
instructions in MIPS-32
path to select the proper field. Fourth, the immediate fields in these formats are always sign (addiu to load the con-
extended, and the sign bit is always in the most significant bit of the instruction. This deci- stant, then and), since
sion means sign extension of the immediate, which may also be on a critical timing path, can MIPS zero-extends logi-
proceed before decoding the instruction. cal immediates. ARM-32
needed to add an addi-
tional instruction, bic,
Elaboration: B- and J-type formats? that performs rx & imme-
As mentioned below, the immediate field is rotated 1 bit for branch instructions, a variation diate to compensate for
zeroextending immediates.
of the S format that we relabel the B format. The immediate field of jump instructions rotated
12 bits for jump instructions, a variation of the U format relabeled J format. Hence, there are
a really four basic formats, but we can conservatively count RISC-V as having six formats.

To help programmers, a bit pattern of all zeros is an illegal RV32I instruction. Thus,
erroneous jumps into zeroed memory regions will immediately trap, an aid to debugging.
Similarly, the bit pattern of all ones is an illegal instruction, which will trap other common
errors such as unprogrammed non-volatile memory devices, disconnected memory buses, or
broken memory chips.
To leave ample room for ISA extensions, the base RV32I ISA uses less than 1/8-th of
the encoding space in the 32-bit instruction word. The architects also carefully picked
the RV32I opcodes so that instructions with common datapath operations share as many of
the same opcode bit values as possible, which simplifies the control logic. Finally, as we
shall see, the branch and jump addresses in the B and J formats must be shifted left 1 bit
so as to multiply the addresses by 2, thereby giving branches and jumps a greater range.
RISC-V rotates the bits in the immediate operands from a natural placement to reduce the RISC-V implemen-
instruction signal fanout and immediate multiplexing cost by almost a factor for two, which tations all use the
again simplifies datapath logic on low-end implementations. same opcodes for
the optional ex-
What’s Different? We’ll end each section in this and following chapters with description tensions such as
on how RISC-V differs from other ISAs. The contrast is often what RISC-V is missing. RV32M, RV32F, and
Architects demonstrate good taste by the features they omit as well as by what they include. so on. Non-standard ex-
The ARM-32 12-bit immediate field is not simply a constant but an input to a function that tensions that are unique to
processor are restricted to
produces a constant: 8 bits are zero-extended to full width and then rotated right by the value a reserved opcode space
in the 4 remaining bits multiplied by 2. The hope was encoding more useful constants in 12 in RISC-V.
bits would reduce the number of executed instructions. ARM-32 also dedicates a precious
four bits in most instruction formats to conditional execution. Despite being infrequently
used, conditional execution adds to the complexity of out-of-order processors.

Elaboration: Out-of-order processors

are high-speed, pipelined processors that execute instructions opportunistically instead of in
lock-step program order. A critical feature of such processors is register renaming, which
maps the register names in the program onto a larger number of internal physical registers.
The problem with conditional execution is that the registers in these instructions must be
allocated to internal physical registers whether or not the condition holds, yet internal physical
register availability is a critical performance resource for out-of-order processors.
18 CHAPTER 2. RV32I: RISC-V BASE INTEGER ISA

2.3 RV32I Registers

Figure 2.4 lists the RV32I registers and their names as determined by the RISC-V application
binary interface (ABI). We will use the ABI names in our code examples to make them easier
to read. To the joy of assembly language programmers and compiler writers, RV32I has
31 registers plus x0, which always has the value 0. ARM-32 has merely 16 registers while
x86-32 has only 8!
What’s Different? Dedicating a register to zero is a surprisingly large factor in simplify-
ing the RISC-V ISA. Figure 3.3 on page 36 in Chapter 3 gives many examples of operations
that are native instructions in ARM-32 and x86-32, which don’t have a zero register, but can
be synthesized from RV32I instructions simply by using the zero register as an operand.
Pipelining is used by The PC is one of ARM-32’s 16 registers, which means that any instruction that changes
all but the cheapest pro- a register may also as a side effect be a branch instruction. The PC as a register complicates
cessors today to get good hardware branch prediction, whose accuracy is vital for good pipelined-performance, since
performance. Like an in-
dustrial assembly line, they every instruction might be a branch instead of 10–20% of instructions executed in programs
get higher throughput by for typical ISAs. It also means one less general-purpose register.
overlapping the execution
of many instructions at
once. To pull this off, the 2.4 RV32I Integer Computation
processors predict branch
outcomes, which they can Appendix A gives details of all of the RISC-V instructions, including formats and opcodes.
do with more than 90%
accuracy. When they mis- In this section, and similar sections of the following chapters, we give an ISA overview that
predict, they re-execute should be sufficient for knowledgeable assembly language programmers, as well as highlight
instructions. Early micro- the features that demonstrate the seven ISA metrics from Chapter 1.
processors had a 5-stage The simple arithmetic instructions (add, sub), logical instructions (and, or, xor), and
pipeline, which meant 5
instructions overlapped shift instructions (sll, srl, sra) in Figure 2.1 are just as you would expect in any ISA. They
execution. Recent ones read two 32-bit values from registers and write a 32-bit result to the destination register.
have more than 10 pipeline RV32I also offers immediate versions of these instructions. Unlike ARM-32, immediates are
stages.
always sign-extended so that they can be negative when needed, which is why there is no
need for an immediate version of sub.
Programs can generate a Boolean value from the result of a comparison. To accommodate
such cases, RV32I offers a set less than instruction, which sets the destination register to 1
if the first operand is less than the second, or 0 otherwise. As one would expect, there is a
signed version (slt) and an unsigned version (sltu) for signed and unsigned integers as well
as immediate versions for both (slti, sltiu). As we shall see, while RV32I branches can
check for all relationships between two registers, some conditional expressions involve rela-
tionships between many pairs of registers. The compiler or assembly language programmer
could use slt and the logical instructions and, or, xor to resolve more elaborate conditional
expressions.
The two remaining integer computation instructions Figure 2.1 help with assembly and
linking. Load upper immediate (lui) loads a 20-bit constant into the most significant 20 bits
of a register. It can be followed by a standard immediate instruction to create a 32-bit constant
from only two 32-bit RV32I instructions. Add upper immediate to PC (auipc) supports two-
instruction sequences to access arbitrary offsets from the PC for both control-flow transfers
and data accesses. The combination of an auipc and the 12-bit immediate in a jalr (see
below) can transfer control to any 32-bit PC-relative address, while an auipc plus the 12-bit
immediate offset in regular load or store instructions can access any 32-bit PC-relative data
address.
2.4. RV32I INTEGER COMPUTATION 19

31 0
x0 / zero Hardwired zero
x1 / ra Return address
x2 / sp Stack pointer
x3 / gp Global pointer
x4 / tp Thread pointer
x5 / t0 Temporary
x6 / t1 Temporary
x7 / t2 Temporary
x8 / s0 / fp Saved register, frame pointer
x9 / s1 Saved register
x10 / a0 Function argument, return value
x11 / a1 Function argument, return value
x12 / a2 Function argument
x13 / a3 Function argument
x14 / a4 Function argument
x15 / a5 Function argument
x16 / a6 Function argument
x17 / a7 Function argument
x18 / s2 Saved register
x19 / s3 Saved register
x20 / s4 Saved register
x21 / s5 Saved register
x22 / s6 Saved register
x23 / s7 Saved register
x24 / s8 Saved register
x25 / s9 Saved register
x26 / s10 Saved register
x27 / s11 Saved register
x28 / t3 Temporary
x29 / t4 Temporary
x30 / t5 Temporary
x31 / t6 Temporary
32
31 0
pc
32

Figure 2.4: The registers of RV32I. Chapter 3 explains the RISC-V calling convention, the rationale behind
the various pointers (sp, gp, tp, fp), Saved registers (s0-s11), and Temporaries (t0-t6). (Figure 2.1 and Table
20.1 of [Waterman and Asanović 2017] is the basis of this figure.)
20 CHAPTER 2. RV32I: RISC-V BASE INTEGER ISA

What’s Different? First, there are no byte or half-word integer computation operations.
The operations are always the full register width. Memory accesses take orders of magnitude
more energy than arithmetic operations, so narrow data accesses can save significant energy,
but narrow operations do not. ARM-32 has the unusual feature of having an option to shift
one of the operands in most arithmetic-logic operations, which complicates the datapath and
is rarely needed [Hohl and Hinds 2016]; RV32I has separate shift instructions.
Nor does RV32I include multiply and divide; they comprise the optional RV32M exten-
sion (see Chapter 4). Unlike ARM-32 and x86-32, the full RISC-V software stack can run
without them, which can shrink embedded chips. While not a hardware issue, the MIPS-32
assembler may replace a multiply with a sequence shifts and adds to try to improve perfor-
mance, which may confuse the programmer seeing instructions executed not found in the
assembly language program. RV32I also omits rotate instructions and detection of integer
arithmetic overflow. Both can be calculated in a few RV32I instructions (see Section 2.6).

Elaboration: “Bit twiddling” instructions

such as rotate are under consideration by the RISC-V Foundation as part of an optional in-
struction extension called RV32B (see Chapter 11).

Elaboration: xor enables a magic trick.

You can exchange two values without using an intermediate register! This code exchanges
the values of x1 and x2. We leave the proof to the reader. Hint: exclusive or is commutative
(a ⊕ b = b ⊕ a), associative ((a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)), is its own inverse (a ⊕ a = 0), and
has an identity (a ⊕ 0 = a).

xor x1,x1,x2 # x1’ == x1^x2, x2’ == x2

xor x2,x1,x2 # x1’ == x1^x2, x2’ == x1’^x2 == x1^x2^x2 == x1
xor x1,x1,x2 # x1” == x1’^x2’ == x1^x2^x1 == x1^x1^x2 == x2, x2’ == x1
However fascinating, RISC-V’s ample register set usually lets compilers find a scratch regis-
ter, so it rarely uses the XOR-swap.

2.5 RV32I Loads and Stores

As well as providing loads and stores of 32-bit words (lw, sw), Figure 2.1 shows that RV32I
has loads for signed and unsigned bytes and halfwords (lb, lbu, lh, lhu) and stores for
bytes and halfwords (sb, sh). Signed bytes and halfwords are sign-extended to 32 bits and
written to the destination registers. This widening of narrow data allows subsequent integer
computation instructions to operate correctly on all 32 bits, even if the natural data types
are narrower. Unsigned bytes and halfwords, useful for text and unsigned integers, are zero-
extended to 32 bits before being written to the destination register.
The only addressing mode for loads and stores is adding a sign-extended 12-bit immediate
to a register, called displacement addressing mode in x86-32 [Irvine 2014].
What’s Different? RV32I omitted the sophisticated addressing modes of ARM-32 and
x86-32. Alas, all ARM-32 addressing modes aren’t available for all data types, but RV32I
addressing does not discriminate against any data type. RISC-V can imitate some x86 ad-
dressing modes. For example, setting the immediate field to 0 has the same effect as the
2.6. RV32I CONDITIONAL BRANCH 21

register-indirect addressing mode. Unlike x86-32, RISC-V has no special stack instructions.
By using one of the 31 registers as the stack pointer (see Figure 2.4), the standard addressing
mode gets most of the benefits of push and pop instructions without the added ISA complex-
ity. Unlike MIPS-32, RISC-V rejected delayed load. Similar in spirit to delayed branches,
MIPS-32 redefined the load so the data is unavailable until two instructions later, when it
would show up in a five-stage pipeline. Whatever benefit it had evaporated for the longer
pipelines that came later.
While ARM-32 and MIPS-32 require data to be aligned naturally to data-sized boundaries
in memory, RISC-V does not. Misaligned accesses are sometimes required when porting
legacy code. One option is to disallow misaligned accesses in the base ISA and then provide
some separate instructions support for misaligned accesses, such as Load Word Left and Load
Word Right of MIPS-32. This option would complicate register access, however, since lwl
and lwr require writing pieces of registers instead of simply full registers. Requiring instead
that the regular loads and stores support misaligned accesses simplified the overall design.

Elaboration: Endianness
RISC-V chose little-endian byte ordering because it is dominant commercially: all x86-32
systems, and Apple iOS, Google Android OS, and Microsoft Windows for ARM are all little-
endian. Since the endian order matters only when accessing the identical data both as a word
and as bytes, endianness affects few programmers.

2.6 RV32I Conditional Branch

RV32I can compare two registers and branch on the result if they are equal (beq), not equal bltu allows signed
(bne), greater than or equal (bge), or less than (blt). The latter two cases are signed com- array bounds to
parisons, but RV32I also offers unsigned versions: bgeu and bltu. The two remaining be checked with a
single instruction,
relationships (greater than and less than or equal) can be checked simply by reversing the since any negative index
operands, since x < y means that y > x and x ≥ y implies y ≤ x. will compare greater than
Since RISC-V instructions must be a multiple of two bytes long—see Chapter 7 to learn any nonnegative bound!
about the optional 2-byte instructions—the branch addressing mode multiplies the 12-bit
immediate by 2, sign-extends it, and then adds it to the PC. PC-relative addressing helps with
position independent code and thereby reduces the work of the linker and loader (Chapter 3).

What’s Different? As noted above, RISC-V excluded the infamous delayed branch of
MIPS-32, Oracle SPARC, and others. It also avoided the condition codes of ARM-32 and
x86-32 for conditional branches. They add extra state that is implicitly set by most instruc-
tions, which needlessly complicate the dependence calculation for out-of-order execution. Fi-
nally, it omitted the loop instructions of the x86-32: loop, loope, loopz, loopne, loopnz.

Elaboration: Multiword addition without condition codes

is done as follows in RV32I by using sltu to calculate the carry-out:
add a0,a2,a4 # add lower 32 bits: a0 = a2 + a4
sltu a2,a0,a2 # a2’ = 1 if (a2+a4) < a2, a2’ = 0 otherwise
add a5,a3,a5 # add upper 32 bits: a5 = a3 + a5
add a1,a2,a5 # add carry-out from lower 32 bits
22 CHAPTER 2. RV32I: RISC-V BASE INTEGER ISA

Elaboration: Reading the PC

The current PC can be obtained by setting the U-immediate field of auipc to 0. For the x86-
32, to read the PC you need to call a function (which pushes the PC to the stack); the callee
then reads the pushed PC from the stack, and finally returns the PC (by popping the stack).
So reading the current PC took 1 store, 2 loads, and 2 taken jumps!

Elaboration: Software checking of overflow

Most but not all programs ignore integer arithmetic overflow, so RISC-V relies on software
overflow checking. Unsigned addition requires only a single additional branch instruction
after the addition: addu t0, t1, t2; bltu t0, t1, overflow.
For signed addition, if one operand’s sign is known, overflow checking requires only a single
branch after the addition: addi t0, t1, +imm; blt t0, t1, overflow.
This covers the common case of addition with an immediate operand. For general signed
addition, three additional instructions after the addition are required, observing that the sum
should be less than one of the operands if and only if the other operand is negative.
add t0, t1, t2
slti t3, t2, 0 # t3 = (t2<0)
slt t4, t0, t1 # t4 = (t1+t2<t1)
bne t3, t4, overflow # overflow if (t2<0) && (t1+t2>=t1)
# || (t2>=0) && (t1+t2<t1)

2.7 RV32I Unconditional Jump

The single jump and link instruction (jal) in Figure 2.1 serves dual functions. To support
procedure calls, it saves the address of the next instruction PC+4 into the destination register,
normally the return address register ra (see Figure 2.4). To support unconditional jumps, we
use the zero register (x0) instead of ra as the destination register, as x0 can’t be changed.
Like branches, jal multiplies its 20-bit branch address by 2, sign extends it, and then adds
the result to the PC to form the jump address.
The register version of jump and link (jalr) is similarly multipurpose. It can call a pro-
cedure to a dynamically calculated address or simply perform a procedure return by selecting
the ra as the source register, and the zero register (x0) again as the destination register. Switch
or case statements, which calculate a jump address, can also use jalr with the zero register
as the destination register.
Register windows What’s Different? RV32I shunned intricate procedure call instructions, such as the
accelerated function call enter and leave instructions of the x86-32, or register windows as found in the Intel Ita-
by having many more
registers than 32. A new nium, Oracle SPARC, and Cadence Tensilica.
function would get a new
set or window of 32 reg-
isters on a call. To pass 2.8 RV32I Miscellaneous
arguments, the windows
overlapped, meaning The Control Status Register instructions (csrrc, csrrs, csrrw, csrrci, csrrsi, csrrwi)
some registers were in two in Figure 2.1 provide easy access to registers that help measure program performance. These
adjacent windows.
64-bit counters, which can be read 32 bits at a time, measure wall clock time, clock cycles
executed, and number of instructions retired.
2.9. COMPARING RV32I, ARM-32, MIPS-32, AND X86-32 23

void insertion_sort(long a[], size_t n)

{
for (size_t i = 1, j; i < n; i++) {
long x = a[i];
for (j = i; j > 0 && a[j-1] > x; j--) {
a[j] = a[j-1];
}
a[j] = x;
}
}

Figure 2.5: Insertion Sort in C. While simple, Insertion Sort has many advantages over complicated sorting
algorithms: it is memory efficient and fast for small data sets while being adaptive, stable, and online. GCC
compilers produced the code for the following four figures. We set the optimization flags to reduce code size,
as that produced the easiest to understand code.

ISA ARM-32 ARM Thumb-2 MIPS-32 microMIPS x86-32 RV32I RV32I+RVC

Instructions 19 18 24 24 20 19 19
Bytes 76 46 96 56 45 76 52

Figure 2.6: Number of instructions and code size for Insertion Sort for these ISAs. Chapter 7 describes
ARM Thumb-2, microMIPS, and RV32C.

The ecall instruction makes requests to the supporting execution environment, such
as system calls. Debuggers use the ebreak instruction to transfer control to a debugging
environment.
The fence instruction sequences device I/O and memory accesses as viewed by other
threads and by external devices or coprocessors. The fence.i instruction synchronizes the
instruction and data streams. RISC-V does not guarantee that stores to instruction memory
are visible to instruction fetches in the same processor until a fence.i instruction executes.
Chapter 10 covers the RISC-V system instructions.
What’s Different? RISC-V uses memory mapped I/O instead of the in, ins, insb,
insw and out, outs, outsb, outsw instructions of the x86-32. It supports strings using byte
loads and stores instead of the 16 special string instructions of the x86-32 rep, movs, coms,
scas, lods, ....

2.9 Comparing RV32I, ARM-32, MIPS-32, and x86-32 using Insertion

Sort
We’ve introduced the RISC-V base instruction set, and commented upon its choices as com-
pared to ARM-32, MIPS-32, and x86-32. We’ll now do a head-to-head comparison. Fig-
ure 2.5 shows Insertion Sort in C, which will be our benchmark. Figure 2.6 is a table that
summarizes the number of instructions and number of bytes in Insertion Sort for the ISAs.
We move the code
Figures 2.8 to 2.11 show the compiled code for RV32I, ARM-32, MIPS-32, and x86-32.
examples to after
Despite the emphasis on simplicity, the RISC-V version uses the same or fewer instructions, the end of the chap-
and the code sizes of the architectures are quite close. In this example, the compare-and- ter text to maintain the
execute branches of RISC-V save as many instructions as do the fancier address modes and flow of the writing in this
and following chapters.
the push and pop instructions of ARM-32 and x86-32 in Figures 2.9 and 2.11.
24 REFERENCES

2.10 Concluding Remarks

Those who cannot remember the past are condemned to repeat it.
—George Santayana, 1905

Figure 2.7 uses the seven metrics of ISA design from Chapter 1 to organize the lessons from
past ISAs listed it the previous sections, and shows the positive outcomes for RV32I. We’re
The genealogy of all not implying that RISC-V is the first ISA to have those outcomes. Indeed, RV32I inherits the
RISC-V instructions is following from RISC-I, its great-great-grandparent [Patterson 2017]:
chronicled in [Chen and
Patterson 2016]. • 32-bit byte-addressable address space
• All instructions are 32-bit long
• 31 registers, all 32 bits wide, with register 0 hardwired to zero
• All operations are between registers (none are register-to-memory)
• Load/store word plus signed and unsigned load/store byte and halfword
• Immediate option for all arithmetic, logical, and shift instructions
• Immediates always sign-extend
• One data addressing mode (register + immediate) and PC-relative branching
• No multiply or divide instructions
• An instruction to load a wide immediate into the upper part of register so that a 32-bit
constant takes only two instructions

RISC-V benefits from starting one-quarter to one-third century later, which allowed its
architects to follow Santayana’s advice to borrow the good ideas but to not repeat the mis-
takes of the past—including those of RISC-I—in the current RISC-V ISA. Moreover, the
RISC-V Foundation will grow the ISA slowly via optional extensions to prevent the rampant
incrementalism that has plagued successful ISAs of the past.
The Lindy effect [Lin
2017] observes that the
future life expectancy of Elaboration: Is RV32I unique?
a technology or idea is Early microprocessors had separate chips for floating-point arithmetic, so those instructions
proportional to its age. were optional. Moore’s Law soon brought everything on chip, and modularity faded in ISAs.
It has stood the test of
Subsetting the full ISA in simpler processors and trapping to software to emulate them goes
time, so the longer it
has survived in the past, back decades, with the IBM 360 model 44 and the Digital Equipment microVAX as exam-
the longer it likely will ples. RV32I is different in that the full software stack needs only the base instructions, so an
survive in the future. If that RV32I processor need not trap repeatedly for omitted instructions in RV32G. Probably the
hypothesis holds, RISC closest ISA to RISC-V in that respect is the Tensilica Xtensa, which is aimed at embedded
architecture may be a applications. Its 80-instruction base ISA is intended to be extended by users with custom
good idea for a long time. instructions that accelerate their applications. RV32I has a simpler base ISA, has a 64-bit
address version, and offers extensions that target supercomputers as well as microcontrollers.

2.11 To Learn More

Lindy effect, 2017. URL https://en.wikipedia.org/wiki/Lindy_effect.
REFERENCES 25

Mistakes of the Past Lessons learned

ARM-32 (1986) MIPS-32 (1986) x86-32 (1978) RV32I (2011)
Cost Integer multiply Integer multiply 8-bit and 16-bit op- No 8-bit and 16-bit op-
mandatory and divide manda- erations. Integer erations. Integer multi-
tory multiply and divide ply and divide optional
mandatory (RV32M)
Simplicity No zero register. Zero- and sign- No zero register. Register x0 dedicated to
Conditional in- extended imme- Complex procedure 0. Immediates only sign-
struction execution. diates. Some call/return instruc- extended. One data ad-
Complex data arithmetic instructions (enter/leave). dressing mode. No con-
address modes. tions can cause Stack instructions ditional execution. No
Stack instruc- overflow traps (push/pop). Com- complex call/return or
tions (push/pop). plex data address stack instructions. No
Shift-option for modes. Loop traps for arithmetic over-
arithmetic/logic instructions flow. Separate shift in-
instructions structions
Performance Condition codes for Source and destina- Condition codes for Compare and branch in-
branches. Source tion registers vary in branches. At most structions (no condition
and destination instruction format. 2 registers per in- codes). 3 registers per in-
registers vary in struction struction. No load mul-
instruction format. tiple. Source and desti-
Load multiple. nation registers fixed in
Computed immedi- instruction format. Con-
ates. PC a general stant immediates. PC not
purpose register a general purpose regis-
ter
Isolate archi- Exposes the pipe- Delayed branch. Registers not gen- No delayed branch. No
tecture from line length when Delayed load. HI eral purpose (AX, delayed load. General
implementa- writing the PC as and LO registers CX, DX, DI, SI purpose registers
tion a general purpose just for multiply have unique uses)
register and divide
Room for Limited available Limited available Generous available op-
growth opcode space opcode space code space
Program size Only 32-bit instruc- Only 32-bit instruc- Byte-variable in- 32-bit instructions + 16-
tions (+Thumb-2 as tions (+microMIPS structions, but poor bit RV32C extension
separate ISA) as separate ISA) choices
Ease of pro- Only 15 registers. Aligned data in Only 8 registers. No 31 registers. Data can be
gramming / Aligned data in memory. Inconsis- PC-relative data ad- unaligned. PC-relative
compiling / memory. Irregu- tent performance dressing. Incon- data addressing. Sym-
linking lar data address counters sistent performance metric data address
modes. Inconsis- counters mode. Performance
tent performance counters defined in
counters architecture

Figure 2.7: Lessons that RISC-V architects learned from past instruction set mistakes. Often the lesson was
simply to avoid ISA “optimizations” of the past. The lessons and mistakes are classified by the seven ISA
metrics from Chapter 1. Many features listed under cost, simplicity, and performance could be swapped
with each other, as it’s a matter of taste, but they are important no matter where they appear.
26 NOTES

T. Chen and D. A. Patterson. RISC-V genealogy. Technical Report UCB/EECS-2016-

6, EECS Department, University of California, Berkeley, Jan 2016. URL http://www2.
eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-6.html.

W. Hohl and C. Hinds. ARM Assembly Language: Fundamentals and Techniques. CRC
Press, 2016.
K. R. Irvine. Assembly language for x86 processors. Prentice Hall, 2014.
D. Patterson. How close is RISC-V to RISC-I?, 2017.

A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.

Notes
1 http://parlab.eecs.berkeley.edu
NOTES 27

# RV32I (19 instructions, 76 bytes, or 52 bytes with RVC)

# a1 is n, a3 points to a[0], a4 is i, a5 is j, a6 is x
0: 00450693 addi a3,a0,4 # a3 is pointer to a[i]
4: 00100713 addi a4,x0,1 # i = 1
Outer Loop:
8: 00b76463 bltu a4,a1,10 # if i < n, jump to Continue Outer loop
Exit Outer Loop:
c: 00008067 jalr x0,x1,0 # return from function
Continue Outer Loop:
10: 0006a803 lw a6,0(a3) # x = a[i]
14: 00068613 addi a2,a3,0 # a2 is pointer to a[j]
18: 00070793 addi a5,a4,0 # j = i
Inner Loop:
1c: ffc62883 lw a7,-4(a2) # a7 = a[j-1]
20: 01185a63 bge a6,a7,34 # if a[j-1] <= a[i], jump to Exit Inner Loop
24: 01162023 sw a7,0(a2) # a[j] = a[j-1]
28: fff78793 addi a5,a5,-1 # j--
2c: ffc60613 addi a2,a2,-4 # decrement a2 to point to a[j]
30: fe0796e3 bne a5,x0,1c # if j != 0, jump to Inner Loop
Exit Inner Loop:
34: 00279793 slli a5,a5,0x2 # multiply a5 by 4
38: 00f507b3 add a5,a0,a5 # a5 is now byte address oi a[j]
3c: 0107a023 sw a6,0(a5) # a[j] = x
40: 00170713 addi a4,a4,1 # i++
44: 00468693 addi a3,a3,4 # increment a3 to point to a[i]
48: fc1ff06f jal x0,8 # jump to Outer Loop

Figure 2.8: RV32I code for Insertion Sort in Figure 2.5. The address in hexadecimal is on the left, the
machine language code in hexadecimal is next, and then the assembly language instruction followed by a
comment. RV32I allocates two registers to point to a[j] and a[j-1]. It has plenty of registers, some of
which the ABI sets aside for procedure calls. Unlike the other ISAs, it skips saving and restoring these
registers to memory. While the code size is larger than x86-32, using the optional RV32C instructions (see
Chapter 7) closes the size gap. Note the compare and branch instructions avoid the three compare
instructions that ARM-32 and x86-32 require.
28 NOTES

# ARM-32 (19 instructions, 76 bytes; or 18 insns/46 bytes with Thumb-2)

# r0 points to a[0], r1 is n, r2 is j, r3 is i, r4 is x
0: e3a03001 mov r3, #1 # i = 1
4: e1530001 cmp r3, r1 # i vs. n (unnecessary?)
8: e1a0c000 mov ip, r0 # ip = a[0]
c: 212fff1e bxcs lr # don’t let return address change ISAs
10: e92d4030 push {r4, r5, lr} # save r4, r5, return address
Outer Loop:
14: e5bc4004 ldr r4, [ip, #4]! # x = a[i] ; increment ip
18: e1a02003 mov r2, r3 # j = i
1c: e1a0e00c mov lr, ip # lr = a[0] (using lr as scratch reg)
Inner Loop:
20: e51e5004 ldr r5, [lr, #-4] # r5 = a[j-1]
24: e1550004 cmp r5, r4 # compare a[j-1] vs. x
28: da000002 ble 38 # if a[j-1]<=a[i], jump to Exit Inner Loop
2c: e2522001 subs r2, r2, #1 # j--
30: e40e5004 str r5, [lr], #-4 # a[j] = a[j-1]
34: 1afffff9 bne 20 # if j != 0, jump to Inner Loop
Exit Inner Loop:
38: e2833001 add r3, r3, #1 # i++
3c: e1530001 cmp r3, r1 # i vs. n
40: e7804102 str r4, [r0, r2, lsl #2] # a[j] = x
44: 3afffff2 bcc 14 # if i < n, jump to Outer Loop
48: e8bd8030 pop {r4, r5, pc} # restore r4, r5, and return address

Figure 2.9: ARM-32 code for Insertion Sort in Figure 2.5. The address in hexadecimal is on the left, the
machine language code in hexadecimal is next, and then the assembly language instruction followed by a
comment. Short on registers, ARM-32 saves two of them on the stack for later reuse along with the return
address. It uses an addressing mode that scales i and j to be byte addresses. Given that a branch has the
potential to change ISAs between ARM-32 and Thumb-2, bxcs first sets the least significant bit of the return
address to 0 before saving it. The condition codes save one compare instruction to check j after
decrementing it, but there are still three compares elsewhere.
NOTES 29

# MIPS-32 (24 instructions, 96 bytes, or 56 bytes with microMIPS)

# a1 is n, a3 is pointer to a[0], v0 is j, v1 is i, t0 is x
0: 24860004 addiu a2,a0,4 # a2 is pointer to a[i]
4: 24030001 li v1,1 # i = 1
Outer Loop:
8: 0065102b sltu v0,v1,a1 # set on i < n
c: 14400003 bnez v0,1c # if i<n, jump to Continue Outer Loop
10: 00c03825 move a3,a2 # a3 is pointer to a[j] (slot filled)
14: 03e00008 jr ra # return from function
18: 00000000 nop # branch delay slot unfilled
Continue Outer Loop:
1c: 8cc80000 lw t0,0(a2) # x = a[i]
20: 00601025 move v0,v1 # j = i
Inner Loop:
24: 8ce9fffc lw t1,-4(a3) # t1 = a[j-1]
28: 00000000 nop # load delay slot unfilled
2c: 0109502a slt t2,t0,t1 # set a[i] < a[j-1]
30: 11400005 beqz t2,48 # if a[j-1]<=a[i], jump to Exit Inner Loop
34: 00000000 nop # branch delay slot unfilled
38: 2442ffff addiu v0,v0,-1 # j--
3c: ace90000 sw t1,0(a3) # a[j] = a[j-1]
40: 1440fff8 bnez v0,24 # if j != 0, jump to Inner Loop
44: 24e7fffc addiu a3,a3,-4 # decr. a2 to point to a[j] (slot filled)
Exit Inner Loop:
48: 00021080 sll v0,v0,0x2 #
4c: 00821021 addu v0,a0,v0 # v0 now byte address oi a[j]
50: ac480000 sw t0,0(v0) # a[j] = x
54: 24630001 addiu v1,v1,1 # i++
58: 1000ffeb b 8 # jump to Outer Loop
5c: 24c60004 addiu a2,a2,4 # incr. a2 to point to a[i] (slot filled)

Figure 2.10: MIPS-32 code for Insertion Sort in Figure 2.5. The address in hexadecimal is on the left, the
machine language code in hexadecimal is next, and then the assembly language instruction followed by a
comment. The MIPS-32 code has three nop instructions, which adds to its length. Two are due to delayed
branches and one is due to the delayed load. The compiler couldn’t find useful instructions to place in those
delay slots. The delayed branches also make the code harder to understand, since the instruction that follows
is also executed when a branch or jump is taken. For example, the last instruction (addiu) at address 5c is
part of the loop even though it trails the branch instruction.
30 NOTES

# x86-32 (20 instructions, 45 bytes)

# eax is j, ecx is x, edx is i
# pointer to a[0] is in memory at address esp+0xc, n is in memory at esp+0x10
0: 56 push esi # save esi on stack (esi needed below)
1: 53 push ebx # save ebx on stack (ebx needed below)
2: ba 01 00 00 00 mov edx,0x1 # i = 1
7: 8b 4c 24 0c mov ecx,[esp+0xc] # ecx is pointer to a[0]
Outer Loop:
b: 3b 54 24 10 cmp edx,[esp+0x10] # compare i vs. n
f: 73 19 jae 2a <Exit Loop> # if i >= n, jump to Exit Outer Loop
11: 8b 1c 91 mov ebx,[ecx+edx*4] # x = a[i]
14: 89 d0 mov eax,edx # j = i
Inner Loop:
16: 8b 74 81 fc mov esi,[ecx+eax*4-0x4] # esi = a[j-1]
1a: 39 de cmp esi,ebx # compare a[j-1] vs. x
1c: 7e 06 jle 24 <Exit Loop> # if a[j-1]<=a[i],jump Exit Inner Loop
1e: 89 34 81 mov [ecx+eax*4],esi # a[j] = a[j-1]
21: 48 dec eax # j--
22: 75 f2 jne 16 <Inner Loop> # if j != 0, jump to Inner Loop
Exit Inner Loop:
24: 89 1c 81 mov [ecx+eax*4],ebx # a[j] = x
27: 42 inc edx # i++
28: eb e1 jmp b <Outer Loop> # jump to Outer Loop
Exit Outer Loop:
2a: 5b pop ebx # restore old value of ebx from stack
2b: 5e pop esi # restore old value of esi from stack
2c: c3 ret # return from function

Figure 2.11: x86-32 code for Insertion Sort in Figure 2.5. The address in hexadecimal is on the left, the
machine language code in hexadecimal is next, and then the assembly language instruction followed by a
comment. Lacking registers, the x86-32 saves two of them on the stack. Moreover, two of the variables
allocated to registers in RV32I are instead kept in memory (n and the pointer to a[0]). It uses the Scaled
Indexed addressing mode to good effect for accessing a[i] and a[j]. Seven of the 20 x32-86 instructions are
one byte long, which gives the x86-32 good code size for this simple program. There are two popular versions
of x86 assembly language: Intel/Microsoft and AT&T/Linux. We use the Intel syntax, in part because it
matches the operand order of RISC-V, ARM-32, and MIPS-32 with the destination on the left and the
source(s) on the right, while the operands are vice versa for AT&T (and the registers prepend a % before
their names). This seemingly trivial matter is nearly a religious issue for some programmers. Pedagogy
drives our choice, not orthodoxy.
NOTES 31
3 RISC-V Assembly Language

Ivan Sutherland (1938- It’s very satisfying to take a problem we thought difficult and find a simple solution. The
) is called the father of best solutions are always simple.
computer graphics for the
invention of Sketchpad— —Ivan Sutherland
the 1962 forerunner of the
graphical user interface in
modern computing—which
led to a Turing Award. 3.1 Introduction
Figure 3.1 shows the four classic steps in translation starting from a C program and ending
with a machine-language program ready to run in the computer. This chapter covers the last
three steps, but we begin with the role the assembler plays in the RISC-V calling convention.

3.2 Calling convention

There are six general stages in calling a function [Patterson and Hennessy 2017]:
1. Place the arguments where the function can access them.
2. Jump to the function (using RV32I’s jal).
3. Acquire local storage resources the function needs, saving registers as required.
4. Perform the desired task of the function.
5. Place the function result value where the calling program can access it, restore any
registers, and release any local storage resources.
6. Since a function can be called from several points in a program, return control to the
point of origin (using ret).
To obtain good performance, try to keep variables in registers rather than memory, but on the
other hand, avoid going to memory frequently to save and restore these registers.
RISC-V fortunately has enough registers to offer the best of both worlds: keep operands
in registers yet reduce the need to save and restore them. The insight is to have some registers
that are not guaranteed to be preserved across a function call, called temporary registers, and
some that are, called saved registers. Functions that avoid calling other functions are called
leaf functions. When a leaf function has only a few arguments and local variables, we can
keep everything in registers without “spilling” any to memory. If these conditions don’t hold,
3.2. CALLING CONVENTION 33

C program
foo.c

Compiller

Assembly program
foo.s

Assembler

Object (machine language module) Library (machine language module)

foo.o lib.o

Linker

Executable (machine language program)

a.out

Loader

Figure 3.1: Steps of translation from C source code to a running program. These are the logical steps,
Memory We use the Unix file suffix name convention for
although some steps are combined to accelerate translation.
each type of file. The equivalent suffixes in MS-DOS are .C, .ASM, .OBJ, .LIB, and .EXE.

then the program must save register values in memory, but a surprising fraction of function
calls fall into this happy case.
Other registers within a function call must be considered either in the same class as saved
registers, which are preserved across a function call, or in the same class as the temporary
registers, which are not. A function will change the register(s) containing the return value(s),
so they are like temporary registers. There is no reason to preserve the registers to pass
arguments to functions, so they also are like temporaries. The caller can rely on the remaining
registers to be unchanged by across a function call: the registers used for the return address
and the stack pointer. Figure 3.2 lists the RISC-V application binary interface (ABI) names
of registers and the convention on whether they are preserved or not across function calls.
Given the ABI conventions, we can see the standard RV32I code for function entry and
exit. The function prologue looks like this:
entry_label:
addi sp,sp,-framesize # Allocate space for stack frame
# by adjusting stack pointer (sp register)
sw ra,framesize-4(sp) # Save return address (ra register)
# save other registers to stack if needed
... # body of the function
If there are too many function arguments and variables to fit in the registers, the prologue
allocates space on the stack for the function frame, as it is called. After the task of the
function is complete, the epilogue undoes the stack frame and returns to the point of origin:
34 CHAPTER 3. RISC-V ASSEMBLY LANGUAGE

Register ABI Name Description Preserved across call?

x0 zero Hard-wired zero —
x1 ra Return address No
x2 sp Stack pointer Yes
x3 gp Global pointer —
x4 tp Thread pointer —
x5 t0 Temporary/alternate link register No
x6–7 t1–2 Temporaries No
x8 s0/fp Saved register/frame pointer Yes
x9 s1 Saved register Yes
x10–11 a0–1 Function arguments/return values No
x12–17 a2–7 Function arguments No
x18–27 s2–11 Saved registers Yes
x28–31 t3–6 Temporaries No
f0–7 ft0–7 FP temporaries No
f8–9 fs0–1 FP saved registers Yes
f10–11 fa0–1 FP arguments/return values No
f12–17 fa2–7 FP arguments No
f18–27 fs2–11 FP saved registers Yes
f28–31 ft8–11 FP temporaries No

Figure 3.2: Assembler mnemonics for RISC-V integer and floating-point registers. RISC-V has enough
registers that the ABI can allocate registers that procedures or methods are free to use without saving or
restoring when they don’t call other procedures or methods themselves. The registers preserved across a
procedure call are also named caller saved versus callee saved for those that aren’t. Chapter 5 explains the
floating-point f registers. (Table 20.1 of [Waterman and Asanović 2017] is the basis of this figure.)
3.3. ASSEMBLY 35

# restore registers from stack if needed

lw ra,framesize-4(sp) # Restore return address register
addi sp,sp, framesize # De-allocate space for stack frame
ret # Return to calling point
We’ll see an example that follows this ABI shortly, but first we need to explain the re-
maining assembly tasks beyond turning the ABI register names into register numbers.

Elaboration: The saved and temporary registers aren’t contiguous

to support RV32E, an embedded version of RISC-V that has only 16 registers (see Chap-
ter 11). It simply uses register numbers x0 to x15, so some saved and temporary registers are
in this range, and the rest are in the last 16 registers. RV32E is smaller, but has no compiler
support yet, since it doesn’t match RV32I.

3.3 Assembly
The input to this step in Unix is a file with the suffix .s, such as foo.s; for MS-DOS it is
.ASM.
The job of the assembler step of Figure 3.1 is not simply to produce object code from the
instructions that the processor understands, but to extend them to include operations useful
for the assembly language programmer or the compiler writer. This category, based on clever
configurations of regular instructions, is called pseudoinstructions. Figures 3.3 and 3.4 list the
RISC-V pseudoinstructions, with those in the first figure all relying on register x0 to always
be zero while those in the second list do not. For example, ret mentioned above is actually
a pseudoinstruction that the assembler replaces with jalr x0, x1, 0 (see Figure 3.3). The
majority of RISC-V pseudoinstructions depend on x0. As you can see, setting aside one
of the 32 registers to be hardwired to zero greatly simplifies the RISC-V instruction set by
providing many popular operations—such as jump, return, and branch on equal to zero—as
pseudoinstructions.
Figure 3.5 shows the classic “Hello world” program in C. The compiler produces the The “Hello world”
assembly language output in Figure 3.6 using the calling convention in Figure 3.2 and the program is typically
pseudoinstructions from Figures 3.3 and 3.4. the first program run
on a newly designed
The commands that start with a period are assembler directives. They are commands to processor. Architects
the assembler rather than code to be translated by it. They tell the assembler where to place traditionally consider
code and data, specify text and data constants for use in the program, and so forth. Figure 3.9 running the operating
shows the assembler directives of RISC-V. For Figure 3.6, the directives are: system well enough to
print “Hello world” as
• .text—Enter text section. a strong sign that their
new chip largely works.
• .align 2—Align following code to 22 bytes. They email this output to
their management and
• .globl main—Declare global symbol “main”. colleagues, and then they
• .section .rodata—Enter read-only data section. celebrate.

• .balign 4—Align data section to 4 bytes.

• .string “Hello, %s!\n”—Create this null-terminated string.
• .string “world”—Create this null-terminated string.
The assembler produces the object file in Figure 3.7 using the Executable and Linkable
Format (ELF) standard format [TIS Committee 1995].
36 CHAPTER 3. RISC-V ASSEMBLY LANGUAGE

Pseudoinstruction Base Instruction(s) Meaning

nop addi x0, x0, 0 No operation
neg rd, rs sub rd, x0, rs Two’s complement
negw rd, rs subw rd, x0, rs Two’s complement word
snez rd, rs sltu rd, x0, rs Set if 6= zero
sltz rd, rs slt rd, rs, x0 Set if < zero
sgtz rd, rs slt rd, x0, rs Set if > zero
beqz rs, offset beq rs, x0, offset Branch if = zero
bnez rs, offset bne rs, x0, offset Branch if 6= zero
blez rs, offset bge x0, rs, offset Branch if ≤ zero
bgez rs, offset bge rs, x0, offset Branch if ≥ zero
bltz rs, offset blt rs, x0, offset Branch if < zero
bgtz rs, offset blt x0, rs, offset Branch if > zero
j offset jal x0, offset Jump
jr rs jalr x0, rs, 0 Jump register
ret jalr x0, x1, 0 Return from subroutine
auipc x6, offset[31:12]
tail offset Tail call far-away subroutine
jalr x0, x6, offset[11:0]
rdinstret[h] rd csrrs rd, instret[h], x0 Read instructions-retired counter
rdcycle[h] rd csrrs rd, cycle[h], x0 Read cycle counter
rdtime[h] rd csrrs rd, time[h], x0 Read real-time clock
csrr rd, csr csrrs rd, csr, x0 Read CSR
csrw csr, rs csrrw x0, csr, rs Write CSR
csrs csr, rs csrrs x0, csr, rs Set bits in CSR
csrc csr, rs csrrc x0, csr, rs Clear bits in CSR
csrwi csr, imm csrrwi x0, csr, imm Write CSR, immediate
csrsi csr, imm csrrsi x0, csr, imm Set bits in CSR, immediate
csrci csr, imm csrrci x0, csr, imm Clear bits in CSR, immediate
frcsr rd csrrs rd, fcsr, x0 Read FP control/status register
fscsr rs csrrw x0, fcsr, rs Write FP control/status register
frrm rd csrrs rd, frm, x0 Read FP rounding mode
fsrm rs csrrw x0, frm, rs Write FP rounding mode
frflags rd csrrs rd, fflags, x0 Read FP exception flags
fsflags rs csrrw x0, fflags, rs Write FP exception flags

Figure 3.3: 32 RISC-V pseudoinstructions that rely on x0, the zero register. Appendix A includes includes
the RISC-V pseudoinstructions as well as the real instructions. Those that read the 64-bit counters can read
by upper 32 bits in RV32I by using the “h” version of the instructions and the lower 32 bits using the plain
version. (Tables 20.2 and 20.3 of [Waterman and Asanović 2017] are the basis of this figure.).
3.3. ASSEMBLY 37

Pseudoinstruction Base Instruction(s) Meaning

auipc rd, symbol[31:12]
lla rd, symbol Load local address
addi rd, rd, symbol[11:0]
PIC: auipc rd, GOT[symbol][31:12]
la rd, symbol l{w|d} rd, rd, GOT[symbol][11:0] Load address
Non-PIC: Same as lla rd, symbol
auipc rd, symbol[31:12]
l{b|h|w|d} rd, symbol Load global
l{b|h|w|d} rd, symbol[11:0](rd)
auipc rt, symbol[31:12]
s{b|h|w|d} rd, symbol, rt Store global
s{b|h|w|d} rd, symbol[11:0](rt)
auipc rt, symbol[31:12]
fl{w|d} rd, symbol, rt Floating-point load global
fl{w|d} rd, symbol[11:0](rt)
auipc rt, symbol[31:12]
fs{w|d} rd, symbol, rt Floating-point store global
fs{w|d} rd, symbol[11:0](rt)
li rd, immediate Myriad sequences Load immediate
mv rd, rs addi rd, rs, 0 Copy register
not rd, rs xori rd, rs, -1 One’s complement
sext.w rd, rs addiw rd, rs, 0 Sign extend word
seqz rd, rs sltiu rd, rs, 1 Set if = zero
fmv.s rd, rs fsgnj.s rd, rs, rs Copy single-precision register
fabs.s rd, rs fsgnjx.s rd, rs, rs Single-precision absolute value
fneg.s rd, rs fsgnjn.s rd, rs, rs Single-precision negate
fmv.d rd, rs fsgnj.d rd, rs, rs Copy double-precision register
fabs.d rd, rs fsgnjx.d rd, rs, rs Double-precision absolute value
fneg.d rd, rs fsgnjn.d rd, rs, rs Double-precision negate
bgt rs, rt, offset blt rt, rs, offset Branch if >
ble rs, rt, offset bge rt, rs, offset Branch if ≤
bgtu rs, rt, offset bltu rt, rs, offset Branch if >, unsigned
bleu rs, rt, offset bgeu rt, rs, offset Branch if ≤, unsigned
jal offset jal x1, offset Jump and link
jalr rs jalr x1, rs, 0 Jump and link register
auipc x1, offset[31:12]
call offset Call far-away subroutine
jalr x1, x1, offset[11:0]
fence fence iorw, iorw Fence on all memory and I/O
fscsr rd, rs csrrw rd, fcsr, rs Swap FP control/status register
fsrm rd, rs csrrw rd, frm, rs Swap FP rounding mode
fsflags rd, rs csrrw rd, fflags, rs Swap FP exception flags

Figure 3.4: 28 RISC-V pseudoinstructions that are independent of x0, the zero register. For la, GOT stands
for Global Offset Table, which holds the runtime address of symbols in dynamically linked libraries.
Appendix A includes the RISC-V pseudoinstructions as well as the real instructions. (Tables 20.2 and 20.3 of
[Waterman and Asanović 2017] are the basis of this figure.)
38 CHAPTER 3. RISC-V ASSEMBLY LANGUAGE

#include <stdio.h>
int main()
{
printf("Hello, %s\n", "world");
return 0;
}

Figure 3.5: Hello World program in C (hello.c).

.text # Directive: enter text section

.align 2 # Directive: align code to 2^2 bytes
.globl main # Directive: declare global symbol main
main: # label for start of main
addi sp,sp,-16 # allocate stack frame
sw ra,12(sp) # save return address
lui a0,%hi(string1) # compute address of
addi a0,a0,%lo(string1) # string1
lui a1,%hi(string2) # compute address of
addi a1,a1,%lo(string2) # string2
call printf # call function printf
lw ra,12(sp) # restore return address
addi sp,sp,16 # deallocate stack frame
li a0,0 # load return value 0
ret # return
.section .rodata # Directive: enter read-only data section
.balign 4 # Directive: align data section to 4 bytes
string1: # label for first string
.string "Hello, %s!\n" # Directive: null-terminated string
string2: # label for second string
.string "world" # Directive: null-terminated string

Figure 3.6: Hello World program in RISC-V assembly language (hello.s).

00000000 <main>:
0: ff010113 addi sp,sp,-16
4: 00112623 sw ra,12(sp)
8: 00000537 lui a0,0x0
c: 00050513 mv a0,a0
10: 000005b7 lui a1,0x0
14: 00058593 mv a1,a1
18: 00000097 auipc ra,0x0
1c: 000080e7 jalr ra
20: 00c12083 lw ra,12(sp)
24: 01010113 addi sp,sp,16
28: 00000513 li a0,0
2c: 00008067 ret

Figure 3.7: Hello World program in RISC-V machine language (hello.o). The six instructions that are later
patched by the linker (locations 8 to 1c) have zero in their address fields. The symbol table included in the
object file records the labels and addresses of all the instructions that need to be edited by the linker.
3.3. ASSEMBLY 39

000101b0 <main>:
101b0: ff010113 addi sp,sp,-16
101b4: 00112623 sw ra,12(sp)
101b8: 00021537 lui a0,0x21
101bc: a1050513 addi a0,a0,-1520 # 20a10 <string1>
101c0: 000215b7 lui a1,0x21
101c4: a1c58593 addi a1,a1,-1508 # 20a1c <string2>
101c8: 288000ef jal ra,10450 <printf>
101cc: 00c12083 lw ra,12(sp)
101d0: 01010113 addi sp,sp,16
101d4: 00000513 li a0,0
101d8: 00008067 ret

Figure 3.8: Hello World program as RISC-V machine language program after linking. In Unix systems, the
file would be named a.out.

Directive Description
.text Subsequent items are stored in the text section (machine code).
.data Subsequent items are stored in the data section (global variables).
Subsequent items are stored in the bss section (global variables initial-
.bss
ized to 0).
.section .foo Subsequent items are stored in the section named .foo.
Align the next datum on a 2n -byte boundary. For example, .align 2
.align n
aligns the next value on a word boundary.
Align the next datum on a n-byte boundary. For example, .balign 4
.balign n
aligns the next value on a word boundary.
.globl sym Declare that label sym is global and may be referenced from other files.
.string "str" Store the string str in memory and null-terminate it.
.byte b1,..., bn Store the n 8-bit quantities in successive bytes of memory.
.half w1,...,wn Store the n 16-bit quantities in successive memory halfwords.
.word w1,...,wn Store the n 32-bit quantities in successive memory words.
.dword w1,...,wn Store the n 64-bit quantities in successive memory doublewords.
Store the n single-precision floating-point numbers in successive mem-
.float f1,..., fn
ory words.
Store the n double-precision floating-point numbers in successive
.double d1,..., dn
memory doublewords.
.option rvc Compress subsequent instructions (see Chapter 7).
.option norvc Don’t compress subsequent instructions.
.option relax Allow linker relaxations for subsequent instructions.
.option norelax Don’t allow linker relaxations for subsequent instructions.
.option pic Subsequent instructions are position-independent code.
.option nopic Subsequent instructions are position-dependent code.
Push the current setting of all .options to a stack, so that a subsequent
.option push
.option pop will restore their value.
Pop the option stack, restoring all .options to their setting at the time
.option pop
of the last .option push.

Figure 3.9: Common RISC-V assembler directives.

40 CHAPTER 3. RISC-V ASSEMBLY LANGUAGE

sp = bfff fff0hex
Stack

Dynamic data

Static data
1000 0000hex
Text
pc = 0001 0000hex
Reserved
0

Figure 3.10: RV32I allocation of memory to program and data. The high addresses are the top of the figure
and the low addresses are the bottom. In this RISC-V software convention, the stack pointer (sp) starts at
bfff fff0hex and grows down toward the Static data. The text (program code) starts at 0001 0000hex and
includes the statically-linked libraries. The Static data starts immediately above the text region; in this
example, we assume that address is 1000 0000hex . Dynamic data, allocated in C by malloc(), is just above
the Static data. Called the heap, it grows upward toward the stack. It includes the dynamically-linked
libraries.

3.4 Linker
Rather than compile all the source code every time one file changes, the linker allows individ-
ual files to be compiled and assembled separately. It “stitches” the new object code together
with existing machine language modules, such as libraries. It derives its name from one of
its tasks, which is to all edit the links of the jump and link instructions in the object file. In
fact, linker is short for link editor, which was the historical name of this step in Figure 3.1. In
Unix systems, the input to the linker are files with the .o suffix (e.g., foo.o, libc.o ), and
its output is the a.out file. For MS-DOS, the inputs are files with the suffix .OBJ or .LIB
and the output is a .EXE file.
Figure 3.10 shows the addresses of the regions of memory allocated for code and data
in a typical RISC-V program. The linker must adjust the program and data addresses of
instructions in all the object files to match addresses in this figure. It is less work for the
linker if the input files are position independent code (PIC). PIC means that all the branches
to instructions and references to data inside the file are correct wherever the code is placed.
As mentioned in Chapter 2, the PC-relative branch of RV32I makes PIC much easier to fulfill.

In addition to the instructions, each object file contains a symbol table that includes all
the labels in the program that must be given addresses as part of the linking process. This list
includes labels to data as well as to code. Figure 3.6 has two data labels to be set (string1
and string2) and two code labels to be assigned in (main and printf). Since it’s hard to
specify a 32-bit address within a single 32-bit instruction, the linker must adjust two instruc-
tions per label in the RV32I code, as Figure 3.6 shows: lui and addi for data addresses, and
3.5. STATIC VS. DYNAMIC LINKING 41

auipc and jalr for code addresses. Figure 3.8 shows the final linked a.out version of the
object file in Figure 3.7.
RISC-V compilers support several ABIs, depending on whether the F and D extensions
are present. For RV32, the ABIs are named ilp32, ilp32f, and ilp32d. ilp32 means that the
C language data types int, long, and pointer are all 32 bits; the optional suffix indicates how
floating-point arguments are passed. In ilp32, floating-point arguments are passed in integer
registers. In ilp32f, single-precision floating-point arguments are passed in floating-point
registers. In ilp32d, double-precision floating-point arguments are also passed in floating-
point registers.
Naturally, to pass a floating-point argument in a floating-point register, you need the cor-
responding floating-point ISA extension F or D (see Chapter 5). So, to compile code for
RV32I (GCC flag ‘-march=rv32i‘), you must use the ilp32 ABI (GCC flag ‘-mabi=ilp32‘).
On the other hand, having floating-point instructions doesn’t mean the calling convention is
required to use them; so, for example, RV32IFD is compatible with all three ABIs: ilp32,
ilp32f, and ilp32d.
The linker checks that the program’s ABI matches all of its libraries. Although the com-
piler supports many combinations of ISA extensions and ABIs, only a few sets of libraries
might be installed. Hence, a common pitfall is attempting to link a program without having
compatible libraries installed. The linker will not produce a helpful diagnostic message in
this case; it will simply attempt to link with an incompatible library, then inform you of the
incompatibility. This pitfall generally occurs only when compiling on one computer for a
different computer (cross compiling).

Elaboration: Linker relaxation

The jump and link instruction has a 20-bit PC-relative address field, so a single instruction
can jump far. While the compiler produces two instructions for each external function, quite
often only one instruction is necessary. Since this optimization saves both time and space,
linkers will make passes over the code to replace two instructions with one whenever it can.
Because a pass might shrink distance between a call and the function so that it now fits in
a single instruction, the linker keeps optimizing the code until there are no further changes.
This process is called Linker relaxation, with the name referring to relaxation techniques for
solving systems of equations. In addition to procedure calls, the RISC-V linker relaxes data
addressing to use the global pointer when the datum lies within ±2 KiB of gp, removing a
lui or auipc. It similarly relaxes thread-local storage addressing when the datum lies within
±2 KiB of tp.

Architects typically
3.5 Static vs. Dynamic Linking measure processor
performance using
benchmarks that are
The prior section describes static linking, where all potential library code is linked and then statically linked de-
loaded together before execution. Such libraries can be relatively large, so linking a popu- spite most real programs
lar library into multiple programs wastes memory. Moreover, the libraries are bound when having dynamic links. The
excuse is that users in-
linked—even when they are updated later to fix bugs—forcing the statically-linked code to
terested in performance
use the old, buggy version. should link statically, but
To avoid both problems, most systems today rely on dynamic linking, where the desired it’s a poor justification.
external function is loaded and linked to the program only after it is first called; if it’s never It makes more sense to
accelerate performance of
called, it’s never loaded and linked. Every call after the first one uses a fast link, so the real programs, not bench-
dynamic overhead is only paid once. Each time a program starts it links in the current version marks.
42 REFERENCES

of the library functions it needs, which is how it can get the newest version. Furthermore,
if multiple programs use the same dynamically linked library, the code in the library need
appear only once in memory.
The code that the compiler generates resembles that for static linking. Instead of jumping
to the real function, it jumps to a short (three-instruction) stub function. The stub function
loads the address of the real function from a table in memory, then jumps to it. However, on
the first call, the table lacks the address of the real function, but instead contains the address
of the dynamic-linking routine. When invoked, the dynamic linker uses the symbol table to
find the real function, copies it into memory, and then updates the table to point to the real
function. Each subsequent call pays only the three-instruction overhead of the stub function.

3.6 Loader
A program like the one in Figure 3.8 is an executable file kept in the computer’s storage.
When one is to be run, the loader’s job is to load it into memory and jump to the starting
address. The “loader” today is the operating system; stated alternatively, loading a.out is
one of many tasks of an operating system.
Loading is a little trickier for dynamically-linked programs. Instead of simply starting the
program, the operating system starts the dynamic linker. It in turn starts the desired program,
and then handles all first-time external calls, copies the functions into memory, and edits the
program after each call to point it to the correct function.

3.7 Concluding Remarks

Keep it simple, stupid.
—Kelly Johnson, aeronautical engineer who coined the “KISS Principle,” 1960

The assembler enhances the simple RISC-V ISA with 60 pseudoinstructions that make
RISC-V code easier to read and to write without increasing hardware costs. Simply dedicat-
ing one RISC-V register to zero enables many of these helpful operations. The Load Upper
Immediate (lui) and Add Upper Immediate to PC (auipc) instructions make it easier for
the compiler and linker to adjust addresses for external data and functions, and PC-relative
branching makes it easier to help the linker with position-independent code. Having plenty
of registers enables a calling convention that makes function call and return faster by reducing
the number of register spills and restores.
RISC-V offers a tasteful collection of simple but impactful mechanisms that reduce cost,
improve performance, and make it easier to program.

3.8 To Learn More

D. A. Patterson and J. L. Hennessy. Computer Organization and Design RISC-V Edition:
The Hardware Software Interface. Morgan Kaufmann, 2017.
TIS Committee. Tool interface standard (TIS) executable and linking format (ELF) specifi-
cation version 1.2. TIS Committee, 1995.
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.
NOTES 43

Notes
1 http://parlab.eecs.berkeley.edu
4 RV32M: Multiply and Divide

William of Occam Entities should not be multiplied beyond necessity.

(1287-1347) was an English
—William of Occam, 1320
theologian who promoted
what is now called “Occam’s
razor,” a preference for
simplicity in the scientific
method. 4.1 Introduction
RV32M adds integer multiply and divide instructions to RV32I. Figure 4.1 is a graphical
representation of the RV32M extension instruction set and Figure 4.2 lists their opcodes.
Divide is straightforward. Recall that

Quotient = (Dividend − Remainder) ÷ Divisor

or alternatively
Dividend = Quotient × Divisor + Remainder
Remainder = Dividend − (Quotient × Divisor)

srl can do unsigned RV32M has divide instructions for both signed and unsigned integers: divide (div) and di-
division by 2i . For vide unsigned (divu), which place the quotient into the destination register. Less frequently,
example, if a2 = 16 (24 ) programmers want the remainder instead of the quotient, so RV32M offers remainder (rem)
then srli t2,a1,4
produces the same value
and remainder unsigned (remu), which write the remainder instead of the quotient.
as divu t2,a1,a2.

RV32M
multiply
_
multiply high unsigned
signed unsigned
_
divide
unsigned
remainder

Figure 4.1: Diagram of the RV32M instructions.

4.1. INTRODUCTION 45

31 25 24 20 19 15 14 12 11 7 6 0
0000001 rs2 rs1 000 rd 0110011 R mul
0000001 rs2 rs1 001 rd 0110011 R mulh
0000001 rs2 rs1 010 rd 0110011 R mulhsu
0000001 rs2 rs1 011 rd 0110011 R mulhu
0000001 rs2 rs1 100 rd 0110011 R div
0000001 rs2 rs1 101 rd 0110011 R divu
0000001 rs2 rs1 110 rd 0110011 R rem
0000001 rs2 rs1 111 rd 0110011 R remu

Figure 4.2: RV32M opcode map has instruction layout, opcodes, format type, and names. (Table 19.2 of
[Waterman and Asanović 2017] is the basis of this figure.)

# Compute unsigned division of a0 by 3 using multiplication.

0: aaaab2b7 lui t0,0xaaaab # t0 = 0xaaaaaaab
4: aab28293 addi t0,t0,-1365 # = ~ 2^32 / 1.5
8: 025535b3 mulhu a1,a0,t0 # a1 = ~ (a0 / 1.5)
c: 0015d593 srli a1,a1,0x1 # a1 = (a0 / 3)

Figure 4.3: RV32M code to divide by a constant by multiplying. It takes careful numerical analysis to show
that this algorithm works for any dividend, and for some other divisors, the correction step is more
complicated. The proof of correctness, and the algorithm for generating the reciprocals and correction steps,
is in [Granlund and Montgomery 1994].

The multiply equation is simply:

sll can do signed
P roduct = M ultiplicand × M ultiplier and unsigned mul-
tiplication by 2i . For
It’s more complicated than divide because the size of the product is the sum of the sizes of the example, if a2 = 16 (24 )
multiplier and the multiplicand; multiplying two 32-bit numbers yields a 64-bit product. To then slli t2,a1,4
produces the same value
produce a properly signed or unsigned 64-bit product, RISC-V has four multiply instructions. as mul t2,a1,a2.
To get the integer 32-bit product—the lower 32-bits of the full product—use mul. To get the
upper 32 bits of the 64-bit product, use mulh if both operands are signed, mulhu if both
operands are unsigned, and mulhsu is one is signed, and the other is unsigned. Since it
would complicate the hardware to write the 64-bit product into two 32-bit registers in one
instruction, RV32M requires two multiply instructions to produce the 64-bit product.
For many microprocessors, integer division is a relatively slow operation. As mentioned
above, right shifts can replace unsigned division by powers of 2. It turns out that division
by other constants can be optimized, too, by multiplying by the approximate reciprocal then For almost all pro-
applying a correction to the upper half of the product. For example, Figure 4.3 shows the cessors, multiplies
are slower than
code for unsigned division by 3. shifts or adds and
What’s Different? ARM-32 long had multiply but no divide instruction. Divide didn’t divides are much slower
become mandatory until 2005, almost 20 years after the first ARM processor. MIPS-32 than multiplies.
uses special registers (HI and LO) as the sole destination registers for multiply and divide
instructions. While this design reduced the complexity of early MIPS implementations, it
takes an extra move instruction to use the result of the multiply or divide, potentially reducing
performance. The HI and LO registers also increase the architectural state, making it slightly
slower to switch between tasks.
46 NOTES

Elaboration: mulh and mulhu can check for overflow in multiplication.

There is no overflow when using mul for unsigned multiplication if the result of mulhu is
zero. Similarly, there is no overflow when using mul for signed multiplication if all bits in
the result of mulh match the sign bit of the result of mul, i.e., equals 0 if positive or ffff ffffhex
if negative.

Elaboration: It’s also easy to check for divide by zero.

Just add a beqz test of the dividend before the divide. RV32I doesn’t trap on divide by zero
because few programs want that behavior, and the ones that do can easily check for zero in
software. Of course, divides by constants never need checks.

Elaboration: mulhsu is useful for multi-word signed multiplication.

mulhsu generates the upper half of the product when the multiplier is signed and the multipli-
cand is unsigned. It is as a substep of multi-word signed multiplication when multiplying the
most-significant word of the multiplier (which contains the sign bit) with the less-significant
words of the multiplicand (which are unsigned). This instruction improves performance of
multi-word multiplication by about 15%.

4.2 Concluding Remarks

The cheapest, fastest, and most reliable components are those that aren’t there.
—C. Gordon Bell, architect of prominent minicomputers

To offer the smallest RISC-V processor for embedded applications, multiply and divide are
part of the first optional standard extension of RISC-V. Nevertheless, many RISC-V proces-
sors will include RV32M.

4.3 To Learn More

T. Granlund and P. L. Montgomery. Division by invariant integers using multiplication. In
ACM SIGPLAN Notices, volume 29, pages 61–72. ACM, 1994.

A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.

Notes
1 http://parlab.eecs.berkeley.edu
NOTES 47
5 RV32F and RV32D: Single- and
Double-Precision Floating Point

Antoine de Saint Perfection is finally attained not when there is no longer anything to add, but when there
Exup’ery, L’Avion is no longer anything to take away.
(1900-1944) was a French
writer and aviator best —Antoine de Saint Exup’ery, L’Avion, 1940
known for the book The
Little Prince.
5.1 Introduction
Although RV32F and RV32D are separate, optional instruction set extensions, they are often
included together. Given single- and double-precision (32- and 64-bit) versions of nearly
all floating-point instructions, for brevity we present them in one chapter. Figure 5.1 is a
graphical representation of the RV32F and RV32D extension instruction sets. Figure 5.2 lists
the opcodes of RV32F and Figure 5.3 lists the opcodes of RV32D. Like virtually all other
modern ISAs, RISC-V obeys the IEEE 754-2008 floating-point standard [IEEE Standards
Committee 2008].

5.2 Floating-Point Registers

RV32F and RV32D use 32 separate f registers instead of the x registers. The main reason
for the two sets of registers is that processors can improve performance by doubling the
register capacity and bandwidth by having two sets of registers without increasing the space
for the register specifier in the cramped RISC-V instruction format. The major impact on
the instruction set is to have new instructions to load and store the f registers and to transfer
data between the x and f registers. Figure 5.4 lists the RV32D and RV32F registers and their
names as determined by the RISC-V ABI.
If a processor has both RV32F and RV32D, the single-precision data uses only the lower
32 bits of the f registers. Unlike x0 in RV32I, register f0 is not hardwired to 0 but is an
alterable register like all the other 31 f registers.
The IEEE 754-2008 standard provides several ways to round floating-point arithmetic,
which are helpful to determine error bounds and in writing numerical libraries. The most
accurate and most common is round to nearest even (RNE). The rounding mode is set in the
floating-point control and status register fcsr. Figure 5.5 shows fcsr and lists the rounding
options. It also holds the accrued exception flags that the standard requires.
What’s Different? Both ARM-32 and MIPS-32 have 32 single-precision floating-point
registers but only 16 double-precision registers. They both map two single-precision registers
5.3. FLOATING-POINT LOADS, STORES, AND ARITHMETIC 49

RV32F and RV32D

Floating-Point Computation Load and Store
add load
float word
subtract store doubleword
multiply
.single Conversion
float divide
.double _
square root float convert to .single from .word unsigned
minimum .double
_
maximum float convert to .word .single
unsigned from
_ add .single .double
float negative multiply float convert to .single from .double
subtract .double
float convert to .double from .single
float move to .single from .x register
Other instructions
float move to .x register from .single _
negative .single
float sign injection
Comparison exclusive or .double
equals
.single float classify .single
compare float less than .double
.double
less than or equals

Figure 5.1: Diagram of the RV32F and RV32D instructions.

into the left and right 32-bit halves of a double-precision register. x86-32 floating-point Having only 16
arithmetic didn’t have any registers, but used a stack instead. The stack entries were 80-bits double-precision
wide to improve accuracy, so loads covert 32-bit or 64-bit operands to 80 bits, and vice versa registers was the
most painful ISA er-
for stores. A subsequent version of x86-32 added 8 traditional 64-bit floating-point registers ror in MIPS according
and associated instructions. Unlike RV32FD and MIPS-32, ARM-32 and x86-32 overlooked to John Mashey, one of its
instructions to move data directly between floating-point and integer registers. The only architects.
solution is to store a floating-point register in memory and then load it from memory to an
integer register, and vice versa.

Elaboration: RV32FD allows the rounding mode to be set per instruction.

Called static rounding, it helps performance when you only need to change the rounding
mode for one instruction. The default is to use the dynamic rounding mode in the fcsr.
Static rounding is specified as an optional last argument, as fadd.s ft0, ft1, ft2, rtz
will round towards zero, irrespective of fcsr. The caption of Figure 5.5 lists the names of
the rounding modes.

5.3 Floating-Point Loads, Stores, and Arithmetic

RISC-V has two load instructions (flw, fld) and two store instructions (fsw, fsd) for RV32F Unlike integer arith-
and RV32D. The have the same addressing mode and instruction format as lw and sw. metic, the size of
the product from a
Adding to the standard arithmetic operations (fadd.s, fadd.d, fsub.s, fsub.d, floating-point mul-
fmul.s, fmul.d, fdiv.s, fdiv.d), RV32F and RV32D include square root (fsqrt.s, tiply is the same as
fsqrt.d). They also have minimum and maximum (fmin.s, fmin.d, fmax.s, its operands. Also,
fmax.d), which write the smaller or larger values from the pair of source operands without RV32F and RF32D omit
floating-point remainder
using a branch instruction. instructions.
50 CHAPTER 5. RV32FD: SINGLE/DOUBLE FLOATING POINT

31 27 26 25 24 20 19 15 14 12 11 7 6 0
imm[11:0] rs1 010 rd 0000111 I flw
imm[11:5] rs2 rs1 010 imm[4:0] 0100111 S fsw
rs3 00 rs2 rs1 rm rd 1000011 R4 fmadd.s
rs3 00 rs2 rs1 rm rd 1000111 R4 fmsub.s
rs3 00 rs2 rs1 rm rd 1001011 R4 fnmsub.s
rs3 00 rs2 rs1 rm rd 1001111 R4 fnmadd.s
0000000 rs2 rs1 rm rd 1010011 R fadd.s
0000100 rs2 rs1 rm rd 1010011 R fsub.s
0001000 rs2 rs1 rm rd 1010011 R fmul.s
0001100 rs2 rs1 rm rd 1010011 R fdiv.s
0101100 00000 rs1 rm rd 1010011 R fsqrt.s
0010000 rs2 rs1 000 rd 1010011 R fsgnj.s
0010000 rs2 rs1 001 rd 1010011 R fsgnjn.s
0010000 rs2 rs1 010 rd 1010011 R fsgnjx.s
0010100 rs2 rs1 000 rd 1010011 R fmin.s
0010100 rs2 rs1 001 rd 1010011 R fmax.s
1100000 00000 rs1 rm rd 1010011 R fcvt.w.s
1100000 00001 rs1 rm rd 1010011 R fcvt.wu.s
1110000 00000 rs1 000 rd 1010011 R fmv.x.w
1010000 rs2 rs1 010 rd 1010011 R feq.s
1010000 rs2 rs1 001 rd 1010011 R flt.s
1010000 rs2 rs1 000 rd 1010011 R fle.s
1110000 00000 rs1 001 rd 1010011 R fclass.s
1101000 00000 rs1 rm rd 1010011 R fcvt.s.w
1101000 00001 rs1 rm rd 1010011 R fcvt.s.wu
1111000 00000 rs1 000 rd 1010011 R fmv.w.x

Figure 5.2: RV32F opcode map has instruction layout, opcodes, format type, and names. The primary
difference in the encodings between this and the next figure is bit 12 is a 0 for the first two instructions and
bit 25 is a 0 for the rest of the instructions where both bits are 1 in RV32D. (Table 19.2 of [Waterman and
Asanović 2017] is the basis of this figure.)
5.3. FLOATING-POINT LOADS, STORES, AND ARITHMETIC 51

31 27 26 25 24 20 19 15 14 12 11 7 6 0
imm[11:0] rs1 011 rd 0000111 I fld
imm[11:5] rs2 rs1 011 imm[4:0] 0100111 S fsd
rs3 01 rs2 rs1 rm rd 1000011 R4 fmadd.d
rs3 01 rs2 rs1 rm rd 1000111 R4 fmsub.d
rs3 01 rs2 rs1 rm rd 1001011 R4 fnmsub.d
rs3 01 rs2 rs1 rm rd 1001111 R4 fnmadd.d
0000001 rs2 rs1 rm rd 1010011 R fadd.d
0000101 rs2 rs1 rm rd 1010011 R fsub.d
0001001 rs2 rs1 rm rd 1010011 R fmul.d
0001101 rs2 rs1 rm rd 1010011 R fdiv.d
0101101 00000 rs1 rm rd 1010011 R fsqrt.d
0010001 rs2 rs1 000 rd 1010011 R fsgnj.d
0010001 rs2 rs1 001 rd 1010011 R fsgnjn.d
0010001 rs2 rs1 010 rd 1010011 R fsgnjx.d
0010101 rs2 rs1 000 rd 1010011 R fmin.d
0010101 rs2 rs1 001 rd 1010011 R fmax.d
0100000 00001 rs1 rm rd 1010011 R fcvt.s.d
0100001 00000 rs1 rm rd 1010011 R fcvt.d.s
1010001 rs2 rs1 010 rd 1010011 R feq.d
1010001 rs2 rs1 001 rd 1010011 R flt.d
1010001 rs2 rs1 000 rd 1010011 R fle.d
1110001 00000 rs1 001 rd 1010011 R fclass.d
1100001 00000 rs1 rm rd 1010011 R fcvt.w.d
1100001 00001 rs1 rm rd 1010011 R fcvt.wu.d
1101001 00000 rs1 rm rd 1010011 R fcvt.d.w
1101001 00001 rs1 rm rd 1010011 R fcvt.d.wu

Figure 5.3: RV32D opcode map has instruction layout, opcodes, format type, and names. There are some
instructions in these two figures do not simply differ by data width. This figure uniquely has fcvt.s.d and
fcvt.d.s while the other has fmv.x.w and fmv.w.x. (Table 19.2 of [Waterman and Asanović 2017] is the
basis of this figure.)
52 CHAPTER 5. RV32FD: SINGLE/DOUBLE FLOATING POINT

63 32 31 0
f0 / ft0 FP Temporary
f1 / ft1 FP Temporary
f2 / ft2 FP Temporary
f3 / ft3 FP Temporary
f4 / ft4 FP Temporary
f5 / ft5 FP Temporary
f6 / ft6 FP Temporary
f7 / ft7 FP Temporary
f8 / fs0 FP Saved register
f9 / fs1 FP Saved register
f10 / fa0 FP Function argument, return value
f11 / fa1 FP Function argument, return value
f12 / fa2 FP Function argument
f13 / fa3 FP Function argument
f14 / fa4 FP Function argument
f15 / fa5 FP Function argument
f16 / fa6 FP Function argument
f17 / fa7 FP Function argument
f18 / fs2 FP Saved register
f19 / fs3 FP Saved register
f20 / fs4 FP Saved register
f21 / fs5 FP Saved register
f22 / fs6 FP Saved register
f23 / fs7 FP Saved register
f24 / fs8 FP Saved register
f25 / fs9 FP Saved register
f26 / fs10 FP Saved register
f27 / fs11 FP Saved register
f28 / ft8 FP Temporary
f29 / ft9 FP Temporary
f30 / ft10 FP Temporary
f31 / ft11 FP Temporary
32 32

Figure 5.4: The floating-point registers of RV32F and RV32D. The single-precision registers occupy the
rightmost half of the 32 double-precision registers. Chapter 3 explains the RISC-V calling convention for the
floating-point registers, the rationale behind the FP Argument registers (fa0-fa7), FP Saved registers
(fs0-fs11), and FP Temporaries (ft0-ft11). (Table 20.1 of [Waterman and Asanović 2017] is the basis of this
figure.)
5.4. FLOATING-POINT MOVES AND CONVERTS 53

31 87 5 4 3 2 1 0
Reserved Rounding Mode (frm) Accrued Exceptions (fflags)
NV DZ OF UF NX
24 3 1 1 1 1 1

Figure 5.5: Floating-point control and status register. It holds the rounding modes and the exception flags.
The rounding modes are round to nearest, ties to even (rte, 000 in frm); round towards zero (rtz, 001); round
down, towards −∞ (rdn, 010); round up, towards +∞ (rup, 011); and round to nearest, ties to max
magnitude (rmm, 100). The five accrued exception flags indicate the exception conditions that have arisen on
any floating-point arithmetic instruction since the field was last reset by software: NV is Invalid Operation;
DZ is Divide by Zero; OF is Overflow; UF is Underflow; and NX is Inexact. (Figure 8.2 of [Waterman and
Asanović 2017] is the basis of this figure.)

Many floating-point algorithms, such as matrix multiply, perform a multiply immediately

followed by an addition or a subtraction. Hence, RISC-V offers instructions that multiply
two operands and then either add (fmadd.s, fmadd.d) or subtract (fmsub.s, fmsub.d) a
third operand to that product before writing the sum. It also has versions that negate the prod-
uct before adding or subtracting the third operand: fnmadd.s, fnmadd.d, fnmsub.s,
fnmsub.d. These fused multiply-add instructions are more accurate as well as faster than
separate multiply and add instructions, because they round only once (after the add) rather
than twice (after the multiply, then after the add). These instructions need a new instruction
format to specify 4 registers, called R4. Figures 5.2 and 5.3 show the R4 format, which is a
variation of the R format.
Instead of floating-point branch instructions, RV32F and RV32D supply comparison in-
structions that set an integer register to 1 or 0 based on comparison of two floating-point
registers: feq.s, feq.d, flt.s, flt.d, fle.s, fle.d. These instructions allow an
integer branch instruction to jump based on a floating-point condition. For example, this code
branches to Exit if f1 < f2:
flt x5, f1, f2 # x5 = 1 if f1 < f2; otherwise x5 = 0
bne x5, x0, Exit # if x5 != 0, jump to Exit

5.4 Floating-Point Converts and Moves

RV32F and RV32D have instructions that perform all combinations of useful conversions
between 32-bit signed integers, 32-bit unsigned integers, 32-bit floating point, and 64-bit
floating point. Figure 5.6 displays these 10 instructions by source data type and converted
destination data type.
RV32F also offers instructions to move data to x from f registers (fmv.x.w) and vice
versa (fmv.w.x).

5.5 Miscellaneous Floating-Point Instructions

RV32F and RV32D offer unusual instructions that help with math libraries as well as provide
useful pseudoinstructions. (The IEEE 754 floating-point standard requires a way to copy and
manipulate signs and to classify floating-point data, which inspired these instructions.)
The first is the sign-injection instructions, which copy everything from the first source
register but the sign bit. The value of the sign bit depends on the instruction:
54 CHAPTER 5. RV32FD: SINGLE/DOUBLE FLOATING POINT

From
To 32b signed 32b unsigned 32b floating 64b floating
integer (w) integer (wu) point (s) point (d)
32b signed integer (w) – – fcvt.w.s fcvt.w.d
32b unsigned integer (wu) – – fcvt.wu.s fcvt.wu.d
32b floating point (s) fcvt.s.w fcvt.s.wu – fcvt.s.d
64b floating point (d) fcvt.d.w fcvt.d.wu fcvt.d.s –

Figure 5.6: RV32F and RV32D conversion instructions. The columns list the source data types and the rows
show the converted destination data type.

1. Float sign inject (fsgnj.s, fsgnj.d): the result’s sign bit is rs2’s sign bit.
2. Float sign inject negative (fsgnjn.s, fsgnjn.d): the result’s sign bit is the opposite
of rs2’s sign bit.
3. Float sign inject exclusive-or (fsgnjx.s, fsgnjx.d): the sign bit is the XOR of the
sign bits of rs1 and rs2.
As well as helping with sign manipulation in math libraries, sign-injection instructions
provide three popular floating-point pseudoinstructions (see Figure 3.4 on page 37):
1. Copy floating-point register:
fmv.s rd,rs is really fsgnj.s rd,rs,rs and
fmv.d rd,rs is really fsgnj.d rd,rs,rs.
2. Negate:
fneg.s rd,rs maps to fsgnjn.s rd,rs,rs and
fneg.d rd,rs maps to fsgnjn.d rd,rs,rs.
3. Absolute value (since 0 ⊕ 0 = 0 and 1 ⊕ 1 = 0):
fabs.s rd,rs becomes fsgnjx.s rd,rs,rs and
fabs.d rd,rs becomes fsgnjx.d rd,rs,rs.
The second unusual floating-point instruction is classify (fclass.s, fclass.d).
Classify instructions are also a great aid to math libraries. They test a source operand to see
which of 10 floating-point properties apply (see the table below), and then write a mask into
the lower 10 bits of the destination integer register with the answer. Only one of the ten bits
is set to 1, with the rest set to 0s.

x[rd] bit Meaning

0 f[rs1] is −∞.
1 f[rs1] is a negative normal number.
2 f[rs1] is a negative subnormal number.
3 f[rs1] is −0.
4 f[rs1] is +0.
5 f[rs1] is a positive subnormal number.
6 f[rs1] is a positive normal number.
7 f[rs1] is +∞.
8 f[rs1] is a signaling NaN.
9 f[rs1] is a quiet NaN.
5.6. COMPARING RV32FD, ARM-32, MIPS-32, AND X86-32 USING DAXPY 55

void daxpy(size_t n, double a, const double x[], double y[])

{
for (size_t i = 0; i < n; i++) {
y[i] = a*x[i] + y[i];
}
}

Figure 5.7: The floating-point intensive DAXPY program in C.

ISA ARM-32 ARM Thumb-2 MIPS-32 microMIPS x86-32 RV32FD RV32FD+RV32C

Instructions 10 10 12 12 16 11 11
Per Loop 6 6 7 7 6 7 7
Bytes 40 28 48 32 50 44 28

Figure 5.8: Number of instructions and code size of DAXPY for four ISAs. It lists number of instructions per
loop and total. Chapter 7 describes ARM Thumb-2, microMIPS, and RV32C.

5.6 Comparing RV32FD, ARM-32, MIPS-32, and x86-32 using DAXPY

The name DAXPY
We’ll now do a head-to-head comparison using DAXPY as our floating-point benchmark come from the formula
(Figure 5.7). It calculates Y = a × X + Y in double-precision, where X and Y are vectors itself: Double-precision
A times X Plus Y. The
and a is a scalar. Figure 5.8 summarizes the number of instructions and number of bytes in single-precision version is
DAXPY of programs for the four ISAs. Their code is in Figures 5.9 to 5.12. called SAXPY.
As was the case for Insertion Sort in Chapter 2, despite its emphasis on simplicity, the
RISC-V version again has about the same or fewer instructions, and the code sizes of the
architectures are quite close. In this example, the compare-and-execute branches of RISC-V
save as many instructions as do the fancier address modes and the push and pop instructions
of ARM-32 and x86-32.

5.7 Concluding Remarks

Less is More.
—Robert Browning, 1855. The Minimalist school of (building) architecture adopted this
poem as an axiom in the 1980s.

The IEEE 754-2008 floating-point standard [IEEE Standards Committee 2008] defines the
floating-point data types, the accuracy of computation, and the required operations. Its suc-
cess greatly reduces the difficulty of porting floating-point programs, and it also means that
the floating-point ISAs are probably more uniform than are the equivalent in other chapters.
56 NOTES

Elaboration: 16-bit, 128-bit, and decimal floating-point arithmetic

The revised IEEE floating-point standard (IEEE 754-2008) describes several new formats
beyond single- and double-precision, which they call binary32 and binary64. The least sur-
prising addition is quadruple precision, named binary128. RISC-V has a tentative extension
planned for it called RV32Q (see Chapter 11). The standard also provided two more sizes for
binary data interchange, indicating that programmers might store these numbers in memory
or storage but shouldn’t expect to be able to compute in these sizes. They are half-precision
(binary16) and octuple precision (binary256). Despite the standard’s intent, GPUs do com-
pute in half-precision as well as keep them in memory. The plan for RISC-V is to include
half-precision in the vector instructions (RV32V in Chapter 8), with the proviso that pro-
cessors supporting vector half-precision will also add half-precision scalar instructions. The
surprising addition to the revised standard is decimal floating point, for which RISC-V has
set aside RV32L (see Chapter 11). The three self-explanatory decimal formats are called
decimal32, decimal64, and decimal128.

5.8 To Learn More

IEEE Standards Committee. 754-2008 IEEE standard for floating-point arithmetic. IEEE
Computer Society Std, 2008.
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.

Notes
1 http://parlab.eecs.berkeley.edu
NOTES 57

# RV32FD (7 insns in loop; 11 insns/44 bytes total; 28 bytes RVC)

# a0 is n, a1 is pointer to x[0], a2 is pointer to y[0], fa0 is a
0: 02050463 beqz a0,28 # if n == 0, jump to Exit
4: 00351513 slli a0,a0,0x3 # a0 = n*8
8: 00a60533 add a0,a2,a0 # a0 = address of x[n] (last element)
Loop:
c: 0005b787 fld fa5,0(a1) # fa5 = x[]
10: 00063707 fld fa4,0(a2) # fa4 = y[]
14: 00860613 addi a2,a2,8 # a2++ (increment pointer to y)
18: 00858593 addi a1,a1,8 # a1++ (increment pointer to x)
1c: 72a7f7c3 fmadd.d fa5,fa5,fa0,fa4 # fa5 = a*x[i] + y[i]
20: fef63c27 fsd fa5,-8(a2) # y[i] = a*x[i] + y[i]
24: fea614e3 bne a2,a0,c # if i != n, jump to Loop
Exit:
28: 00008067 ret # return

Figure 5.9: RV32D code for DAXPY in Figure 5.7. The address in hexadecimal is on the left, the machine
language code in hexadecimal is next, and then the assembly language instruction followed by a comment.
The compare-and-branch instructions avoid the two compare instructions in the code of ARM-32 and
x86-32.

# ARM-32 (6 insns in loop; 10 insns/40 bytes total; 28 bytes Thumb-2)

# r0 is n, d0 is a, r1 is pointer to x[0], r2 is pointer to y[0]
0: e3500000 cmp r0, #0 # compare n to 0
4: 0a000006 beq 24 <daxpy+0x24> # if n == 0, jump to Exit
8: e0820180 add r0, r2, r0, lsl #3 # r0 = address of x[n] (last element)
Loop:
c: ecb16b02 vldmia r1!,{d6} # d6 = x[i], increment pointer to x
10: ed927b00 vldr d7,[r2] # d7 = y[i]
14: ee067b00 vmla.f64 d7, d6, d0 # d7 = a*x[i] + y[i]
18: eca27b02 vstmia r2!, {d7} # y[i] = a*x[i] + y[i], incr. ptr to y
1c: e1520000 cmp r2, r0 # i vs. n
20: 1afffff9 bne c <daxpy+0xc> # if i != n, jump to Loop
Exit:
24: e12fff1e bx lr # return

Figure 5.10: ARM-32 code for DAXPY in Figure 5.7. The autoincrement addressing mode of ARM-32 saves
two instructions as compared to RISC-V. Unlike Insertion Sort, there is no need to push and pop registers for
DAXPY in ARM-32.
58 NOTES

# MIPS-32 (7 insns in loop; 12 insns/48 bytes total; 32 bytes microMIPS)

# a0 is n, a1 is pointer to x[0], a2 is pointer to y[0], f12 is a
0: 10800009 beqz a0,28 <daxpy+0x28> # if n == 0, jump to Exit
4: 000420c0 sll a0,a0,0x3 # a0 = n*8 (filled branch delay slot)
8: 00c42021 addu a0,a2,a0 # a0 = address of x[n] (last element)
Loop:
c: 24c60008 addiu a2,a2,8 # a2++ (increment pointer to y)
10: d4a00000 ldc1 $f0,0(a1) # f0 = x[i]
14: 24a50008 addiu a1,a1,8 # a1++ (increment pointer to x)
18: d4c2fff8 ldc1 $f2,-8(a2) # f2 = y[i]
1c: 4c406021 madd.d $f0,$f2,$f12,$f0 # f0 = a*x[i] + y[i]
20: 14c4fffa bne a2,a0,c <daxpy+0xc> # if i != n, jump to Loop
24: f4c0fff8 sdc1 $f0,-8(a2) # y[i] = a*x[i] + y[i] (filled delay slot)
Exit:
28: 03e00008 jr ra # return
2c: 00000000 nop # (unfilled branch delay slot)

Figure 5.11: MIPS-32 code for DAXPY in Figure 5.7. Two of the three branch delay slots are filled with
useful instructions. The ability to check for equality between two registers avoids the two compare
instructions found in ARM-32 and x86-32. Unlike integer loads, floating-point loads have no delay slot.

# x86-32 (6 insns in loop; 16 insns/50 bytes total)

# eax is i, n is in memory at esp+0x8, a is in memory at esp+0xc
# pointer to x[0] is in memory at esp+0x14
# pointer to y[0] is in memory at esp+0x18
0: 53 push ebx # save ebx
1: 8b 4c 24 08 mov ecx,[esp+0x8] # ecx has copy of n
5: c5 fb 10 4c 24 0c vmovsd xmm1,[esp+0xc] # xmm1 has a copy of a
b: 8b 5c 24 14 mov ebx,[esp+0x14] # ebx points to x[0]
f: 8b 54 24 18 mov edx,[esp+0x18] # edx points to y[0]
13: 85 c9 test ecx,ecx # compare n to 0
15: 74 19 je 30 <daxpy+0x30> # if n==0, jump to Exit
17: 31 c0 xor eax,eax # i = 0 (since x^x==0)
Loop:
19: c5 fb 10 04 c3 vmovsd xmm0,[ebx+eax*8] # xmm0 = x[i]
1e: c4 e2 f1 a9 04 c2 vfmadd213sd xmm0,xmm1,[edx+eax*8] # xmm0 = a*x[i] + y[i]
24: c5 fb 11 04 c2 vmovsd xmm0,xmm1,[edx+eax*8] # y[i] = a*x[i] + y[i]
29: 83 c0 01 add eax,0x1 # i++
2c: 39 c1 cmp ecx,eax # compare i vs n
2e: 75 e9 jne 19 <daxpy+0x19> # if i!=n, jump to Loop
Exit:
30: 5b pop ebx # restore ebx
31: c3 ret # return

Figure 5.12: x86-32 code for DAXPY in Figure 5.7. The lack of x86-32 registers is evident in this example,
with four variables allocated to memory that are in registers in the code for the other ISAs. It also
demonstrates x86-32 idioms to compare a register to zero (test ecx,ecx) or to set a register to zero (xor
eax,eax).
NOTES 59
6 RV32A: Atomic Instructions

Albert Einstein (1879- Everything should be made as simple as possible, but no simpler.
1955) was the most famous
—Albert Einstein, 1933
scientist of the 20th century.
He invented the theory of
relativity and advocated
building the atomic bomb for
World War II. 6.1 Introduction
Our assumption is that you already understand ISA support for multiprocessing, so our job is
just to explain the RV32A instructions and what they do. If you don’t feel you have sufficient
background or need a reminder, study “synchronization (computer science)” on Wikipedia
(https://en.wikipedia.org/wiki/Synchronization_(computer_science)) or read Section 2.1 of
our related RISC-V architecture book [Patterson and Hennessy 2017].
RV32A has two types of atomic operations for synchronization:
• atomic memory operations (AMO), and
• load reserved / store conditional.
Figure 6.1 is a graphical representation of the RV32A extension instruction set and Figure 6.2
lists their opcodes and instruction formats.

RV32A
add
and
or
swap
atomic memory operation xor .word
maximum
maximum unsigned
minimum
minimum unsigned

load reserved .word

store conditional

Figure 6.1: Diagram of the RV32A instructions.

6.1. INTRODUCTION 61

31 25 24 20 19 15 14 12 11 7 6 0
00010 aq rl 00000 rs1 010 rd 0101111 R lr.w
00011 aq rl rs2 rs1 010 rd 0101111 R sc.w
00001 aq rl rs2 rs1 010 rd 0101111 R amoswap.w
00000 aq rl rs2 rs1 010 rd 0101111 R amoadd.w
00100 aq rl rs2 rs1 010 rd 0101111 R amoxor.w
01100 aq rl rs2 rs1 010 rd 0101111 R amoand.w
01000 aq rl rs2 rs1 010 rd 0101111 R amoor.w
10000 aq rl rs2 rs1 010 rd 0101111 R amomin.w
10100 aq rl rs2 rs1 010 rd 0101111 R amomax.w
11000 aq rl rs2 rs1 010 rd 0101111 R amominu.w
11100 aq rl rs2 rs1 010 rd 0101111 R amomaxu.w

Figure 6.2: RV32A opcode map has instruction layout, opcodes, format type, and names. (Table 19.2 of
[Waterman and Asanović 2017] is the basis of this figure.

The AMO instructions atomically perform an operation on an operand in memory and set
the destination register to the original memory value. Atomic means there can be no interrupt
between the read and the write of memory, nor could other processors modify the memory
value between the memory read and write of the AMO instruction.
AMOs and LR/SC
Load reserved and store conditional provide an atomic operation across two instructions.
require naturally
Load reserved reads a word from memory, writes it to the destination register, and records a aligned memory
reservation on that word in memory. Store conditional stores a word at the address in a source addresses because it
register provided there exists a load reservation on that memory address. It writes zero to the is onerous for hardware to
guarantee atomicity across
destination register if the store succeeded, or a nonzero error code otherwise. cache-line boundaries.
An obvious question is: Why does RV32A have two ways to perform atomic operations?
The answer is that there are two quite distinct use cases.
Programming language developers assume the underlying architecture can perform an
atomic compare-and-swap operation: Compare a register value to a value in memory ad-
dressed by another register, and if they are equal, then swap a third register value with the one
in memory. They make that assumption because it is a universal synchronization primitive,
in that any other single-word synchronization operation can be synthesized from compare-
and-swap [Herlihy 1991].
While that is powerful argument for adding such an instruction to an ISA, it requires
three source registers in one instruction. Alas, going from two to three source operands
would complicate the integer datapath, control, and the instruction format. (The three source
operands of RV32FD’s multiply-add instructions affect the floating-point datapath, not the
integer datapath.) Fortunately, load reserved and store conditional have only two source
registers and can implement atomic compare and swap (see top half of Figure 6.3).
The rationale for also having AMO instructions is that they scale better to large multipro-
cessor systems than load reserved and store conditional. They can also be used to implement
reduction operations efficiently. AMOs are useful as well for communicating with I/O de-
vices, because they perform a read and a write in a single atomic bus transaction. This
atomicity can both simplify device drivers and improve I/O performance. The bottom half of
Figure 6.3 shows how to write a critical section using atomic swap.
62 REFERENCES

# Compare-and-swap (CAS) memory word M[a0] using lr/sc.

# Expected old value in a1; desired new value in a2.
0: 100526af lr.w a3,(a0) # Load old value
4: 06b69e63 bne a3,a1,80 # Old value equals a1?
8: 18c526af sc.w a3,a2,(a0) # Swap in new value if so
c: fe069ae3 bnez a3,0 # Retry if store failed
... code following successful CAS goes here ...
80: # Unsuccessful CAS.

# Critical section guarded by test-and-set spinlock using an AMO.

0: 00100293 li t0,1 # Initialize lock value
4: 0c55232f amoswap.w.aq t1,t0,(a0) # Attempt to acquire lock
8: fe031ee3 bnez t1,4 # Retry if unsuccessful
... critical section goes here ...
20: 0a05202f amoswap.w.rl x0,x0,(a0) # Release lock.

Figure 6.3: Two examples of synchronization. The first uses load reserved/store conditional lr.w,sc.w to
implement compare-and-swap, and the second uses an atomic swap amoswap.w to implement a mutex.

Elaboration: Memory consistency models

RISC-V has a relaxed memory consistency model, so other threads may view some memory
accesses out of order. Figure 6.2 shows that all RV32A instructions have an acquire bit (aq)
and a release bit (rl). An atomic operation with the aq bit set guarantees that other threads
will see the AMO in-order with subsequent memory accesses. If the rl bit is set, other threads
will see the atomic operation in-order with previous memory accesses. To learn more, [Adve
and Gharachorloo 1996] is an excellent tutorial on the topic.

What’s Different? The original MIPS-32 had no mechanism for synchronization, but
architects added load reserved / store conditional instructions to a later MIPS ISA.

6.2 Concluding Remarks

RV32A is optional, and a RISC-V processor is simpler without it. However, as Einstein said,
everything should be as simple as possible, but no simpler. Many situations require RV32A.

6.3 To Learn More

S. V. Adve and K. Gharachorloo. Shared memory consistency models: A tutorial. Computer,
29(12):66–76, 1996.
M. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and
Systems, 1991.
D. A. Patterson and J. L. Hennessy. Computer Organization and Design RISC-V Edition:
The Hardware Software Interface. Morgan Kaufmann, 2017.
NOTES 63

A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.

Notes
1 http://parlab.eecs.berkeley.edu
7 RV32C: Compressed Instructions

E. F. Schumacher Small is Beautiful.

(1911-1977) wrote this eco-
—E. F. Schumacher, 1973
nomics book that advocated
human-scale, decentralized,
and appropriate technolo-
gies. Translated into nu-
merous languages, it was
7.1 Introduction
named one of the 100 most
influential books since World Prior ISAs significantly expanded the number of instructions and instruction formats to shrink
War II. code size: adding short instructions with two operands instead of three, small immediate
fields, and so on. ARM and MIPS invented whole ISAs twice to shrink code: ARM Thumb
and Thumb-2 plus MIPS16 and microMIPS. These new ISAs hampered the processor and
the compiler and increased the cognitive load on the assembly language programmer.
RV32C takes a novel approach: every short instruction must map to one single standard
32-bit RISC-V instruction. Moreover, only the assembler and linker are aware of the 16-
bit instructions, and it is up to them to replace a wide instruction with its narrow cousin.
The compiler writer and assembly language programmer can be blissfully oblivious of the
RV32C instructions and their formats, except ending up with programs that are smaller than
most. Figure 7.1 is a graphical representation of the RV32C extension instruction set.
The RISC-V architects chose the instructions in the RVC extension to obtain good code
compression across a range of programs, using three observations to fit them into 16 bits.
First, ten popular registers (a0–a5, s0–s1, sp, and ra) are accessed far more than the rest.
Second, many instructions overwrite one of their source operands. Third, immediate operands
tend to be small, and some instructions favor certain immediates. So, many RV32C instruc-
tions can access only the popular registers; some instructions implicitly overwrite a source
operand; and almost all immediates are reduced in size, with loads and stores using only
unsigned offsets in multiples of the operand size.
Figures 7.3 and 7.4 list the RV32C code for Insertion Sort and DAXPY. We show the
RV32C instructions to demonstrate the impact of compression explicitly, but normally these
instructions are invisible in the assembly language program. The comments show the equiv-
alent 32-bit instructions to the RV32C instructions parenthetically. Appendix A includes the
32-bit RISC-V instruction that corresponds to each 16-bit RV32C instruction.
For example, at address 4 in Insertion Sort in Figure 7.3, the assembler replaced the
following 32-bit RV32I instruction:
addi a4,x0,1 # i = 1
with this 16-bit RV32C instruction:
7.1. INTRODUCTION 65

Integer Computation
RV32C
Control transfer
_
c.add equal
immediate c.branch to zero
not equal
c.add immediate * 16 to stack pointer _
c.add immediate * 4 to stack pointer nondestructive c.jump and link
c.subtract _
c.jump and link register
shift left logical
c. shift right arithmetic immediate
Other instructions
shift right logical
_ c.environment break
c.and
immediate
c.or
c.move
c.exclusive or
_
c.load immediate
upper
Loads and Stores
_ load word _
c. float using stack pointer
store
_
c.float load doubleword using stack pointer
store

Figure 7.1: Diagram of the RV32C instructions. The immediate fields of the shift instructions and
c.addi4spn are zero extended and sign extended for the other instructions.

c.li a4,1 # (expands to addi a4,x0,1) i = 1

The RV32C load immediate instruction is narrower because it must specify only one register
and a small immediate. The c.li machine code is only four hexadecimal digits in Figure 7.3,
showing that the c.li instruction is indeed 2 bytes long.
Another example is at address 10 in Figure 7.3, where the assembler replaced:
add a2,x0,a3 # a2 is pointer to a[j]
with this 16-bit RV32C instruction:
c.mv a2,a3 # (expands to add a2,x0,a3) a2 is pointer to a[j]
The RV32C move instruction is merely 16 bits long because it specifies only two registers.
While the processor designer can’t ignore RV32C, an implementation trick makes them
inexpensive: a decoder translates all 16-bit instructions into their equivalent 32-bit version
before they execute. Figures 7.6 to 7.8 list the RV32C instruction formats and opcodes that
the decoder translates. It is equivalent to only 400 gates when the tiniest 32-bit processor—
without any RISC-V extensions—is 8000 gates. If it’s 5% of such a tiny design, the decoder
nearly disappears inside a moderate processor that with caches is order 100,000 gates.
What’s Different? There are no byte or halfword instructions in RV32C because other
instructions have a bigger influence on code size. The small size advantage of Thumb-2 over
RV32C in Figure 1.5 on page 9 is due to the code size savings of Load and Store Multiple
on procedure entry and exit. RV32C excludes them to maintain the one-to-one mapping to
RV32G instructions, which omits them to reduce implementation complexity for high-end
processors. Since Thumb-2 is a separate ISA from ARM-32, but a processor can switch
between them, the hardware must have two instruction decoders: one for ARM-32 and one
for Thumb-2. RV32GC is a single ISA, so RISC-V processors need only a single decoder.
66 CHAPTER 7. RV32C: COMPRESSED INSTRUCTIONS

Benchmark ISA ARM Thumb-2 microMIPS x86-32 RV32I+RVC

Instructions 18 24 20 19
Insertion Sort
Bytes 46 56 45 52
Instructions 10 12 16 11
DAXPY
Bytes 28 32 50 28

Figure 7.2: Instructions and code size for Insertion Sort and DAXPY for compressed ISAs.

Elaboration: Why would architects ever skip RV32C?

Instruction decode can be a bottleneck for superscalar processors that try to fetch several in-
structions per clock cycle. Another example is macrofusion, whereby the instruction decoder
combines RISC-V instructions together to form more powerful instructions for execution (see
Chapter 1). A mix of 16-bit RV32C and 32-bit RV32I instructions can make sophisticated
decoding more difficult to complete within the clock cycle of a high-performance implemen-
tation.

7.2 Comparing RV32GC, Thumb-2, microMIPS, and x86-32

Figure 7.2 summarizes the size of Insertion Sort and DAXPY for these four ISAs.
Of the 19 original RV32I instructions in Insertion Sort, 12 become RV32C, so the code
shrinks from 19 × 4 = 76 bytes to 12 × 2 + 7 × 4 = 52 bytes, saving 24/76 = 32%. DAXPY
shrinks from 11 × 4 = 44 bytes to 8 × 2 + 3 × 4 = 28 bytes, or 16/44 = 36%.
The results for these two small examples are surprisingly in line with Figure 1.5 on page 9
in Chapter 2, which shows that RV32G code is about 37% larger than RV32GC code, for
a larger set of much bigger programs. To achieve that level of savings, over half of the
instructions in the programs had to be RV32C instructions.

Elaboration: Is RV32C really unique?

RV32I instructions are indistinguishable in RV32IC. Thumb-2 is actually a separate ISA
with 16-bit instructions plus most but not all of ARMv7. For example, Compare and Branch
on Zero is in Thumb-2 but not ARMv7, and vice versa for Reverse Subtract with Carry.
Nor is microMIPS32 a superset of MIPS32. For example, microMIPS multiplies branch
displacements by two but it’s four in MIPS32. RISC-V always multiplies by two.

7.3 Concluding Remarks

I would have written a shorter letter, but I did not have the time.
—Blaise Pascal, 1656.
He was a mathematician who built one of the first mechanical calculators, which led
Turing Award laureate Niklaus Wirth to name a programming language after him.

RV32C gives RISC-V one of the smallest code sizes today. You can almost think of
them as hardware-assisted pseudoinstructions. However, now the assembler is hiding them
from the assembly language programmer and compiler writer rather than, as in Chapter 3,
7.4. TO LEARN MORE 67

expanding the real instruction set with popular operations that make RISC-V code easier to
use and to read. Both approaches aid programmer productivity.
We consider RV32C as one of RISC-V’s best examples of a simple, powerful mechanism
that improves its cost-performance.

7.4 To Learn More

A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.

Notes
1 http://parlab.eecs.berkeley.edu
68 NOTES

# RV32C (19 instructions, 52 bytes)

# a1 is n, a3 points to a[0], a4 is i, a5 is j, a6 is x
0: 00450693 addi a3,a0,4 # a3 is pointer to a[i]
4: 4705 c.li a4,1 # (expands to addi a4,x0,1) i = 1
Outer Loop:
6: 00b76363 bltu a4,a1,c # if i < n, jump to Continue Outer loop
a: 8082 c.ret # (expands to jalr x0,ra,0) return from function
Continue Outer Loop:
c: 0006a803 lw a6,0(a3) # x = a[i]
10: 8636 c.mv a2,a3 # (expands to add a2,x0,a3) a2 is pointer to a[j]
12: 87ba c.mv a5,a4 # (expands to add a5,x0,a4) j = i
InnerLoop:
14: ffc62883 lw a7,-4(a2) # a7 = a[j-1]
18: 01185763 ble a7,a6,26 # if a[j-1] <= a[i], jump to Exit InnerLoop
1c: 01162023 sw a7,0(a2) # a[j] = a[j-1]
20: 17fd c.addi a5,-1 # (expands to addi a5,a5,-1) j--
22: 1671 c.addi a2,-4 # (expands to addi a2,a2,-4)decr a2 to point to a[j]
24: fbe5 c.bnez a5,14 # (expands to bne a5,x0,14)if j!=0,jump to InnerLoop
Exit InnerLoop:
26: 078a c.slli a5,0x2 # (expands to slli a5,a5,0x2) multiply a5 by 4
28: 97aa c.add a5,a0 # (expands to add a5,a5,a0)a5 = byte address of a[j]
2a: 0107a023 sw a6,0(a5) # a[j] = x
2e: 0705 c.addi a4,1 # (expands to addi a4,a4,1) i++
30: 0691 c.addi a3,4 # (expands to addi a3,a3,4) incr a3 to point to a[i]
32: bfd1 c.j 6 # (expands to jal x0,6) jump to Outer Loop

Figure 7.3: RV32C code for Insertion Sort. The twelve 16-bit instructions make the code 32% smaller. The
width of each instruction is evident by the number of hexadecimal characters in the second column. The
RV32C instructions (starting with c.) are shown explicitly in this example, but normally assembly language
programmers and compilers cannot see them.
NOTES 69

# RV32DC (11 instructions, 28 bytes)

# a0 is n, a1 is pointer to x[0], a2 is pointer to y[0], fa0 is a
0: cd09 c.beqz a0,1a # (expands to beq a0,x0,1a) if n==0, jump to Exit
2: 050e c.slli a0,a0,0x3 # (expands to slli a0,a0,0x3) a0 = n*8
4: 9532 c.add a0,a2 # (expands to add a0,a0,a2) a0 = address of x[n]
Loop:
6: 2218 c.fld fa4,0(a2) # (expands to fld fa4,0(a2) ) fa5 = x[]
8: 219c c.fld fa5,0(a1) # (expands to fld fa5,0(a1) ) fa4 = y[]
a: 0621 c.addi a2,8 # (expands to addi a2,a2,8) a2++ (incr. ptr to y)
c: 05a1 c.addi a1,8 # (expands to addi a1,a1,8) a1++ (incr. ptr to x)
e: 72a7f7c3 fmadd.d fa5,fa5,fa0,fa4 # fa5 = a*x[i] + y[i]
12: fef63c27 fsd fa5,-8(a2) # y[i] = a*x[i] + y[i]
16: fea618e3 bne a2,a0,6 # if i != n, jump to Loop
Exit:
1a: 8082 ret # (expands to jalr x0,ra,0) return from function

Figure 7.4: RV32DC code for DAXPY. The eight 16-bit instructions shrink the code by 36%. The width of
each instruction is evident by the number of hexadecimal characters in the second column. The RV32C
instructions (starting with c.) are shown explicitly in this example, but normally they are invisible to the
assembly language programmer and compiler.

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
000 nzimm[5] 0 nzimm[4:0] 01 CI c.nop
000 nzimm[5] rs1/rd6=0 nzimm[4:0] 01 CI c.addi
001 imm[11|4|9:8|10|6|7|3:1|5] 01 CJ c.jal
010 imm[5] rd6=0 imm[4:0] 01 CI c.li
011 nzimm[9] 2 nzimm[4|6|8:7|5] 01 CI c.addi16sp
011 nzimm[17] rd6={0, 2} nzimm[16:12] 01 CI c.lui
100 nzuimm[5] 00 rs10 /rd0 nzuimm[4:0] 01 CI c.srli
100 nzuimm[5] 01 rs10 /rd0 nzuimm[4:0] 01 CI c.srai
100 imm[5] 10 rs10 /rd0 imm[4:0] 01 CI c.andi
100 0 11 rs10 /rd0 00 rs20 01 CR c.sub
0 0
100 0 11 rs1 /rd 01 rs20 01 CR c.xor
0 0
100 0 11 rs1 /rd 10 rs20 01 CR c.or
100 0 11 rs10 /rd0 11 rs20 01 CR c.and
101 imm[11|4|9:8|10|6|7|3:1|5] 01 CJ c.j
110 imm[8|4:3] rs10 imm[7:6|2:1|5] 01 CB c.beqz
111 imm[8|4:3] rs10 imm[7:6|2:1|5] 01 CB c.bnez

Figure 7.5: RV32C opcode map (bits[1 : 0] = 01) lists layout, opcodes, format, and names. rd’, rs1’, and
rs2’ refer to the 10 popular registers a0–a5, s0–s1, sp, and ra. (Table 12.5 of Waterman and Asanović
2017] is the basis of this figure.)
70 NOTES

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
000 0 0 00 CIW Illegal instruction
000 nzuimm[5:4|9:6|2|3] rd0 00 CIW c.addi4spn
001 uimm[5:3] rs10 uimm[7:6] rd0 00 CL c.fld
010 uimm[5:3] rs10 uimm[2|6] rd0 00 CL c.lw
011 uimm[5:3] rs10 uimm[2|6] rd0 00 CL c.flw
101 uimm[5:3] rs10 uimm[7:6] rs20 00 CL c.fsd
110 uimm[5:3] rs10 uimm[2|6] rs20 00 CL c.sw
111 uimm[5:3] rs10 uimm[2|6] rs20 00 CL c.fsw

Figure 7.6: RV32C opcode map (bits[1 : 0] = 00) lists layout, opcodes, format, and names. rd’, rs1’, and
rs2’ refer to the 10 popular registers a0–a5, s0–s1, sp, and ra. (Table 12.4 of Waterman and Asanović
2017] is the basis of this figure.)

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
000 nzuimm[5] rs1/rd6=0 nzuimm[4:0] 10 CI c.slli
000 0 rs1/rd6=0 0 10 CI c.slli64
001 uimm[5] rd uimm[4:3|8:6] 10 CSS c.fldsp
010 uimm[5] rd6=0 uimm[4:2|7:6] 10 CSS c.lwsp
011 uimm[5] rd uimm[4:2|7:6] 10 CSS c.flwsp
100 0 rs16=0 0 10 CJ c.jr
100 0 rd6=0 rs26=0 10 CR c.mv
100 1 0 0 10 CI c.ebreak
100 1 rs16=0 0 10 CJ c.jalr
100 1 rs1/rd6=0 rs26=0 10 CR c.add
101 uimm[5:3|8:6] rs2 10 CSS c.fsdsp
110 uimm[5:2|7:6] rs2 10 CSS c.swsp
111 uimm[5:2|7:6] rs2 10 CSS c.fswsp

Figure 7.7: RV32C opcode map (bits[1 : 0] = 10) lists layout, opcodes, format, and names. (Table 12.6
of Waterman and Asanović 2017] is the basis of this figure.)

Format Meaning 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
CR Register funct4 rd/rs1 rs2 op
CI Immediate funct3 imm rd/rs1 imm op
CSS Stack-relative Store funct3 imm rs2 op
CIW Wide Immediate funct3 imm rd0 op
0
CL Load funct3 imm rs1 imm rd0 op
0
CS Store funct3 imm rs1 imm rs20 op
0
CB Branch funct3 offset rs1 offset op
CJ Jump funct3 jump target op

Figure 7.8: Compressed 16-bit RVC instruction formats. rd’, rs1’, and rs2’ refer to the 10 popular
registers a0–a5, s0–s1, sp, and ra. (Table 12.1 of Waterman and Asanović 2017] is the basis of this figure.)
NOTES 71
8 RV32V: Vector

Seymour Cray (1925- I’m all for simplicity. If it’s very complicated I can’t understand it.
1996) was architect of the
—Seymour Cray
Cray-1 in 1976, the first
commercially successful
supercomputer using a
vector architecture. The 8.1 Introduction
Cray-1 was a gem; it was
the world’s fastest computer This chapter focuses on data-level parallelism, where there is plenty of data that the desired
even without using the application can compute on concurrently. Arrays are a popular example. While fundamental
vector instructions.
to scientific applications, multimedia programs use arrays as well. The former uses single-
and double-precision float-point data and the latter often uses 8- and 16-bit integer data.
The best known architecture for data-level parallelism is Single Instruction Multiple Data
(SIMD). SIMD first became popular by partitioning 64-bit registers into many 8-, 16-, or 32-
bit pieces and then computing on them in parallel. The opcode supplied the data width and
the operation. Data transfers are simply loads and stores of a single (wide) SIMD register.
The first step of partitioning existing 64-bit registers is tempting because it is straightfor-
The Intel Multimedia ward. To make SIMD faster, architects subsequently widen the registers to compute more
Extensions (MMX) in partitions concurrently. Because the SIMD ISAs belong to the incremental school of de-
1997 made SIMD popular.
They were embraced and sign, and the opcode specifies the data width, expanding the SIMD registers also expands the
expanded via Streaming SIMD instruction set. Each subsequent step of doubling the width of SIMD registers and the
SIMD Extensions (SSE) number of SIMD instructions leads ISAs down the path of escalating complexity, which is
in 1999 and Advanced borne by processor designers, compiler writers, and assembly language programmers.
Vector Extensions (AVX)
in 2010. MMX fame was An older and, in our opinion, more elegant alternative to exploit data-level parallelism is
fueled by an Intel ad the vector architecture. This chapter provides our rationale for using vectors instead of SIMD
campaign showing disco- in RISC-V.
dancing workers of a
Vector computers gather objects from main memory and put them into long, sequential
semiconductor line clad
in technicolor clean suits vector registers. Pipelined execution units compute very efficiently on these vector registers.
(https://www.youtube.com/ Vector architectures then scatter the results back from the vector registers to main memory.
watch?v=paU16B-bZEA). The size of the vector registers is determined by the implementation, rather than baked into
the opcode, as with SIMD. As we shall see, separating the vector length and maximum op-
erations per clock cycle from the instruction encoding is the crux of the vector architecture:
the vector microarchitect can flexibly design the data-parallel hardware without affecting the
programmer, and the programmer can take advantage of longer vectors without rewriting the
code. In addition, vector architectures have many fewer instructions than SIMD architectures.
Moreover, vector architectures have well-established compiler technology, unlike SIMD.
8.2. VECTOR COMPUTATION INSTRUCTIONS 73

Computation RV32V
add
Load and Store
multiply
_
multiply high load strided
vector
vector and .vv store indexed
or .vs
xor Comparison
minimum equal
.vv
maximum vector predicate not equal .vs
convert less than
subtract greater than or equal
divide .vv and
vector remainder .vs and not
shift left logical vector predicate or
.sv
shift right arithmetic exclusive or
shift right logical .vvv not
_ add .vvs vector predicate swap
vector fused negative multiply subtract .vsv
_ .vss Miscellaneous instructions
vector sign injection negative .vv set vector length
vector class.v exclusive or vector extract.vs
add vector merge.vv
vector move.vv
and vector select.vv
vector square root.v
or vector set data configuration
.vv
vector atomic memory operation swap
.vs
xor
minimum
maximum

Figure 8.1: Diagram of the RV32V instructions. Because of dynamic register typing, this instruction
diagram also works without change for RV64V in Chapter 9.

Vector architectures are rarer than SIMD architectures, so fewer readers know vector
ISAs. Thus, this chapter will have a more tutorial flavor than earlier ones. If you want to dig
deeper into vector architectures, read Chapter 4 and Appendix G of [Hennessy and Patterson
2011]. RV32V also has novel features that simplify the ISA, which requires more explanation
even if you already are familiar with vector architectures.

8.2 Vector Computation Instructions

Figure 8.1 is a graphical representation of the RV32V extension instruction set. The RV32V
encoding has not been finalized, so this edition does not include the usual instruction-layout
diagram.
Virtually every integer and floating-point computation instruction from an earlier chapter
has a vector version: Figure 8.1 inherits operations from RV32I, RV32M, RV32F, RV32D,
and RV32A. There are several types of each vector instruction depending on whether the
source operands are all vectors (.vv suffix) or a vector source operand and a scalar source
operand (.vs suffix). A scalar suffix means an x or f register is an operand along with a
vector register (v). For example, our DAXPY program (Figure 5.7 on page 55 in Chapter 5)
calculates Y = a × X + Y , where X and Y are vectors, and a is a scalar. For vector-scalar
operations, the rs1 field specifies the scalar register to be accessed.
Asymmetric operations like subtraction and division offer a third variation of vector in-
74 CHAPTER 8. RV32V: VECTOR

structions where the first operand is scalar and the second is vector (.sv suffix). Operations
like Y = a − X use them. They are superfluous for symmetric operations like addition and
multiplication, so those instructions have no .sv version. The fused multiply-add instruc-
tions have three operands, so they have the largest combination of vector and scalar options:
.vvv, .vvs, .vsv, and .vss.
Readers may notice that Figure 8.1 ignores the data type and width of the vector opera-
tions. The next section explains why.

8.3 Vector Registers and Dynamic Typing

RV32V adds 32 vector registers, whose names start with v, but the number of elements per
vector register varies. That number depends on both the width of the operations and on the
amount of memory dedicated to vector registers, which is up to the processor designer. For
example, if the processor allocated 4096 bytes for vector registers, that is enough for all 32
vector registers to have 16 64-bit elements, 32 32-bit elements, 64 16-bit elements, or 128
8-bit elements.
To keep the number of elements flexible in a vector ISA, a vector processor calculates the
maximum vector length (mvl) that programs use to run properly on processors with differing
amounts of memory for vector registers. The vector length register (vl) sets the number of
elements in a vector for a particular operation, which helps programs when a dimension of
an array is not a multiple of mvl. We’ll demonstrate mvl, vl, and the eight predicate registers
(vpi) in more detail in the following sections.
RV32V takes the novel approach of associating the data type and length with the vector
registers rather than with the instruction opcodes. A program tags the vector registers with
their data type and width before executing the vector computation instructions. Dynamic
register typing slashes the number of vector instructions, important because there are often six
integer and three floating-point versions of each vector instruction as Figure 8.1 shows. As
we shall see in Section 8.9 when we confront the numerous SIMD instructions, a dynamically
typed vector architecture reduces the cognitive load on the assembly language programmer
and the difficulty of the compiler’s code generator.
Another advantage of dynamic typing is that programs can disable unused vector regis-
ters. This feature allocates all the vector memory to the enabled vector registers. For example,
suppose only two vector registers are enabled, they are type 64-bit floats, and the processor
has 1024 bytes of vector register memory. The processor would halve the memory, giving
each vector register 512 bytes or 512/8 = 64 elements and therefore set mvl to 64. Thus, mvl
is dynamic, but its value is set by the processor and cannot be directly changed by software.
The source and destination registers determine the type and size of the operation and the
result, so conversions are implicit with dynamic typing. For example, a processor can multi-
ply a vector of double-precision floating-point numbers by a single-precision scalar without
first having to convert the operands to the same precision. This bonus benefit reduces the
total number of vector instructions and the number of instructions executed.
The vsetdcfg instruction sets the vector register types. Figure 8.2 shows the vector reg-
ister types available to RV32V plus more types for RV64V (see Chapter 9). RV32V requires
that vector floating-point operations have the scalar versions also. Thus, you must have at
least RV32FV to use the F32 type and RV32FDV to use the F64 type. RV32V introduces
a 16-bit floating-point format type F16. If an implementation supports both RV32V and
RV32F, then it must support both F16 and F32 formats.
8.4. VECTOR LOADS AND STORES 75

Type Floating Point Signed Integer Unsigned Integer

Width Name vetype Name vetype Name vetype
8 bits – – X8 10 100 X8U 11 100
16 bits F16 01 101 X16 10 101 X16U 11 101
32 bits F32 01 110 X32 10 110 X32U 11 110
64 bits F64 01 111 X64 10 111 X64U 11 111

Figure 8.2: RV32V encodings of vector register types. The rightmost three bits of the field show the width of
the data, and the two leftmost bits give its type. X64 and U64 are available only for RV64V. F16 and F32
require the RV32F extension and F64 requires RV32F and RV32D. F16 is the IEEE 754-2008 16-bit
floating-point format (binary16). Setting vetype to 00000 disables the vector registers. (Table 17.4 of
[Waterman and Asanović 2017] is the basis of this figure.)

Concern about slow

Elaboration: RV32V can switch context quickly.
context switch times
One reason vector architectures were less popular than SIMD architectures was concern that led Intel to avoid adding
adding large vector registers would stretch the time to save and restore a program on an registers in the original
interrupt, called a context switch. Dynamic register typing helps. The programmer must tell MMX SIMD extension. It
simply reused the existing
the processor which vector registers are being used, which means processor needs to save
floating-point registers,
and restore only those registers on a context switch. The RV32V convention is to disable all which meant no extra
vector registers when the vector instructions aren’t being used, which means a processor can context to switch, but a
have the performance benefit of vector registers but pay the extra context switch time only program couldn’t intermix
if an interrupt occurs while the vector instructions are executing. Earlier vector architectures floating-point and multime-
had to pay the worst-case context switch cost of saving and restoring all vector registers dia instructions.
whenever an interrupt occurred.

8.4 Vector Loads and Stores

The easiest case for vector loads and stores is dealing with single-dimension arrays that are Each load and store
has a 7-bit unsigned
stored sequentially in memory. Vector load fills a vector register with data from sequential immediate offset that
addresses in memory starting with the address in the vld instruction. The data type associated is scaled by the element
with the vector register determines the size of the data elements and the vector length register type in the destination
vl sets the number of elements to load. Vector store vst does the inverse operation of vld. register for loads and the
source register for stores.
For example, if a0 has 1024, and the type of v0 is X32, then vld v0, 0(a0) will gen-
erate the addresses 1024, 1028, 1032, 1036, ... until reaching the limit set by vl.
For multi-dimension arrays, some accesses will not be sequential. If stored in row major
order, sequential column accesses in a two-dimensional array want data elements separated
by the size of the row. Vector architectures support these accesses with strided data transfers:
vlds and vsts. While one could get the same effect as vld and vst by setting the stride
to the size of the element in vlds and vsts, vld and vst guarantee that all accesses will
be sequential, which makes it easier to deliver high memory bandwidth. Another reason is
that providing vld and vst reduces code size and instructions executed for the common case
of unit stride. These instructions specify two source registers, with one giving the starting
address and the other specifying the stride in bytes.
For example, assume the starting address in a0 was address 1024, and the size of a row in
a1 was 64 bytes. vlds v0,a0,a1 would send this sequence of addresses to memory: 1024,
1088 (1024 + 1 × 64), 1152 (1024 + 2 × 64), 1216 (1024 + 3 × 64), and so on until the vector
76 CHAPTER 8. RV32V: VECTOR

length register vl tells it to stop. The returning data is written into sequential elements of the
destination vector register.
Thus far, we have assumed that the program is working with dense arrays. To support
sparse arrays, vector architectures offer indexed data transfers: vldx and vstx. One source
register for these instructions refers to a vector register and the other to a scalar register. The
scalar register has the starting address of the sparse array, and each element of the vector
register contains the index in bytes of the nonzero elements of the sparse array.
Suppose the starting address in a0 was address 1024, and vector register v1 had these byte
indices in the first 4 elements: 16, 48, 80, 160. vldx v0,a0,v1 would send this sequence of
Indexed load is also addresses to memory: 1040 (1024 + 16), 1072 (1024 + 48), 1104 (1024 + 80), 1184 (1024 +
called gather and 160). It loads the returning data into sequential elements of the destination vector register.
indexed store is We used sparse arrays as our motivation for indexed loads and stores, but there are many
often named scatter.
other algorithms that access data indirectly via a table of indices.

8.5 Parallelism During Vector Execution

While a simple vector processor might execute one vector element at a time, element oper-
ations are independent by definition, and so a processor could theoretically compute all of
them simultaneously. The widest data for RV32G is 64 bits, and today’s vector processors
typically execute two, four, or eight 64-bit elements per clock cycle. Hardware handles the
fringe cases when the vector length is not a multiple of the number of the elements executed
per clock cycle.
Like SIMD, the number of smaller data operations is the ratio of the widths of the narrow
data to the wide data. Thus, a vector processor that computes 4 64-bit operations per clock
cycle would normally launch 8 32-bit, 16 16-bit, and 32 8-bit operations per clock cycle.
In SIMD, the ISA architect determines the maximum number of data parallel operations
per clock cycle and the number of elements per register. In contrast, the RV32V processor
designer picks both of them without having to change the ISA or the compiler, while every
doubling of SIMD register width doubles the number of SIMD instructions and requires
changes to the SIMD compilers. This hidden flexibility means the identical RV32V program
runs without change on the simplest and most aggressive vector processors.

8.6 Conditional Execution of Vector Operations

Some vector computations include if statements. Rather than rely on conditional branches,
vector architectures include a mask that suppresses operations on some elements of a vector
operation. The predicate instructions in Figure 8.1 perform conditional tests between two
vectors or a vector and scalar and writes into each element of the vector mask a 1 if the
condition holds or a 0 otherwise. (The vector mask must have the same number of elements
as the vector registers.) Any subsequent vector instruction can use that mask, with a 1 in
bit i means that element i is changed by vector operations, and a 0 means that element i is
A program is called
vectorizable if most unchanged.
operations are performed RV32V provides 8 vector predicate registers (vpi) to act as vector masks. The instruc-
by vector instructions. tions vpand, vpandn, vpor, vpxor, and vpnot perform logical instructions to combine them
Gather, scatter, and predi-
cate instructions increase
together to allow efficient processing of nested conditional statements.
the number of vectorizable RV32V instructions specify either vp0 or vp1 to be the mask that controls a vector oper-
programs. ation. To perform a normal operation on all elements, one of those two predicates registers
8.7. MISCELLANEOUS VECTOR INSTRUCTIONS 77

must be set to all ones. To swap one of the other six predicate registers quickly into vp0 or
vp1, RV32V has the vpswap instruction. The predicate registers are also enabled dynami-
cally, and disabling them clears all the predicate registers quickly.
For example, suppose all the even-numbered elements of vector register v3 were negative
integers and all the odd-numbered elements were positive integers. The result of this code:

vplt.vs vp0,v3,x0 # set mask bits when elements of v3 < 0

add.vv,vp0 v0,v1,v2 # change elements of v0 to v1+v2 when true

would set all the even bits of vp0 to 1, all the odd bits to 0, and would replace all the even
elements of v0 with the sum of the corresponding elements of v1 and v2. The odd elements
of v0 would be unchanged.

8.7 Miscellaneous Vector Instructions

Adding to the instruction that configures the data types of vector registers mentioned above
(vsetdcfg), setvl sets the vector length register (vl) and the destination register with the
smaller of the source operand and the maximum vector length (mvl). The reason for picking
the minimum is to decide in loops whether the vector code can run at the maximum vector
length (mvl) or it must run at a smaller value to cover the remaining elements. Thus, to handle
the tail, setvl is executed every loop iteration.
RV32V also has three instructions that manipulate elements within a vector register.
Vector select (vselect) produces a new result vector by gathering elements from one
source data vector at the element locations specified by the second source index vector:

# vindices holds values from 0..mvl-1 that select elements from vsrc
vselect vdest, vsrc, vindices

Thus, if the first four elements of v2 contain 8, 0, 4, 2, then vselect v0,v1,v2 will replace
the zeroth element of v0 with eighth element of v1, the first element of v0 with the zeroth
element of v1, the second element of v0 with the fourth element of v1, and the third element
of v0 with the second element of v1.
Vector merge (vmerge) resembles vector select, but it uses a vector predicate register to
choose which of the sources to use. It produces a new result vector by gathering elements
from one of two source registers depending on the predicate register. The new element comes
from vsrc1 if the predicate vector register element is 0 or from vsrc2 if it is 1:

# vp0 bit i determines whether new element i for vdest

# comes from vsrc1 (if bit i == 0) or vsrc2 (if bit i == 1)
vmerge,vp0 vdest, vsrc1, vsrc2

Thus, if the first four elements of vp0 contain 1, 0, 0, 1, the first four elements of v1 contain 1,
2, 3, 4, and the first four elements of v2 contain 10, 20, 30, 40, then vmerge,vp0 v0,v1,v2
will make the first four elements of v0 be 10, 2, 3, 40.
The vector extract instruction takes elements starting from the middle of one vector and
places these at the beginning of a second vector register:

# start is scalar reg holding element starting number of vsrc

vextract vdest, vsrc, start
78 CHAPTER 8. RV32V: VECTOR

# a0 is n, a1 is pointer to x[0], a2 is pointer to y[0], fa0 is a

0: li t0, 2<<25
4: vsetdcfg t0 # enable 2 64b Fl.Pt. registers
loop:
8: setvl t0, a0 # vl = t0 = min(mvl, n)
c: vld v0, a1 # load vector x
10: slli t1, t0, 3 # t1 = vl * 8 (in bytes)
14: vld v1, a2 # load vector y
18: add a1, a1, t1 # increment C pointer to x by vl*8
1c: vfmadd v1, v0, fa0, v1 # v1 += v0 * fa0 (y = a * x + y)
20: sub a0, a0, t0 # n -= vl (t0)
24: vst v1, a2 # store Y
28: add a2, a2, t1 # increment C pointer to y by vl*8
2c: bnez a0, loop # repeat if n != 0
30: ret # return

Figure 8.3: RV32V code for DAXPY in Figure 5.7. The machine language is missing because the RV32V
opcodes are yet to be defined.

For example, if vector length vl is 64 and a0 contains 32, then vextract v0,v1,a0 will
copy the last 32 elements of v1 into the first 32 elements of v0.
The vextract instruction assists reductions by following a recursive-halving approach
for any binary associative operator. For example, to sum all the elements of a vector register,
use vector extract to copy the last half of a vector into the first half of another vector register
and halve the vector length. Next, add these two vector registers together and repeat the
recursive-halving with their sum until vector length equals 1. The result in the zeroth element
will be the sum of all the original elements in the vector register.

8.8 Vector Example: DAXPY in RV32V

The V in RISC-V is Figure 8.3 shows the RV32V assembly language for DAXPY (Figure 5.7 on page 55 in Chap-
also for vector. The
ter 5), which we’ll explain a step at a time.
RISC-V architects had ex-
tensive positive experience RV32V DAXPY starts by enabling the vector registers needed for this function. It re-
with vector architectures quires only two vector registers to hold portions of x and y, which are double-precision
and were frustrated that floating-point numbers each 8 bytes wide. The first instruction creates a constant and the sec-
SIMD dominated micro-
processors. Hence, the
ond writes it to the control status register that configures vector registers (vcfgd) to get two
V is for the fifth Berkeley registers of type F64 (see Figure 8.2). By definition, the hardware allocates the configured
RISC project and because registers in numerical order, yielding v0 and v1.
their ISA would highlight Let’s assume our RV32V processor has 1024 bytes of memory dedicated to vector reg-
vectors.
isters. The hardware allocates the memory evenly between the two vector registers, which
hold double-precision floating-point numbers (8 bytes). Each vector register has 512/8 = 64
elements, so the processor sets the maximum vector length (mvl) for this function to 64.
The first instruction in the loop sets the vector length for the following vector instructions.
The instruction setvl writes the smaller of the mvl and n into vl and t0. The insight is that
if the number of iterations of the loop is larger than n, the fastest the code can crunch the data
is 64 values at time, so set vl to mvl. If n is smaller than mvl, then we can’t read or write
beyond the end of x and y, so we should compute only on the last n elements in this final
8.9. COMPARING RV32V, MIPS-32 MSA SIMD, AND X86-32 AVX SIMD 79

iteration of the loop. setvl also writes to t0 to help with later loop bookkeeping at location Vector architectures
10. without setvl have
The instruction vld at address c is a vector load from the address of x in scalar register extra strip-mining code
to set vl to the last n
a1. It transfers vl elements of x from memory to v0. The following shift instruction slli elements of the loop and
multiplies the vector length by the width of the data in bytes (8) for later use in incrementing to check if n is initially
pointers to x and y. zero.
The instruction at address 14 (vld) loads vl elements of y from memory into v1 and the
next instruction (add) increments the pointer to x.
The instruction at address 1c is the jackpot. vfmadd multiplies vl elements of x (v0) by
the scalar a (f0) and adds each product to vl elements of y (v1) and stores those vl sums
back into y (v1).
All that is to left is store the results in memory and some loop overhead. The instruction
at address 20 (sub) decrements n (a0) by vl to record the number of operations completed in
this iteration of the loop. The following instruction (vst) stores vl results into y in memory.
The instruction at address 28 (add) increments the pointer to y and the following instruction
repeats the loop if n (a0) is not zero. If n is zero, the final instruction ret returns to the
calling site.
The power of vector architecture is that each iteration of this 10-instruction loop launches
3 × 64 = 192 memory accesses and 2 × 64 = 128 floating-point multiplies and additions
(assuming that n is at least 64). That averages about 19 memory accesses and 13 operations
per instruction. As we shall see in the next section, these ratios for SIMD are an order of
magnitude worse.

8.9 Comparing RV32V, MIPS-32 MSA SIMD, and x86-32 AVX SIMD
We’ll now see the contrast between how SIMD and vector executes DAXPY. If you tilt your ARM-32 has a SIMD
head, you can see SIMD as a restricted vector architecture with short vector registers—eight extension called
NEON but it doesn’t
8-bit “elements”—but it has no vector length register and no strided or indexed data transfers. support double-precision
floating-point instructions,
MIPS SIMD. Figure 8.5 on page 83 shows the MIPS SIMD Architecture (MSA) version so it doesn’t help DAXPY.
of DAXPY. Each MSA SIMD instruction can operate on two floating-point numbers since
the MSA registers are 128 bits wide.
Unlike RV32V, because there is no vector length register, MSA requires extra bookkeep- Such bookkeeping
code is considered part
ing instructions to check for problem values of n. When n is odd, there is extra code to
of strip mining in
compute a single floating-point multiply-add since MSA must operate on pairs of operands. vector architectures. As
That code is found in locations 3c to 4c in Figure 8.5. In the unlikely but possible case when the caption of Figure 8.5
n is zero, the branch at location 10 will skip the main computation loop. explains, the vector length
register vl renders such
If it doesn’t branch around the loop, the instruction at location 18 (splati.d) puts copies SIMD bookkeeping code
of a in both halves of the SIMD register w2. To add scalar data in SIMD, we need to replicate moot for RV32V. Traditional
it to be as wide as the SIMD register. vector architectures need
Inside the loop, the ld.d instruction at location 1c loads two elements of y into SIMD extra code to handle the
corner case of n = 0.
register w0 and then increments the pointer to y. It then does the a load of two elements of x RV32V just makes vector
into the SIMD register w1. The following instruction at location 28 increments the pointer to instructions act like nops
x. The payoff multiply-add instruction at location 2c is next. when n = 0.
The (delayed) branch at the end of the loop tests to see if the pointer to y has been
incremented beyond the last even element of y. If it hasn’t, the loop repeats. The SIMD store
in the delay slot at address 34 writes the result to two elements of y.
80 CHAPTER 8. RV32V: VECTOR

ISA MIPS-32 MSA x86-32 AVX2 RV32FDV

Instructions (static) 22 29 13
Bytes (static) 88 92 52
Instructions per Main Loop 7 6 10
Results per Main Loop 2 4 64
Instructions (dynamic, n=1000) 3511 1517 163

Figure 8.4: Number of instructions and code size of DAXPY for vector ISAs. It lists number of instructions
total (static), code size, number of instructions and results per loop, and number of instructions executed (n
= 1000). microMIPS with MSA shrinks code size to 64 bytes and RV32FDCV reduces it to 40 bytes.

After the main loop terminates, the code checks to see if n is odd. If so, it performs the
last multiply-add using scalar instructions from Chapter 5. The final instruction returns to the
calling site.
The 7-instruction loop at the heart of the MIPS MSA DAXPY code does 6 double-
precision memory accesses and 4 floating-point multiplies and additions. The average is
about 1 memory access and 0.5 operations per instruction.
x86 SIMD. Intel has gone through many generations of SIMD extensions, which we see
in the code in Figure 8.6 on page 84. The SSE expansion to 128-bit SIMD led to the xmm
registers and instructions that can use them, and the expansion to 256-bit SIMD as part of
AVX created the ymm registers and their instructions.
The first group of instructions at addresses 0 to 25 load the variables from memory, make
four copies of a in a 256-bit ymm registers, and tests to ensure n is at least 4 before entering the
main loop. It uses two SSE and one AVX instructions. (The caption of Figure 8.6 explains
how in more detail.)
The main loop does the heart of the DAXPY computation. The AVX instruction vmovapd
at address 27 loads 4 elements of x into ymm0. The AVX instruction vfmadd213pd at address
2c multiplies 4 copies of a (ymm2) times 4 elements of x (ymm0), adds 4 elements of y (in
memory at address ecx+edx*8), and puts the 4 sums into ymm0. The following AVX instruc-
tion at address 32, vmovapd, stores the 4 results into y. The next three instructions increment
counters and repeat the loop if needed.
As was the case for MIPS MSA, the “fringe” code between addresses 3e and 57 deals
with the cases when n is not a multiple of 4. It relies on three SSE instructions.
The 6 instructions of the main loop in the x86-32 AVX2 DAXPY code do 12 double-
precision memory accesses and 8 floating-point multiplies and additions. They average 2
memory accesses and about 1 operation per instruction.

Elaboration: The Illiac IV was the first to show the difficulty of compiling for SIMD.
With 64 parallel 64-bit floating-point units (FPUs), the Illiac IV was planned to have more
than 1 million logic gates before Moore published his law. Its architect originally predicted
1000 million floating-point operations per second (MFLOPS), but actual performance was 15
MFLOPS at best. Costs escalated from the $8M estimated in 1966 to $31M by 1972, despite
the construction of only 64 of the planned 256 FPUs. The project started in 1965 but took
until 1976 to run its first real application, the year the Cray-1 was unveiled. Perhaps the most
infamous supercomputer, it made a top 10 list of engineering disasters [Falk 1976].
8.10. CONCLUDING REMARKS 81

8.10 Concluding Remarks

If the code is vectorizable, the best architecture is vector.
—Jim Smith, keynote speech, International Symposium on Computer Architecture, 1994

Figure 8.4 summarizes the number of instructions and number of bytes in DAXPY of pro-
grams for RV32IFDV, MIPS-32 MSA, and x86-32 AVX2. The SIMD computation code is
dwarfed by the bookkeeping code. Two-thirds to three-fourths of the code for MIPS-32 MSA
and x86-32 AVX2 is SIMD overhead, either to prepare the data for the main SIMD loop or to
handle the fringe elements when n is not a multiple of the number of floating-point numbers
in a SIMD register.
RV32V code in Figure 8.3 doesn’t need such bookkeeping code, which halves the number
of instructions. Unlike SIMD, it has a vector length register, which makes the vector instruc-
tions work at any value of n. You might think RV32V would have a problem when n is 0. It
doesn’t because RV32V vector instructions leave everything unchanged when vl = 0.
However, the most significant difference between SIMD and vector processing is not the
static code size. The SIMD instructions execute 10 to 20 times more instructions than RV32V
because each SIMD loop does only 2 or 4 elements instead of 64 in the vector case. The extra
instruction fetches and instruction decodes means higher energy to perform the same task.
Comparing the results in Figure 8.4 to the scalar versions of DAXPY in Figure 5.8 on
page 29 in Chapter 5, we see that SIMD roughly doubles the size of the code in instructions
and bytes, but the main loop is the same size. The reduction in the dynamic number of
instructions executed is a factor of 2 or 4, depending on the width of the SIMD registers.
However, the RV32V vector code size increases by a factor of 1.2 (with the main loop 1.4X)
but the dynamic instruction count is a factor of 43 smaller!
While dynamic instruction count is a large difference, in our view that is the second most
significant disparity between SIMD and vector. Lacking a vector length register explodes the
number of instructions as well as the bookkeeping code. ISAs like MIPS-32 and x86-32 that
follow the incrementalist doctrine must duplicate all the old SIMD instructions defined for
narrower SIMD registers every time they double the SIMD width. Surely, hundreds of MIPS-
32 and x86-32 instructions were created over many generations of SIMD ISAs and hundreds
more are in their future. The cognitive load on the assembly language programmer of this
brute force approach to ISA evolution must be overwhelming. How can one remember what
vfmadd213pd means and when to use it?
In comparison, RV32V code is unaffected by the size of the memory for vector registers.
Not only is RV32V unchanged if vector memory size expands, you don’t even have to re-
compile. Since the processor supplies the value of maximum vector length mvl, the code in
Figure 8.3 is untouched whether a processor raises the vector memory from 1024 bytes to,
say, 4096 bytes, or drops it to 256 bytes.
Unlike SIMD, where the ISA dictates the required hardware—and changing the ISA
means changing the compiler—the RV32V ISA allows processor designers to choose the
resources for data parallelism for their application without affecting the programmer or com-
piler. One can argue that SIMD violates the ISA design principle from Chapter 1 of isolating
the architecture from implementation.
We think the high contrast in cost-energy-performance, complexity, and ease of program-
ming between the modular vector approach of RV32V and the incrementalist SIMD architec-
tures of ARM-32, MIPS-32, and x86-32 might be the most persuasive argument for RISC-V.
82 NOTES

8.11 To Learn More

H. Falk. What went wrong V: Reaching for a gigaflop: The fate of the famed Illiac IV was
shaped by both research brilliance and real-world disasters. IEEE spectrum, 13(10):65–70,
1976.
J. L. Hennessy and D. A. Patterson. Computer architecture: a quantitative approach. Else-
vier, 2011.

A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.

Notes
1 http://parlab.eecs.berkeley.edu
NOTES 83

# a0 is n, a2 is pointer to x[0], a3 is pointer to y[0], $w13 is a

00000000 <daxpy>:
0: 2405fffe li a1,-2
4: 00852824 and a1,a0,a1 # a1 = floor(n/2)*2 (mask bit 0)
8: 000540c0 sll t0,a1,0x3 # t0 = byte address of a1
c: 00e81821 addu v1,a3,t0 # v1 = &y[a1]
10: 10e30009 beq a3,v1,38 # if y==&y[a1] goto Fringe (t0==0 so n is 0 | 1)
14: 00c01025 move v0,a2 # (delay slot) v0 = &x[0]
18: 78786899 splati.d $w2,$w13[0] # w2 = fill SIMD register with copies of a
Loop:
1c: 78003823 ld.d $w0,0(a3) # w0 = 2 elements of y
20: 24e70010 addiu a3,a3,16 # increment C pointer to y by 2 Fl.Pt. numbers
24: 78001063 ld.d $w1,0(v0) # w1 = 2 elements of x
28: 24420010 addiu v0,v0,16 # increment C pointer to x by 2 Fl.Pt. numbers
2c: 7922081b fmadd.d $w0,$w1,$w2 # w0 = w0 + w1 * w2
30: 1467fffa bne v1,a3,1c # if (end of y != ptr to y) go to Loop
34: 7bfe3827 st.d $w0,-16(a3) # (delay slot) store 2 elts of y
Fringe:
38: 10a40005 beq a1,a0,50 # if (n is even) goto Done
3c: 00c83021 addu a2,a2,t0 # (delay slot) a2 = &x[n-1]
40: d4610000 ldc1 $f1,0(v1) # f1 = y[n-1]
44: d4c00000 ldc1 $f0,0(a2) # f0 = x[n-1]
48: 4c206b61 madd.d $f13,$f1,$f13,$f0 # f13 = f1 + f0 * f13 (muladd if n is odd)
4c: f46d0000 sdc1 $f13,0(v1) # y[n-1] = f13 (store odd result)
Done:
50: 03e00008 jr ra # return
54: 00000000 nop # (delay slot)

Figure 8.5: MIPS-32 MSA code for DAXPY in Figure 5.7. The bookkeeping overhead of SIMD is evident
when comparing this code to the RV32V code in Figure 8.3. The first part of the MIPS MSA code (addresses
0 to 18) duplicate the scalar variable a in a SIMD register and to check to ensure n is at least 2 before
entering the main loop. The third part of the MIPS MSA code (addresses 38 to 4c) handle the fringe case
when n is not a multiple of 2. Such bookkeeping code is unneeded in RV32V because the vector length
register vl and the setvl instruction lets the loop work for all values of n, whether odd or even.
84 NOTES

# eax is i, n is esi, a is xmm1, pointer to x[0] is ebx, pointer to y[0] is ecx

00000000 <daxpy>:
0: 56 push esi
1: 53 push ebx
2: 8b 74 24 0c mov esi,[esp+0xc] # esi = n
6: 8b 5c 24 18 mov ebx,[esp+0x18] # ebx = x
a: c5 fb 10 4c 24 10 vmovsd xmm1,[esp+0x10] # xmm1 = a
10: 8b 4c 24 1c mov ecx,[esp+0x1c] # ecx = y
14: c5 fb 12 d1 vmovddup xmm2,xmm1 # xmm2 = {a,a}
18: 89 f0 mov eax,esi
1a: 83 e0 fc and eax,0xfffffffc # eax = floor(n/4)*4
1d: c4 e3 6d 18 d2 01 vinsertf128 ymm2,ymm2,xmm2,0x1 # ymm2 = {a,a,a,a}
23: 74 19 je 3e # if n < 4 goto Fringe
25: 31 d2 xor edx,edx # edx = 0
Loop:
27: c5 fd 28 04 d3 vmovapd ymm0,[ebx+edx*8] # load 4 elements of x
2c: c4 e2 ed a8 04 d1 vfmadd213pd ymm0,ymm2,[ecx+edx*8] # 4 mul adds
32: c5 fd 29 04 d1 vmovapd [ecx+edx*8],ymm0 # store into 4 elements of y
37: 83 c2 04 add edx,0x4
3a: 39 c2 cmp edx,eax # compare to n
3c: 72 e9 jb 27 # repeat loop if < n
Fringe:
3e: 39 c6 cmp esi,eax # any fringe elements?
40: 76 17 jbe 59 # if (n mod 4) == 0 goto Done
FringeLoop:
42: c5 fb 10 04 c3 vmovsd xmm0,[ebx+eax*8] # load element of x
47: c4 e2 f1 a9 04 c1 vfmadd213sd xmm0,xmm1,[ecx+eax*8] # 1 mul add
4d: c5 fb 11 04 c1 vmovsd [ecx+eax*8],xmm0 # store into element of y
52: 83 c0 01 add eax,0x1 # increment Fringe count
55: 39 c6 cmp esi,eax # compare Loop and Fringe counts
57: 75 e9 jne 42 <daxpy+0x42> # repeat FringeLoop if != 0
Done:
59: 5b pop ebx # function epilogue
5a: 5e pop esi
5b: c3 ret

Figure 8.6: x86-32 AVX2 code for DAXPY in Figure 5.7. The SSE instruction vmovsd at address a loads a
into half of the 128-bit xmm1 register. The SSE instruction vmovddup at address 14 duplicates a into both
halves of xmm1 for later SIMD computation. The AVX instruction vinsertf128 at address 1d makes four
copies of a in ymm2 starting from the two copies of a in xmm1. The three AVX instructions at addresses 42 to
4d (vmovsd, vfmadd213sd, vmovsd) handle when mod(n,4) 6= 0. They perform the DAXPY computation one
element at a time, with the loop repeating until the function has performed exactly n multiple-add
operations. Once again, such code is unnecessary for RV32V because the vector length register vl and the
setvl instruction makes the loop work for any value of n.
NOTES 85
9 RV64: 64-bit Address Instructions

C. Gordon Bell (1934-) There is only one mistake that can be made in computer design that is difficult to recover
was one of the lead archi- from—not having enough address bits for memory addressing and memory management.
tects of two of the most
popular minicomputer ar- —C. Gordon Bell, 1976
chitectures of their day:
the Digital Equipment Cor-
poration PDP-11 (16-bit
address), which was an- 9.1 Introduction
nounced in 1970, and its
successor seven years Figures 9.1 to 9.4 shows graphical representations of the RV64G versions of the RV32G
later, the Digital Equipment
instructions. These figures illustrate the small increase in the number of instructions to switch
Corporation 32-bit address
VAX-11 (Virtual Address to a 64-bit ISA in RISC-V. The ISAs typically add only a few word, doubleword, or long
eXtension). versions of the 32-bit instructions and expand all the registers, including the PC, to 64 bits.
Thus, sub in RV64I subtracts two 64-bit numbers rather than two 32-bit numbers as in RV32I.
RV64 is a close but actually different ISA than RV32; it adds a few instructions and the base
instructions do slightly different things.
For example, Insertion Sort for RV64I in Figure 9.8 is quite near the code for RV32I
in Figure 2.8 on page 27 in Chapter 2. It is the same number of instructions and the same
number of bytes. The only changes are that the load and store word instructions become load
and store doublewords, and the address increment goes from 4 for words (4 bytes) to 8 for
doublewords (8 bytes). Figure 9.5 lists the opcodes of the RV64GC instructions in Figures 9.1
to 9.4.
Despite RV64I having 64-bit addresses and a default data size of 64 bits, 32-bit words are
valid data types in programs. Hence, RV64I needs to support words just as RV32I needs to
support bytes and halfwords. More specifically, since registers are now 64 bits wide, RV64I
adds word versions of addition and subtraction: addw, addiw, subw. They truncate their
results to 32 bits and write the sign-extended result to the destination register. RV64I also
includes word versions of the shift instructions to get 32-bit shift result instead of a 64-bit
shift result: sllw, slliw, srlw, srliw, sraw, sraiw. To do 64-bit data transfers, it
has load and store doubleword: ld, sd. Finally, just as there are unsigned versions of load
byte and load halfword in RV32I, RV64I must have an unsigned version of load word: lwu.
For similar reasons, RV64M needs to add word versions of multiply, divide, and remain-
der: mulw, divw, divuw, remw, remuw. To allow the programmer to synchronize on
both words and doublewords, RV64A adds doubleword versions of all 11 of its instructions.
9.1. INTRODUCTION 87

RV64I
Integer Computation Loads and Stores
_ _ byte
add immediate word
_ load halfword
subtract word store word
and doubleword
_
or byte
immediate
exclusive or
load halfword unsigned
word
shift left logical _ _ Miscellaneous instructions
shift right arithmetic immediate word fence loads & stores
shift right logical fence.instruction & data
load upper immediate environment break
add upper immediate to pc call
_ _ read & clear bit
set less than _
immediate unsigned control status register read & set bit
Control transfer
immediate
read & write
branch equal
not equal
_
branch greater than or equal
less than unsigned
_
jump and link
register

Figure 9.1: Diagram of the RV64I instructions. The underlined letters are concatenated from left to right to
form RV64I instructions. The dimmed portion are the old RV64I instructions extended to 64-bit registers
and the dark (red) portion are the new instructions for RV64I.

RV64M RV64A
_
multiply word add
_ and
unsigned or
multiply high
signed unsigned swap
.word
_ atomic memory operation xor
divide _ .doubleword
unsigned maximum
remainder word maximum unsigned
minimum
minimum unsigned

load reserved .word

store conditional .doubleword

Figure 9.2: Diagrams of the RV64M and RV64A instructions.

88 CHAPTER 9. RV64: 64-BIT ADDRESS INSTRUCTIONS

RV64F and RV64D

Floating-Point Computation Load and Store
add load
float word
subtract store doubleword
multiply
.single Conversion
float divide
.double _
square root float convert to .single from .word unsigned
minimum .double .long
_
maximum float convert to .word from .single
_ add .long unsigned .double
float negative multiply .single
subtract .double float convert to .single from .double
float convert to .double from .single
float move to .single from .x register
.double
Other instructions
float move to .x register from .single _
.double .single
float sign injection negative
Comparison exclusive or .double
equals
.single float classify .single
compare float less than .double
.double
less than or equals

Figure 9.3: Diagram of the RV64F and RV64D instructions.

RV64C
Integer Computation Control transfer
_ _
c.add equal
immediate word c.branch to zero
not equal
c.add immediate * 16 to stack pointer _
c.add immediate * 4 to stack pointer nondestructive c.jump and link
_
c.subtract _
word
c.jump and link register
shift left logical
c. shift right arithmetic immediate
Other instructions
shift right logical
_ c.environment break
c.and
immediate
c.or
c.move
c.exclusive or
_
c.load immediate
upper
Loads and Stores
_ word _
c. float load
doubleword using stack pointer
store
_
c.float load doubleword using stack pointer
store

Figure 9.4: Diagram of the RV64C instructions.

9.1. INTRODUCTION 89

31 25 24 20 19 15 14 12 11 7 6 0
imm[11:0] rs1 110 rd 0000011 I lwu
imm[11:0] rs1 011 rd 0000011 I ld
imm[11:5] rs2 rs1 011 imm[4:0] 0100011 S sd
000000 shamt rs1 001 rd 0010011 I slli
000000 shamt rs1 101 rd 0010011 I srli
010000 shamt rs1 101 rd 0010011 I srai
imm[11:0] rs1 000 rd 0011011 I addiw
0000000 shamt rs1 001 rd 0011011 I slliw
0000000 shamt rs1 101 rd 0011011 I srliw
0100000 shamt rs1 101 rd 0011011 I sraiw
0000000 rs2 rs1 000 rd 0111011 R addw
0100000 rs2 rs1 000 rd 0111011 R subw
0000000 rs2 rs1 001 rd 0111011 R sllw
0000000 rs2 rs1 101 rd 0111011 R srlw
0100000 rs2 rs1 101 rd 0111011 R sraw

RV64M Standard Extension (in addition to RV32M)

0000001 rs2 rs1 000 rd 0111011 R mulw
0000001 rs2 rs1 100 rd 0111011 R divw
0000001 rs2 rs1 101 rd 0111011 R divuw
0000001 rs2 rs1 110 rd 0111011 R remw
0000001 rs2 rs1 111 rd 0111011 R remuw

RV64A Standard Extension (in addition to RV32A)

00010 aq rl 00000 rs1 011 rd 0101111 R lr.d
00011 aq rl rs2 rs1 011 rd 0101111 R sc.d
00001 aq rl rs2 rs1 011 rd 0101111 R amoswap.d
00000 aq rl rs2 rs1 011 rd 0101111 R amoadd.d
00100 aq rl rs2 rs1 011 rd 0101111 R amoxor.d
01100 aq rl rs2 rs1 011 rd 0101111 R amoand.d
01000 aq rl rs2 rs1 011 rd 0101111 R amoor.d
10000 aq rl rs2 rs1 011 rd 0101111 R amomin.d
10100 aq rl rs2 rs1 011 rd 0101111 R amomax.d
11000 aq rl rs2 rs1 011 rd 0101111 R amominu.d
11100 aq rl rs2 rs1 011 rd 0101111 R amomaxu.d

RV64F Standard Extension (in addition to RV32F)

1100000 00010 rs1 rm rd 1010011 R fcvt.l.s
1100000 00011 rs1 rm rd 1010011 R fcvt.lu.s
1101000 00010 rs1 rm rd 1010011 R fcvt.s.l
1101000 00011 rs1 rm rd 1010011 R fcvt.s.lu

RV64D Standard Extension (in addition to RV32D)

1100001 00010 rs1 rm rd 1010011 R fcvt.l.d
1100001 00011 rs1 rm rd 1010011 R fcvt.lu.d
1110001 00000 rs1 000 rd 1010011 R fmv.x.d
1101001 00010 rs1 rm rd 1010011 R fcvt.d.l
1101001 00011 rs1 rm rd 1010011 R fcvt.d.lu
1111001 00000 rs1 000 rd 1010011 R fmv.d.x

Figure 9.5: RV64 opcode map of the base instructions and optional extensions. It shows instruction layout,
opcodes, format type, and name. (Table 19.2 of [ Waterman and Asanović 2017] is the basis of this figure.)
90 CHAPTER 9. RV64: 64-BIT ADDRESS INSTRUCTIONS

RV64F and RV64D adds integer doublewords to the convert instructions, calling them
longs so to prevent confusion with double precision floating-point data: fcvt.l.s,
fcvt.l.d, fcvt.lu.s, fcvt.lu.d, fcvt.s.l, fcvt.s.lu, fcvt.d.l,
fcvt.d.lu. As the integer x registers are now 64 bits wide, they can now hold dou-
ble precision floating-point data, so RV64D adds two floating-point moves: fmv.x.w and
fmv.w.x.
The one exception to the superset relationship between RV64 and RV32 is the compressed
instructions. RV64C replaced a few RV32C instructions, since other instructions shrank
code more for 64-bit addresses. RV64C drops the compressed jump and link (c.jal) and the
integer and floating-point load and store word instructions (c.lw, c.sw, c.lwsp, c.swsp,
c.flw, c.fsw, c.flwsp, and c.fswsp). In their place, RV64C adds the more popular add
and subtract word instructions (c.addw, c.addiw, c.subw) and load and store double-
word instructions (c.ld, c.sd, c.ldsp, c.sdsp).

Elaboration: The RV64 ABIs are lp64, lp64f, and lp64d.

lp64 means that the C language data types long and pointer are 64 bits; int is still 32 bits. The
suffixes f and d indicate how floating-point arguments are passed, which is the same as for
RV32 (see Chapter 3).

Elaboration: There is no instruction diagram for RV64V

because it exactly matches RV32V due to dynamic register typing. The only change is that
the X64 and X64U dynamic register types in Figure 8.2 on page 75 are available in RV64V
but not RV32V.

9.2 Comparison to Other 64-bit ISAs using Insertion Sort

As Gordon Bell said at the opening of this chapter, the one fatal architecture flaw is running
out of address bits. As programs pushed the limits of a 32-bit address space, architects began
to make 64-bit address versions of their ISAs [Mashey 2009].
The earliest was MIPS in 1991. It extended all registers and the program counter from
32 to 64 bits and added new 64-bit versions of the MIPS-32 instructions. The MIPS-64
assembly language instructions all begin with the letter “d”, such as daddu or dsll (see
Figure 9.10). Programmers can mix MIPS-32 and MIPS-64 instructions in the same program.
MIPS-64 dropped the load delay slot from MIPS-32 (the pipeline stalls on a read-after-write
dependence).
A decade later, it was time for a successor to x86-32. When architects increased the
addressing size, they took the opportunity to make a few more improvements in x86-64:
• Increased the number of integer registers from 8 to 16 (r8–r15);
• Increased the number of SIMD registers from 8 to 16 (xmm8–xmm15); and
• Added PC-relative data addressing to better support position-independent code.
These improvements smoothed some rough edges of x86-32.
You can see the benefits by comparing the x86-32 version of Insertion Sort in Figure 2.11
on page 30 in Chapter 2 to the x86-64 version in Figure 9.11. The newer ISA keeps all the
variables in registers rather than having several in memory, which reduces the instruction
9.2. COMPARISON TO OTHER 64-BIT ISAS USING INSERTION SORT 91

ISA ARM-64 MIPS-64 x86-64 RV64I RV64I+RV64C

Instructions 16 24 15 19 19
Bytes 64 96 46 76 52

Figure 9.6: Number of instructions and code size for Insertion Sort for four ISAs. ARM Thumb-2 and
microMIPS are 32-bit address ISAs, so are unavailable for ARM-64 and MIPS-64.

count from 20 to 15 instructions. The code size is actually larger by one byte with the newer
ISA despite having fewer instructions: 46 versus 45. The reason is that to squeeze in the new
opcodes to enable more registers, x86-64 added a prefix byte to identify the new instructions.
The average instruction length increases in x86-64 over x86-32.
ARM faced the same address problem another decade later. Rather than evolve the old
ISA to have 64-bit addresses as did x86-64, they used the opportunity to invent a brand new
ISA. Given a fresh start, they changed many of the awkward ARM-32 traits to give them a
modern ISA:

• Increase the number of integer registers from 15 to 31;

• Remove the PC from the set of registers;
• Provide a register that’s hardwired to zero for most instructions (r31);
• Unlike ARM-32, all ARM-64 data addressing modes work with all data sizes and
types;
• ARM-64 dropped the load and store multiple instructions of ARM-32; and
• ARM-64 omitted the conditional execution option of ARM-32 instructions.

It still shares some weaknesses of ARM-32: condition codes for branch, source and desti-
nation register fields move in the instruction format, conditional move instructions, complex
addressing modes, inconsistent performance counters, and only 32-bit length instructions.
ARM-64 can’t switch to the Thumb-2 ISA, as Thumb-2 only works with 32-bit addresses.
Intel didn’t invent
Unlike RISC-V, ARM decided to take a maximalist approach to ISA design. While cer-
the x86-64 ISA. When
tainly a better ISA than ARM-32, it is also bigger. For example, it has more than 1000 switching to 64-bit ad-
instructions and the ARM-64 manual is 3185 pages long [ARM 2015]. Moreover, it is still dresses, Intel invented a
growing. There have been three expansions of ARM-64 since its announcement a few years new ISA called Itanium
that was incompatible with
ago. x86-32. Its competitor for
The ARM-64 code for Insertion Sort in Figure 9.9 looks closer to the RV64I code or x86-32 processors was
x86-64 code than to the ARM-32 code. For example, with 31 registers, there is no need to locked out of Itanium, so
save and restore registers from the stack. And since the PC is no longer one of the registers, AMD invented a 64-bit
version of x86-32 called
ARM-64 uses a separate return instruction. AMD64. Itanium eventually
Figure 9.6 is a table that summarizes the number of instructions and number of bytes in failed, so Intel was forced
Insertion Sort for the ISAs. Figures 9.8 to 9.11 show the compiled code for RV64I, ARM-64, to adopt the AMD64 ISA
MIPS-64, and x86-64. Parenthetical phrases in the comments of these four programs identify as the 64-bit address suc-
cessor of x86-32, which
the differences between the RV32I versions in Chapter 2 and these RV64I versions. we call x86-64 [Kerner and
MIPS-64 needs the most instructions, primarily because of the nop instructions of the Padgett 2007].
unfilled delayed branch slots. RV64I needs fewer because of the compare-and-branch in-
structions and no delayed branch. While ARM-64 and x86-64 need two compare instructions
92 CHAPTER 9. RV64: 64-BIT ADDRESS INSTRUCTIONS

1.4 1.35 1.34

1.23

Code Size Rela,ve to RV32GC

1.2
1
1

0.8

0.6

0.4

0.2

0
RISC-V RV64GC RISC-V RV64G ARM-64 INTEL x86-64
(16b & 32b) (32b) (32b) (variable 8b)

Figure 9.7: Relative program sizes for RV64G, ARM-64, and x86-64 versus RV64GC. This comparison
measures much bigger programs than in Figure 9.6. This graph is the 64-bit address equivalent to the graph
of 32-bit ISAs in Figure 1.5 on page 9 in Chapter 2. RV32C code size almost matches to RV64C; it is 1%
smaller. There is no Thumb-2 option for ARM-64, so the core of other 64-bit ISAs significantly exceeds the
size of RV64GC code. The programs measured were the SPEC CPU2006 benchmarks using the GCC
compilers [Waterman 2016].

that are unnecessary for RV64I, their scaling addressing modes avoid address arithmetic in-
structions needed in RV64I, giving them the fewest instructions. However, RV64I+RV64C
has much smaller code size, as the next section explains.

Elaboration: ARM-64, MIPS-64, and x86-64 aren’t the official names.

The official names are: ARMv8 is what we call ARM-64, MIPS-IV is MIPS-64, and AMD64
is x86-64 (see the sidebar on the previous page for the history of x86-64).

9.3 Program size

Figure 9.7 compares average relative code sizes for RV64, ARM-64, and x86-64. Compare
this figure to Figure 1.5 on page 9 in Chapter 1. First, RV32GC code is almost identical in size
to RV64GC; it is only 1% smaller. This closeness is also true for RV32I and RV64I. While
ARM-64 code is 8% smaller than ARM-32 code, there is no 64-bit address version of Thumb-
2, so all instructions remain 32-bits long. Hence, ARM-64 code is 25% larger than ARM
Thumb-2 code. Code for x86-64 is 7% larger than x86-32 code due to adding prefix opcodes
to x86-64 instructions to accommodate new operations and the expanded set of registers.
RV64GC wins as ARM-64 code is 23% bigger than RV64GC and x86-64 code is 34% bigger
than RV64GC. That difference is large enough that either it will improve performance due to
lower instruction cache miss rates, or reduce cost by allowing a smaller instruction cache that
still provides satisfactory miss rates.
9.4. CONCLUDING REMARKS 93

9.4 Concluding Remarks

One of the problems of being a pioneer is you always make mistakes, and I never, never
want to be a pioneer. It’s always best to come second when you can look at the mistakes
the pioneers made.
—Seymour Cray, architect of the first supercomputer, 1976 MIPS is for sale.
Imagination Technologies,
Running out of address bits is the Achilles heel of computer architecture. Many an archi- which bought the MIPS
tecture has died from a wound there. ARM-32 and Thumb-2 remain 32-bit architectures, so ISA in 2012 for $100M,
recently announced that it
they’re no help for big programs. Some ISAs like MIPS-64 and x86-64 survived the transi- is for sale; no buyers yet.
tion, but x86-64 is not a paragon of ISA design and the future of MIPS-64 is unclear at the
time of this writing. ARM-64 is a new large ISA, and time will tell how successful it will be.

RISC-V benefited from designing both the 32-bit and the 64-bit architectures together,
whereas older ISAs had to architect them sequentially. Unsurprisingly, the transition between
32-bit and 64-bit is easiest for RISC-V programmers and compiler writers; the RV64I ISA
has virtually all RV32I instructions. Indeed, that is why we can list both RV32GCV and
RV64GCV in only two pages of the Reference Card. More important, the simultaneous
design meant the 64-bit architecture did not have to be squeezed into a cramped 32-bit opcode
space. RV64I has plenty of room for optional instruction extensions, particularly RV64C,
which makes it the leader in code size.
We see the 64-bit architecture as more evidence of RISC-V’s sound design, admittedly
easier to achieve if you start 20 years later so that you can borrow the pioneers’ good ideas as
well as learn from their mistakes.

Elaboration: RV128
RV128 began as an inside joke with the RISC-V architects, simply to show that a 128-bit
address ISA was possible. However, warehouse scale computers may soon have more than
264 bytes of semiconductor storage (DRAM and Flash memory), which programmers might
want to access as a memory address. There are also proposals to use a 128-bit address to
improve security [Woodruff et al. 2014]. The RISC-V manual does specify a full 128-bit ISA
called RV128G [Waterman and Asanović 2017]. The additional instructions are basically the
same as needed to go from RV32 to RV64, which Figures 9.1 to 9.4 illustrate. All the registers
also grow to 128 bits, and the new RV128 instructions specify either 128-bit versions of some
instructions (using Q in the name for quadword) or 64-bit versions of others (using D for in
the name doubleword).

9.5 To Learn More

I. ARM. Armv8-a architecture reference manual. 2015.
M. Kerner and N. Padgett. A history of modern 64-bit computing. Technical report, CS De-
partment, University of Washington, Feb 2007. URL http://courses.cs.washington.
edu/courses/csep590/06au/projects/history-64-bit.pdf.
J. Mashey. The long road to 64 bits. Communications of the ACM, 52(1):45–53, 2009.
A. Waterman. Design of the RISC-V Instruction Set Architecture. PhD thesis, EECS Depart-
ment, University of California, Berkeley, Jan 2016. URL http://www2.eecs.berkeley.
edu/Pubs/TechRpts/2016/EECS-2016-1.html.
94 NOTES

A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.
J. Woodruff, R. N. Watson, D. Chisnall, S. W. Moore, J. Anderson, B. Davis, B. Laurie,
P. G. Neumann, R. Norton, and M. Roe. The CHERI capability model: Revisiting RISC
in an age of risk. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International
Symposium on, pages 457–468. IEEE, 2014.

Notes
1 http://parlab.eecs.berkeley.edu
NOTES 95

# RV64I (19 instructions, 76 bytes, or 52 bytes with RV64C)

# a1 is n, a3 points to a[0], a4 is i, a5 is j, a6 is x
0: 00850693 addi a3,a0,8 # (8 vs 4) a3 is pointer to a[i]
4: 00100713 li a4,1 # i = 1
Outer Loop:
8: 00b76463 bltu a4,a1,10 # if i < n, jump to Continue Outer loop
Exit Outer Loop:
c: 00008067 ret # return from function
Continue Outer Loop:
10: 0006b803 ld a6,0(a3) # (ld vs lw) x = a[i]
14: 00068613 mv a2,a3 # a2 is pointer to a[j]
18: 00070793 mv a5,a4 # j = i
Inner Loop:
1c: ff863883 ld a7,-8(a2) # (ld vs lw, 8 vs 4) a7 = a[j-1]
20: 01185a63 ble a7,a6,34 # if a[j-1] <= a[i], jump to Exit Inner Loop
24: 01163023 sd a7,0(a2) # (sd vs sw) a[j] = a[j-1]
28: fff78793 addi a5,a5,-1 # j--
2c: ff860613 addi a2,a2,-8 # (8 vs 4) decrement a2 to point to a[j]
30: fe0796e3 bnez a5,1c # if j != 0, jump to Inner Loop
Exit Inner Loop:
34: 00379793 slli a5,a5,0x3 # (8 vs 4) multiply a5 by 8
38: 00f507b3 add a5,a0,a5 # a5 is now byte address oi a[j]
3c: 0107b023 sd a6,0(a5) # (sd vs sw) a[j] = x
40: 00170713 addi a4,a4,1 # i++
44: 00868693 addi a3,a3,8 # increment a3 to point to a[i]
48: fc1ff06f j 8 # jump to Outer Loop # continue outer loop

Figure 9.8: RV64I code for Insertion Sort in Figure 2.5. The RV64I assembly language program is very
similar to the RV32I assembly language in Figure 2.8 on page 27 in Chapter 2. We list the differences in
parentheses in the comments. The size of the data is now 8 bytes instead of 4, so three instructions change the
constant 4 to 8. This extra width also stretches two load words (lw) to load doublewords (ld) and two store
words (sw) to store doublewords (sd).
96 NOTES

# ARM-64 (16 instructions, 64 bytes)

# x0 points to a[0], x1 is n, x2 is j, x3 is i, x4 is x
0: d2800023 mov x3, #0x1 # i = 1
Outer Loop:
4: eb01007f cmp x3, x1 # compare i vs n
8: 54000043 b.cc 10 # if i < n, jump to Continue Outer loop
Exit Outer Loop:
c: d65f03c0 ret # return from function
Continue Outer Loop:
10: f8637804 ldr x4, [x0, x3, lsl #3] # (x4 ca r4) vs x = a[i]
14: aa0303e2 mov x2, x3 # (x2 vs r2) j = i
Inner Loop:
18: 8b020c05 add x5, x0, x2, lsl #3 # x5 is pointer to a[j]
1c: f85f80a5 ldur x5, [x5, #-8] # x5 = a[j]
20: eb0400bf cmp x5, x4 # compare a[j-1] vs. x
24: 5400008d b.le 34 # if a[j-1]<=a[i], jump to Exit Inner Loop

28: f8227805 str x5, [x0, x2, lsl #3] # a[j] = a[j-1]
2c: f1000442 subs x2, x2, #0x1 # j--
30: 54ffff41 b.ne 18 # if j != 0, jump to Inner Loop
Exit Inner Loop:
34: f8227804 str x4, [x0, x2, lsl #3] # a[j] = x
38: 91000463 add x3, x3, #0x1 # i++
3c: 17fffff2 b 4 # jump to Outer Loop

Figure 9.9: ARM-64 code for Insertion Sort in Figure 2.5. The ARM-64 assembly language program is
different from to the ARM-32 assembly language in Figure 2.11 on page 30 in Chapter 2 since it is a new
instruction set. The registers start with x instead of a. The data addressing modes can shift a register by 3
bits to scale the index to a byte address. With 31 registers, there is no need to save and restore registers from
the stack. Since PC is not one of the registers, it uses is a separate return instruction. In fact, the code looks
closer to the RV64I code or x86-64 code than to the ARM-32 code.
NOTES 97

# MIPS-64 (24 instructions, 96 bytes)

# a1 is n, a3 is pointer to a[0], v0 is j, v1 is i, t0 is x
0: 64860008 daddiu a2,a0,8 # (daddiu vs addiu, 8 vs 4) a2 is pointer to a[i]
4: 24030001 li v1,1 # i = 1
Outer Loop:
8: 0065102b sltu v0,v1,a1 # set on i < n
c: 14400003 bnez v0,1c # if i < n, jump to Continue Outer Loop
10: 00c03825 move a3,a2 # a3 is pointer to a[j] (slot filled)
14: 03e00008 jr ra # return from function
18: 00000000 nop # branch delay slot unfilled
Continue Outer Loop:
1c: dcc80000 ld a4,0(a2) # (ld vs lw) x = a[i]
20: 00601025 move v0,v1 # j = i
Inner Loop:
24: dce9fff8 ld a5,-8(a3) # (ld vs lw, 8 vs. 4, a5 vs t1) a5 = a[j-1]
28: 0109502a slt a6,a4,a5 # (no load delay slot) set a[i] < a[j-1]
2c: 11400005 beqz a6,44 # if a[j-1] <= a[i], jump to Exit Inner Loop
30: 00000000 nop # branch delay slot unfilled
34: 6442ffff daddiu v0,v0,-1 # (daddiu vs addiu) j--
38: fce90000 sd a5,0(a3) # (sd vs sw, a5 vs t1) a[j] = a[j-1]
3c: 1440fff9 bnez v0,24 # if j != 0, jump to Inner Loop (next slot filled)
40: 64e7fff8 daddiu a3,a3,-8 # (daddiu vs addiu, 8 vs 4) decr a2 pointer to a[j]
Exit Inner Loop:
44: 000210f8 dsll v0,v0,0x3 # (dsll vs sll)
48: 0082102d daddu v0,a0,v0 # (daddu vs addu) v0 now byte address oi a[j]
4c: fc480000 sd a4,0(v0) # (sd vs sw) a[j] = x
50: 64630001 daddiu v1,v1,1 # (daddiu vs addiu) i++
54: 1000ffec b 8 # jump to Outer Loop (next delay slot filled)
58: 64c60008 daddiu a2,a2,8 # (daddiu vs addiu, 8 vs 4) incr a2 pointer to a[i]
5c: 00000000 nop # Unncessary(?)

Figure 9.10: MIPS-64 code for Insertion Sort in Figure 2.5. The MIPS-64 assembly language program has
several differences from to the MIPS-32 assembly language in Figure 2.10 on page 29 in Chapter 2. First,
most operations for 64-bit data prepend a “d” to their names: daddiu, daddu, dsll. Like Figure 9.8,
three instructions change the constant from 4 to 8 since size of the data grew from 4 to 8 bytes. Again like
RV64I, the extra width also stretches two load words (lw) to load doublewords (ld) and two store words (sw)
to store doublewords (sd). Finally, MIPS-64 does not have the load delay slot from MIPS-32; the pipeline
stalls on a read-after-write dependence.
98 NOTES

# x86-64 (15 instructions, 46 bytes)

# rax is j, rcx is x, rdx is i, rsi is n, rdi is pointer to a[0]
0: ba 01 00 00 00 mov edx,0x1
Outer Loop:
5: 48 39 f2 cmp rdx,rsi # compare i vs. n
8: 73 23 jae 2d <Exit Loop> # if i >= n, jump to Exit Outer Loop
a: 48 8b 0c d7 mov rcx,[rdi+rdx*8] # x = a[i]
e: 48 89 d0 mov rax,rdx # j = i
Inner Loop:
11: 4c 8b 44 c7 f8 mov r8,[rdi+rax*8-0x8] # r8 = a[j-1]
16: 49 39 c8 cmp r8,rcx # compare a[j-1] vs. x
19: 7e 09 jle 24 <Exit Loop> # if a[j-1]<=a[i],jump to Exit InnerLoop
1b: 4c 89 04 c7 mov [rdi+rax*8],r8 # a[j] = a[j-1]
1f: 48 ff c8 dec rax # j--
22: 75 ed jne 11 <Inner Loop> # if j != 0, jump to Inner Loop
Exit InnerLoop:
24: 48 89 0c c7 mov [rdi+rax*8],rcx # a[j] = x
28: 48 ff c2 inc rdx # i++
2b: eb d8 jmp 5 <Outer Loop> # jump to Outer Loop
Exit Outer Loop:
2d: c3 ret # return from function

Figure 9.11: x86-64 code for Insertion Sort in Figure 2.5. The x86-64 assembly language program is quite
different from to the x86-32 assembly language in Figure 2.11 on page 30 in Chapter 2. First, unlike RV64I,
the wider registers have different names rax, rcx, rdx, rsi, rdi, r8. Second, because x86-64 added 8
more registers, there are now enough to keep all the variables in registers instead of in memory. Third, the
x86-64 instructions are longer than for x86-32 since many need to prepend 8-bits or 16-bits to fit the new
instructions in the opcode space. For example, incrementing or decrementing a register (inc, dec) takes 1
byte in x86-32 but 3 bytes in x86-64. Hence, while many fewer instructions, x86-64 code size of Insertion Sort
is almost identical to x86-32: 45 bytes vs. 46 bytes.
NOTES 99
10 RV32/64 Privileged Architecture

Edsger W. Dijkstra Simplicity is prerequisite for reliability.

(1930–2002) received
—Edsger W. Dijkstra
the 1972 Turing Award for
fundamental contributions
to developing programming
languages. 10.1 Introduction
The book so far has focused on RISC-V support for general-purpose computation: all of
the instructions we’ve introduced are available in user mode, where application code usually
runs. This chapter introduces two new privilege modes: machine mode, which runs the most
trusted code, and supervisor mode, which provides support for operating systems like Linux,
FreeBSD, and Windows. Both new modes are more privileged than user mode, hence the
title of the chapter. More-privileged modes generally have access to all of the features of
less-privileged modes, and they add additional functionality not available to less-privileged
modes, such as the ability to handle interrupts and perform I/O. Processors typically spend
most of their execution time in their least-privileged mode; interrupts and exceptions transfer
control to more-privileged modes.
Embedded-system runtimes and operating systems use the features of these new modes
to respond to external events, like the arrival of network packets; to support multitasking and
protection between tasks; and to abstract and virtualize hardware features. Given the breadth
of these topics, a thorough programmer’s guide would be an entire additional book; instead,
this chapter aims to hit the high notes of the RISC-V features. Programmers disinterested in
embedded system runtimes and operating systems can either skip or skim this chapter.
Figure 10.1 is a graphical representation of the RISC-V privileged instructions, and Fig-
ure 10.2 lists these instructions’ opcodes. As you can see, the privileged architecture adds

RV32/64 Privileged Instructions

machine-mode
supervisor-mode trap return
supervisor-mode fence.virtual memory address
wait for interrupt

Figure 10.1: Diagram of the RISC-V privileged instructions instructions.

10.2. MACHINE MODE FOR SIMPLE EMBEDDED SYSTEMS 101

31 27 26 25 24 20 19 15 14 12 11 7 6 0
0001000 00010 00000 000 00000 1110011 R sret
0011000 00010 00000 000 00000 1110011 R mret
0001000 00101 00000 000 00000 1110011 R wfi
0001001 rs2 rs1 000 00000 1110011 R sfence.vma

Figure 10.2: RISC-V privileged instruction layout, opcodes, format type, and name. (Table 6.1 of [Waterman
and Asanović 2017] is the basis of this figure.)

very few instructions; instead, several new control and status registers (CSRs) expose the
additional functionality.
This chapter describes the RV32 and RV64 privileged architectures together. Some con-
cepts differ only in the size of an integer register, so to keep the descriptions concise, we
introduce the term XLEN to refer to the width of an integer register in bits. XLEN is 32 for
RV32 or 64 for RV64.

10.2 Machine Mode for Simple Embedded Systems

Machine mode, abbreviated as M-mode, is the most privileged mode that a RISC-V hart
Hart is a contraction
(hardware thread) can execute in. Harts running in M-mode have full access to memory, I/O,
of hardware thread.
and low-level system features necessary to boot and configure the system. As such, it is the We use the term to distin-
only privilege mode that all standard RISC-V processors implement; indeed, simple RISC-V guish them from software
microcontrollers support only M-mode. Such systems are the focus of this section. threads, which most pro-
grammers are familiar with.
The most important feature of machine mode is the ability to intercept and handle ex- Software threads are time-
ceptions: unusual runtime events. RISC-V classifies exceptions into two categories. Syn- multiplexed on harts. Most
chronous exceptions arise as a result of instruction execution, as when accessing an invalid processor cores have only
memory address or executing an instruction with an invalid opcode. Interrupts are external one hart.
events that are asynchronous with the instruction stream, like a mouse button click. Ex-
ceptions in RISC-V are precise: all instructions prior to the exception completely execute,
and none of the subsequent instructions appear to have begun execution. Figure 10.3 lists the
standard exception causes.
Five kinds of synchronous exceptions can happen during M-mode execution:
• Access fault exceptions arise when a physical memory address doesn’t support the ac-
cess type—for example, attempting to store to a ROM. Misaligned instruc-
tion address excep-
• Breakpoint exceptions arise from executing an ebreak instruction, or when an address tions can’t occur
or datum matches a debug trigger. with the C extension
because it’s never possible
• Environment call exceptions arise from executing an ecall instruction. to jump to an odd address:
branches and JAL imme-
• Illegal instruction exceptions result from decoding an invalid opcode. diates are always even,
• Misaligned address exceptions occur when the effective address isn’t divisible by the and JALR masks off the
least-significant bit of its
access size—for example, amoadd.w with an address of 0x12. effective address. With-
out the C extension, this
If you recall Chapter 2’s claim that misaligned loads and stores are permitted, you might exception occurs when
be wondering why misaligned load and store address exceptions are listed in Figure 10.3. jumping to an address that
There are two reasons. First, the atomic memory operations in Chapter 6 require natu- is 2 mod 4.
rally aligned addresses. Second, some implementors choose to omit hardware support for
102 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE

.
Interrupt Exception Exception Code
Description
mcause[XLEN-1] mcause[XLEN-2:0]
1 1 Supervisor software interrupt
1 3 Machine software interrupt
1 5 Supervisor timer interrupt
1 7 Machine timer interrupt
1 9 Supervisor external interrupt
1 11 Machine external interrupt
0 0 Instruction address misaligned
0 1 Instruction access fault
0 2 Illegal instruction
0 3 Breakpoint
0 4 Load address misaligned
0 5 Load access fault
0 6 Store address misaligned
0 7 Store access fault
0 8 Environment call from U-mode
0 9 Environment call from S-mode
0 11 Environment call from M-mode
0 12 Instruction page fault
0 13 Load page fault
0 15 Store page fault

Figure 10.3: RISC-V exception and interrupt causes. The most-significant bit of mcause is set to 1 for
interrupts or 0 for synchronous exceptions, and the least-significant bits identify the interrupt or exception.
Supervisor interrupts and page-fault exceptions are only possible when supervisor mode is implemented (see
Section 10.5). (Table 3.6 of [Waterman and Asanović 2017] is the basis of this figure.)
10.3. MACHINE-MODE EXCEPTION HANDLING 103

XLEN-1 XLEN-2 23 22 21 20 19 18 17
SD 0 TSR TW TVM MXR SUM MPRV
1 XLEN-24 1 1 1 1 1 1

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
XS FS MPP 0 SPP MPIE 0 SPIE UPIE MIE 0 SIE UIE
2 2 2 2 1 1 1 1 1 1 1 1 1

Figure 10.4: The mstatus CSR. The only fields present in simple processors with only Machine mode and
without the F and V extensions are the global interrupt enable, MIE, and MPIE, which after an exception
holds the old value of MIE. XLEN is 32 for RV32, or 64 for RV64. Figure 3.7 of [Waterman and Asanović
2017] is the basis of this figure; see Section 3.1 of that document for a description of the other fields.

misaligned regular loads and stores, because it is a difficult feature to implement and is in-
frequently used. Processors without this hardware rely instead upon an exception handler
to trap and emulate misaligned loads and stores in software, using a sequence of smaller,
aligned loads and stores. Application code is none the wiser: misaligned memory accesses
work as expected, albeit slowly, while the hardware remains simple. Alternatively, more
performant processors can implement misaligned loads and stores in hardware. This imple-
mentation flexibility owes to RISC-V’s decision to permit misaligned loads and stores using
the regular load and store opcodes, following Chapter 1’s guideline to isolate architecture
from implementation.
There are three standard sources of interrupts: software, timer, and external. Software in-
terrupts are triggered by storing to a memory-mapped register and are generally used by one
hart to interrupt another hart, a mechanism other architectures refer to as an interprocessor
interrupt. Timer interrupts are raised when a hart’s time comparator, a memory-mapped reg-
ister named mtimecmp, exceeds the real-time counter mtime. External interrupts are raised
by a platform-level interrupt controller, to which most external devices are attached. As
different hardware platforms have different memory maps and demand divergent features
of their interrupt controllers, the mechanisms for raising and clearing these interrupts differ
from platform to platform. What is constant across all RISC-V systems is how exceptions
are handled and interrupts are masked, the topic of the next section.

10.3 Machine-Mode Exception Handling

Eight control and status registers (CSRs) are integral to machine-mode exception handling:

• mtvec, Machine Trap Vector, holds the address the processor jumps to when an excep-
tion occurs.
• mepc, Machine Exception PC, points to the instruction where the exception occurred.
• mcause, Machine Exception Cause, indicates which exception occurred.
• mie, Machine Interrupt Enable, lists which interrupts the processor can take and which
it must ignore.
• mip, Machine Interrupt Pending, lists the interrupts currently pending.
104 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE

Encoding Name Abbreviation

00 User U
01 Supervisor S
11 Machine M

Figure 10.5: RISC-V privilege levels and their encoding.

• mtval, Machine Trap Value, holds additional trap information: the faulting address for
address exceptions, the instruction itself for illegal instruction exceptions, and zero for
other exceptions.
• mscratch, Machine Scratch, holds one word of data for temporary storage.
• mstatus, Machine Status, holds the global interrupt enable, along with a plethora of
other state, as Figure 10.4 shows.

When executing in M-mode, interrupts are only taken if the global interrupt-enable bit,
mstatus.MIE, is set. Furthermore, each interrupt has its own enable bit in the mie CSR. The
bit positions in mie correspond to the interrupt codes in Figure 10.3: for example, mie[7]
corresponds to the M-mode timer interrupt. The mip CSR has the same layout and indicates
which interrupts are currently pending. Putting all three CSRs together, a machine timer
interrupt can be taken if mstatus.MIE=1, mie[7]=1, and mip[7]=1.
RISC-V also sup-
When a hart takes an exception, the hardware atomically undergoes several state transi-
ports vectored in-
terrupts, wherein the tions:
processor jumps to an
interrupt-specific address, • The PC of the exceptional instruction is preserved in mepc, and the PC is set to mtvec.
rather than a single entry (For synchronous exceptions, mepc points to the instruction that caused the exception;
point. This addressing
eliminates the need to
for interrupts, it points where execution should resume after the interrupt is handled.)
read and decode mcause, • mcause is set to the exception cause, as encoded in Figure 10.3, and mtval is set to
speeding up interrupt han-
dling. Setting mtval[0] the faulting address or some other exception-specific word of information.
to 1 enables this feature; • Interrupts are disabled by setting MIE=0 in the mstatus CSR, and the previous value
interrupt cause x then
sets the PC to (mtval- of MIE is preserved in MPIE.
1+4x), instead of the usual
• The pre-exception privilege mode is preserved in mstatus’ MPP field, and the privi-
mtval.
lege mode is changed to M. Figure 10.5 shows the encoding of the MPP field. (If the
processor only implements M-mode, this step is effectively skipped.)

To avoid overwriting the contents of the integer registers, the prologue of an interrupt
handler usually begins by swapping an integer register (say, a0) with the mscratch CSR.
Usually, the software will have arranged for mscratch to contain a pointer to additional in-
memory scratch space, which the handler uses to save as many integer registers as its body
will use. After the body executes, the epilogue of an interrupt handler restores the registers
it saved to memory, then again swaps a0 with mscratch, restoring both registers to their
pre-exception values. Finally, the handler returns with mret, an instruction unique to M-
mode. mret sets the PC to mepc, restores the previous interrupt-enable setting by copying
the mstatus MPIE field to MIE, and sets the privilege mode to the value in mstatus’ MPP
field, essentially reversing the actions described in the preceding paragraph.
10.4. USER MODE AND PROCESS ISOLATION IN EMBEDDED SYSTEMS 105

Figure 10.6 shows RISC-V assembly code for a basic timer interrupt handler following
this pattern. It simply increments the time comparator then returns to the previous task,
whereas a more realistic timer interrupt handler might invoke a scheduler to switch between
tasks. It is not preemptible, so it keeps interrupts disabled throughout the handler. Those
caveats aside, it is a complete example of a RISC-V interrupt handler on a single page!
Sometimes it’s desirable to take a higher-priority interrupt while processing a lower-
priority exception. Alas, there’s only one copy of the mepc, mcause, mtval, and mstatus
CSRs; taking a second interrupt would destroy the old values in these registers, causing data
loss without some additional help from software. A preemptible interrupt handler can save
these registers to an in-memory stack before enabling interrupts, then, just prior to exiting,
disable interrupts and restore the registers from the stack.
In addition to the mret instruction we introduced above, M-mode provides just one other
instruction: wfi (Wait For Interrupt). wfi informs the processor that there isn’t any useful
work to do, so it should enter a lower-power mode until any enabled interrupt becomes pend-
ing, i.e., (mie & mip)6=0. RISC-V processors implement this instruction in a variety of ways,
including stopping the clock until an interrupt becomes pending; some simply execute it as a
nop. Hence, wfi is typically used inside a loop.

Elaboration: wfi works whether or not interrupts are globally enabled.

If wfi is executed when interrupts are globally enabled (mstatus.MIE=1), and then an en-
abled interrupt becomes pending, the processor jumps to the exception handler. If, on the
other hand, wfi is executed when interrupts are globally disabled, and then an enabled inter-
rupt becomes pending, the processor continues executing the code following the wfi. This
code typically examines the mip CSR to decide what to do next. This strategy can reduce
interrupt latency as compared to jumping to the exception handler, because there’s no need to
save and restore integer registers.

10.4 User Mode and Process Isolation in Embedded Systems

Although Machine mode is sufficient for simple embedded systems, it is only suitable when
the entire codebase is trusted, since M-mode has unfettered access to the hardware platform.
More often, it is not practical to trust all of the application code, because it is not known
in advance or is too vast to prove correct. So, RISC-V provides mechanisms to protect the
system from the untrusted code, and to protect untrusted processes from each other.
Untrusted code must be forbidden from executing privileged instructions, like mret, and
accessing privileged CSRs, like mstatus, as these would allow the program to take control
of the system. This restriction is accomplished easily enough: an additional privilege mode,
User mode (U-mode), denies access to these features, generating an illegal instruction excep-
tion when attempting to use an M-mode instruction or CSR. Otherwise, U-mode and M-mode
behave very similarly. M-mode software can enter U-mode by setting mstatus.MPP to U
(which, as Figure 10.5 shows, is encoded as 0), then executing an mret instruction. If an
exception occurs in U-mode, control is returned to M-mode.
Untrusted code must also be restricted to access only its own memory. Processors that
implement M and U modes have a feature called Physical Memory Protection (PMP), which
allows M-mode to specify which memory addresses U-mode can access. PMP consists of
several address registers (usually eight to sixteen) and corresponding configuration registers,
which grant or deny read, write, and execute permissions. When a processor in U-mode
106 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE

# save registers
csrrw a0, mscratch, a0 # save a0; set a0 = &temp storage
sw a1, 0(a0) # save a1
sw a2, 4(a0) # save a2
sw a3, 8(a0) # save a3
sw a4, 12(a0) # save a4

# decode interrupt cause

csrr a1, mcause # read exception cause
bgez a1, exception # branch if not an interrupt
andi a1, a1, 0x3f # isolate interrupt cause
li a2, 7 # a2 = timer interrupt cause
bne a1, a2, otherInt # branch if not a timer interrupt

# handle timer interrupt by incrementing time comparator

la a1, mtimecmp # a1 = &time comparator
lw a2, 0(a1) # load lower 32 bits of comparator
lw a3, 4(a1) # load upper 32 bits of comparator
addi a4, a2, 1000 # increment lower bits by 1000 cycles
sltu a2, a4, a2 # generate carry-out
add a3, a3, a2 # increment upper bits
sw a3, 4(a1) # store upper 32 bits
sw a4, 0(a1) # store lower 32 bits

# restore registers and return

lw a4, 12(a0) # restore a4
lw a3, 4(a0) # restore a3
lw a2, 4(a0) # restore a2
lw a1, 0(a0) # restore a1
csrrw a0, mscratch, a0 # restore a0; mscratch = &temp storage
mret # return from handler

Figure 10.6: RISC-V code for a simple timer interrupt handler. The code assumes that interrupts are
globally enabled by setting mstatus.MIE; that timer interrupts have been enabled by setting mie[7]; that
the mtvec CSR has been set to the address of this handler; and that the mscratch CSR has been set to the
address of a buffer that contains 16 bytes of temporary storage to save the registers. The prologue saves five
registers, preserving a0 in mscratch and a1–a4 in memory. It then decodes the exception cause by
examining mcause: interrupt if mcause<0, or synchronous exception if mcause≥0. If it is an interrupt, it
checks that the lower bits of mcause equal 7, indicating an M-mode timer interrupt. If it is a timer interrupt,
it adds 1000 cycles to the time comparator, so that the next timer interrupt will occur about 1000 timer cycles
in the future. Finally, the epilogue restores the a0–a4 and mscratch, then returns whence it came using mret.
10.4. USER MODE AND PROCESS ISOLATION IN EMBEDDED SYSTEMS 107

XLEN-1 0
address[PhysicalAddressSize-1:2]

7 6 5 4 3 2 1 0
L 0 A X W R

Figure 10.7: A PMP address and configuration register. The address register is right-shifted by 2, and if
physical addresses are less than XLEN-2 bits wide, the upper bits are zeros. The R, W, and X fields grant
read, write, and execute permissions. The A field sets the PMP mode, and the L field locks the PMP and
corresponding address registers.

31 24 23 16 15 8 7 0
PMP3 PMP2 PMP1 PMP0 pmpcfg0

PMP7 PMP6 PMP5 PMP4 pmpcfg1

PMP11 PMP10 PMP9 PMP8 pmpcfg2

PMP15 PMP14 PMP13 PMP12 pmpcfg3

RV32

63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
PMP7 PMP6 PMP5 PMP4 PMP3 PMP2 PMP1 PMP0 pmpcfg0

PMP15 PMP14 PMP13 PMP12 PMP11 PMP10 PMP9 PMP8 pmpcfg2

RV64

Figure 10.8: The layout of PMP configurations in the pmpcfg CSRs. For RV32 (above), the sixteen
configuration registers are packed into four CSRs. For RV64 (below), they are packed into the two
even-numbered CSRs.

attempts to fetch an instruction, or execute a load or store, the address is compared against
all of the PMP address registers. If the address is greater than or equal to PMP address i, but
less than PMP address i+1, then PMP i+1’s configuration register decides whether that access
may proceed; otherwise, it raises an access exception.
Figure 10.7 shows the layout of a PMP address and configuration register. Both are CSRs,
with the address registers named pmpaddr0 to pmpaddrN, where N+1 is the number of PMPs
implemented. The address registers are shifted right two bits because PMPs have a four-byte
granularity. The configuration registers are densely packed in the CSRs to accelerate context
switching, as Figure 10.8 shows. A PMP’s configuration consists of R, W, and X bits, which
when set permit loads, stores, and fetches, respectively, and a mode field, A, which when 0
disables this PMP or when 1 enables it. The PMP configuration also supports other modes
and can be locked, features described in [Waterman and Asanović 2017].
108 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE

10.5 Supervisor Mode for Modern Operating Systems

The PMP scheme described in the previous section is attractive for embedded systems be-
cause it provides memory protection at relatively low cost, but it has several drawbacks that
limit its use in general-purpose computing. Since PMP supports only a fixed number of
memory regions, it doesn’t scale to complex applications. And since these regions must be
Fragmentation occurs contiguous in physical memory, the system can suffer from memory fragmentation. Finally,
when memory is available,
but not in large enough PMP doesn’t efficiently support paging to secondary storage.
contiguous chunks to be More sophisticated RISC-V processors handle these problems the same way as nearly all
useful. general-purpose architectures: using page-based virtual memory. This feature forms the core
of supervisor mode (S-mode), an optional privilege mode designed to support modern Unix-
like operating systems, such as Linux, FreeBSD, and Windows. S-mode is more privileged
than U-mode, but less-privileged than M-mode. Like U-mode, S-mode software can’t use M-
mode CSRs and instructions, and is subject to PMP restrictions. This section covers S-mode
interrupts and exceptions, and the next section details the S-mode virtual-memory system.
Why not uncondi-
By default, all exceptions, regardless of privilege mode, transfer control to the M-mode
tionally delegate in-
terrupts to S-mode? exception handler. Most exceptions in a Unix system, though, should invoke the operating
One reason is virtualiza- system, which runs in S-mode. The M-mode exception handler could re-route exceptions to
tion: if M-mode wants S-mode, but this extra code would slow down the handling of most exceptions. So, RISC-V
to virtualize a device for
S-mode, its interrupts
provides an exception delegation mechanism, by which interrupts and synchronous excep-
should go to M-mode, not tions can be delegated to S-mode selectively, bypassing M-mode software altogether.
S-mode. The mideleg (Machine Interrupt Delegation) CSR controls which interrupts are dele-
gated to S-mode. Like mip and mie, each bit in mideleg corresponds to the exception code
of the same number in Figure 10.3. For example, mideleg[5] corresponds to the S-mode
timer interrupt; if set, S-mode timer interrupts will transfer control to the S-mode exception
handler, rather than the M-mode exception handler.
S-mode doesn’t di-
Any interrupt delegated to S-mode can be masked by S-mode software. The sie (Super-
rectly control timer
and software inter- visor Interrupt Enable) and sip (Supervisor Interrupt Pending) CSRs are S-mode CSRs that
rupts but instead uses are subsets of the mie and mip CSRs. They have the same layout as their M-mode counter-
the ecall instruction to parts, but only the bits corresponding to interrupts that have been delegated in mideleg are
request M-mode to set up
timers or send interproces-
readable and writable through sie and sip. The bits corresponding to interrupts that haven’t
sor interrupts on its behalf. been delegated are always zero.
This software convention M-mode can also delegate synchronous exceptions to S-mode using the medeleg (Ma-
is part of the Supervisor chine Exception Delegation) CSR. The mechanism is analogous to interrupt delegation, but
Binary Interface.
the bits in medeleg correspond instead to the synchronous exception codes in Figure 10.3.
For example, setting medeleg[15] will delegate store page faults to S-mode.
Note that exceptions will never transfer control to a less-privileged mode, no matter the
delegation settings. An exception that occurs in M-mode is always handled in M-mode. An
exception that occurs in S-mode might be handled by either M-mode or S-mode, depending
on the delegation configuration, but never U-mode.
S-mode has several exception-handling CSRs, sepc, stvec, scause, sscratch, stval,
and sstatus, which perform the same function as their M-mode counterparts described in
Section 10.2. Figure 10.9 shows the layout of the sstatus register. The supervisor exception
return instruction, sret, behaves the same as mret, but it acts on the S-mode exception-
handling CSRs instead of the M-mode ones.
The act of taking an exception is also very similar to M-mode. If a hart takes an excep-
tion and it is delegated to S-mode, the hardware atomically undergoes several similar state
10.6. PAGE-BASED VIRTUAL MEMORY 109

XLEN-1 XLEN-2 20 19 18 17
SD 0 MXR SUM 0
1 XLEN-21 1 1 1

16 15 14 13 12 9 8 76 5 4 32 1 0
XS[1:0] FS[1:0] 0 SPP 0 SPIE UPIE 0 SIE UIE
2 2 4 1 2 1 1 2 1 1

Figure 10.9: The sstatus CSR. sstatus is a subset of mstatus (Figure 10.4), hence the similar layout. SIE
and SPIE hold the current and pre-exception interrupt enables, analogous to MIE and MPIE in mstatus.
XLEN is 32 for RV32, or 64 for RV64. Figure 4.2 of [Waterman and Asanović 2017] is the basis of this figure;
see Section 4.1 of that document for a description of the other fields.

transitions, using S-mode CSRs instead of M-mode ones:

• The PC of the exceptional instruction is preserved in sepc, and the PC is set to stvec.
• scause is set to the exception cause, as encoded in Figure 10.3, and stval is set to
the faulting address or some other exception-specific word of information.
• Interrupts are disabled by setting SIE=0 in the sstatus CSR, and the previous value
of SIE is preserved in SPIE.
• The pre-exception privilege mode is preserved in sstatus’ SPP field, and the privilege
mode is changed to S.

10.6 Page-Based Virtual Memory

S-mode provides a conventional virtual memory system that divides memory into fixed-size
pages for the purposes of address translation and memory protection. When paging is en-
abled, most addresses (including load and store effective addresses and the PC) are virtual
addresses that must be translated into physical addresses in order to access physical memory.
Virtual addresses are translated to physical addresses by traversing a high-radix tree known
as the page table. A leaf node in the page table indicates whether the virtual address maps to
4 KiB pages have
a physical page, and, if so, which privilege modes and access types have permission to access
been popular for
the page. Accessing a page that is unmapped or grants insufficient permissions results in a five decades starting
page fault exception. with the IBM 360 model
RISC-V paging schemes are named SvX, where X is the size of a virtual address in bits. 67. Atlas, the first com-
puter with paging, had
RV32’s paging scheme, Sv32, supports a 4 GiB virtual-address space, which is divided into 3 KiB pages (it had 6-
210 megapages of size 4 MiB. Each megapage is subdivided into 210 base pages—the funda- byte words). We find it
mental unit of paging—each 4 KiB. Hence, Sv32’s page table is a two-level tree of radix 210 . remarkable that, after a
Each entry in the page table is four bytes, so a page table is itself 4 KiB. It’s no coincidence half-century of exponen-
tial growth in computer
that a page table is exactly the size of a page: this design simplifies operating-system memory performance and memory
allocation. capacity, the page size re-
Figure 10.10 shows the layout of an Sv32 page-table entry (PTE), which has the following mains virtually unchanged.
fields, explained from right to left:

• The V bit indicates whether the rest of this PTE is valid (V=1). If V=0, any virtual-
address translation that traverses this PTE results in a page fault.
110 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE

31 20 19 10 9 8 7 6 5 4 3 2 1 0
PPN[1] PPN[0] RSW D A G U X W R V
12 10 2 1 1 1 1 1 1 1 1

Figure 10.10: An RV32 Sv32 page-table entry (PTE).

63 54 53 28 27 19 18 10 9 8 7 6 5 4 3 2 1 0
Reserved PPN[2] PPN[1] PPN[0] RSW D A G U X W R V
10 26 9 9 2 1 1 1 1 1 1 1 1

Figure 10.11: An RV64 Sv39 page-table entry (PTE).

• The R, W, and X bits indicate whether the page has read, write, and execute permis-
sions, respectively. If all three bits are 0, this PTE is a pointer to the next level of the
page table; otherwise, it’s a leaf of the tree.
• The U bit indicates whether this page is a user page. If U=0, U-mode cannot access
this page, but S-mode can. If U=1, U-mode can access this page, but S-mode cannot.
The OS relies on the
A and D bits to de- • The G bit indicates this mapping exists in all virtual-address spaces, information the
cide which pages to hardware can use to improve address-translation performance. It is typically only used
swap to secondary
storage. Periodically
for pages that belong to the operating system.
clearing the A bits helps • The A bit indicates whether the page has been accessed since the last time the A bit
the OS approximate which
pages have been least-
was cleared.
recently used. The D bit • The D bit indicates whether the page has been dirtied (i.e., written) since the last time
indicates a page is even
more expensive to swap the D bit was cleared.
out, because it must be • The RSW field is reserved for the operating system’s use; the hardware ignores it.
written back to secondary
storage. • The PPN field holds a physical page number, which is part of a physical address. If this
PTE is a leaf, the PPN is part of the translated physical address. Otherwise, the PPN
gives the address of the next level of the page table. (Figure 10.10 divides the PPN into
two subfields to simplify the description of the address-translation algorithm.)
The other RV64 pag- RV64 supports multiple paging schemes, but we describe only the most popular one,
ing schemes simply
add more levels to
Sv39. Sv39 uses the same 4 KiB base page as Sv32. The page-table entries double in size to
the page table. Sv48 eight bytes so they can hold bigger physical addresses. To maintain the invariant that a page
is nearly identical to Sv39, table is exactly the size of a page, the radix of the tree correspondingly falls to 29 . The tree
but its virtual-address is three levels deep. Sv39’s 512 GiB address space is divided into 29 gigapages, each 1 GiB.
space is 29 times bigger
and its page table is one
Each gigapage is subdivided into 29 megapages, which in Sv39 are slightly smaller than in
level deeper. Sv32: 2 MiB. Each megapage is subdivided into 29 4 KiB base pages.
Figure 10.11 shows the layout of an Sv39 PTE. It’s identical to an Sv32 PTE, except the
PPN field has been widened to 44 bits to support 56-bit physical addresses, or 226 GiB of
physical-address space.
10.6. PAGE-BASED VIRTUAL MEMORY 111

31 30 22 21 0
MODE ASID PPN RV32
1 9 22
63 60 59 44 43 0
MODE ASID PPN RV64
4 16 44

Figure 10.12: The satp CSR. Figures 4.11 and 4.12 of [Waterman and Asanović 2017] are the bases for this
figure.

RV32
Value Name Description
0 Bare No translation or protection.
1 Sv32 Page-based 32-bit virtual addressing.
RV64
Value Name Description
0 Bare No translation or protection.
8 Sv39 Page-based 39-bit virtual addressing.
9 Sv48 Page-based 48-bit virtual addressing.

Figure 10.13: The encoding of the MODE field in the satp CSR. Table 4.3 of [Waterman and Asanović 2017]
is the basis for this figure.

Elaboration: Unused address bits

Since Sv39’s virtual addresses are narrower than an RV64 integer register, you might
wonder what becomes of the remaining 25 bits. Sv39 mandates that address bits 63–39
be copies of bit 38. Thus, the valid virtual addresses are 0000_0000_0000_0000hex –
0000_003f_ffff_ffffhex and ffff_ffc0_0000_0000hex –ffff_ffff_ffff_ffffhex .
The gap between these two ranges is, of course, 225 times bigger than the size of the two
ranges combined, seemingly wasting 99.999997% of the values a 64-bit register can repre-
sent. Why not make better use of those extra 25 bits? The answer is that, as programs grow
to require more than 512 GiB of virtual-address space, architects want to increase the ad-
dress space without breaking backwards compatibility. If we allowed programs to store extra
data in the upper 25 bits, it would be impossible to later reclaim those bits to hold bigger
addresses. Allowing data storage in unused address bits is a grievous error, but one that has
recurred many times in computing history.

An S-mode CSR, satp (Supervisor Address Translation and Protection), controls the
paging system. As Figure 10.12 shows, satp has three fields. The MODE field enables
paging and selects the page-table depth; Figure 10.13 shows its encoding. The ASID (Address
Space Identifier) field is optional and can be used to reduce the cost of context switches.
Finally, the PPN field holds the physical address of the root page table, divided by the 4 KiB
page size. Typically, M-mode software will write zero to satp before entering S-mode for
the first time, disabling paging, then S-mode software will write it again after setting up the
page tables.
112 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE

VPN

31 22 21 12 11 0
VPN[1] VPN[0] offset
VA

Page Table
satp

Page Table
PTE

PTE

33 12 11 0
PA PPN offset

Figure 10.14: Diagram of the Sv32 address-translation process.

When paging is enabled in the satp register, S-mode and U-mode virtual addresses are
translated into physical addresses by traversing the page table, starting at the root. Fig-
ure 10.14 depicts this process:
1. satp.PPN gives the base address of the first-level page table, and VA[31:22] gives the
first-level index, so the processor reads the PTE located at address (satp.PPN×4096
+ VA[31:22]×4).
2. That PTE contains the base address of the second-level page table and VA[21:12] gives
the second-level index, so the processor reads the leaf PTE located at (PTE.PPN×4096
+ VA[21:12]×4).
3. The leaf PTE’s PPN field and the page offset (the twelve least-significant bits
of the original virtual address) form the final result: the physical address is
(LeafPTE.PPN×4096 + VA[11:0]).
The processor then performs the physical memory access. The translation process is
almost the same for Sv39 as for Sv32, but with larger PTEs and one more level of indirec-
tion. Figure 10.19, at the end of this chapter, gives a complete description of the page-table
traversal algorithm, detailing the exceptional conditions and the special case of superpage
translations.
That’s almost all there is to the RISC-V paging system, save for one wrinkle. If all
instruction fetches, loads, and stores resulted in several page-table accesses, then paging
would reduce performance substantially! All modern processors reduce this overhead with
an address-translation cache (often called a TLB, for Translation Lookaside Buffer). To
reduce the cost of this cache, most processors don’t automatically keep it coherent with the
page table—if the operating system modifies the page table, the cache becomes stale. S-
mode adds one more instruction to solve this problem: sfence.vma informs the processor
10.7. CONCLUDING REMARKS 113

XLEN-1 12 11 10 9 8 7 6 5 4 3 2 1 0
WIRI MEIP WIRI SEIP UEIP MTIP WIRI STIP UTIP MSIP WIRI SSIP USIP
WPRI MEIE WPRI SEIE UEIE MTIE WPRI STIE UTIE MSIE WPRI SSIE USIE
XLEN-12 1 1 1 1 1 1 1 1 1 1 1 1

Figure 10.15: Machine interrupt registers. They are XLEN-bit read/write registers that hold pending
interrupts (mip) and the interrupt enable bits (mie) CSRs. Only the bits corresponding to lower-privilege
software interrupts (USIP, SSIP), timer interrupts (UTIP, STIP), and external interrupts (UEIP, SEIP) in
mip are writable through this CSR address; the remaining bits are read-only.

XLEN-1 10 9 8 7 6 5 4 3 2 1 0
WIRI SEIP UEIP WIRI STIP UTIP WIRI SSIP USIP
WPRI SEIE UEIE WPRI STIE UTIE WPRI SSIE USIE
XLEN-10 1 1 2 1 1 2 1 1

Figure 10.16: Supervisor interrupt registers. They are XLEN-bit read/write registers that hold pending
interrupts (sip) and the interrupt enable bits (sie) CSRs.

that software may have modified the page tables, so the processor can flush the translation
caches accordingly. It takes two optional arguments, which narrow the scope of the cache
flush: rs1 indicates which virtual address’ translation has been modified in the page table,
and rs2 gives the address-space identifier of the process whose page table has been modified.
If x0 is given for both, the entire translation cache is flushed.

Elaboration: Address-translation cache coherence in multiprocessors

sfence.vma only affects the address-translation hardware for the hart that executed the in-
struction. When a hart changes a page table that another hart is using, the first hart must use
an interprocessor interrupt to inform the second hart that it should execute an sfence.vma
instruction. This procedure is often referred to as a TLB shootdown.

10.7 Concluding Remarks

Study after study shows that the very best designers produce structures that are faster,
smaller, simpler, clearer, and produced with less effort. The differences between the great
and the average approach an order of magnitude.
—Fred Brooks, Jr., 1986.
Brooks is a Turing Award laureate and an architect of the IBM System/360 family of
computers, which demonstrated the importance of differentiating architecture from im-
plementation. Descendants of that 1964 architecture are still selling today.

The modularity of the RISC-V privileged architectures caters to the needs of a variety of
systems. The minimalist Machine mode supports bare-metal embedded applications at low
cost. The additional User mode and Physical Memory Protection together enable multitask-
ing in more sophisticated embedded systems. Finally, Supervisor mode and page-based
virtual memory provide the flexibility needed to host modern operating systems.
114 NOTES

XLEN-1 21 0
BASE[XLEN-1:2] MODE
XLEN-2 2

Figure 10.17: Machine and supervisor trap-vector base-address register (mtvec and stvec) CSRs. They are
XLEN-bit read/write registers that hold trap vector configuration, consisting of a vector base address
(BASE) and a vector mode (MODE). The value in the BASE field must always be aligned on a 4-byte
boundary. MODE = 0 means all exceptions set the PC to BASE. MODE = 1 sets the PC to
(BASE + (4 × cause)) on asynchronous interrupts.

XLEN-1 XLEN-2 0
Interrupt Exception Code
1 XLEN-1

Figure 10.18: Machine and supervisor cause (mcause and scause) CSRs. When a trap is taken, the CSR is
written with a code indicating the event that caused the trap. The Interrupt bit is set if the trap was caused
by an interrupt. The Exception Code field contains a code identifying the last exception. Tables 3.6 and 4.2 of
[Waterman and Asanović 2017] map the code values to the reason for the traps.

10.8 To Learn More

A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual Volume
II: Privileged Architecture Version 1.10. May 2017. URL https://riscv.org/
specifications/privileged-isa/.

Notes
1 http://parlab.eecs.berkeley.edu
NOTES 115

1. Let a be satp.ppn × PAGESIZE, and let i = LEVELS − 1.

2. Let pte be the value of the PTE at address a + va.vpn[i] × PTESIZE.
3. If pte.v = 0, or if pte.r = 0 and pte.w = 1, stop and raise a page-fault exception.
4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 5. Otherwise, this PTE is a
pointer to the next level of the page table. Let i = i − 1. If i < 0, stop and raise a page-fault
exception. Otherwise, let a = pte.ppn × PAGESIZE and go to step 2.
5. A leaf PTE has been found. Determine if the requested memory access is allowed by the pte.r,
pte.w, pte.x, and pte.u bits, given the current privilege mode and the value of the SUM and
MXR fields of the mstatus register. If not, stop and raise a page-fault exception.
6. If i > 0 and pa.ppn[i − 1 : 0] 6= 0, this is a misaligned superpage; stop and raise a page-fault
exception.
7. If pte.a = 0, or if the memory access is a store and pte.d = 0, then either:
• Raise a page-fault exception, or:
• Set pte.a to 1 and, if the memory access is a store, also set pte.d to 1.
8. The translation is successful. The translated physical address is given as follows:
• pa.pgoff = va.pgoff.
• If i > 0, then this is a superpage translation and pa.ppn[i − 1 : 0] = va.vpn[i − 1 : 0].
• pa.ppn[LEVELS − 1 : i] = pte.ppn[LEVELS − 1 : i].

Figure 10.19: The complete algorithm for virtual-to-physical address translation. va is the virtual address
input and pa is the physical address output. The PAGESIZE constant is 212 . For Sv32, LEVELS=2 and
PTESIZE=4, whereas for Sv39, LEVELS=3 and PTESIZE=8. Section 4.3.2 of [Waterman and Asanović
2017] is the basis for this figure.
11 Future RISC-V Optional Extensions

Alan Perlis (1922–1990) Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it.
was the first recipient of
—Alan Perlis, 1982
the Turing Award (1966),
conferred for his influence
The RISC-V Foundation will develop at least eight optional extensions.
on advanced programming
languages and compilers. In
1958 he helped design AL-
GOL, which has influenced
11.1 “B” Standard Extension for Bit Manipulation
virtually every imperative
programming language The B extension offers bit manipulation, including insert, extract, and test bit fields; rotations;
including C and Java. funnel shifts; bit and byte permutations; count leading and trailing zeros; and count bits set.

11.2 “E” Standard Extension for Embedded

To reduce the cost of low-end cores, it has 16 fewer registers. RV32E is why the saved and
temporary registers are split between the registers 0-15 and 16-31 (Figure 3.2).

11.3 “H” Privileged Architecture Extension for Hypervisor Support

The H extension to the privileged architecture adds a new hypervisor mode and a second level
of page-based address translation to improve the efficiency of running multiple operating
systems on the same machine.

11.4 “J” Standard Extension for Dynamically Translated Languages

Many popular languages are usually implemented via dynamic translation, including Java
and Javascript. These languages can benefit from additional ISA support for dynamic checks
and garbage collection. (J stands for Just-In-Time compiler.)

11.5 “L” Standard Extension for Decimal Floating-Point

The L extension is intended to support decimal floating-point arithmetic as defined in the
IEEE 754-2008 standard. The problem with binary numbers is that they cannot represent
some common decimal fractions, such as 0.1. The motivation for RV32L is that the compu-
tation radix can be identical to the radix of the input and output.
11.6. “N” STANDARD EXTENSION FOR USER-LEVEL INTERRUPTS 117

11.6 “N” Standard Extension for User-Level Interrupts

The N extension allows interrupts and exceptions that occur in user-level programs to transfer
control directly to a user-level trap handler without invoking the outer execution environment.
User-level interrupts are mainly intended to support secure embedded systems with only M-
mode and U-mode present (Chapter 10). However, they can also support user-level trap
handling in systems running Unix-like operating systems. When used in a Unix environment,
conventional signal handling would likely remain, but user-level interrupts could be used as a
building block for future extensions that generate user-level events such as garbage collection
barriers, integer overflow, and floating-point traps.

11.7 “P” Standard Extension for Packed-SIMD Instructions

The P extension subdivides the existing architectural registers to provide data-parallel compu-
tation on smaller data types. Packed-SIMD designs represent a reasonable design point when
reusing existing wide datapath resources. However, if significant additional resources are to
be devoted to data-parallel execution, Chapter 8 shows that designs for vector architectures
are a better choice, and architects should use the RVV extension.

11.8 “Q” Standard Extension for Quad-Precision Floating-Point

The Q extension adds 128-bit quad-precision binary floating-point instructions compliant
with the IEEE 754-2008 arithmetic standard. The floating-point registers are now extended
to hold either a single, double, or quad-precision floating-point value. The quad-precision
binary floating-point extension requires RV64IFD.

11.9 Concluding Remarks

Simplify, simplify.
—Henry David Thoreau, an eminent writer of the 19th century, 1854

Having an open, standards-like committee approach to expanding RISC-V hopefully will

mean that the feedback and debate will occur before the instructions are finalized rather than
afterwards, when it’s too late to change. In the ideal case, a few members will implement the
proposal before it is ratified, which FPGAs make much easier to do. Proposing instruction
extensions via the RISC-V Foundation committees will also be a fair amount of work, which
will keep the rate of change slow, unlike what happened to x86-32 (see Figure 1.2 on page 3
in Chapter 1). Don’t forget that everything in this chapter will be optional, despite how many
extensions are adopted.
Our hope is that RISC-V can evolve with technological demands while maintaining its
reputation as a simple, efficient ISA. If it succeeds, RISC-V will we be a significant break
from the incremental ISAs of the past.
A RISC-V Instruction Listings

Coco Chanel (1883- Simplicity is the keynote of all true elegance.

1971) Founder of the
—Coco Chanel, 1923
Chanel fashion brand, her
pursuit of expensive sim-
plicity shaped 20th-century This appendix lists all the instructions for RV32/64I, all the extensions covered in this book
fashion. (RVM, RVA, RVF, RVD, RVC, and RVV), and all the pseudoinstructions. Each entry has
the instruction name, operands, a register-transfer level definition, instruction format type,
English description, compressed versions (if any), and a figure showing the actual layout
with opcodes. We think you have everything you need to understand all the instructions
in these compact summaries. However, if you want even more detail, refer to the official
RISC-V specifications [Waterman and Asanović 2017].
To help readers find the desired instruction in this appendix, the header of the left
(even) page contains the first instruction from the top of that page and the header on the
right (odd) page contains the last instruction from at the bottom of that page. The format
is similar to the headers of dictionaries, which helps you search for the page that your
word is on. For example, the header of the next even page shows AMOADD.W, the first
instruction on the page, and the header of the following odd page shows AMOMINU.D, the
last instruction on that page. These are the two pages where you would find any of these
10 instructions: amoadd.w, amoand.d, amoand.w, amomax.d, amomax.w, amomaxu.d,
amomaxu.w, amomin.d, amomin.w, and amominu.d.
RISC-V INSTRUCTIONS: AMOADD.D 119

add rd, rs1, rs2 x[rd] = x[rs1] + x[rs2]

Add. R-type, RV32I and RV64I.
Adds register x[rs2] to register x[rs1] and writes the result to x[rd]. Arithmetic overflow is
ignored.
Compressed forms: c.add rd, rs2; c.mv rd, rs2
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 000 rd 0110011

addi rd, rs1, immediate x[rd] = x[rs1] + sext(immediate)

Add Immediate. I-type, RV32I and RV64I.
Adds the sign-extended immediate to register x[rs1] and writes the result to x[rd]. Arithmetic
overflow is ignored.
Compressed forms: c.li rd, imm; c.addi rd, imm; c.addi16sp imm; c.addi4spn rd, imm
31 20 19 15 14 12 11 76 0
immediate[11:0] rs1 000 rd 0010011

addiw rd, rs1, immediate x[rd] = sext((x[rs1] + sext(immediate))[31:0])

Add Word Immediate. I-type, RV64I only.
Adds the sign-extended immediate to x[rs1], truncates the result to 32 bits, and writes the
sign-extended result to x[rd]. Arithmetic overflow is ignored.
Compressed form: c.addiw rd, imm
31 20 19 15 14 12 11 76 0
immediate[11:0] rs1 000 rd 0011011

addw rd, rs1, rs2 x[rd] = sext((x[rs1] + x[rs2])[31:0])

Add Word. R-type, RV64I only.
Adds register x[rs2] to register x[rs1], truncates the result to 32 bits, and writes the sign-
extended result to x[rd]. Arithmetic overflow is ignored.
Compressed form: c.addw rd, rs2
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 000 rd 0111011

amoadd.d rd, rs2, (rs1) x[rd] = AMO64(M[x[rs1]] + x[rs2])

Atomic Memory Operation: Add Doubleword. R-type, RV64A only.
Atomically, let t be the value of the memory doubleword at address x[rs1], then set that
memory doubleword to t + x[rs2]. Set x[rd] to t.
31 27 26 25 24 20 19 15 14 12 11 76 0
00000 aq rl rs2 rs1 011 rd 0101111
120 RISC-V INSTRUCTIONS: AMOADD.W

amoadd.w rd, rs2, (rs1) x[rd] = AMO32(M[x[rs1]] + x[rs2])

Atomic Memory Operation: Add Word. R-type, RV32A and RV64A.
Atomically, let t be the value of the memory word at address x[rs1], then set that memory
word to t + x[rs2]. Set x[rd] to the sign extension of t.
31 27 26 25 24 20 19 15 14 12 11 76 0
00000 aq rl rs2 rs1 010 rd 0101111

amoand.d rd, rs2, (rs1) x[rd] = AMO64(M[x[rs1]] & x[rs2])

Atomic Memory Operation: AND Doubleword. R-type, RV64A only.
Atomically, let t be the value of the memory doubleword at address x[rs1], then set that
memory doubleword to the bitwise AND of t and x[rs2]. Set x[rd] to t.
31 27 26 25 24 20 19 15 14 12 11 76 0
01100 aq rl rs2 rs1 011 rd 0101111

amoand.w rd, rs2, (rs1) x[rd] = AMO32(M[x[rs1]] & x[rs2])

Atomic Memory Operation: AND Word. R-type, RV32A and RV64A.
Atomically, let t be the value of the memory word at address x[rs1], then set that memory
word to the bitwise AND of t and x[rs2]. Set x[rd] to the sign extension of t.
31 27 26 25 24 20 19 15 14 12 11 76 0
01100 aq rl rs2 rs1 010 rd 0101111

amomax.d rd, rs2, (rs1) x[rd] = AMO64(M[x[rs1]] MAX x[rs2])

Atomic Memory Operation: Maximum Doubleword. R-type, RV64A only.
Atomically, let t be the value of the memory doubleword at address x[rs1], then set that
memory doubleword to the larger of t and x[rs2], using a two’s complement comparison. Set
x[rd] to t.
31 27 26 25 24 20 19 15 14 12 11 76 0
10100 aq rl rs2 rs1 011 rd 0101111

amomax.w rd, rs2, (rs1) x[rd] = AMO32(M[x[rs1]] MAX x[rs2])

Atomic Memory Operation: Maximum Word. R-type, RV32A and RV64A.
Atomically, let t be the value of the memory word at address x[rs1], then set that memory
word to the larger of t and x[rs2], using a two’s complement comparison. Set x[rd] to the
sign extension of t.
31 27 26 25 24 20 19 15 14 12 11 76 0
10100 aq rl rs2 rs1 010 rd 0101111
RISC-V INSTRUCTIONS: AMOMINU.D 121

amomaxu.d rd, rs2, (rs1) x[rd] = AMO64(M[x[rs1]] MAXU x[rs2])

Atomic Memory Operation: Maximum Doubleword, Unsigned. R-type, RV64A only.
Atomically, let t be the value of the memory doubleword at address x[rs1], then set that
memory doubleword to the larger of t and x[rs2], using an unsigned comparison. Set x[rd] to
t.
31 27 26 25 24 20 19 15 14 12 11 76 0
11100 aq rl rs2 rs1 011 rd 0101111

amomaxu.w rd, rs2, (rs1) x[rd] = AMO32(M[x[rs1]] MAXU x[rs2])

Atomic Memory Operation: Maximum Word, Unsigned. R-type, RV32A and RV64A.
Atomically, let t be the value of the memory word at address x[rs1], then set that memory
word to the larger of t and x[rs2], using an unsigned comparison. Set x[rd] to the sign
extension of t.
31 27 26 25 24 20 19 15 14 12 11 76 0
11100 aq rl rs2 rs1 010 rd 0101111

amomin.d rd, rs2, (rs1) x[rd] = AMO64(M[x[rs1]] MIN x[rs2])

Atomic Memory Operation: Minimum Doubleword. R-type, RV64A only.
Atomically, let t be the value of the memory doubleword at address x[rs1], then set that
memory doubleword to the smaller of t and x[rs2], using a two’s complement comparison.
Set x[rd] to t.
31 27 26 25 24 20 19 15 14 12 11 76 0
10000 aq rl rs2 rs1 011 rd 0101111

amomin.w rd, rs2, (rs1) x[rd] = AMO32(M[x[rs1]] MIN x[rs2])

Atomic Memory Operation: Minimum Word. R-type, RV32A and RV64A.
Atomically, let t be the value of the memory word at address x[rs1], then set that memory
word to the smaller of t and x[rs2], using a two’s complement comparison. Set x[rd] to the
sign extension of t.
31 27 26 25 24 20 19 15 14 12 11 76 0
10000 aq rl rs2 rs1 010 rd 0101111

amominu.d rd, rs2, (rs1) x[rd] = AMO64(M[x[rs1]] MINU x[rs2])

Atomic Memory Operation: Minimum Doubleword, Unsigned. R-type, RV64A only.
Atomically, let t be the value of the memory doubleword at address x[rs1], then set that
memory doubleword to the smaller of t and x[rs2], using an unsigned comparison. Set x[rd]
to t.
31 27 26 25 24 20 19 15 14 12 11 76 0
11000 aq rl rs2 rs1 011 rd 0101111
122 RISC-V INSTRUCTIONS: AMOMINU.W

amominu.w rd, rs2, (rs1) x[rd] = AMO32(M[x[rs1]] MINU x[rs2])

Atomic Memory Operation: Minimum Word, Unsigned. R-type, RV32A and RV64A.
Atomically, let t be the value of the memory word at address x[rs1], then set that memory
word to the smaller of t and x[rs2], using an unsigned comparison. Set x[rd] to the sign
extension of t.
31 27 26 25 24 20 19 15 14 12 11 76 0
11000 aq rl rs2 rs1 010 rd 0101111

amoor.d rd, rs2, (rs1) x[rd] = AMO64(M[x[rs1]] | x[rs2])

Atomic Memory Operation: OR Doubleword. R-type, RV64A only.
Atomically, let t be the value of the memory doubleword at address x[rs1], then set that
memory doubleword to the bitwise OR of t and x[rs2]. Set x[rd] to t.
31 27 26 25 24 20 19 15 14 12 11 76 0
01000 aq rl rs2 rs1 011 rd 0101111

amoor.w rd, rs2, (rs1) x[rd] = AMO32(M[x[rs1]] | x[rs2])

Atomic Memory Operation: OR Word. R-type, RV32A and RV64A.
Atomically, let t be the value of the memory word at address x[rs1], then set that memory
word to the bitwise OR of t and x[rs2]. Set x[rd] to the sign extension of t.
31 27 26 25 24 20 19 15 14 12 11 76 0
01000 aq rl rs2 rs1 010 rd 0101111

amoswap.d rd, rs2, (rs1) x[rd] = AMO64(M[x[rs1]] SWAP x[rs2])

Atomic Memory Operation: Swap Doubleword. R-type, RV64A only.
Atomically, let t be the value of the memory doubleword at address x[rs1], then set that
memory doubleword to x[rs2]. Set x[rd] to t.
31 27 26 25 24 20 19 15 14 12 11 76 0
00001 aq rl rs2 rs1 011 rd 0101111

amoswap.w rd, rs2, (rs1) x[rd] = AMO32(M[x[rs1]] SWAP x[rs2])

Atomic Memory Operation: Swap Word. R-type, RV32A and RV64A.
Atomically, let t be the value of the memory word at address x[rs1], then set that memory
word to x[rs2]. Set x[rd] to the sign extension of t.
31 27 26 25 24 20 19 15 14 12 11 76 0
00001 aq rl rs2 rs1 010 rd 0101111
RISC-V INSTRUCTIONS: AUIPC 123

amoxor.d rd, rs2, (rs1) x[rd] = AMO64(M[x[rs1]] ˆ x[rs2])

Atomic Memory Operation: XOR Doubleword. R-type, RV64A only.
Atomically, let t be the value of the memory doubleword at address x[rs1], then set that
memory doubleword to the bitwise XOR of t and x[rs2]. Set x[rd] to t.
31 27 26 25 24 20 19 15 14 12 11 76 0
00100 aq rl rs2 rs1 011 rd 0101111

amoxor.w rd, rs2, (rs1) x[rd] = AMO32(M[x[rs1]] ˆ x[rs2])

Atomic Memory Operation: XOR Word. R-type, RV32A and RV64A.
Atomically, let t be the value of the memory word at address x[rs1], then set that memory
word to the bitwise XOR of t and x[rs2]. Set x[rd] to the sign extension of t.
31 27 26 25 24 20 19 15 14 12 11 76 0
00100 aq rl rs2 rs1 010 rd 0101111

and rd, rs1, rs2 x[rd] = x[rs1] & x[rs2]

AND. R-type, RV32I and RV64I.
Computes the bitwise AND of registers x[rs1] and x[rs2] and writes the result to x[rd].
Compressed form: c.and rd, rs2
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 111 rd 0110011

andi rd, rs1, immediate x[rd] = x[rs1] & sext(immediate)

AND Immediate. I-type, RV32I and RV64I.
Computes the bitwise AND of the sign-extended immediate and register x[rs1] and writes the
result to x[rd].
Compressed form: c.andi rd, imm
31 20 19 15 14 12 11 76 0
immediate[11:0] rs1 111 rd 0010011

auipc rd, immediate x[rd] = pc + sext(immediate[31:12] << 12)

Add Upper Immediate to PC. U-type, RV32I and RV64I.
Adds the sign-extended 20-bit immediate, left-shifted by 12 bits, to the pc, and writes the
result to x[rd].
31 12 11 76 0
immediate[31:12] rd 0010111
124 RISC-V INSTRUCTIONS: BEQ

beq rs1, rs2, offset if (rs1 == rs2) pc += sext(offset)

Branch if Equal. B-type, RV32I and RV64I.
If register x[rs1] equals register x[rs2], set the pc to the current pc plus the sign-extended
offset.
Compressed form: c.beqz rs1, offset
31 25 24 20 19 15 14 12 11 76 0
offset[12|10:5] rs2 rs1 000 offset[4:1|11] 1100011

beqz rs1, offset if (rs1 == 0) pc += sext(offset)

Branch if Equal to Zero. Pseudoinstruction, RV32I and RV64I.
Expands to beq rs1, x0, offset.

bge rs1, rs2, offset if (rs1 ≥s rs2) pc += sext(offset)

Branch if Greater Than or Equal. B-type, RV32I and RV64I.
If register x[rs1] is at least x[rs2], treating the values as two’s complement numbers, set the
pc to the current pc plus the sign-extended offset.
31 25 24 20 19 15 14 12 11 76 0
offset[12|10:5] rs2 rs1 101 offset[4:1|11] 1100011

bgeu rs1, rs2, offset if (rs1 ≥u rs2) pc += sext(offset)

Branch if Greater Than or Equal, Unsigned. B-type, RV32I and RV64I.
If register x[rs1] is at least x[rs2], treating the values as unsigned numbers, set the pc to the
current pc plus the sign-extended offset.
31 25 24 20 19 15 14 12 11 76 0
offset[12|10:5] rs2 rs1 111 offset[4:1|11] 1100011

bgez rs1, offset if (rs1 ≥s 0) pc += sext(offset)

Branch if Greater Than or Equal to Zero. Pseudoinstruction, RV32I and RV64I.
Expands to bge rs1, x0, offset.

bgt rs1, rs2, offset if (rs1 >s rs2) pc += sext(offset)

Branch if Greater Than. Pseudoinstruction, RV32I and RV64I.
Expands to blt rs2, rs1, offset.

bgtu rs1, rs2, offset if (rs1 >u rs2) pc += sext(offset)

Branch if Greater Than, Unsigned. Pseudoinstruction, RV32I and RV64I.
Expands to bltu rs2, rs1, offset.
RISC-V INSTRUCTIONS: BLTU 125

bgtz rs2, offset if (rs2 >s 0) pc += sext(offset)

Branch if Greater Than Zero. Pseudoinstruction, RV32I and RV64I.
Expands to blt x0, rs2, offset.

ble rs1, rs2, offset if (rs1 ≤s rs2) pc += sext(offset)

Branch if Less Than or Equal. Pseudoinstruction, RV32I and RV64I.
Expands to bge rs2, rs1, offset.

bleu rs1, rs2, offset if (rs1 ≤u rs2) pc += sext(offset)

Branch if Less Than or Equal, Unsigned. Pseudoinstruction, RV32I and RV64I.
Expands to bgeu rs2, rs1, offset.

blez rs2, offset if (rs2 ≤s 0) pc += sext(offset)

Branch if Less Than or Equal to Zero. Pseudoinstruction, RV32I and RV64I.
Expands to bge x0, rs2, offset.

blt rs1, rs2, offset if (rs1 <s rs2) pc += sext(offset)

Branch if Less Than. B-type, RV32I and RV64I.
If register x[rs1] is less than x[rs2], treating the values as two’s complement numbers, set the
pc to the current pc plus the sign-extended offset.
31 25 24 20 19 15 14 12 11 76 0
offset[12|10:5] rs2 rs1 100 offset[4:1|11] 1100011

bltz rs1, offset if (rs1 <s 0) pc += sext(offset)

Branch if Less Than Zero. Pseudoinstruction, RV32I and RV64I.
Expands to blt rs1, x0, offset.

bltu rs1, rs2, offset if (rs1 <u rs2) pc += sext(offset)

Branch if Less Than, Unsigned. B-type, RV32I and RV64I.
If register x[rs1] is less than x[rs2], treating the values as unsigned numbers, set the pc to the
current pc plus the sign-extended offset.
31 25 24 20 19 15 14 12 11 76 0
offset[12|10:5] rs2 rs1 110 offset[4:1|11] 1100011
126 RISC-V INSTRUCTIONS: BNE

bne rs1, rs2, offset if (rs1 6= rs2) pc += sext(offset)

Branch if Not Equal. B-type, RV32I and RV64I.
If register x[rs1] does not equal register x[rs2], set the pc to the current pc plus the sign-
extended offset.
Compressed form: c.bnez rs1, offset
31 25 24 20 19 15 14 12 11 76 0
offset[12|10:5] rs2 rs1 001 offset[4:1|11] 1100011

bnez rs1, offset if (rs1 6= 0) pc += sext(offset)

Branch if Not Equal to Zero. Pseudoinstruction, RV32I and RV64I.
Expands to bne rs1, x0, offset.

c.add rd, rs2 x[rd] = x[rd] + x[rs2]

Add. RV32IC and RV64IC.
Expands to add rd, rd, rs2. Invalid when rd=x0 or rs2=x0.
15 13 12 11 76 21 0
100 1 rd rs2 10

c.addi rd, imm x[rd] = x[rd] + sext(imm)

Add Immediate. RV32IC and RV64IC.
Expands to addi rd, rd, imm.
15 13 12 11 76 21 0
000 imm[5] rd imm[4:0] 01

c.addi16sp imm x[2] = x[2] + sext(imm)

Add Immediate, Scaled by 16, to Stack Pointer. RV32IC and RV64IC.
Expands to addi x2, x2, imm. Invalid when imm=0.
15 13 12 11 76 21 0
011 imm[9] 00010 imm[4|6|8:7|5] 01

c.addi4spn rd0 , uimm x[8+rd0 ] = x[2] + uimm

Add Immediate, Scaled by 4, to Stack Pointer, Nondestructive. RV32IC and RV64IC.
Expands to addi rd, x2, uimm, where rd=8+rd0 . Invalid when uimm=0.
15 13 12 54 21 0
000 uimm[5:4|9:6|2|3] rd0 00
RISC-V INSTRUCTIONS: C.BNEZ 127

c.addiw rd, imm x[rd] = sext((x[rd] + sext(imm))[31:0])

Add Word Immediate. RV64IC only.
Expands to addiw rd, rd, imm. Invalid when rd=x0.
15 13 12 11 76 21 0
001 imm[5] rd imm[4:0] 01

c.and rd0 , rs20 x[8+rd0 ] = x[8+rd0 ] & x[8+rs20 ]

AND. RV32IC and RV64IC.
Expands to and rd, rd, rs2, where rd=8+rd0 and rs2=8+rs20 .
15 10 9 76 54 21 0
100011 rd0 11 rs20 01

c.addw rd0 , rs20 x[8+rd0 ] = sext((x[8+rd0 ] + x[8+rs20 ])[31:0])

Add Word. RV64IC only.
Expands to addw rd, rd, rs2, where rd=8+rd0 and rs2=8+rs20 .
15 10 9 76 54 21 0
100111 rd0 01 rs20 01

c.andi rd0 , imm x[8+rd0 ] = x[8+rd0 ] & sext(imm)

AND Immediate. RV32IC and RV64IC.
Expands to andi rd, rd, imm, where rd=8+rd0 .
15 13 12 11 10 9 76 21 0
100 imm[5] 10 rd0 imm[4:0] 01

c.beqz rs10 , offset if (x[8+rs10 ] == 0) pc += sext(offset)

Branch if Equal to Zero. RV32IC and RV64IC.
Expands to beq rs1, x0, offset, where rs1=8+rs10 .
15 13 12 10 9 76 21 0
110 offset[8|4:3] rs10 offset[7:6|2:1|5] 01

c.bnez rs10 , offset if (x[8+rs10 ] 6= 0) pc += sext(offset)

Branch if Not Equal to Zero. RV32IC and RV64IC.
Expands to bne rs1, x0, offset, where rs1=8+rs10 .
15 13 12 10 9 76 21 0
111 offset[8|4:3] rs10 offset[7:6|2:1|5] 01
128 RISC-V INSTRUCTIONS: C.EBREAK

c.ebreak RaiseException(Breakpoint)
Environment Breakpoint. RV31IC and RV64IC.
Expands to ebreak.
15 13 12 11 76 21 0
100 1 00000 00000 10

c.fld rd0 , uimm(rs10 ) f[8+rd0 ] = M[x[8+rs10 ] + uimm][63:0]

Floating-point Load Doubleword. RV32DC and RV64DC.
Expands to fld rd, uimm(rs1), where rd=8+rd0 and rs1=8+rs10 .
15 13 12 10 9 76 54 21 0
001 uimm[5:3] rs10 uimm[7:6] rd0 00

c.fldsp rd, uimm(x2) f[rd] = M[x[2] + uimm][63:0]

Floating-point Load Doubleword, Stack-Pointer Relative. RV32DC and RV64DC.
Expands to fld rd, uimm(x2).
15 13 12 11 76 21 0
001 uimm[5] rd uimm[4:3|8:6] 10

c.flw rd0 , uimm(rs10 ) f[8+rd0 ] = M[x[8+rs10 ] + uimm][31:0]

Floating-point Load Word. RV32FC only.
Expands to flw rd, uimm(rs1), where rd=8+rd0 and rs1=8+rs10 .
15 13 12 10 9 76 54 21 0
011 uimm[5:3] rs10 uimm[2|6] rd0 00

c.flwsp rd, uimm(x2) f[rd] = M[x[2] + uimm][31:0]

Floating-point Load Word, Stack-Pointer Relative. RV32FC only.
Expands to flw rd, uimm(x2).
15 13 12 11 76 21 0
011 uimm[5] rd uimm[4:2|7:6] 10

c.fsd rs20 , uimm(rs10 ) M[x[8+rs10 ] + uimm][63:0] = f[8+rs20 ]

Floating-point Store Doubleword. RV32DC and RV64DC.
Expands to fsd rs2, uimm(rs1), where rs2=8+rs20 and rs1=8+rs10 .
15 13 12 10 9 76 54 21 0
101 uimm[5:3] rs10 uimm[7:6] rs20 00
RISC-V INSTRUCTIONS: C.JALR 129

c.fsdsp rs2, uimm(x2) M[x[2] + uimm][63:0] = f[rs2]

Floating-point Store Doubleword, Stack-Pointer Relative. RV32DC and RV64DC.
Expands to fsd rs2, uimm(x2).
15 13 12 76 21 0
101 uimm[5:3|8:6] rs2 10

c.fsw rs20 , uimm(rs10 ) M[x[8+rs10 ] + uimm][31:0] = f[8+rs20 ]

Floating-point Store Word. RV32FC only.
Expands to fsw rs2, uimm(rs1), where rs2=8+rs20 and rs1=8+rs10 .
15 13 12 10 9 76 54 21 0
111 uimm[5:3] rs10 uimm[2|6] rs20 00

c.fswsp rs2, uimm(x2) M[x[2] + uimm][31:0] = f[rs2]

Floating-point Store Word, Stack-Pointer Relative. RV32FC only.
Expands to fsw rs2, uimm(x2).
15 13 12 76 21 0
111 uimm[5:2|7:6] rs2 10

c.j offset pc += sext(offset)

Jump. RV32IC and RV64IC.
Expands to jal x0, offset.
15 13 12 21 0
101 offset[11|4|9:8|10|6|7|3:1|5] 01

c.jal offset x[1] = pc+2; pc += sext(offset)

Jump and Link. RV32IC only.
Expands to jal x1, offset.
15 13 12 21 0
001 offset[11|4|9:8|10|6|7|3:1|5] 01

c.jalr rs1 t = pc+2; pc = x[rs1]; x[1] = t

Jump and Link Register. RV32IC and RV64IC.
Expands to jalr x1, 0(rs1). Invalid when rs1=x0.
15 13 12 11 76 21 0
100 1 rs1 00000 10
130 RISC-V INSTRUCTIONS: C.JR

c.jr rs1 pc = x[rs1]

Jump Register. RV32IC and RV64IC.
Expands to jalr x0, 0(rs1). Invalid when rs1=x0.
15 13 12 11 76 21 0
100 0 rs1 00000 10

c.ld rd0 , uimm(rs10 ) x[8+rd0 ] = M[x[8+rs10 ] + uimm][63:0]

Load Doubleword. RV64IC only.
Expands to ld rd, uimm(rs1), where rd=8+rd0 and rs1=8+rs10 .
15 13 12 10 9 76 54 21 0
011 uimm[5:3] rs10 uimm[7:6] rd0 00

c.ldsp rd, uimm(x2) x[rd] = M[x[2] + uimm][63:0]

Load Doubleword, Stack-Pointer Relative. RV64IC only.
Expands to ld rd, uimm(x2). Invalid when rd=x0.
15 13 12 11 76 21 0
011 uimm[5] rd uimm[4:3|8:6] 10

c.li rd, imm x[rd] = sext(imm)

Load Immediate. RV32IC and RV64IC.
Expands to addi rd, x0, imm.
15 13 12 11 76 21 0
010 imm[5] rd imm[4:0] 01

c.lui rd, imm x[rd] = sext(imm[17:12] << 12)

Load Upper Immediate. RV32IC and RV64IC.
Expands to lui rd, imm. Invalid when rd=x2 or imm=0.
15 13 12 11 76 21 0
011 imm[17] rd imm[16:12] 01

c.lw rd0 , uimm(rs10 ) x[8+rd0 ] = sext(M[x[8+rs10 ] + uimm][31:0])

Load Word. RV32IC and RV64IC.
Expands to lw rd, uimm(rs1), where rd=8+rd0 and rs1=8+rs10 .
15 13 12 10 9 76 54 21 0
010 uimm[5:3] rs10 uimm[2|6] rd0 00
RISC-V INSTRUCTIONS: C.SLLI 131

c.lwsp rd, uimm(x2) x[rd] = sext(M[x[2] + uimm][31:0])

Load Word, Stack-Pointer Relative. RV32IC and RV64IC.
Expands to lw rd, uimm(x2). Invalid when rd=x0.
15 13 12 11 76 21 0
010 uimm[5] rd uimm[4:2|7:6] 10

c.mv rd, rs2 x[rd] = x[rs2]

Move. RV32IC and RV64IC.
Expands to add rd, x0, rs2. Invalid when rs2=x0.
15 13 12 11 76 21 0
100 0 rd rs2 10

c.or rd0 , rs20 x[8+rd0 ] = x[8+rd0 ] | x[8+rs20 ]

OR. RV32IC and RV64IC.
Expands to or rd, rd, rs2, where rd=8+rd0 and rs2=8+rs20 .
15 10 9 76 54 21 0
100011 rd0 10 rs20 01

c.sd rs20 , uimm(rs10 ) M[x[8+rs10 ] + uimm][63:0] = x[8+rs20 ]

Store Doubleword. RV64IC only.
Expands to sd rs2, uimm(rs1), where rs2=8+rs20 and rs1=8+rs10 .
15 13 12 10 9 76 54 21 0
111 uimm[5:3] rs10 uimm[7:6] rs20 00

c.sdsp rs2, uimm(x2) M[x[2] + uimm][63:0] = x[rs2]

Store Doubleword, Stack-Pointer Relative. RV64IC only.
Expands to sd rs2, uimm(x2).
15 13 12 76 21 0
111 uimm[5:3|8:6] rs2 10

c.slli rd, uimm x[rd] = x[rd] << uimm

Shift Left Logical Immediate. RV32IC and RV64IC.
Expands to slli rd, rd, uimm.
15 13 12 11 76 21 0
000 uimm[5] rd uimm[4:0] 10
132 RISC-V INSTRUCTIONS: C.SRAI

c.srai rd0 , uimm x[8+rd0 ] = x[8+rd0 ] >>s uimm

Shift Right Arithmetic Immediate. RV32IC and RV64IC.
Expands to srai rd, rd, uimm, where rd=8+rd0 .
15 13 12 11 10 9 76 21 0
100 uimm[5] 01 rd0 uimm[4:0] 01

c.srli rd0 , uimm x[8+rd0 ] = x[8+rd0 ] >>u uimm

Shift Right Logical Immediate. RV32IC and RV64IC.
Expands to srli rd, rd, uimm, where rd=8+rd0 .
15 13 12 11 10 9 76 21 0
100 uimm[5] 00 rd0 uimm[4:0] 01

c.sub rd0 , rs20 x[8+rd0 ] = x[8+rd0 ] - x[8+rs20 ]

Subtract. RV32IC and RV64IC.
Expands to sub rd, rd, rs2, where rd=8+rd0 and rs2=8+rs20 .
15 10 9 76 54 21 0
100011 rd0 00 rs20 01

c.subw rd0 , rs20 x[8+rd0 ] = sext((x[8+rd0 ] - x[8+rs20 ])[31:0])

Subtract Word. RV64IC only.
Expands to subw rd, rd, rs2, where rd=8+rd0 and rs2=8+rs20 .
15 10 9 76 54 21 0
100111 rd0 00 rs20 01

c.sw rs20 , uimm(rs10 ) M[x[8+rs10 ] + uimm][31:0] = x[8+rs20 ]

Store Word. RV32IC and RV64IC.
Expands to sw rs2, uimm(rs1), where rs2=8+rs20 and rs1=8+rs10 .
15 13 12 10 9 76 54 21 0
110 uimm[5:3] rs10 uimm[2|6] rs20 00

c.swsp rs2, uimm(x2) M[x[2] + uimm][31:0] = x[rs2]

Store Word, Stack-Pointer Relative. RV32IC and RV64IC.
Expands to sw rs2, uimm(x2).
15 13 12 76 21 0
110 uimm[5:2|7:6] rs2 10
RISC-V INSTRUCTIONS: CSRRC 133

c.xor rd0 , rs20 x[8+rd0 ] = x[8+rd0 ] ˆ x[8+rs20 ]

Exclusive-OR. RV32IC and RV64IC.
Expands to xor rd, rd, rs2, where rd=8+rd0 and rs2=8+rs20 .
15 10 9 76 54 21 0
100011 rd0 01 rs20 01

call rd, symbol x[rd] = pc+8; pc = &symbol

Call. Pseudoinstruction, RV32I and RV64I.
Writes the address of the next instruction (pc+8) to x[rd], then sets the pc to symbol. Expands
to auipc rd, offsetHi then jalr rd, offsetLo(rd). If rd is omitted, x1 is implied.

csrr rd, csr x[rd] = CSRs[csr]

Control and Status Register Read. Pseudoinstruction, RV32I and RV64I.
Copies control and status register csr to x[rd]. Expands to csrrs rd, csr, x0.

csrc csr, rs1 CSRs[csr] &= ∼x[rs1]

Control and Status Register Clear. Pseudoinstruction, RV32I and RV64I.
For each bit set in x[rs1], clear the corresponding bit in control and status register csr.
Expands to csrrc x0, csr, rs1.

csrci csr, zimm[4:0] CSRs[csr] &= ∼zimm

Control and Status Register Clear Immediate. Pseudoinstruction, RV32I and RV64I.
For each bit set in the five-bit zero-extended immediate, clear the corresponding bit in control
and status register csr. Expands to csrrci x0, csr, zimm.

csrrc rd, csr, rs1 t = CSRs[csr]; CSRs[csr] = t &∼x[rs1]; x[rd] = t

Control and Status Register Read and Clear. I-type, RV32I and RV64I.
Let t be the value of control and status register csr. Write the bitwise AND of t and the ones’
complement of x[rs1] to the csr, then write t to x[rd].
31 20 19 15 14 12 11 76 0
csr rs1 011 rd 1110011
134 RISC-V INSTRUCTIONS: CSRRCI

csrrci rd, csr, zimm[4:0] t = CSRs[csr]; CSRs[csr] = t &∼zimm; x[rd] =

t
Control and Status Register Read and Clear Immediate. I-type, RV32I and RV64I.
Let t be the value of control and status register csr. Write the bitwise AND of t and the ones’
complement of the five-bit zero-extended immediate zimm to the csr, then write t to x[rd].
(Bits 5 and above in the csr are not modified.)
31 20 19 15 14 12 11 76 0
csr zimm[4:0] 111 rd 1110011

csrrs rd, csr, rs1 t = CSRs[csr]; CSRs[csr] = t | x[rs1]; x[rd] = t

Control and Status Register Read and Set. I-type, RV32I and RV64I.
Let t be the value of control and status register csr. Write the bitwise OR of t and x[rs1] to
the csr, then write t to x[rd].
31 20 19 15 14 12 11 76 0
csr rs1 010 rd 1110011

csrrsi rd, csr, zimm[4:0] t = CSRs[csr]; CSRs[csr] = t | zimm; x[rd] =

t
Control and Status Register Read and Set Immediate. I-type, RV32I and RV64I.
Let t be the value of control and status register csr. Write the bitwise OR of t and the five-bit
zero-extended immediate zimm to the csr, then write t to x[rd]. (Bits 5 and above in the csr
are not modified.)
31 20 19 15 14 12 11 76 0
csr zimm[4:0] 110 rd 1110011

csrrw rd, csr, rs1 t = CSRs[csr]; CSRs[csr] = x[rs1]; x[rd] = t

Control and Status Register Read and Write. I-type, RV32I and RV64I.
Let t be the value of control and status register csr. Copy x[rs1] to the csr, then write t to
x[rd].
31 20 19 15 14 12 11 76 0
csr rs1 001 rd 1110011

csrrwi rd, csr, zimm[4:0] x[rd] = CSRs[csr]; CSRs[csr] = zimm

Control and Status Register Read and Write Immediate. I-type, RV32I and RV64I.
Copies the control and status register csr to x[rd], then writes the five-bit zero-extended
immediate zimm to the csr.
31 20 19 15 14 12 11 76 0
csr zimm[4:0] 101 rd 1110011
RISC-V INSTRUCTIONS: DIVU 135

csrs csr, rs1 CSRs[csr] |= x[rs1]

Control and Status Register Set. Pseudoinstruction, RV32I and RV64I.
For each bit set in x[rs1], set the corresponding bit in control and status register csr. Expands
to csrrs x0, csr, rs1.

csrsi csr, zimm[4:0] CSRs[csr] |= zimm

Control and Status Register Set Immediate. Pseudoinstruction, RV32I and RV64I.
For each bit set in the five-bit zero-extended immediate, set the corresponding bit in control
and status register csr. Expands to csrrsi x0, csr, zimm.

csrw csr, rs1 CSRs[csr] = x[rs1]

Control and Status Register Write. Pseudoinstruction, RV32I and RV64I.
Copies x[rs1] to control and status register csr. Expands to csrrw x0, csr, rs1.

csrwi csr, zimm[4:0] CSRs[csr] = zimm

Control and Status Register Write Immediate. Pseudoinstruction, RV32I and RV64I.
Copies the five-bit zero-extended immediate to control and status register csr. Expands to
csrrwi x0, csr, zimm.

div rd, rs1, rs2 x[rd] = x[rs1] ÷s x[rs2]

Divide. R-type, RV32M and RV64M.
Divides x[rs1] by x[rs2], rounding towards zero, treating the values as two’s complement
numbers, and writes the quotient to x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 100 rd 0110011

divu rd, rs1, rs2 x[rd] = x[rs1] ÷u x[rs2]

Divide, Unsigned. R-type, RV32M and RV64M.
Divides x[rs1] by x[rs2], rounding towards zero, treating the values as unsigned numbers,
and writes the quotient to x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 101 rd 0110011
136 RISC-V INSTRUCTIONS: DIVUW

divuw rd, rs1, rs2 x[rd] = sext(x[rs1][31:0] ÷u x[rs2][31:0])

Divide Word, Unsigned. R-type, RV64M only.
Divides the lower 32 bits of x[rs1] by the lower 32 bits of x[rs2], rounding towards zero,
treating the values as unsigned numbers, and writes the sign-extended 32-bit quotient to x[rd].

31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 101 rd 0111011

divw rd, rs1, rs2 x[rd] = sext(x[rs1][31:0] ÷s x[rs2][31:0])

Divide Word. R-type, RV64M only.
Divides the lower 32 bits of x[rs1] by the lower 32 bits of x[rs2], rounding towards zero,
treating the values as two’s complement numbers, and writes the sign-extended 32-bit quo-
tient to x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 100 rd 0111011

ebreak RaiseException(Breakpoint)
Environment Breakpoint. I-type, RV32I and RV64I.
Makes a request of the debugger by raising a Breakpoint exception.
31 20 19 15 14 12 11 76 0
000000000001 00000 000 00000 1110011

ecall RaiseException(EnvironmentCall)
Environment Call. I-type, RV32I and RV64I.
Makes a request of the execution environment by raising an Environment Call exception.
31 20 19 15 14 12 11 76 0
000000000000 00000 000 00000 1110011

fabs.d rd, rs1 f[rd] = |f[rs1]|

Floating-point Absolute Value. Pseudoinstruction, RV32D and RV64D.
Writes the absolute value of the double-precision floating-point number in f[rs1] to f[rd].
Expands to fsgnjx.d rd, rs1, rs1.

fabs.s rd, rs1 f[rd] = |f[rs1]|

Floating-point Absolute Value. Pseudoinstruction, RV32F and RV64F.
Writes the absolute value of the single-precision floating-point number in f[rs1] to f[rd].
Expands to fsgnjx.s rd, rs1, rs1.
RISC-V INSTRUCTIONS: FCLASS.S 137

fadd.d rd, rs1, rs2 f[rd] = f[rs1] + f[rs2]

Floating-point Add, Double-Precision. R-type, RV32D and RV64D.
Adds the double-precision floating-point numbers in registers f[rs1] and f[rs2] and writes the
rounded double-precision sum to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 rm rd 1010011

fadd.s rd, rs1, rs2 f[rd] = f[rs1] + f[rs2]

Floating-point Add, Single-Precision. R-type, RV32F and RV64F.
Adds the single-precision floating-point numbers in registers f[rs1] and f[rs2] and writes the
rounded single-precision sum to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 rm rd 1010011

fclass.d rd, rs1, rs2 x[rd] = classifyd (f[rs1])

Floating-point Classify, Double-Precision. R-type, RV32D and RV64D.
Writes to x[rd] a mask indicating the class of the double-precision floating-point number in
f[rs1]. See the description of fclass.s for the interpretation of the value written to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1110001 00000 rs1 001 rd 1010011

fclass.s rd, rs1, rs2 x[rd] = classifys (f[rs1])

Floating-point Classify, Single-Precision. R-type, RV32F and RV64F.
Writes to x[rd] a mask indicating the class of the single-precision floating-point number in
f[rs1]. Exactly one bit in x[rd] is set, per the following table:
x[rd] bit Meaning
0 f[rs1] is −∞.
1 f[rs1] is a negative normal number.
2 f[rs1] is a negative subnormal number.
3 f[rs1] is −0.
4 f[rs1] is +0.
5 f[rs1] is a positive subnormal number.
6 f[rs1] is a positive normal number.
7 f[rs1] is +∞.
8 f[rs1] is a signaling NaN.
9 f[rs1] is a quiet NaN.

31 25 24 20 19 15 14 12 11 76 0
1110000 00000 rs1 001 rd 1010011
138 RISC-V INSTRUCTIONS: FCVT.D.L

fcvt.d.l rd, rs1, rs2 f[rd] = f64s64 (x[rs1])

Floating-point Convert to Double from Long. R-type, RV64D only.
Converts the 64-bit two’s complement integer in x[rs1] to a double-precision floating-point
number and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1101001 00010 rs1 rm rd 1010011

fcvt.d.lu rd, rs1, rs2 f[rd] = f64u64 (x[rs1])

Floating-point Convert to Double from Unsigned Long. R-type, RV64D only.
Converts the 64-bit unsigned integer in x[rs1] to a double-precision floating-point number
and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1101001 00011 rs1 rm rd 1010011

fcvt.d.s rd, rs1, rs2 f[rd] = f64f 32 (f[rs1])

Floating-point Convert to Double from Single. R-type, RV32D and RV64D.
Converts the single-precision floating-point number in f[rs1] to a double-precision floating-
point number and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0100001 00000 rs1 rm rd 1010011

fcvt.d.w rd, rs1, rs2 f[rd] = f64s32 (x[rs1])

Floating-point Convert to Double from Word. R-type, RV32D and RV64D.
Converts the 32-bit two’s complement integer in x[rs1] to a double-precision floating-point
number and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1101001 00000 rs1 rm rd 1010011

fcvt.d.wu rd, rs1, rs2 f[rd] = f64u32 (x[rs1])

Floating-point Convert to Double from Unsigned Word. R-type, RV32D and RV64D.
Converts the 32-bit unsigned integer in x[rs1] to a double-precision floating-point number
and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1101001 00001 rs1 rm rd 1010011
RISC-V INSTRUCTIONS: FCVT.S.D 139

fcvt.l.d rd, rs1, rs2 x[rd] = s64f 64 (f[rs1])

Floating-point Convert to Long from Double. R-type, RV64D only.
Converts the double-precision floating-point number in register f[rs1] to a 64-bit two’s com-
plement integer and writes it to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1100001 00010 rs1 rm rd 1010011

fcvt.l.s rd, rs1, rs2 x[rd] = s64f 32 (f[rs1])

Floating-point Convert to Long from Single. R-type, RV64F only.
Converts the single-precision floating-point number in register f[rs1] to a 64-bit two’s com-
plement integer and writes it to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1100000 00010 rs1 rm rd 1010011

fcvt.lu.d rd, rs1, rs2 x[rd] = u64f 64 (f[rs1])

Floating-point Convert to Unsigned Long from Double. R-type, RV64D only.
Converts the double-precision floating-point number in register f[rs1] to a 64-bit unsigned
integer and writes it to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1100001 00011 rs1 rm rd 1010011

fcvt.lu.s rd, rs1, rs2 x[rd] = u64f 32 (f[rs1])

Floating-point Convert to Unsigned Long from Single. R-type, RV64F only.
Converts the single-precision floating-point number in register f[rs1] to a 64-bit unsigned
integer and writes it to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1100000 00011 rs1 rm rd 1010011

fcvt.s.d rd, rs1, rs2 f[rd] = f32f 64 (f[rs1])

Floating-point Convert to Single from Double. R-type, RV32D and RV64D.
Converts the double-precision floating-point number in f[rs1] to a single-precision floating-
point number and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0100000 00001 rs1 rm rd 1010011
140 RISC-V INSTRUCTIONS: FCVT.S.L

fcvt.s.l rd, rs1, rs2 f[rd] = f32s64 (x[rs1])

Floating-point Convert to Single from Long. R-type, RV64F only.
Converts the 64-bit two’s complement integer in x[rs1] to a single-precision floating-point
number and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1101000 00010 rs1 rm rd 1010011

fcvt.s.lu rd, rs1, rs2 f[rd] = f32u64 (x[rs1])

Floating-point Convert to Single from Unsigned Long. R-type, RV64F only.
Converts the 64-bit unsigned integer in x[rs1] to a single-precision floating-point number and
writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1101000 00011 rs1 rm rd 1010011

fcvt.s.w rd, rs1, rs2 f[rd] = f32s32 (x[rs1])

Floating-point Convert to Single from Word. R-type, RV32F and RV64F.
Converts the 32-bit two’s complement integer in x[rs1] to a single-precision floating-point
number and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1101000 00000 rs1 rm rd 1010011

fcvt.s.wu rd, rs1, rs2 f[rd] = f32u32 (x[rs1])

Floating-point Convert to Single from Unsigned Word. R-type, RV32F and RV64F.
Converts the 32-bit unsigned integer in x[rs1] to a single-precision floating-point number and
writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1101000 00001 rs1 rm rd 1010011

fcvt.w.d rd, rs1, rs2 x[rd] = sext(s32f 64 (f[rs1]))

Floating-point Convert to Word from Double. R-type, RV32D and RV64D.
Converts the double-precision floating-point number in register f[rs1] to a 32-bit two’s com-
plement integer and writes the sign-extended result to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1100001 00000 rs1 rm rd 1010011
RISC-V INSTRUCTIONS: FDIV.S 141

fcvt.wu.d rd, rs1, rs2 x[rd] = sext(u32f 64 (f[rs1]))

Floating-point Convert to Unsigned Word from Double. R-type, RV32D and RV64D.
Converts the double-precision floating-point number in register f[rs1] to a 32-bit unsigned
integer and writes the sign-extended result to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1100001 00001 rs1 rm rd 1010011

fcvt.w.s rd, rs1, rs2 x[rd] = sext(s32f 32 (f[rs1]))

Floating-point Convert to Word from Single. R-type, RV32F and RV64F.
Converts the single-precision floating-point number in register f[rs1] to a 32-bit two’s com-
plement integer and writes the sign-extended result to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1100000 00000 rs1 rm rd 1010011

fcvt.wu.s rd, rs1, rs2 x[rd] = sext(u32f 32 (f[rs1]))

Floating-point Convert to Unsigned Word from Single. R-type, RV32F and RV64F.
Converts the single-precision floating-point number in register f[rs1] to a 32-bit unsigned
integer and writes the sign-extended result to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1100000 00001 rs1 rm rd 1010011

fdiv.d rd, rs1, rs2 f[rd] = f[rs1] ÷ f[rs2]

Floating-point Divide, Double-Precision. R-type, RV32D and RV64D.
Divides the double-precision floating-point number in register f[rs1] by f[rs2] and writes the
rounded double-precision quotient to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0001101 rs2 rs1 rm rd 1010011

fdiv.s rd, rs1, rs2 f[rd] = f[rs1] ÷ f[rs2]

Floating-point Divide, Single-Precision. R-type, RV32F and RV64F.
Divides the single-precision floating-point number in register f[rs1] by f[rs2] and writes the
rounded single-precision quotient to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0001100 rs2 rs1 rm rd 1010011
142 RISC-V INSTRUCTIONS: FENCE

fence pred, succ Fence(pred, succ)

Fence Memory and I/O. I-type, RV32I and RV64I.
Renders preceding memory and I/O accesses in the predecessor set observable to other
threads and devices before subsequent memory and I/O accesses in the successor set become
observable. Bits 3, 2, 1, and 0 in these sets correspond to device input, device output, mem-
ory reads, and memory writes, respectively. The instruction fence r,rw, for example, orders
older reads with younger reads and writes, and is encoded with pred=0010 and succ=0011.
If the arguments are omitted, a full fence iorw, iorw is implied.
31 28 27 24 23 20 19 15 14 12 11 76 0
0000 pred succ 00000 000 00000 0001111

fence.i Fence(Store, Fetch)

Fence Instruction Stream. I-type, RV32I and RV64I.
Renders stores to instruction memory observable to subsequent instruction fetches.
31 20 19 15 14 12 11 76 0
000000000000 00000 001 00000 0001111

feq.d rd, rs1, rs2 x[rd] = f[rs1] == f[rs2]

Floating-point Equals, Double-Precision. R-type, RV32D and RV64D.
Writes 1 to x[rd] if the double-precision floating-point number in f[rs1] equals the number in
f[rs2], and 0 if not.
31 25 24 20 19 15 14 12 11 76 0
1010001 rs2 rs1 010 rd 1010011

feq.s rd, rs1, rs2 x[rd] = f[rs1] == f[rs2]

Floating-point Equals, Single-Precision. R-type, RV32F and RV64F.
Writes 1 to x[rd] if the single-precision floating-point number in f[rs1] equals the number in
f[rs2], and 0 if not.
31 25 24 20 19 15 14 12 11 76 0
1010000 rs2 rs1 010 rd 1010011

fld rd, offset(rs1) f[rd] = M[x[rs1] + sext(offset)][63:0]

Floating-point Load Doubleword. I-type, RV32D and RV64D.
Loads a double-precision floating-point number from memory address x[rs1] + sign-
extend(offset) and writes it to f[rd].
Compressed forms: c.fldsp rd, offset; c.fld rd, offset(rs1)
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 011 rd 0000111
RISC-V INSTRUCTIONS: FLW 143

fle.d rd, rs1, rs2 x[rd] = f[rs1] ≤ f[rs2]

Floating-point Less Than or Equal, Double-Precision. R-type, RV32D and RV64D.
Writes 1 to x[rd] if the double-precision floating-point number in f[rs1] is less than or equal
to the number in f[rs2], and 0 if not.
31 25 24 20 19 15 14 12 11 76 0
1010001 rs2 rs1 000 rd 1010011

fle.s rd, rs1, rs2 x[rd] = f[rs1] ≤ f[rs2]

Floating-point Less Than or Equal, Single-Precision. R-type, RV32F and RV64F.
Writes 1 to x[rd] if the single-precision floating-point number in f[rs1] is less than or equal
to the number in f[rs2], and 0 if not.
31 25 24 20 19 15 14 12 11 76 0
1010000 rs2 rs1 000 rd 1010011

flt.d rd, rs1, rs2 x[rd] = f[rs1] < f[rs2]

Floating-point Less Than, Double-Precision. R-type, RV32D and RV64D.
Writes 1 to x[rd] if the double-precision floating-point number in f[rs1] is less than the num-
ber in f[rs2], and 0 if not.
31 25 24 20 19 15 14 12 11 76 0
1010001 rs2 rs1 001 rd 1010011

flt.s rd, rs1, rs2 x[rd] = f[rs1] < f[rs2]

Floating-point Less Than, Single-Precision. R-type, RV32F and RV64F.
Writes 1 to x[rd] if the single-precision floating-point number in f[rs1] is less than the number
in f[rs2], and 0 if not.
31 25 24 20 19 15 14 12 11 76 0
1010000 rs2 rs1 001 rd 1010011

flw rd, offset(rs1) f[rd] = M[x[rs1] + sext(offset)][31:0]

Floating-point Load Word. I-type, RV32F and RV64F.
Loads a single-precision floating-point number from memory address x[rs1] + sign-
extend(offset) and writes it to f[rd].
Compressed forms: c.flwsp rd, offset; c.flw rd, offset(rs1)
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 010 rd 0000111
144 RISC-V INSTRUCTIONS: FMADD.D

fmadd.d rd, rs1, rs2, rs3 f[rd] = f[rs1]×f[rs2]+f[rs3]

Floating-point Fused Multiply-Add, Double-Precision. R4-type, RV32D and RV64D.
Multiplies the double-precision floating-point numbers in f[rs1] and f[rs2], adds the un-
rounded product to the double-precision floating-point number in f[rs3], and writes the
rounded double-precision result to f[rd].
31 27 26 25 24 20 19 15 14 12 11 76 0
rs3 01 rs2 rs1 rm rd 1000011

fmadd.s rd, rs1, rs2, rs3 f[rd] = f[rs1]×f[rs2]+f[rs3]

Floating-point Fused Multiply-Add, Single-Precision. R4-type, RV32F and RV64F.
Multiplies the single-precision floating-point numbers in f[rs1] and f[rs2], adds the un-
rounded product to the single-precision floating-point number in f[rs3], and writes the
rounded single-precision result to f[rd].
31 27 26 25 24 20 19 15 14 12 11 76 0
rs3 00 rs2 rs1 rm rd 1000011

fmax.d rd, rs1, rs2 f[rd] = max(f[rs1], f[rs2])

Floating-point Maximum, Double-Precision. R-type, RV32D and RV64D.
Copies the larger of the double-precision floating-point numbers in registers f[rs1] and f[rs2]
to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0010101 rs2 rs1 001 rd 1010011

fmax.s rd, rs1, rs2 f[rd] = max(f[rs1], f[rs2])

Floating-point Maximum, Single-Precision. R-type, RV32F and RV64F.
Copies the larger of the single-precision floating-point numbers in registers f[rs1] and f[rs2]
to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0010100 rs2 rs1 001 rd 1010011

fmin.d rd, rs1, rs2 f[rd] = min(f[rs1], f[rs2])

Floating-point Minimum, Double-Precision. R-type, RV32D and RV64D.
Copies the smaller of the double-precision floating-point numbers in registers f[rs1] and
f[rs2] to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0010101 rs2 rs1 000 rd 1010011
RISC-V INSTRUCTIONS: FMUL.S 145

fmin.s rd, rs1, rs2 f[rd] = min(f[rs1], f[rs2])

Floating-point Minimum, Single-Precision. R-type, RV32F and RV64F.
Copies the smaller of the single-precision floating-point numbers in registers f[rs1] and f[rs2]
to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0010100 rs2 rs1 000 rd 1010011

fmsub.d rd, rs1, rs2, rs3 f[rd] = f[rs1]×f[rs2]-f[rs3]

Floating-point Fused Multiply-Subtract, Double-Precision. R4-type, RV32D and RV64D.
Multiplies the double-precision floating-point numbers in f[rs1] and f[rs2], subtracts the
double-precision floating-point number in f[rs3] from the unrounded product, and writes the
rounded double-precision result to f[rd].
31 27 26 25 24 20 19 15 14 12 11 76 0
rs3 01 rs2 rs1 rm rd 1000111

fmsub.s rd, rs1, rs2, rs3 f[rd] = f[rs1]×f[rs2]-f[rs3]

Floating-point Fused Multiply-Subtract, Single-Precision. R4-type, RV32F and RV64F.
Multiplies the single-precision floating-point numbers in f[rs1] and f[rs2], subtracts the
single-precision floating-point number in f[rs3] from the unrounded product, and writes the
rounded single-precision result to f[rd].
31 27 26 25 24 20 19 15 14 12 11 76 0
rs3 00 rs2 rs1 rm rd 1000111

fmul.d rd, rs1, rs2 f[rd] = f[rs1] × f[rs2]

Floating-point Multiply, Double-Precision. R-type, RV32D and RV64D.
Multiplies the double-precision floating-point numbers in registers f[rs1] and f[rs2] and
writes the rounded double-precision product to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0001001 rs2 rs1 rm rd 1010011

fmul.s rd, rs1, rs2 f[rd] = f[rs1] × f[rs2]

Floating-point Multiply, Single-Precision. R-type, RV32F and RV64F.
Multiplies the single-precision floating-point numbers in registers f[rs1] and f[rs2] and writes
the rounded single-precision product to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0001000 rs2 rs1 rm rd 1010011
146 RISC-V INSTRUCTIONS: FMV.D

fmv.d rd, rs1 f[rd] = f[rs1]

Floating-point Move. Pseudoinstruction, RV32D and RV64D.
Copies the double-precision floating-point number in f[rs1] to f[rd]. Expands to fsgnj.d rd,
rs1, rs1.

fmv.d.x rd, rs1, rs2 f[rd] = x[rs1][63:0]

Floating-point Move Doubleword from Integer. R-type, RV64D only.
Copies the double-precision floating-point number in register x[rs1] to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1111001 00000 rs1 000 rd 1010011

fmv.s rd, rs1 f[rd] = f[rs1]

Floating-point Move. Pseudoinstruction, RV32F and RV64F.
Copies the single-precision floating-point number in f[rs1] to f[rd]. Expands to fsgnj.s rd,
rs1, rs1.

fmv.w.x rd, rs1, rs2 f[rd] = x[rs1][31:0]

Floating-point Move Word from Integer. R-type, RV32F and RV64F.
Copies the single-precision floating-point number in register x[rs1] to f[rd].
31 25 24 20 19 15 14 12 11 76 0
1111000 00000 rs1 000 rd 1010011

fmv.x.d rd, rs1, rs2 x[rd] = f[rs1][63:0]

Floating-point Move Doubleword to Integer. R-type, RV64D only.
Copies the double-precision floating-point number in register f[rs1] to x[rd].
31 25 24 20 19 15 14 12 11 76 0
1110001 00000 rs1 000 rd 1010011

fmv.x.w rd, rs1, rs2 x[rd] = sext(f[rs1][31:0])

Floating-point Move Word to Integer. R-type, RV32F and RV64F.
Copies the single-precision floating-point number in register f[rs1] to x[rd], sign-extending
the result for RV64F.
31 25 24 20 19 15 14 12 11 76 0
1110000 00000 rs1 000 rd 1010011
RISC-V INSTRUCTIONS: FNMSUB.D 147

fneg.d rd, rs1 f[rd] = -f[rs1]

Floating-point Negate. Pseudoinstruction, RV32D and RV64D.
Writes the opposite of the double-precision floating-point number in f[rs1] to f[rd]. Expands
to fsgnjn.d rd, rs1, rs1.

fneg.s rd, rs1 f[rd] = -f[rs1]

Floating-point Negate. Pseudoinstruction, RV32F and RV64F.
Writes the opposite of the single-precision floating-point number in f[rs1] to f[rd]. Expands
to fsgnjn.s rd, rs1, rs1.

fnmadd.d rd, rs1, rs2, rs3 f[rd] = -f[rs1]×f[rs2]-f[rs3]

Floating-point Fused Negative Multiply-Add, Double-Precision. R4-type, RV32D and
RV64D.
Multiplies the double-precision floating-point numbers in f[rs1] and f[rs2], negates the result,
subtracts the double-precision floating-point number in f[rs3] from the unrounded product,
and writes the rounded double-precision result to f[rd].
31 27 26 25 24 20 19 15 14 12 11 76 0
rs3 01 rs2 rs1 rm rd 1001111

fnmadd.s rd, rs1, rs2, rs3 f[rd] = -f[rs1]×f[rs2]-f[rs3]

Floating-point Fused Negative Multiply-Add, Single-Precision. R4-type, RV32F and RV64F.
Multiplies the single-precision floating-point numbers in f[rs1] and f[rs2], negates the result,
subtracts the single-precision floating-point number in f[rs3] from the unrounded product,
and writes the rounded single-precision result to f[rd].
31 27 26 25 24 20 19 15 14 12 11 76 0
rs3 00 rs2 rs1 rm rd 1001111

fnmsub.d rd, rs1, rs2, rs3 f[rd] = -f[rs1]×f[rs2]+f[rs3]

Floating-point Fused Negative Multiply-Subtract, Double-Precision. R4-type, RV32D and
RV64D.
Multiplies the double-precision floating-point numbers in f[rs1] and f[rs2], negates the re-
sult, adds the unrounded product to the double-precision floating-point number in f[rs3], and
writes the rounded double-precision result to f[rd].
31 27 26 25 24 20 19 15 14 12 11 76 0
rs3 01 rs2 rs1 rm rd 1001011
148 RISC-V INSTRUCTIONS: FNMSUB.S

fnmsub.s rd, rs1, rs2, rs3 f[rd] = -f[rs1]×f[rs2]+f[rs3]

Floating-point Fused Negative Multiply-Subtract, Single-Precision. R4-type, RV32F and
RV64F.
Multiplies the single-precision floating-point numbers in f[rs1] and f[rs2], negates the result,
adds the unrounded product to the single-precision floating-point number in f[rs3], and writes
the rounded single-precision result to f[rd].
31 27 26 25 24 20 19 15 14 12 11 76 0
rs3 00 rs2 rs1 rm rd 1001011

frcsr rd x[rd] = CSRs[fcsr]

Floating-point Read Control and Status Register. Pseudoinstruction, RV32F and RV64F.
Copies the floating-point control and status register to x[rd]. Expands to csrrs rd, fcsr, x0.

frflags rd x[rd] = CSRs[fflags]

Floating-point Read Exception Flags. Pseudoinstruction, RV32F and RV64F.
Copies the floating-point exception flags to x[rd]. Expands to csrrs rd, fflags, x0.

frrm rd x[rd] = CSRs[frm]

Floating-point Read Rounding Mode. Pseudoinstruction, RV32F and RV64F.
Copies the floating-point rounding mode to x[rd]. Expands to csrrs rd, frm, x0.

fscsr rd, rs1 t = CSRs[fcsr]; CSRs[fcsr] = x[rs1]; x[rd] = t

Floating-point Swap Control and Status Register. Pseudoinstruction, RV32F and RV64F.
Copies x[rs1] to the floating-point control and status register, then copies the previous value
of the floating-point control and status register to x[rd]. Expands to csrrw rd, fcsr, rs1. If rd
is omitted, x0 is assumed.

fsd rs2, offset(rs1) M[x[rs1] + sext(offset)] = f[rs2][63:0]

Floating-point Store Doubleword. S-type, RV32D and RV64D.
Stores the double-precision floating-point number in register f[rs2] to memory at address
x[rs1] + sign-extend(offset).
Compressed forms: c.fsdsp rs2, offset; c.fsd rs2, offset(rs1)
31 25 24 20 19 15 14 12 11 76 0
offset[11:5] rs2 rs1 011 offset[4:0] 0100111
RISC-V INSTRUCTIONS: FSGNJN.S 149

fsflags rd, rs1 t = CSRs[fflags]; CSRs[fflags] = x[rs1]; x[rd] = t

Floating-point Swap Exception Flags. Pseudoinstruction, RV32F and RV64F.
Copies x[rs1] to the floating-point exception flags register, then copies the previous floating-
point exception flags to x[rd]. Expands to csrrw rd, fflags, rs1. If rd is omitted, x0 is
assumed.

fsgnj.d rd, rs1, rs2 f[rd] = {f[rs2][63], f[rs1][62:0]}

Floating-point Sign Inject, Double-Precision. R-type, RV32D and RV64D.
Constructs a new double-precision floating-point number from the exponent and significand
of f[rs1], taking the sign from f[rs2], and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0010001 rs2 rs1 000 rd 1010011

fsgnj.s rd, rs1, rs2 f[rd] = {f[rs2][31], f[rs1][30:0]}

Floating-point Sign Inject, Single-Precision. R-type, RV32F and RV64F.
Constructs a new single-precision floating-point number from the exponent and significand
of f[rs1], taking the sign from f[rs2], and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0010000 rs2 rs1 000 rd 1010011

fsgnjn.d rd, rs1, rs2 f[rd] = {∼f[rs2][63], f[rs1][62:0]}

Floating-point Sign Inject-Negate, Double-Precision. R-type, RV32D and RV64D.
Constructs a new double-precision floating-point number from the exponent and significand
of f[rs1], taking the opposite sign of f[rs2], and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0010001 rs2 rs1 001 rd 1010011

fsgnjn.s rd, rs1, rs2 f[rd] = {∼f[rs2][31], f[rs1][30:0]}

Floating-point Sign Inject-Negate, Single-Precision. R-type, RV32F and RV64F.
Constructs a new single-precision floating-point number from the exponent and significand
of f[rs1], taking the opposite sign of f[rs2], and writes it to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0010000 rs2 rs1 001 rd 1010011
150 RISC-V INSTRUCTIONS: FSGNJX.D

fsgnjx.d rd, rs1, rs2 f[rd] = {f[rs1][63] ˆ f[rs2][63], f[rs1][62:0]}

Floating-point Sign Inject-XOR, Double-Precision. R-type, RV32D and RV64D.
Constructs a new double-precision floating-point number from the exponent and significand
of f[rs1], taking the sign from the XOR of the signs of f[rs1] and f[rs2], and writes it to f[rd].

31 25 24 20 19 15 14 12 11 76 0
0010001 rs2 rs1 010 rd 1010011

fsgnjx.s rd, rs1, rs2 f[rd] = {f[rs1][31] ˆ f[rs2][31], f[rs1][30:0]}

Floating-point Sign Inject-XOR, Single-Precision. R-type, RV32F and RV64F.
Constructs a new single-precision floating-point number from the exponent and significand
of f[rs1], taking the sign from the XOR of the signs of f[rs1] and f[rs2], and writes it to f[rd].

31 25 24 20 19 15 14 12 11 76 0
0010000 rs2 rs1 010 rd 1010011

√
fsqrt.d rd, rs1, rs2 f[rd] = f[rs1]
Floating-point Square Root, Double-Precision. R-type, RV32D and RV64D.
Computes the square root of the double-precision floating-point number in register f[rs1] and
writes the rounded double-precision result to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0101101 00000 rs1 rm rd 1010011

√
fsqrt.s rd, rs1, rs2 f[rd] = f[rs1]
Floating-point Square Root, Single-Precision. R-type, RV32F and RV64F.
Computes the square root of the single-precision floating-point number in register f[rs1] and
writes the rounded single-precision result to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0101100 00000 rs1 rm rd 1010011

fsrm rd, rs1 t = CSRs[frm]; CSRs[frm] = x[rs1]; x[rd] = t

Floating-point Swap Rounding Mode. Pseudoinstruction, RV32F and RV64F.
Copies x[rs1] to the floating-point rounding mode register, then copies the previous floating-
point rounding mode to x[rd]. Expands to csrrw rd, frm, rs1. If rd is omitted, x0 is assumed.
RISC-V INSTRUCTIONS: JAL 151

fsub.d rd, rs1, rs2 f[rd] = f[rs1] - f[rs2]

Floating-point Subtract, Double-Precision. R-type, RV32D and RV64D.
Subtracts the double-precision floating-point number in register f[rs2] from f[rs1] and writes
the rounded double-precision difference to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0000101 rs2 rs1 rm rd 1010011

fsub.s rd, rs1, rs2 f[rd] = f[rs1] - f[rs2]

Floating-point Subtract, Single-Precision. R-type, RV32F and RV64F.
Subtracts the single-precision floating-point number in register f[rs2] from f[rs1] and writes
the rounded single-precision difference to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0000100 rs2 rs1 rm rd 1010011

fsw rs2, offset(rs1) M[x[rs1] + sext(offset)] = f[rs2][31:0]

Floating-point Store Word. S-type, RV32F and RV64F.
Stores the single-precision floating-point number in register f[rs2] to memory at address
x[rs1] + sign-extend(offset).
Compressed forms: c.fswsp rs2, offset; c.fsw rs2, offset(rs1)
31 25 24 20 19 15 14 12 11 76 0
offset[11:5] rs2 rs1 010 offset[4:0] 0100111

j offset pc += sext(offset)
Jump. Pseudoinstruction, RV32I and RV64I.
Sets the pc to the current pc plus the sign-extended offset. Expands to jal x0, offset.

jal rd, offset x[rd] = pc+4; pc += sext(offset)

Jump and Link. J-type, RV32I and RV64I.
Writes the address of the next instruction (pc+4) to x[rd], then set the pc to the current pc
plus the sign-extended offset. If rd is omitted, x1 is assumed.
Compressed forms: c.j offset; c.jal offset
31 12 11 76 0
offset[20|10:1|11|19:12] rd 1101111
152 RISC-V INSTRUCTIONS: JALR

jalr rd, offset(rs1) t =pc+4; pc=(x[rs1]+sext(offset))&∼1; x[rd]=t

Jump and Link Register. I-type, RV32I and RV64I.
Sets the pc to x[rs1] + sign-extend(offset), masking off the least-significant bit of the com-
puted address, then writes the previous pc+4 to x[rd]. If rd is omitted, x1 is assumed.
Compressed forms: c.jr rs1; c.jalr rs1
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 000 rd 1100111

jr rs1 pc = x[rs1]
Jump Register. Pseudoinstruction, RV32I and RV64I.
Sets the pc to x[rs1]. Expands to jalr x0, 0(rs1).

la rd, symbol x[rd] = &symbol

Load Address. Pseudoinstruction, RV32I and RV64I.
Loads the address of symbol into x[rd]. When assembling position-independent code, it
expands into a load from the Global Offset Table: for RV32I, auipc rd, offsetHi then lw rd,
offsetLo(rd); for RV64I, auipc rd, offsetHi then ld rd, offsetLo(rd). Otherwise, it expands
into auipc rd, offsetHi then addi rd, rd, offsetLo.

lb rd, offset(rs1) x[rd] = sext(M[x[rs1] + sext(offset)][7:0])

Load Byte. I-type, RV32I and RV64I.
Loads a byte from memory at address x[rs1] + sign-extend(offset) and writes it to x[rd], sign-
extending the result.
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 000 rd 0000011

lbu rd, offset(rs1) x[rd] = M[x[rs1] + sext(offset)][7:0]

Load Byte, Unsigned. I-type, RV32I and RV64I.
Loads a byte from memory at address x[rs1] + sign-extend(offset) and writes it to x[rd], zero-
extending the result.
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 100 rd 0000011
RISC-V INSTRUCTIONS: LR.D 153

ld rd, offset(rs1) x[rd] = M[x[rs1] + sext(offset)][63:0]

Load Doubleword. I-type, RV64I only.
Loads eight bytes from memory at address x[rs1] + sign-extend(offset) and writes them to
x[rd].
Compressed forms: c.ldsp rd, offset; c.ld rd, offset(rs1)
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 011 rd 0000011

lh rd, offset(rs1) x[rd] = sext(M[x[rs1] + sext(offset)][15:0])

Load Halfword. I-type, RV32I and RV64I.
Loads two bytes from memory at address x[rs1] + sign-extend(offset) and writes them to
x[rd], sign-extending the result.
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 001 rd 0000011

lhu rd, offset(rs1) x[rd] = M[x[rs1] + sext(offset)][15:0]

Load Halfword, Unsigned. I-type, RV32I and RV64I.
Loads two bytes from memory at address x[rs1] + sign-extend(offset) and writes them to
x[rd], zero-extending the result.
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 101 rd 0000011

li rd, immediate x[rd] = immediate

Load Immediate. Pseudoinstruction, RV32I and RV64I.
Loads a constant into x[rd], using as few instructions as possible. For RV32I, it expands to
lui and/or addi; for RV64I, it’s as long as lui, addi, slli, addi, slli, addi, slli, addi.

lla rd, symbol x[rd] = &symbol

Load Local Address. Pseudoinstruction, RV32I and RV64I.
Loads the address of symbol into x[rd]. Expands into auipc rd, offsetHi then addi rd, rd,
offsetLo.

lr.d rd, (rs1) x[rd] = LoadReserved64(M[x[rs1]])

Load-Reserved Doubleword. R-type, RV64A only.
Loads the eight bytes from memory at address x[rs1], writes them to x[rd], and registers a
reservation on that memory doubleword.
31 27 26 25 24 20 19 15 14 12 11 76 0
00010 aq rl 00000 rs1 011 rd 0101111
154 RISC-V INSTRUCTIONS: LR.W

lr.w rd, (rs1) x[rd] = LoadReserved32(M[x[rs1]])

Load-Reserved Word. R-type, RV32A and RV64A.
Loads the four bytes from memory at address x[rs1], writes them to x[rd], sign-extending the
result, and registers a reservation on that memory word.
31 27 26 25 24 20 19 15 14 12 11 76 0
00010 aq rl 00000 rs1 010 rd 0101111

lw rd, offset(rs1) x[rd] = sext(M[x[rs1] + sext(offset)][31:0])

Load Word. I-type, RV32I and RV64I.
Loads four bytes from memory at address x[rs1] + sign-extend(offset) and writes them to
x[rd]. For RV64I, the result is sign-extended.
Compressed forms: c.lwsp rd, offset; c.lw rd, offset(rs1)
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 010 rd 0000011

lwu rd, offset(rs1) x[rd] = M[x[rs1] + sext(offset)][31:0]

Load Word, Unsigned. I-type, RV64I only.
Loads four bytes from memory at address x[rs1] + sign-extend(offset) and writes them to
x[rd], zero-extending the result.
31 20 19 15 14 12 11 76 0
offset[11:0] rs1 110 rd 0000011

lui rd, immediate x[rd] = sext(immediate[31:12] << 12)

Load Upper Immediate. U-type, RV32I and RV64I.
Writes the sign-extended 20-bit immediate, left-shifted by 12 bits, to x[rd], zeroing the lower
12 bits.
Compressed form: c.lui rd, imm
31 12 11 76 0
immediate[31:12] rd 0110111

mret ExceptionReturn(Machine)
Machine-mode Exception Return. R-type, RV32I and RV64I privileged architectures.
Returns from a machine-mode exception handler. Sets the pc to CSRs[mepc], the
privilege mode to CSRs[mstatus].MPP, CSRs[mstatus].MIE to CSRs[mstatus].MPIE, and
CSRs[mstatus].MPIE to 1; and, if user mode is supported, sets CSRs[mstatus].MPP to 0.

31 25 24 20 19 15 14 12 11 76 0
0011000 00010 00000 000 00000 1110011
RISC-V INSTRUCTIONS: MV 155

mul rd, rs1, rs2 x[rd] = x[rs1] × x[rs2]

Multiply. R-type, RV32M and RV64M.
Multiplies x[rs1] by x[rs2] and writes the product to x[rd]. Arithmetic overflow is ignored.
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 000 rd 0110011

mulh rd, rs1, rs2 x[rd] = (x[rs1] s ×s x[rs2]) >>s XLEN

Multiply High. R-type, RV32M and RV64M.
Multiplies x[rs1] by x[rs2], treating the values as two’s complement numbers, and writes the
upper half of the product to x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 001 rd 0110011

mulhsu rd, rs1, rs2 x[rd] = (x[rs1] s ×u x[rs2]) >>s XLEN

Multiply High Signed-Unsigned. R-type, RV32M and RV64M.
Multiplies x[rs1] by x[rs2], treating x[rs1] as a two’s complement number and x[rs2] as an
unsigned number, and writes the upper half of the product to x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 010 rd 0110011

mulhu rd, rs1, rs2 x[rd] = (x[rs1] u ×u x[rs2]) >>u XLEN

Multiply High Unsigned. R-type, RV32M and RV64M.
Multiplies x[rs1] by x[rs2], treating the values as unsigned numbers, and writes the upper
half of the product to x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 011 rd 0110011

mulw rd, rs1, rs2 x[rd] = sext((x[rs1] × x[rs2])[31:0])

Multiply Word. R-type, RV64M only.
Multiplies x[rs1] by x[rs2], truncates the product to 32 bits, and writes the sign-extended
result to x[rd]. Arithmetic overflow is ignored.
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 000 rd 0111011

mv rd, rs1 x[rd] = x[rs1]

Move. Pseudoinstruction, RV32I and RV64I.
Copies register x[rs1] to x[rd]. Expands to addi rd, rs1, 0.
156 RISC-V INSTRUCTIONS: NEG

neg rd, rs2 x[rd] = -x[rs2]

Negate. Pseudoinstruction, RV32I and RV64I.
Writes the two’s complement of x[rs2] to x[rd]. Expands to sub rd, x0, rs2.

negw rd, rs2 x[rd] = sext((-x[rs2])[31:0])

Negate Word. Pseudoinstruction, RV64I only.
Computes the two’s complement of x[rs2], truncates the result to 32 bits, and writes the
sign-extended result to x[rd]. Expands to subw rd, x0, rs2.

nop Nothing
No operation. Pseudoinstruction, RV32I and RV64I.
Merely advances the pc to the next instruction. Expands to addi x0, x0, 0.

not rd, rs1 x[rd] = ∼x[rs1]

NOT. Pseudoinstruction, RV32I and RV64I.
Writes the ones’ complement of x[rs1] to x[rd]. Expands to xori rd, rs1, -1.

or rd, rs1, rs2 x[rd] = x[rs1] | x[rs2]

OR. R-type, RV32I and RV64I.
Computes the bitwise inclusive-OR of registers x[rs1] and x[rs2] and writes the result to
x[rd].
Compressed form: c.or rd, rs2
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 110 rd 0110011

ori rd, rs1, immediate x[rd] = x[rs1] | sext(immediate)

OR Immediate. I-type, RV32I and RV64I.
Computes the bitwise inclusive-OR of the sign-extended immediate and register x[rs1] and
writes the result to x[rd].
31 20 19 15 14 12 11 76 0
immediate[11:0] rs1 110 rd 0010011

rdcycle rd x[rd] = CSRs[cycle]

Read Cycle Counter. Pseudoinstruction, RV32I and RV64I.
Writes the number of cycles that have elapsed to x[rd]. Expands to csrrs rd, cycle, x0.
RISC-V INSTRUCTIONS: REMU 157

rdcycleh rd x[rd] = CSRs[cycleh]

Read Cycle Counter High. Pseudoinstruction, RV32I only.
Writes the number of cycles that have elapsed, shifted right by 32 bits, to x[rd]. Expands to
csrrs rd, cycleh, x0.

rdinstret rd x[rd] = CSRs[instret]

Read Instructions-Retired Counter. Pseudoinstruction, RV32I and RV64I.
Writes the number of instructions that have retired to x[rd]. Expands to csrrs rd, instret, x0.

rdinstreth rd x[rd] = CSRs[instreth]

Read Instructions-Retired Counter High. Pseudoinstruction, RV32I only.
Writes the number of instructions that have retired, shifted right by 32 bits, to x[rd]. Expands
to csrrs rd, instreth, x0.

rdtime rd x[rd] = CSRs[time]

Read Time. Pseudoinstruction, RV32I and RV64I.
Writes the current time to x[rd]. The timer frequency is platform-dependent. Expands to
csrrs rd, time, x0.

rdtimeh rd x[rd] = CSRs[timeh]

Read Time High. Pseudoinstruction, RV32I only.
Writes the current time, shifted right by 32 bits, to x[rd]. The timer frequency is platform-
dependent. Expands to csrrs rd, timeh, x0.

rem rd, rs1, rs2 x[rd] = x[rs1] %s x[rs2]

Remainder. R-type, RV32M and RV64M.
Divides x[rs1] by x[rs2], rounding towards zero, treating the values as two’s complement
numbers, and writes the remainder to x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 110 rd 0110011

remu rd, rs1, rs2 x[rd] = x[rs1] %u x[rs2]

Remainder, Unsigned. R-type, RV32M and RV64M.
Divides x[rs1] by x[rs2], rounding towards zero, treating the values as unsigned numbers,
and writes the remainder to x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 111 rd 0110011
158 RISC-V INSTRUCTIONS: REMUW

remuw rd, rs1, rs2 x[rd] = sext(x[rs1][31:0] %u x[rs2][31:0])

Remainder Word, Unsigned. R-type, RV64M only.
Divides the lower 32 bits of x[rs1] by the lower 32 bits of x[rs2], rounding towards zero,
treating the values as unsigned numbers, and writes the sign-extended 32-bit remainder to
x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 111 rd 0111011

remw rd, rs1, rs2 x[rd] = sext(x[rs1][31:0] %s x[rs2][31:0])

Remainder Word. R-type, RV64M only.
Divides the lower 32 bits of x[rs1] by the lower 32 bits of x[rs2], rounding towards zero,
treating the values as two’s complement numbers, and writes the sign-extended 32-bit re-
mainder to x[rd].
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 110 rd 0111011

ret pc = x[1]
Return. Pseudoinstruction, RV32I and RV64I.
Returns from a subroutine. Expands to jalr x0, 0(x1).

sb rs2, offset(rs1) M[x[rs1] + sext(offset)] = x[rs2][7:0]

Store Byte. S-type, RV32I and RV64I.
Stores the least-significant byte in register x[rs2] to memory at address x[rs1] + sign-
extend(offset).
31 25 24 20 19 15 14 12 11 76 0
offset[11:5] rs2 rs1 000 offset[4:0] 0100011

sc.d rd, rs2, (rs1) x[rd] = StoreConditional64(M[x[rs1]], x[rs2])

Store-Conditional Doubleword. R-type, RV64A only.
Stores the eight bytes in register x[rs2] to memory at address x[rs1], provided there exists
a load reservation on that memory address. Writes 0 to x[rd] if the store succeeded, or a
nonzero error code otherwise.
31 27 26 25 24 20 19 15 14 12 11 76 0
00011 aq rl rs2 rs1 011 rd 0101111
RISC-V INSTRUCTIONS: SGTZ 159

sc.w rd, rs2, (rs1) x[rd] = StoreConditional32(M[x[rs1]], x[rs2])

Store-Conditional Word. R-type, RV32A and RV64A.
Stores the four bytes in register x[rs2] to memory at address x[rs1], provided there exists
a load reservation on that memory address. Writes 0 to x[rd] if the store succeeded, or a
nonzero error code otherwise.
31 27 26 25 24 20 19 15 14 12 11 76 0
00011 aq rl rs2 rs1 010 rd 0101111

sd rs2, offset(rs1) M[x[rs1] + sext(offset)] = x[rs2][63:0]

Store Doubleword. S-type, RV64I only.
Stores the eight bytes in register x[rs2] to memory at address x[rs1] + sign-extend(offset).
Compressed forms: c.sdsp rs2, offset; c.sd rs2, offset(rs1)
31 25 24 20 19 15 14 12 11 76 0
offset[11:5] rs2 rs1 011 offset[4:0] 0100011

seqz rd, rs1 x[rd] = (x[rs1] == 0)

Set if Equal to Zero. Pseudoinstruction, RV32I and RV64I.
Writes 1 to x[rd] if x[rs1] equals 0, or 0 if not. Expands to sltiu rd, rs1, 1.

sext.w rd, rs1 x[rd] = sext(x[rs1][31:0])

Sign-extend Word. Pseudoinstruction, RV64I only.
Reads the lower 32 bits of x[rs1], sign-extends them, and writes the result to x[rd]. Expands
to addiw rd, rs1, 0.

sfence.vma rs1, rs2 Fence(Store, AddressTranslation)

Fence Virtual Memory. R-type, RV32I and RV64I privileged architectures.
Orders preceding stores to the page tables with subsequent virtual-address translations. When
rs2=0, translations for all address spaces are affected; otherwise, only translations for address
space identified by x[rs2] are ordered. When rs1=0, translations for all virtual addresses in
the selected address spaces are ordered; otherwise, only translations for the page containing
virtual address x[rs1] in the selected address spaces are ordered.
31 25 24 20 19 15 14 12 11 76 0
0001001 rs2 rs1 000 00000 1110011

sgtz rd, rs2 x[rd] = (x[rs2] >s 0)

Set if Greater Than to Zero. Pseudoinstruction, RV32I and RV64I.
Writes 1 to x[rd] if x[rs2] is greater than 0, or 0 if not. Expands to slt rd, x0, rs2.
160 RISC-V INSTRUCTIONS: SH

sh rs2, offset(rs1) M[x[rs1] + sext(offset)] = x[rs2][15:0]

Store Halfword. S-type, RV32I and RV64I.
Stores the two least-significant bytes in register x[rs2] to memory at address x[rs1] + sign-
extend(offset).
31 25 24 20 19 15 14 12 11 76 0
offset[11:5] rs2 rs1 001 offset[4:0] 0100011

sw rs2, offset(rs1) M[x[rs1] + sext(offset)] = x[rs2][31:0]

Store Word. S-type, RV32I and RV64I.
Stores the four least-significant bytes in register x[rs2] to memory at address x[rs1] + sign-
extend(offset).
Compressed forms: c.swsp rs2, offset; c.sw rs2, offset(rs1)
31 25 24 20 19 15 14 12 11 76 0
offset[11:5] rs2 rs1 010 offset[4:0] 0100011

sll rd, rs1, rs2 x[rd] = x[rs1] << x[rs2]

Shift Left Logical. R-type, RV32I and RV64I.
Shifts register x[rs1] left by x[rs2] bit positions. The vacated bits are filled with zeros, and
the result is written to x[rd]. The least-significant five bits of x[rs2] (or six bits for RV64I)
form the shift amount; the upper bits are ignored.
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 001 rd 0110011

slli rd, rs1, shamt x[rd] = x[rs1] << shamt

Shift Left Logical Immediate. I-type, RV32I and RV64I.
Shifts register x[rs1] left by shamt bit positions. The vacated bits are filled with zeros, and
the result is written to x[rd]. For RV32I, the instruction is only legal when shamt[5]=0.
Compressed form: c.slli rd, shamt
31 26 25 20 19 15 14 12 11 76 0
000000 shamt rs1 001 rd 0010011

slliw rd, rs1, shamt x[rd] = sext((x[rs1] << shamt)[31:0])

Shift Left Logical Word Immediate. I-type, RV64I only.
Shifts x[rs1] left by shamt bit positions. The vacated bits are filled with zeros, the result is
truncated to 32 bits, and the sign-extended 32-bit result is written to x[rd]. The instruction is
only legal when shamt[5]=0.
31 26 25 20 19 15 14 12 11 76 0
000000 shamt rs1 001 rd 0011011
RISC-V INSTRUCTIONS: SLTU 161

sllw rd, rs1, rs2 x[rd] = sext((x[rs1] << x[rs2][4:0])[31:0])

Shift Left Logical Word. R-type, RV64I only.
Shifts the lower 32 bits of x[rs1] left by x[rs2] bit positions. The vacated bits are filled with
zeros, and the sign-extended 32-bit result is written to x[rd]. The least-significant five bits of
x[rs2] form the shift amount; the upper bits are ignored.
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 001 rd 0111011

slt rd, rs1, rs2 x[rd] = x[rs1] <s x[rs2]

Set if Less Than. R-type, RV32I and RV64I.
Compares x[rs1] and x[rs2] as two’s complement numbers, and writes 1 to x[rd] if x[rs1] is
smaller, or 0 if not.
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 010 rd 0110011

slti rd, rs1, immediate x[rd] = x[rs1] <s sext(immediate)

Set if Less Than Immediate. I-type, RV32I and RV64I.
Compares x[rs1] and the sign-extended immediate as two’s complement numbers, and writes
1 to x[rd] if x[rs1] is smaller, or 0 if not.
31 20 19 15 14 12 11 76 0
immediate[11:0] rs1 010 rd 0010011

sltiu rd, rs1, immediate x[rd] = x[rs1] <u sext(immediate)

Set if Less Than Immediate, Unsigned. I-type, RV32I and RV64I.
Compares x[rs1] and the sign-extended immediate as unsigned numbers, and writes 1 to x[rd]
if x[rs1] is smaller, or 0 if not.
31 20 19 15 14 12 11 76 0
immediate[11:0] rs1 011 rd 0010011

sltu rd, rs1, rs2 x[rd] = x[rs1] <u x[rs2]

Set if Less Than, Unsigned. R-type, RV32I and RV64I.
Compares x[rs1] and x[rs2] as unsigned numbers, and writes 1 to x[rd] if x[rs1] is smaller,
or 0 if not.
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 011 rd 0110011
162 RISC-V INSTRUCTIONS: SLTZ

sltz rd, rs1 x[rd] = (x[rs1] <s 0)

Set if Less Than to Zero. Pseudoinstruction, RV32I and RV64I.
Writes 1 to x[rd] if x[rs1] is less than zero, or 0 if not. Expands to slt rd, rs1, x0.

snez rd, rs2 x[rd] = (x[rs2] 6= 0)

Set if Not Equal to Zero. Pseudoinstruction, RV32I and RV64I.
Writes 0 to x[rd] if x[rs2] equals 0, or 1 if not. Expands to sltu rd, x0, rs2.

sra rd, rs1, rs2 x[rd] = x[rs1] >>s x[rs2]

Shift Right Arithmetic. R-type, RV32I and RV64I.
Shifts register x[rs1] right by x[rs2] bit positions. The vacated bits are filled with copies of
x[rs1]’s most-significant bit, and the result is written to x[rd]. The least-significant five bits
of x[rs2] (or six bits for RV64I) form the shift amount; the upper bits are ignored.
31 25 24 20 19 15 14 12 11 76 0
0100000 rs2 rs1 101 rd 0110011

srai rd, rs1, shamt x[rd] = x[rs1] >>s shamt

Shift Right Arithmetic Immediate. I-type, RV32I and RV64I.
Shifts register x[rs1] right by shamt bit positions. The vacated bits are filled with copies of
x[rs1]’s most-significant bit, and the result is written to x[rd]. For RV32I, the instruction is
only legal when shamt[5]=0.
Compressed form: c.srai rd, shamt
31 26 25 20 19 15 14 12 11 76 0
010000 shamt rs1 101 rd 0010011

sraiw rd, rs1, shamt x[rd] = sext(x[rs1][31:0] >>s shamt)

Shift Right Arithmetic Word Immediate. I-type, RV64I only.
Shifts the lower 32 bits of x[rs1] right by shamt bit positions. The vacated bits are filled with
copies of x[rs1][31], and the sign-extended 32-bit result is written to x[rd]. The instruction
is only legal when shamt[5]=0.
31 26 25 20 19 15 14 12 11 76 0
010000 shamt rs1 101 rd 0011011
RISC-V INSTRUCTIONS: SRLIW 163

sraw rd, rs1, rs2 x[rd] = sext(x[rs1][31:0] >>s x[rs2][4:0])

Shift Right Arithmetic Word. R-type, RV64I only.
Shifts the lower 32 bits of x[rs1] right by x[rs2] bit positions. The vacated bits are filled with
x[rs1][31], and the sign-extended 32-bit result is written to x[rd]. The least-significant five
bits of x[rs2] form the shift amount; the upper bits are ignored.
31 25 24 20 19 15 14 12 11 76 0
0100000 rs2 rs1 101 rd 0111011

sret ExceptionReturn(Supervisor)
Supervisor-mode Exception Return. R-type, RV32I and RV64I privileged architectures.
Returns from a supervisor-mode exception handler. Sets the pc to CSRs[sepc], the privilege
mode to CSRs[sstatus].SPP, CSRs[sstatus].SIE to CSRs[sstatus].SPIE, CSRs[sstatus].SPIE
to 1, and CSRs[sstatus].SPP to 0.
31 25 24 20 19 15 14 12 11 76 0
0001000 00010 00000 000 00000 1110011

srl rd, rs1, rs2 x[rd] = x[rs1] >>u x[rs2]

Shift Right Logical. R-type, RV32I and RV64I.
Shifts register x[rs1] right by x[rs2] bit positions. The vacated bits are filled with zeros, and
the result is written to x[rd]. The least-significant five bits of x[rs2] (or six bits for RV64I)
form the shift amount; the upper bits are ignored.
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 101 rd 0110011

srli rd, rs1, shamt x[rd] = x[rs1] >>u shamt

Shift Right Logical Immediate. I-type, RV32I and RV64I.
Shifts register x[rs1] right by shamt bit positions. The vacated bits are filled with zeros, and
the result is written to x[rd]. For RV32I, the instruction is only legal when shamt[5]=0.
Compressed form: c.srli rd, shamt
31 26 25 20 19 15 14 12 11 76 0
000000 shamt rs1 101 rd 0010011

srliw rd, rs1, shamt x[rd] = sext(x[rs1][31:0] >>u shamt)

Shift Right Logical Word Immediate. I-type, RV64I only.
Shifts the lower 32 bits of x[rs1] right by shamt bit positions. The vacated bits are filled with
zeros, and the sign-extended 32-bit result is written to x[rd]. The instruction is only legal
when shamt[5]=0.
31 26 25 20 19 15 14 12 11 76 0
000000 shamt rs1 101 rd 0011011
164 RISC-V INSTRUCTIONS: SRLW

srlw rd, rs1, rs2 x[rd] = sext(x[rs1][31:0] >>u x[rs2][4:0])

Shift Right Logical Word. R-type, RV64I only.
Shifts the lower 32 bits of x[rs1] right by x[rs2] bit positions. The vacated bits are filled with
zeros, and the sign-extended 32-bit result is written to x[rd]. The least-significant five bits of
x[rs2] form the shift amount; the upper bits are ignored.
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 101 rd 0111011

sub rd, rs1, rs2 x[rd] = x[rs1] - x[rs2]

Subtract. R-type, RV32I and RV64I.
Subtracts register x[rs2] from register x[rs1] and writes the result to x[rd]. Arithmetic over-
flow is ignored.
Compressed form: c.sub rd, rs2
31 25 24 20 19 15 14 12 11 76 0
0100000 rs2 rs1 000 rd 0110011

subw rd, rs1, rs2 x[rd] = sext((x[rs1] - x[rs2])[31:0])

Subtract Word. R-type, RV64I only.
Subtracts register x[rs2] from register x[rs1], truncates the result to 32 bits, and writes the
sign-extended result to x[rd]. Arithmetic overflow is ignored.
Compressed form: c.subw rd, rs2
31 25 24 20 19 15 14 12 11 76 0
0100000 rs2 rs1 000 rd 0111011

tail symbol pc = &symbol; clobber x[6]

Tail call. Pseudoinstruction, RV32I and RV64I.
Sets the pc to symbol, overwriting x[6] in the process. Expands to auipc x6, offsetHi then
jalr x0, offsetLo(x6).

wfi while (noInterruptsPending) idle

Wait for Interrupt. R-type, RV32I and RV64I privileged architectures.
Idles the processor to save energy if no enabled interrupts are currently pending.
31 25 24 20 19 15 14 12 11 76 0
0001000 00101 00000 000 00000 1110011
RISC-V INSTRUCTIONS: XORI 165

xor rd, rs1, rs2 x[rd] = x[rs1] ˆ x[rs2]

Exclusive-OR. R-type, RV32I and RV64I.
Computes the bitwise exclusive-OR of registers x[rs1] and x[rs2] and writes the result to
x[rd].
Compressed form: c.xor rd, rs2
31 25 24 20 19 15 14 12 11 76 0
0000000 rs2 rs1 100 rd 0110011

xori rd, rs1, immediate x[rd] = x[rs1] ˆ sext(immediate)

Exclusive-OR Immediate. I-type, RV32I and RV64I.
Computes the bitwise exclusive-OR of the sign-extended immediate and register x[rs1] and
writes the result to x[rd].
31 20 19 15 14 12 11 76 0
immediate[11:0] rs1 100 rd 0010011
Index

ABI, see also application binary in- amomin.w, see also Atomic Mem- ARMv8, 92
terface ory Operation Minimum Word, 121 ASIC, see also Application Specific
Add, 18, 119 amominu.d, see also Atomic Mem- Integrated Circuits
immediate, 18, 119 ory Operation Minimum Unsigned assembler directives, 35, 35
immediate word, 86, 119 Doubleword, 121 Atomic Memory Operation
upper immediate to PC, 123 amominu.w, see also Atomic Mem- Add
word, 86, 119 ory Operation Minimum Unsigned Doubleword, 86, 119
add, 18, see also c.add, 119 Word, 122 Word, 60, 120
Add upper immediate to PC, 18 amoor.d, see also Atomic Memory And
addi, see also Add immediate, Operation Or Doubleword, 122 Doubleword, 86, 120
see also c.addi16sp, see also amoor.w, see also Atomic Memory Word, 60, 120
c.addi4spn, see also c.addi, see Operation Or Word, 122 Exclusive Or
also c.li, 119 amoswap.d, see also Atomic Mem- Doubleword, 86, 123
addiw, see also Add immediate ory Operation Swap Doubleword, Word, 60, 123
word, see also c.addiw, 119 122 Maximum
addw, see also Add word, see also amoswap.w, see also Atomic Mem- Doubleword, 86, 120
c.addw, 119 ory Operation Swap Word, 122 Word, 60, 120
ALGOL, 116 amoxor.d, see also Atomic Mem- Maximum Unsigned
Allen, Fran, 14 ory Operation Exclusive Or Dou- Doubleword, 86, 121
AMD64, 92 bleword, 123 Word, 60, 121
amoadd.d, see also Atomic Memory amoxor.w, see also Atomic Memory Minimum
Operation Add Doubleword, 119 Operation Exclusive Or Word, 123 Doubleword, 86, 121
amoadd.w, see also Atomic Memory And, 18, 123 Word, 60, 121
Operation Add Word, 120 immediate, 18, 123 Minimum Unsigned
amoand.d, see also Atomic Memory and, see also c.and, 123 Doubleword, 86, 121
Operation And Doubleword, 120 andi, 18, see also And immediate, Word, 60, 122
amoand.w, see also Atomic Memory see also c.andi, 123 Or
Operation And Word, 120 application binary interface, 18, 26, Doubleword, 86, 122
amomax.d, see also Atomic Mem- 33, 34, 48 Word, 60, 122
ory Operation Maximum Double- Application Specific Integrated Cir- Swap
word, 120 cuits, 2 Doubleword, 86, 122
amomax.w, see also Atomic Mem- architecture, 8 Word, 60, 122
ory Operation Maximum Word, 120 ARM auipc, see also Add upper immediate
amomaxu.d, see also Atomic Mem- code size, 9, 92 to PC, 123
ory Operation Maximum Unsigned Cortex-A5, 7
Doubleword, 121 Cortex-A9, 8 backwards binary-compatibility, 4
amomaxu.w, see also Atomic Mem- instruction reference manual Bell, C. Gordon, 46, 86
ory Operation Maximum Unsigned number of pages, 12 beq, see also Branch if equal, see
Word, 121 Load Multiple, 7, 8 also c.beqz, 124
amomin.d, see also Atomic Memory number of registers, 10 beqz, 35, 124
Operation Minimum Doubleword, Thumb, 8, 9 bge, see also Branch if greater or
121 Thumb-2, 8, 9 equal, 124
INDEX 167

bgeu, see also Branch if greater or c.sd, 86, see also sd, 131 yield, 7
equal unsigned, 124 c.sdsp, 86, see also sd, 131 div, 135
bgez, 35, 124 c.slli, see also slli, 131 Divide, 44, 135
bgt, 35, 124 c.srai, see also srai, 132 unsigned, 44, 135
bgtu, 35, 124 c.srli, see also srli, 132 unsigned word, 86, 136
bgtz, 35, 125 c.sub, see also sub, 132 using shift right, 44
ble, 35, 125 c.subw, 86, see also subw, 132 word, 86, 136
bleu, 35, 125 c.sw, see also sw, 132 divu, see also Divide unsigned, 135
blez, 35, 125 c.swsp, see also sw, 132 divuw, see also Divide unsigned
blt, see also Branch if less than, 125 c.xor, see also xor, 133 word, 136
bltu, see also Branch if less than un- call, 35, 133 divw, see also Divide word, 136
signed, 125 Callee saved registers, 34 dynamic linking, 41
bltz, 35, 125 Caller saved registers, 34 dynamic register typing, 74, 90
bne, see also Branch if not equal, see Calling conventions, 32
also c.bnez, 126 Chanel, Coco, 118 ease of programming, compiling,
bnez, 35, 126 chip, see also die, 7 and linking, see also instruction set
Branch Compilers architecture, principles of design,
if equal, 21, 124 Turing Award, 116 ease of programming, compiling,
if greater or equal, 21, 124 context switch, 75 and linking
if greater or equal unsigned, 21, Control and Status Register ebreak, 136
124 read and clear, 22, 133 ecall, 136
if less than, 21, 125 read and clear immediate, 22, 134 Einstein, Albert, 60
if less than unsigned, 21, 125 read and set, 22, 134 ELF, see also executable and link-
if not equal, 21, 126 read and set immediate, 22, 134 able format
branch prediction, 8, 18 read and write, 22, 134 endianness, 21
Brooks, Fred, 113 read and write immediate, 22, 134 epilogue, see also function epilogue
Browning, Robert, 55 CoreMark benchmark, 8 Exception, 101
cost, see also instruction set archi- Exception Return
tecture, principles of design, cost Machine, 104, 154
c.add, see also add, 126 Supervisor, 108, 163
Cray, Seymour, 72, 93
c.addi, see also addi, 126 Exclusive Or, 18, 165
csrc, 35, 133
c.addi16sp, see also addi, 126 immediate, 18, 165
csrci, 35, 133
c.addi4spn, see also addi, 126 csrr, 35, 133 executable and linkable format, 35
c.addiw, 86, see also addiw, 127 csrrc, see also Control and Status
c.addw, 86, see also addw, 127 Register read and clear, 133 fabs.d, 35, 136
c.and, see also and, 127 csrrci, see also Control and Status fabs.s, 35, 136
c.andi, see also andi, 127 Register read and clear immediate, fadd.d, see also Floating-point Add
c.beqz, see also beq, 127 134 double-precision, 137
c.bnez, see also bne, 127 csrrs, see also Control and Status fadd.s, see also Floating-point Add
c.ebreak, see also ebreak, 128 Register read and set, 134 single-precision, 137
c.fld, see also fld, 128 csrrsi, see also Control and Status fclass.d, see also Floating-point
c.fldsp, see also fld, 128 Register read and set immediate, Classify double-precision, 137
c.flw, see also flw, 128 134 fclass.s, see also Floating-point
c.flwsp, see also flw, 128 csrrw, see also Control and Status Classify single-precision, 137
c.fsd, see also fsd, 128 Register read and write, 134 fcvt.d.l, see also Floating-point Con-
c.fsdsp, see also fsd, 129 csrrwi, see also Control and Status vert double from long, 138
c.fsw, see also fsw, 129 Register read and write immediate, fcvt.d.lu, see also Floating-point
c.fswsp, see also fsw, 129 134 Convert double from long unsigned,
c.j, see also jal, 129 csrs, 35, 135 138
c.jal, see also jal, 129 csrsi, 35, 135 fcvt.d.s, see also Floating-point
c.jalr, see also jalr, 129 csrw, 35, 135 Convert double from single, 138
c.jr, see also jalr, 130 csrwi, 35, 135 fcvt.d.w, see also Floating-point
c.ld, 86, see also ld, 130 Convert double from word, 138
c.ldsp, 86, see also ld, 130 da Vinci, Leonardo, 2 fcvt.d.wu, see also Floating-point
c.li, see also addi, 130 data-level parallelism, 72 Convert double from word un-
c.lui, see also lui, 130 de Saint Exup’ery L’Avion, Antoine, signed, 138
c.lw, see also lw, 130 48 fcvt.l.d, see also Floating-point Con-
c.lwsp, see also lw, 131 delay slot, 8 vert long from double, 139
c.mv, see also add, 131 delayed branch, 8 fcvt.l.s, see also Floating-point Con-
c.or, see also or, 131 die, see also chip, 7 vert long from single, 139
168 INDEX

fcvt.lu.d, see also Floating-point standard, 48 Fused negative multiply-subtract

Convert long unsigned from double, sign-injection, 54 double-precision, 48, 147
139 static rounding mode, 49 single-precision, 48, 148
fcvt.lu.s, see also Floating-point Floating-point, 48 half precision, 56
Convert long unsigned from single, Add Less or Equals
139 double-precision, 48, 137 double-precision, 48, 143
fcvt.s.d, see also Floating-point single-precision, 48, 137 single-precision, 48, 143
Convert single from double, 139 binary128, 56 Less Than
fcvt.s.l, see also Floating-point Con- binary16, 56 double-precision, 48, 143
vert single from long, 140 binary256, 56 single-precision, 48, 143
fcvt.s.lu, see also Floating-point binary32, 56 Load
Convert single from long unsigned, binary64, 56 doubleword, 48, 142
140 Classify word, 48, 143
fcvt.s.w, 140 double-precision, 48, 137 Maximum
fcvt.s.wu, see also Floating-point single-precision, 48, 137 double-precision, 48, 144
Convert single from word unsigned, Convert single-precision, 48, 144
140 double from long, 48, 86, 138 Minimum
fcvt.w.d, see also Floating-point double from long unsigned, 48, double-precision, 48, 144
Convert word from double, 140 86, 138 single-precision, 48, 145
fcvt.w.s, see also Floating-point double from single, 48, 138 Move
Convert word from single, 141 double from word, 48, 138 doubleword from integer, 48,
fcvt.wu.d, see also Floating-point double from word unsigned, 48, 146
Convert word unsigned from dou- 138 doubleword to integer, 48, 146
ble, 141 long from double, 48, 86, 139
word from integer, 48, 146
fcvt.wu.s, see also Floating-point long from single, 48, 86, 139
word to integer, 48, 146
Convert word unsigned from single, long unsigned from double, 48,
Multiply
141 86, 139
double-precision, 48, 145
fdiv.d, see also Floating-point Di- long unsigned from single, 48,
vide double-precision, 141 single-precision, 48, 145
86, 139
fdiv.s, see also Floating-point Divide octuple precision, 56
single from double, 48, 139
single-precision, 141 quadruple precision, 56
single from long, 48, 86, 140
Fence Sign-inject
single from long unsigned, 48,
Instruction Stream, 142 86, 140 double-precision, 48, 149
Memory and I/O, 142 single from word unsigned, 48, single-precision, 48, 149
Virtual Memory, 113, 159 140 Sign-inject negative
fence, 35, see also Fence Memory word from double, 48, 140 double-precision, 48, 149
and I/O, 142 word from single, 48, 141 single-precision, 48, 149
fence.i, see also Fence Instruction word unsigned from double, 48, Sign-inject XOR
Stream, 142 141 double-precision, 48, 150
feq.d, see also Floating-point Equals word unsigned from single, 48, single-precision, 48, 150
double-precision, 142 141 Square root
feq.s, see also Floating-point Equals decimal128, 56 double-precision, 48, 150
single-precision, 142 decimal32, 56 single-precision, 48, 150
Field-Programmable Gate Array, 2 decimal64, 56 Store
fld, see also c.fldsp, see also c.fld, Divide doubleword, 48, 148
see also Floating-point load double- double-precision, 48, 141 word, 48, 151
word, 142 single-precision, 48, 141 Subtract
fle.d, see also Floating-point Less Equals double-precision, 48, 151
or Equals double-precision, see also double-precision, 48, 142 single-precision, 48, 151
Floating-point Less Than double- single-precision, 48, 142 floating-point
precision, 143 Fused multiply-add control and status register, 48
fle.s, see also Floating-point Less double-precision, 48, 144 flt.d, 143
or Equals single-precision, see also single-precision, 48, 144 flt.s, 143
Floating-point Less Than single- Fused multiply-subtract flw, see also c.flwsp, see also c.flw,
precision, 143 double-precision, 48, 145 see also Floating-point load word,
Floating-Point single-precision, 48, 145 143
dynamic rounding mode, 49 Fused negative multiply-add fmadd.d, see also Floating-
fused multiply-add, 53 double-precision, 48, 147 point fused multiply-add double-
IEEE 754-2008 floating-point single-precision, 48, 147 precision, 144
INDEX 169

fmadd.s, see also Floating- Sign-inject double-precision, 149 ease of programming, compil-
point fused multiply-add single- fsgnj.s, see also Floating-point Sign- ing, and linking, 10, 17, 18, 21, 24,
precision, 144 inject single-precision, 149 54, 72, 74–76, 81, 90, 91, 93, 105,
fmax.d, see also Floating-point max- fsgnjn.d, see also Floating-point 113
imum double-precision, see also Sign-inject negative double- isolation of architecture from
Floating-point maximum single- precision, 149 implementation, 8, 24, 72, 81, 101,
precision, 144 fsgnjn.s, see also Floating-point 117
fmax.s, 144 Sign-inject negative single- performance, 7, 14, 17, 24, 45,
fmin.d, see also Floating-point min- precision, 149 48, 53, 55, 61, 72, 76, 78, 79, 81,
imum double-precision, 144 fsgnjx.d, see also Floating-point 90, 92
fmin.s, see also Floating-point mini- Sign-inject XOR double-precision, program size, 9, 24, 64, 66, 90,
mum single-precision, 145 150 92, 93
fmsub.d, see also Floating-point fsgnjx.s, see also Floating-point room for growth, 8, 17, 24, 93
fused multiply-subtract double- Sign-inject XOR single-precision, simplicity, 7, 11, 12, 18, 20–24,
precision, 145 150 55, 61, 64, 73, 81, 105, 108, 113,
fmsub.s, see also Floating-point fsqrt.d, see also Floating-point 117
fused multiply-subtract single- Square Root double-precision, 150 mistakes of the past, 24
precision, 145 fsqrt.s, see also Floating-point modularity, 5
fmul.d, see also Floating-point Mul- Square Root single-precision, 150 open, 2
tiply double-precision, 145 fsrm, 35, 150 principles of design
fmul.s, see also Floating-point Mul- fsub.d, see also Floating-point Sub- cost, 42
tiply single-precision, 145, see tract double-precision, 151 ease of programming, compil-
also Floating-point Subtract single- fsub.s, 151 ing, and linking, 40, 42, 116
precision fsw, see also c.fswsp, see also c.fsw, performance, 32, 42, 116, 117
fmv.d, 35, 146 see also Floating-point store word, room for growth, 117
fmv.d.x, see also Floating-point 151 simplicity, 35
move doubleword from integer, 146 function epilogue, 35 Interrupt, 103
fmv.s, 35, 146 function prologue, 33 ISA, see instruction set architecture
fmv.w.x, see also Floating-point Fused multiply-add, 53 isolation of architecture from imple-
move word from integer, 146 mentation, see also instruction set
gather, 76 architecture, principles of design,
fmv.x.d, see also Floating-point
isolation of architecture from imple-
move doubleword to integer, 146
Hart, 101 mentation
fmv.x.w, see also Floating-point
Itanium, 91
move word to integer, 146
IEEE 754-2008 floating-point stan-
fneg.d, 35, 147
dard, 48 j, 35, 151
fneg.s, 35, 147 Illiac IV, 80 jal, 35, see also c.jal, see also c.j, see
fnmadd.d, see also Floating-point implementation, 8 also Jump and link, 151
fused negative multiply-add double- Instruction diagram jalr, 35, see also c.jalr, see also c.jr,
precision, see also Floating-point Privileged instructions, 101 see also Jump and link register, 152
fused negative multiply-add single- RV32A, 60 Johnson, Kelly, 42
precision, 147 RV32C, 64 jr, 35, 152
fnmadd.s, 147 RV32D, 48 Jump and link, 22, 151
fnmsub.d, see also Floating-point RV32F, 48 register, 22, 152
fused negative multiply-subtract RV32I, 14
double-precision, 147 RV32M, 44 la, 152
fnmsub.s, see also Floating-point RV64A, 86 lb, see also Load byte, 152
fused negative multiply-subtract RV64C, 86 lbu, see also Load byte unsigned,
single-precision, 148 RV64D, 86 152
FPGA, see also Field- RV64F, 86 ld, see also c.ldsp, see also c.ld, see
Programmable Gate Array, 2 RV64I, 86 also Load doubleword, 153
frcsr, 35, 148 RV64M, 86 leaf function, 33
frflags, 35, 148 instruction set architecture, 2 lh, see also Load halfword, 153
frrm, 35, 148 backwards binary-compatibility, 4 lhu, see also Load halfword un-
fscsr, 35, 148 elegance, 12, 24, 42, 67, 81, 93, signed, 153
fsd, see also c.fsdsp, see also c.fsd, 117 li, 35, 153
see also Floating-point store dou- incremental, 4 Lindy effect, 24
bleword, 148 metrics of design, 5 linker relaxation, 41
fsflags, 35, 149 cost, 5, 14, 17, 20, 21, 24, 46, 65, little-endian, 21
fsgnj.d, see also Floating-point 92, 112 lla, 153
170 INDEX

Load Occam, William of, 44 fscsr, 35, 148

byte, 20, 152 Or, 18, 156 fsflags, 35, 149
byte unsigned, 20, 152 immediate, 18, 156 fsrm, 35, 150
doubleword, 86, 153 or, see also c.or, 156 j, 35, 151
halfword, 20, 153 ori, 18, see also Or immediate, 156 jr, 35, 152
halfword unsigned, 20, 153 out-of-order processors, 17 la, 152
reserved li, 35, 153
doubleword, 86, 153 Page, 109 lla, 153
word, 60, 154 Page fault, 109 mv, 35, 155
upper immediate, 18, 154 Page table, 109 neg, 35, 156
word, 20, 154 Pascal, Blaise, 66 negw, 35, 156
word unsigned, 20, 154 performance, see also instruction set nop, 35, 156
Load upper immediate, 18 architecture, principles of design, not, 35, 156
lr.d, see also Load reserved double- performance rdcycle, 35, 156
word, 153 CoreMark benchmark, 8 rdcycleh, 35, 157
lr.w, see also Load reserved word, equation, 7 rdinstret, 35, 157
154 Perlis, Alan, 116 rdinstreth, 35, 157
lui, see also c.lui, see also Load up- PIC see also position independent rdtime, 35, 157
per immediate, 154 code 40 rdtimeh, 35, 157
lw, see also c.lwsp, see also c.lw, see pipelined processor, 18 ret, 35, 158
also Load word, 154 position independent code, 10, 21, seqz, 35, 159
lwu, see also Load word unsigned, 40, 91 sext.w, 35, 159
154 Privilege mode, 100 sgtz, 35, 159
Machine mode, 101 sltz, 35, 162
User mode, 105 snez, 35, 162
Machine mode, 101
program size, see also instruction set tail, 35, 164
macrofusion, 7, 7, 66
architecture, principles of design,
metrics of ISA design, see instruc-
program size
tion set architecture rdcycle, 35, 156
Programming languages
metrics of design, 5 rdcycleh, 35, 157
Turing Award, 116
microMIPS, 55 rdinstret, 35, 157
prologue, see also function prologue
MIPS rdinstreth, 35, 157
Pseudoinstruction, 35
assembler, 20 rdtime, 35, 157
beqz, 35, 124
delayed branch, 8, 21, 26, 56, 94 rdtimeh, 35, 157
bgez, 35, 124
delayed load, 21, 26, 94 registers
bgt, 35, 124
MIPS MSA, 80 number of, 10
bgtu, 35, 124
MIPS-IV, 92 bgtz, 35, 125 rem, see also Remainder, 157
Moore’s Law, 2 ble, 35, 125 Remainder, 44, 157
mret, see also Exception Return Ma- bleu, 35, 125 unsigned, 44, 157
chine, 154 blez, 35, 125 unsigned word, 86, 158
mul, see also Multiply, 155 bltz, 35, 125 word, 86, 158
mulh, see also Multiply high, 155 bnez, 35, 126 remu, see also Remainder unsigned,
mulhsu, see also Multiply high call, 35, 133 157
signed-unsigned, 155 csrc, 35, 133 remuw, see also Remainder un-
mulhu, see also Multiply high un- csrci, 35, 133 signed word, 158
signed, 155 csrr, 35, 133 remw, see also Remainder word, 158
Multiply, 45, 155 csrs, 35, 135 ret, 35, 158
high, 45, 155 csrsi, 35, 135 RISC-V
high signed-unsigned, 45, 155 csrw, 35, 135 application binary interface, 18,
high unsigned, 45, 155 csrwi, 35, 135 26, 33, 34, 48
multi-word, 46 fabs.d, 35, 136 assembler directives, 35
using shift left, 45 fabs.s, 35, 136 BOOM, 8
word, 86, 155 fence, 35 Calling conventions, 35
mulw, see also Multiply word, 155 fmv.d, 35, 146 code size, 9, 92
mv, 35, 155 fmv.s, 35, 146 Foundation, 2
fneg.d, 35, 147 function epilogue, 35
neg, 35, 156 fneg.s, 35, 147 function prologue, 33
negw, 35, 156 frcsr, 35, 148 heap region, 40
nop, 35, 156 frflags, 35, 148 instruction reference manual
not, 35, 156 frrm, 35, 148 number of pages, 12
INDEX 171

instruction set naming scheme, 5 sd, see also c.sdsp, see also c.sd, see srai, see also c.srai, see also Shift
lessons learned, 24 also Store doubleword, 159 right arithmetic immediate, 162
Linker, 40 seqz, 35, 159 sraiw, see also Shift right arithmetic
Loader, 42 Set less than, 18, 161 immediate word, 162
long, 90 immediate, 18, 161 sraw, see also Shift right arithmetic
macrofusion, 7 immediate unsigned, 18, 161 word, 163
memory allocation, 40 unsigned, 18, 161 sret, see also Exception Return Su-
modularity, 5 seven metrics of ISA design, see in- pervisor, 163
number of registers, 10 struction set architecture srl, see also Shift right logical, 163
pseudoinstruction, 35 metrics of design, 5 srli, see also c.srli, see also Shift
pseudoinstructions, 10 sext.w, 35, 159 right logical immediate, 163
Rocket, 7 sfence.vma, see also Fence Virtual srliw, see also Shift right logical im-
RV128, 93 Memory, 159 mediate word, 163
RV32A, 60 sgtz, 35, 159 srlw, see also Shift right logical
RV32C, 9, 11, 64 sh, see also Store halfword, 160 word, 164
RV32D, 48 Shift static linking, 41
RV32F, 48 left logical, 18, 160 Store
RV32G, 9, 11 left logical immediate, 18, 160 byte, 20, 158
RV32I, 14 left logical immediate word, 86, conditional
RV32M, 44 160 doubleword, 86, 158
RV32V, 11, 72 left logical word, 86, 161 word, 60, 159
RV64A, 86 right arithmetic, 18, 162 doubleword, 86, 159
RV64C, 86, 92 right arithmetic immediate, 18, halfword, 20, 160
RV64D, 86 162 word, 20, 160
RV64F, 86 strip mining, 79
right arithmetic immediate word,
RV64G, 11 sub, see also Subtract, 18, see also
86, 162
RV64I, 86 c.sub, see also Subtract, 164
right arithmetic word, 86, 163
RV64M, 86 Subtract, 18, 164
right logical, 18, 163
saved registers, 33 word, 86, 164
right logical immediate, 18, 163
stack region, 40 subw, see also c.subw, see also Sub-
right logical immediate word, 86,
static region, 40 tract word, 164
163
temporary registers, 33 superscalar, 2, 8, 66
right logical word, 86, 164
text region, 40 Sutherland, Ivan, 32
SIMD, see also Single Instruction
sw, see also c.swsp, see also c.sw,
RISC-V ABI, see RISC-V Applica- Multiple Data 11
see also Store word, 160
tion Binary Interface, 41 simplicity, see also instruction set
RISC-V Application Binary Inter- architecture, principles of design, tail, 35, 164
face simplicity Thoreau, Henry David, 117
ilp32, 41 Single Instruction Multiple Data, 2, Thumb-2, 55
ilp32d, 41 11, 72 TLB, 113
ilp32f, 41 sll, see also Shift left logical, 160 TLB shootdown, 113
lp64, 90 slli, see also c.slli, see also Shift left Translation Lookaside Buffer, 113
lp64d, 90 logical immediate, 160 Turing Award
lp64f, 90 slliw, see also Shift left logical im- Allen, Fran, 14
RISC-V Foundation, 2 mediate word, 160 Brooks, Fred, 113
room for growth, see also instruc- sllw, see also Shift left logical word, Dijkstra, Edsger W., 100
tion set architecture, principles of 161 Perlis, Alan, 116
design, room for growth slt, see also Set less than, 161 Sutherland, Ivan, 32
RV128, 93 slti, see also Set less than immediate, Wirth, Niklaus, 66
RV32C, 55 161
RV32V, 80 sltiu, see also Set less than immedi- User mode, 105
ate unsigned, 161
Santayana, George, 24 sltu, see also Set less than unsigned, Vector
sb, see also Store byte, 158 161 gather, 76
sc.d, see also Store conditional dou- sltz, 35, 162 indexed load, 76
bleword, 158 Small is Beautiful, 64 indexed store, 76
sc.w, see also Store conditional Smith, Jim, 81 scatter, 76
word, 159 snez, 35, 162 strided load, 75
scatter, 76 sra, see also Shift right arithmetic, strided store, 75
Schumacher, E. F., 64 162 strip-mining, 79
172 INDEX

vectorizable, 76, 81 Wirth, Niklaus, 66 x86-32 AVX2, 80

Vector architecture, 72 x86-64
context switch, 75 x86 AMD64, 91
dynamic register typing, 74, 90 aaa instruction, 4 XLEN, 101
type encoding, 74 aad instruction, 4 xor, 18, see also c.xor, see also Ex-
vectorizable, 76, 81 aam instruction, 4 clusive Or, 165
Virtual address, 109 aas instruction, 4
xor properties, 20
Virtual memory, 109 code size, 9, 92
xor register exchange, 20
von Neumann architecture, 11 enter instruction, 7
instruction reference manual xori, 18, see also Exclusive Or im-
von Neumann, John, 11
number of pages, 12 mediate, 165
Wait for Interrupt, 105, 164 ISA growth, 2
wfi, see also Wait for Interrupt, 164 number of registers, 10
William of Occam, 44 position independent code, 10 yield, 7

Dig - Design+Comp - Arch RISC-V - Edition
No ratings yet
Dig - Design+Comp - Arch RISC-V - Edition
733 pages
Geyer Instructional Online Catalog
100% (3)
Geyer Instructional Online Catalog
196 pages
Exp 2 - MDA 8086
No ratings yet
Exp 2 - MDA 8086
6 pages
TVL CSS G11-Q1-DW4
No ratings yet
TVL CSS G11-Q1-DW4
5 pages
Self Study: Comparative Study of Arm Cores - Armv4 To Arm Cortex
100% (1)
Self Study: Comparative Study of Arm Cores - Armv4 To Arm Cortex
4 pages
RISC V Intro For Hackathon
100% (2)
RISC V Intro For Hackathon
40 pages
Risc V PDF
No ratings yet
Risc V PDF
117 pages
Question MCQ
No ratings yet
Question MCQ
5 pages
RISCV RV32I Instructions
No ratings yet
RISCV RV32I Instructions
17 pages
Digital Design and Computer Architecture RISC-V Edition (Sarah Harris, David Harris) (Z-Library)
No ratings yet
Digital Design and Computer Architecture RISC-V Edition (Sarah Harris, David Harris) (Z-Library)
2 pages
Hardwired Control Unit Vs Microprogrammed Control Unit
No ratings yet
Hardwired Control Unit Vs Microprogrammed Control Unit
4 pages
Windows PC POS Offer Oct'24
No ratings yet
Windows PC POS Offer Oct'24
199 pages
Ankit Computer Assignment of Class9
No ratings yet
Ankit Computer Assignment of Class9
18 pages
B UCSM GUI Firmware Management Guide 3 1 PDF
No ratings yet
B UCSM GUI Firmware Management Guide 3 1 PDF
148 pages
Amazon - in - Order 406-4773654-8250747
No ratings yet
Amazon - in - Order 406-4773654-8250747
1 page
Primepower 200: Midrange Server
No ratings yet
Primepower 200: Midrange Server
2 pages
RISC-V Instruction Set Summary
No ratings yet
RISC-V Instruction Set Summary
4 pages
Arducam Mega Getting Started Guide
No ratings yet
Arducam Mega Getting Started Guide
10 pages
Presentation ON Mobile Computing
No ratings yet
Presentation ON Mobile Computing
20 pages
TMGCMXL
No ratings yet
TMGCMXL
84 pages
Riscv User Isa
100% (1)
Riscv User Isa
9 pages
Appendix B.: RISC-V Instruction Set Summary
No ratings yet
Appendix B.: RISC-V Instruction Set Summary
4 pages
System Unit
No ratings yet
System Unit
7 pages
Expradv Filelist
No ratings yet
Expradv Filelist
78 pages
Manual For Linux, WinCE, Supervivi (BIOS 2.0) (With DNW H-JTAG)
No ratings yet
Manual For Linux, WinCE, Supervivi (BIOS 2.0) (With DNW H-JTAG)
15 pages
Project Report On Oregano 8051 (June 2017)
0% (1)
Project Report On Oregano 8051 (June 2017)
8 pages
RISCV Summary
No ratings yet
RISCV Summary
323 pages
Lenovo Company (International Marketing)
No ratings yet
Lenovo Company (International Marketing)
11 pages
LAB 09 RISC-V Assembly (Part I: Introduction) : EE-222 Microprocessors Systems April 11, 2019
100% (1)
LAB 09 RISC-V Assembly (Part I: Introduction) : EE-222 Microprocessors Systems April 11, 2019
9 pages
Tanvir Mis 442
No ratings yet
Tanvir Mis 442
13 pages
TS1500 Vs TS1000 File Diff Report
No ratings yet
TS1500 Vs TS1000 File Diff Report
2 pages
Embedded Firmware Design and Development
No ratings yet
Embedded Firmware Design and Development
9 pages
Von-Neuman Vs Harvard
No ratings yet
Von-Neuman Vs Harvard
14 pages
Riscv Instructions
100% (1)
Riscv Instructions
1 page
l26 Risc V Part1
No ratings yet
l26 Risc V Part1
30 pages
Chapter-2 ISA Complete
No ratings yet
Chapter-2 ISA Complete
92 pages
OPTEVA 522-562 - SETUP PC SIERRA REV Draft Version - STMI
No ratings yet
OPTEVA 522-562 - SETUP PC SIERRA REV Draft Version - STMI
3 pages
RISC CISC Lecture 12062025 043546pm
No ratings yet
RISC CISC Lecture 12062025 043546pm
53 pages
cs110 Disc5
No ratings yet
cs110 Disc5
72 pages
Module 2 Instructions
No ratings yet
Module 2 Instructions
31 pages
18-447 Lecture 3: RISC-V Instruction Set Architecture: James C. Hoe Department of ECE Carnegie Mellon University
No ratings yet
18-447 Lecture 3: RISC-V Instruction Set Architecture: James C. Hoe Department of ECE Carnegie Mellon University
39 pages
Cs61c Sp25 l08 Risc V Basics
No ratings yet
Cs61c Sp25 l08 Risc V Basics
37 pages
Lec-33-34 EE-222
No ratings yet
Lec-33-34 EE-222
20 pages
Lec-30 EE-222
No ratings yet
Lec-30 EE-222
29 pages
RISC-V Assembly Language Presentation
No ratings yet
RISC-V Assembly Language Presentation
19 pages
Scantest 20220801 0001
No ratings yet
Scantest 20220801 0001
1 page
Rvalp
No ratings yet
Rvalp
89 pages
Chapter-2 ISA Reduced
No ratings yet
Chapter-2 ISA Reduced
62 pages
Lecture6 RISC V Assembly II
No ratings yet
Lecture6 RISC V Assembly II
34 pages
CA I - Chapter 3 RISC V Processor
No ratings yet
CA I - Chapter 3 RISC V Processor
103 pages
IT3030E CA Chap3 Instruction Set Architecture
No ratings yet
IT3030E CA Chap3 Instruction Set Architecture
81 pages
CA I - Chapter 2 ISA 2 RISC V
No ratings yet
CA I - Chapter 2 ISA 2 RISC V
65 pages
HP Pavilion x360 - 14-Dh1006tu HP Pavilion x360 - 14-Dh1007tu
No ratings yet
HP Pavilion x360 - 14-Dh1006tu HP Pavilion x360 - 14-Dh1007tu
2 pages
2018fa CS61C L10 BN Formats
No ratings yet
2018fa CS61C L10 BN Formats
28 pages
Lec03 Arithmetic
No ratings yet
Lec03 Arithmetic
29 pages
Milen Dimitrov HW2 Q2
No ratings yet
Milen Dimitrov HW2 Q2
28 pages
Slide 3
No ratings yet
Slide 3
34 pages
02 Riscv
No ratings yet
02 Riscv
31 pages
Netbackup Interview Questions
No ratings yet
Netbackup Interview Questions
4 pages
L06 - RISCVII (Revised)
No ratings yet
L06 - RISCVII (Revised)
48 pages
c128 Ic
No ratings yet
c128 Ic
3 pages
NAME: - AGE: - Sex: M F CONTACT NUMBER (R) : - (N) : - 1
No ratings yet
NAME: - AGE: - Sex: M F CONTACT NUMBER (R) : - (N) : - 1
3 pages
Riscv Isa
No ratings yet
Riscv Isa
1 page
Intel I
No ratings yet
Intel I
72 pages
CA I - Chapter 3 RISC V Processor
No ratings yet
CA I - Chapter 3 RISC V Processor
107 pages
RISCV Student
No ratings yet
RISCV Student
41 pages
Ece 513 - Microprocessor System (Exam 1) Strictly No Erasures Allowed! I. IDENTIFICATION (15 PTS) - Iv. Motherboard Parts (22 PTS)
No ratings yet
Ece 513 - Microprocessor System (Exam 1) Strictly No Erasures Allowed! I. IDENTIFICATION (15 PTS) - Iv. Motherboard Parts (22 PTS)
1 page
ECE-6913 - RISC-V Project - A1
No ratings yet
ECE-6913 - RISC-V Project - A1
4 pages
L06 RISCV Functions
No ratings yet
L06 RISCV Functions
49 pages
Block Diagram of A RISC-lab-ex
No ratings yet
Block Diagram of A RISC-lab-ex
7 pages
Milestone03 - Computer Architecture Report - Group3
No ratings yet
Milestone03 - Computer Architecture Report - Group3
45 pages
The RISC Architecture - Revision 2018 (Niklaus Wirth) (2010)
No ratings yet
The RISC Architecture - Revision 2018 (Niklaus Wirth) (2010)
3 pages
02 - Instruction Set Architecture-RV Part I V - 21in - Aug23
No ratings yet
02 - Instruction Set Architecture-RV Part I V - 21in - Aug23
32 pages
Aula Ch2 1
No ratings yet
Aula Ch2 1
27 pages
Aula Ch2 2
No ratings yet
Aula Ch2 2
27 pages
Riscv Isa
No ratings yet
Riscv Isa
1 page
CA04 2022S2 New
No ratings yet
CA04 2022S2 New
33 pages
2 0 Riscv-Isa-A
No ratings yet
2 0 Riscv-Isa-A
108 pages
Lec3 - RISC-V Assembly
No ratings yet
Lec3 - RISC-V Assembly
56 pages
CA I - Chapter 3 RISC V Processor
No ratings yet
CA I - Chapter 3 RISC V Processor
100 pages
L11 Datapath1
No ratings yet
L11 Datapath1
49 pages
Riscv Card
No ratings yet
Riscv Card
3 pages
Instruction Summary
No ratings yet
Instruction Summary
2 pages
Lec Riscv
No ratings yet
Lec Riscv
45 pages
CA I - Chapter 2 ISA 2 RISC V
No ratings yet
CA I - Chapter 2 ISA 2 RISC V
66 pages
Riscv Card
No ratings yet
Riscv Card
5 pages
RISC-V Quick Reference
No ratings yet
RISC-V Quick Reference
10 pages
The RISC-V Instruction Set Manual: UCB/EECS-2014-54
No ratings yet
The RISC-V Instruction Set Manual: UCB/EECS-2014-54
100 pages
Riscv Spec
No ratings yet
Riscv Spec
32 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)