Ther Is CV Reader
Ther Is CV Reader
I like RISC-V and this book as they are elegant—brief, to the point, and complete. The
book’s commentaries provide a gratuitous history, motivation, and architecture critique.
—C. Gordon Bell, Microsoft and designer of the Digital PDP-11 and VAX-11
instruction set architectures
This book tells what RISC-V can do and why its designers chose to endow it with those
abilities. Even more interesting, the authors tell why RISC-V omits things found in earlier
machines. The reasons are at least as interesting as RISC-V’s endowments and omissions.
—Ivan Sutherland, Turing Award laureate called the father of computer graphics
RISC-V will change the world, and this book will help you become part of that change.
—Professor Michael B. Taylor, University of Washington
RISC-V is a fine choice for students to learn about instruction set architecture and
assembly-level programming, the basic underpinnings for later work in higher-level lan-
guages. This clearly-written book offers a good introduction to RISC-V, augmented with
insightful comments on its evolutionary history and comparisons with other familiar ar-
chitectures. Drawing on past experience with other architectures, RISC-V designers were
able avoid unnecessary, often irregular features, yielding easy pedagogy. Although sim-
ple, it is still powerful enough for widespread use in real applications. Long ago, I used
to teach a first course in assembly programming and if I were doing that now, I’d happily
use this book.
—John Mashey, one of the designers of the MIPS instruction set architecture
This book will be an invaluable reference for anyone working with the RISC-V ISA. The
opcodes are presented in several useful formats for quick reference, making assembly
coding and interpretation easy. In addition, the explanations and examples of how to use
the ISA make the programmer’s job even simpler. The comparisons with other ISAs are
interesting and demonstrate why the RISC-V creators made the design decisions they did.
—Megan Wachs, PhD, SiFive Engineer
Open Reference Card ①
Base Integer Instructions: RV32I and RV64I RV Privileged Instructions
Category Name Fmt RV32I Base +RV64I Category Name Fmt RV mnemonic
Shifts Shift Left Logical R SLL rd,rs1,rs2 SLLW rd,rs1,rs2 Trap Mach-mode trap return R MRET
Shift Left Log. Imm. I SLLI rd,rs1,shamt SLLIW rd,rs1,shamt Supervisor-mode trap return R SRET
Shift Right Logical R SRL rd,rs1,rs2 SRLW rd,rs1,rs2 Interrupt Wait for Interrupt R WFI
Shift Right Log. Imm. I SRLI rd,rs1,shamt SRLIW rd,rs1,shamt MMU Virtual Memory FENCE R SFENCE.VMA rs1,rs2
Shift Right Arithmetic R SRA rd,rs1,rs2 SRAW rd,rs1,rs2 Examples of the 60 RV Pseudoinstructions
Shift Right Arith. Imm. I SRAI rd,rs1,shamt SRAIW rd,rs1,shamt Branch = 0 (BEQ rs,x0,imm) J BEQZ rs,imm
Arithmetic ADD R ADD rd,rs1,rs2 ADDW rd,rs1,rs2 Jump (uses JAL x0,imm) J J imm
ADD Immediate I ADDI rd,rs1,imm ADDIW rd,rs1,imm MoVe (uses ADDI rd,rs,0) R MV rd,rs
SUBtract R SUB rd,rs1,rs2 SUBW rd,rs1,rs2 RETurn (uses JALR x0,0,ra) I RET
Load Upper Imm U LUI rd,imm Optional Compressed (16-bit) Instruction Extension: RV32C
Add Upper Imm to PC U AUIPC rd,imm Category Name Fmt RVC RISC-V equivalent
Logical XOR R XOR rd,rs1,rs2 Loads Load Word CL C.LW rd′,rs1′,imm LW rd′,rs1′,imm*4
XOR Immediate I XORI rd,rs1,imm Load Word SP CI C.LWSP rd,imm LW rd,sp,imm*4
OR R OR rd,rs1,rs2 Float Load Word SP CL C.FLW rd′,rs1′,imm FLW rd′,rs1′,imm*8
OR Immediate I ORI rd,rs1,imm Float Load Word CI C.FLWSP rd,imm FLW rd,sp,imm*8
AND R AND rd,rs1,rs2 Float Load Double CL C.FLD rd′,rs1′,imm FLD rd′,rs1′,imm*16
AND Immediate I ANDI rd,rs1,imm Float Load Double SP CI C.FLDSP rd,imm FLD rd,sp,imm*16
Compare Set < R SLT rd,rs1,rs2 Stores Store Word CS C.SW rs1′,rs2′,imm SW rs1′,rs2′,imm*4
Set < Immediate I SLTI rd,rs1,imm Store Word SP CSS C.SWSP rs2,imm SW rs2,sp,imm*4
Set < Unsigned R SLTU rd,rs1,rs2 Float Store Word CS C.FSW rs1′,rs2′,imm FSW rs1′,rs2′,imm*8
Set < Imm Unsigned I SLTIU rd,rs1,imm Float Store Word SP CSS C.FSWSP rs2,imm FSW rs2,sp,imm*8
Branches Branch = B BEQ rs1,rs2,imm Float Store Double CS C.FSD rs1′,rs2′,imm FSD rs1′,rs2′,imm*16
Branch ≠ B BNE rs1,rs2,imm Float Store Double SP CSS C.FSDSP rs2,imm FSD rs2,sp,imm*16
Branch < B BLT rs1,rs2,imm Arithmetic ADD CR C.ADD rd,rs1 ADD rd,rd,rs1
Branch ≥ B BGE rs1,rs2,imm ADD Immediate CI C.ADDI rd,imm ADDI rd,rd,imm
Branch < Unsigned B BLTU rs1,rs2,imm ADD SP Imm * 16 CI C.ADDI16SP x0,imm ADDI sp,sp,imm*16
Branch ≥ Unsigned B BGEU rs1,rs2,imm ADD SP Imm * 4 CIW C.ADDI4SPN rd',imm ADDI rd',sp,imm*4
Jump & Link J&L J JAL rd,imm SUB CR C.SUB rd,rs1 SUB rd,rd,rs1
Jump & Link Register I JALR rd,rs1,imm AND CR C.AND rd,rs1 AND rd,rd,rs1
Synch Synch thread I FENCE AND Immediate CI C.ANDI rd,imm ANDI rd,rd,imm
Synch Instr & Data I FENCE.I OR CR C.OR rd,rs1 OR rd,rd,rs1
Environment CALL I ECALL eXclusive OR CR C.XOR rd,rs1 AND rd,rd,rs1
BREAK I EBREAK MoVe CR C.MV rd,rs1 ADD rd,rs1,x0
Load Immediate CI C.LI rd,imm ADDI rd,x0,imm
Control Status Register (CSR) Load Upper Imm CI C.LUI rd,imm LUI rd,imm
Read/Write I CSRRW rd,csr,rs1 Shifts Shift Left Imm CI C.SLLI rd,imm SLLI rd,rd,imm
Read & Set Bit I CSRRS rd,csr,rs1 Shift Right Ari. Imm. CI C.SRAI rd,imm SRAI rd,rd,imm
Read & Clear Bit I CSRRC rd,csr,rs1 Shift Right Log. Imm. CI C.SRLI rd,imm SRLI rd,rd,imm
Read/Write Imm I CSRRWI rd,csr,imm Branches Branch=0 CB C.BEQZ rs1′,imm BEQ rs1',x0,imm
Read & Set Bit Imm I CSRRSI rd,csr,imm Branch≠0 CB C.BNEZ rs1′,imm BNE rs1',x0,imm
Read & Clear Bit Imm I CSRRCI rd,csr,imm Jump Jump CJ C.J imm JAL x0,imm
Jump Register CR C.JR rd,rs1 JALR x0,rs1,0
Jump & Link J&L CJ C.JAL imm JAL ra,imm
Loads Load Byte I LB rd,rs1,imm Jump & Link Register CR C.JALR rs1 JALR ra,rs1,0
Load Halfword I LH rd,rs1,imm System Env. BREAK CI C.EBREAK EBREAK
Load Byte Unsigned I LBU rd,rs1,imm +RV64I Optional Compressed Extention: RV64C
Load Half Unsigned I LHU rd,rs1,imm LWU rd,rs1,imm All RV32C (except C.JAL, 4 word loads, 4 word strores) plus:
Load Word I LW rd,rs1,imm LD rd,rs1,imm ADD Word (C.ADDW) Load Doubleword (C.LD)
Stores Store Byte S SB rs1,rs2,imm ADD Imm. Word (C.ADDIW) Load Doubleword SP (C.LDSP)
Store Halfword S SH rs1,rs2,imm SUBtract Word (C.SUBW) Store Doubleword (C.SD)
Store Word S SW rs1,rs2,imm SD rs1,rs2,imm Store Doubleword SP (C.SDSP)
32-bit Instruction Formats 16-bit (RVC) Instruction Formats
R CR
I CI
S CSS
B CIW
U CL
J CS
CB
CJ
RISC-V Integer Base (RV32I/64I), privileged, and optional RV32/64C. Registers x1-x31 and the PC are 32 bits wide in RV32I and 64 in
RV64I (x0=0). RV64I adds 12 instructions for the wider data. Every 16-bit RVC instruction maps to an existing 32-bit RISC-V instruction.
Open Reference Card ②
Optional Multiply-Divide Instruction Extension: RVM Optional Vector Extension: RVV
Category Name Fmt RV32M (Multiply-Divide) +RV64M Name Fmt RV32V/R64V
Multiply MULtiply R MUL rd,rs1,rs2 MULW rd,rs1,rs2 SET Vector Len. R SETVL rd,rs1
MULtiply High R MULH rd,rs1,rs2 MULtiply High R VMULH rd,rs1,rs2
MULtiply High Sign/Uns R MULHSU rd,rs1,rs2 REMainder R VREM rd,rs1,rs2
MULtiply High Uns R MULHU rd,rs1,rs2 Shift Left Log. R VSLL rd,rs1,rs2
Divide DIVide R DIV rd,rs1,rs2 DIVW rd,rs1,rs2 Shift Right Log. R VSRL rd,rs1,rs2
DIVide Unsigned R DIVU rd,rs1,rs2 Shift R. Arith. R VSRA rd,rs1,rs2
Remainder REMainder R REM rd,rs1,rs2 REMW rd,rs1,rs2 LoaD I VLD rd,rs1,imm
REMainder Unsigned R REMU rd,rs1,rs2 REMUW rd,rs1,rs2 LoaD Strided R VLDS rd,rs1,rs2
October 4, 2017
Copyright 2017 Strawberry Canyon LLC. All rights reserved.
No part of this book or its related materials may be reproduced in any form without the
written consent of the copyright holder.
The cover background is a photo of the Mona Lisa. It is a portrait of Lisa Gherardini,
painted between 1503 and 1506, by the Leonardo da Vinci. The King of France bought it
from Leonardo in about 1530, and it has been on display at the Louvre Museum in Paris
since 1797. The Mona Lisa is considered the best known work of art in the world. Mona Lisa
represents elegance, which we believe is a feature of RISC-V.
Both the print book and ebook were prepared with LATEX, tex4ht, and Ruby scripts
that use Nokogiri (based on libxml2) to massage the XHTML output and HTTParty to au-
tomatically keep the GitHub Gists and screencast URIs up-to-date in the text. The neces-
sary Makefiles, style files and most of the scripts are available under the BSD License at
http://github.com/armandofox/latex2ebook.
Arthur Klepchukov designed the covers and graphics for all versions.
Publisher’s Cataloging-in-Publication
i
About the Authors
David Patterson retired after 40 years as a Professor of Computer Science at UC Berkeley in
2016, and then joined Google as a distinguished engineer. He also serves as Vice-Chair of the
Board of Directors of the RISC-V Foundation. In the past, he was named Chair of Berkeley’s
Computer Science Division and was elected to be Chair of the Computing Research Associ-
ation and President of the Association for Computing Machinery. In the 1980s, he led four
generations of Reduced Instruction Set Computer (RISC) projects, which inspired Berkeley’s
latest RISC to be named “RISC Five.” Along with Andrew Waterman, he was one of the four
architects of RISC-V. Beyond RISC, his best-known research projects are Redundant Arrays
of Inexpensive Disks (RAID) and Networks of Workstations (NOW). This research led to
many papers, 7 books, and more than 35 honors, including election to the National Academy
of Engineering, the National Academy of Sciences, and the Silicon Valley Engineering Hall
of Fame as well as being named a Fellow of the Computer History Museum, ACM, IEEE, and
both AAAS organizations. His teaching awards include the Distinguished Teaching Award
(UC Berkeley), the Karlstrom Outstanding Educator Award (ACM), the Mulligan Education
Medal (IEEE), and the Undergraduate Teaching Award (IEEE). He also won Textbook Ex-
cellence Awards (“Texty”) from the Text and Academic Authors Association for a computer
architecture book and for a software engineering book. He received all his degrees from
UCLA, which awarded him an Outstanding Engineering Academic Alumni Award. He grew
up in Southern California, and for fun he plays soccer and rides bikes with his sons and walks
on the beach with his wife. Originally high-school sweethearts, they celebrated their 50th
wedding anniversary a few days after the Beta edition was published.
Andrew Waterman serves as SiFive’s Chief Engineer and co-founder. SiFive was founded
by the creators of the RISC-V architecture to provide low-cost custom chips based on RISC-
V. He received his PhD in Computer Science from UC Berkeley, where, weary of the vagaries
of existing instruction set architectures, he co-designed the RISC-V ISA and the first RISC-V
microprocessors. Andrew is one of the main contributors to the open-source RISC-V-based
Rocket chip generator, the Chisel hardware construction language, and the RISC-V ports of
the Linux operating system kernel and the GNU C Compiler and C Library. He also has an
MS from UC Berkeley, which was the basis of the RVC extension for RISC-V, and a BSE
from Duke University.
ii
Quick Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Why RISC-V? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
6 RV32A: Atomic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8 RV32V: Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Contents
List of Figures x
Preface xii
1 Why RISC-V? 2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Modular vs. Incremental ISAs . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 ISA Design 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 An Overview of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6 RV32A: Atomic 60
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8 RV32V: Vector 72
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2 Vector Computation Instructions . . . . . . . . . . . . . . . . . . . . . . . . 73
8.3 Vector Registers and Dynamic Typing . . . . . . . . . . . . . . . . . . . . . 74
8.4 Vector Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.5 Parallelism During Vector Execution . . . . . . . . . . . . . . . . . . . . . . 76
8.6 Conditional Execution of Vector Operations . . . . . . . . . . . . . . . . . . 76
8.7 Miscellaneous Vector Instructions . . . . . . . . . . . . . . . . . . . . . . . 77
8.8 Vector Example: DAXPY in RV32V . . . . . . . . . . . . . . . . . . . . . . 78
8.9 Comparing RV32V, MIPS-32 MSA SIMD, and x86-32 AVX SIMD . . . . . 79
8.10 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.11 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
v
10 RV32/64 Privileged Architecture 100
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.2 Machine Mode for Simple Embedded Systems . . . . . . . . . . . . . . . . . 101
10.3 Machine-Mode Exception Handling . . . . . . . . . . . . . . . . . . . . . . 103
10.4 User Mode and Process Isolation in Embedded Systems . . . . . . . . . . . . 105
10.5 Supervisor Mode for Modern Operating Systems . . . . . . . . . . . . . . . 108
10.6 Page-Based Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 109
10.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
10.8 To Learn More . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Index 166
vi
vii
List of Figures
ix
x
Welcome!
RISC-V has been a phenomenon, rapidly growing in popularity since its introduction in 2011.
We thought a slim programmer’s guide would help fuel its ride and encourage newcomers to
understand why it is an attractive instruction set and see how it differs from conventional
instruction set architectures (ISA) of the past.
Books for other ISAs inspired us, although we hoped that the simplicity of RISC-V would
mean writing much less than the 500+ pages of fine books such as See MIPS Run. At one-
third the overall length, at least by that measure we’ve succeeded. In fact, the ten chapters
that introduce each component of the modular RISC-V instruction set take just 100 pages—
despite averaging nearly one figure per page (75 total)—which makes for quick reading.
After explaining the principles of instruction set design, we show how the RISC-V ar-
chitects learned from the instruction sets of the past 40 years to borrow their good ideas and
avoid their mistakes. ISAs are judged as much by what is omitted as by what is included.
We then introduce each component of this modular architecture in a sequence of chap-
ters. Every chapter has a program in RISC-V assembly language that demonstrates use of
the instructions introduced in that chapter, which makes it easier for the assembly language
programmer to learn RISC-V code. We also often show equivalent programs in ARM, MIPS,
and x86 that highlight the simplicity and cost-energy-performance benefits of RISC-V.
To make the book more fun to read, we include almost 50 sidebars in the page margins
with what we hope are interesting commentaries about the text. We also include about 75
images in the margins to emphasize examples of good ISA design. (Our margins are well-
used!) Finally, for the dedicated reader, we add roughly 25 elaborations throughout the text.
You can delve into these optional sections if you are interested in a topic. These sections
aren’t required to understand the other material in the book, so feel free to skip them if they
don’t catch your interest. For computer architecture buffs, we cite 25 papers and books that
may broaden your horizons. We learned a lot by reading them in order to write this book!
of history of the field too, which is why we feature quotes from famous computer scientists
and engineers throughout the text.
• Reference Card – This one page (two sides) condensed description of RISC-V covers
both RV32GCV and RV64GCV, which includes the base and all defined extensions:
RVI, RVM, RVA, RVF, RVD, RVC, and even RVV, even though it is still under devel-
opment.
• Instruction Diagrams – These half-page graphical descriptions of each instruction
extension, which are the first figures of the chapters, list the full names of all RISC-V
instructions in a format that let’s you easily see the variations of each instruction. See
Figures 2.1, 4.1, 5.1, 6.1, 7.1, 8.1, 9.1, 9.2, 9.3, and 9.4.
• Opcode Maps – These tables show the instruction layout, opcodes, format type, and
instruction mnemonic for each instruction extension in a fraction of a page. See Fig-
ures 2.3, 3.3, 3.4, 4.2, 5.2, 5.3, 6.2, 7.6, 7.5, 7.7, 9.5, and 10.1. (The instruction
diagrams and opcode maps inspired the use of the word atlas in the book’s subtitle.)
• Instruction Glossary – Appendix A is a thorough description of every RISC-V
instruction and pseudoinstruction.1 It includes everything: the operation name
and operands, an English description, a register-transfer language definition, which
RISC-V extension it is in, the full name of the instruction, the instruction format, a
diagram of the instruction showing the opcodes, and references to compact versions of
the instruction. Amazingly, this all fits into less than 50 pages.
• Index – It helps you find the page that describes the instruction explanation, definition,
or diagram either by the full name or by mnemonic. It is organized like a dictionary.
1 The committee defining RV32V did not complete their work in time for the Beta edition, so we omit those
instructions from Appendix A. Chapter 8 is our best guess of what RV32V will be, although it is likely to change a
little.
xiv
Acknowledgments
We wish to thank Armando Fox for use of his LaTeX pipeline and advice on navigating the
world of self publishing.
Our deepest thanks go to the people who read early drafts of the book and offered helpful
suggestions: Krste Asanović, Nikhil Athreya, C. Gordon Bell, Stuart Hoad, David Kanter,
John Mashey, Ivan Sutherland, Ted Speers, Michael Taylor, Megan Wachs, ... .
Finally, we thank the hundreds of UC Berkeley students for their debugging help and their
continuing interest in this material!
Figure 1.1: The corporate members of the RISC-V Foundation as of the Sixth RISC-V Workshop in May
2017 ranked by annual sales. The left column companies all exceed $US 50B in annual sales, the middle
column companies sell less than $US 50B but more than $US 5B, and the sales of those in the right column
are less than $US 5B but more than $US 0.5B. The foundation includes another 25 smaller companies, 5
startup companies (Antmicro Ltd, Blockstream, Esperanto Technologies, Greenwaves Technologies, and
SiFive), 4 nonprofit organizations (CSEM, Draper Laboratory, ICT, and lowRISC), and 6 universities (ETH
Zurich, IIT Madras, National University of Defense Technology, Princeton, and UC Berkeley). Most of the
60 organizations have their headquarters outside the US. To learn more, see www.riscv.org.
1600
Number x86 Instruc0ons
1338
1200
1048
800 670
500
437 446
400 293
140 145 162 166 223
80
0
1978 1982 1986 1990 1994 1998 2002 2006 2010 2014
Figure 1.2: Growth of x86 instruction set over its lifetime. x86 started with 80 instructions in 1978. It grew
16X to 1338 instructions by 2015, and it’s still growing. Amazingly, this graph is conservative. An Intel blog
puts the count at 3600 instructions in 2015 [Rodgers and Uhlig 2017], which would raise the x86 rate to one
new instruction every four days between 1978 and 2015. We count assembly language instructions, and they
presumably count machine language instructions. As Chapter 8 explains, a large part of the growth is
because the x86 ISA relies on SIMD instructions for data level parallelism.
4 CHAPTER 1. WHY RISC-V?
Figure 1.3: Description of the x86-32 ASCII Adjust after Addition (aaa) instruction. It performs computer
arithmetic in Binary Coded Decimal (BCD), which has fallen into the dustbin of information technology
history. The x86 also has three related instructions for subtraction (aas), multiplication (aam), and division
(aad). As each is a one-byte instruction, they collectively occupy 1.6% (4/256) of the precious opcode space.
The conventional approach to computer architecture is incremental ISAs, where new proces-
sors must implement not only new ISA extensions but also all extensions of the past. The
purpose is to maintain backwards binary-compatibility so that binary versions of decades-old
programs can still run correctly on the latest processor. This requirement, when combined
with the marketing appeal of announcing new instructions with a new generation of proces-
sors, has led to ISAs that grow substantially in size with age. For example, Figure 1.2 shows
the growth in the number of instructions for a dominant ISA today: the 80x86. It dates back
to 1978, yet it has added about three instructions per month over its long lifetime.
This convention means that every implementation of the x86-32 (the name we use for
the 32-bit address version of x86) must implement the mistakes of past extensions, even
when they no longer make sense. For example, Figure 1.3 describes the ASCII Adjust after
Addition (aaa) instruction of the x86, which has long outlived its usefulness.
As an analogy, suppose a restaurant serves only a fixed-price meal, which starts out as a
small dinner of just a hamburger and a milkshake. Over time, it adds fries, and then an ice
cream sundae, followed by salad, pie, wine, vegetarian pasta, steak, beer, ad infinitum until
it becomes a gigantic banquet. It may make little sense in total, but diners can find whatever
they’ve ever eaten in a past meal at that restaurant. The bad news is that diners must pay the
rising cost of the expanding banquet for each dinner.
Beyond being recent and open, RISC-V is unusual since, unlike almost all prior ISAs, it is
modular. At the core is a base ISA, called RV32I, which runs a full software stack. RV32I is
1.3. ISA DESIGN 101 5
frozen and will never change, which gives compiler writers, operating system developers, and
assembly language programmers a stable target. The modularity comes from optional stan-
dard extensions that hardware can include or not depending on the needs of the application.
This modularity enables very small and low energy implementations of RISC-V, which can
be critical for embedded applications. By informing the RISC-V compiler what extensions If software uses
are included, it can generate the best code for that hardware. The convention is to append an omitted RISC-V
the extension letters to the name to indicate which are included. For example, RV32IMFD instruction from an
optional extension,
adds the multiply (RV32M), single-precision floating point (RV32F), and double-precision the hardware traps and
floating point extensions (RV32D) to the mandatory base instructions (RV32I). executes the desired
Returning to our analogy, RISC-V offers a menu instead of a buffet; the chef need cook function in software as part
only what the customers want—not a feast for every meal—and the customers pay only for of a standard library.
what they order. RISC-V has no need to add instructions simply for the marketing sizzle. The
RISC-V Foundation decides when to add a new option to the menu, and they will do so only
for solid technical reasons after an extended open discussion by a committee of hardware and
software experts. Even when new choices appear on the menu, they remain optional and not
a new requirement for all future implementations, like incremental ISAs.
To illustrate what we mean, in this section we’ll show some choices from older ISAs that
look unwise in retrospect and where RISC-V often made much better decisions.
Cost. Processors are implemented as integrated circuits, commonly called chips or dies.
They are called dies because they start life as a piece of a single round wafer, which is diced
into many individual pieces. Figure 1.4 shows a wafer of RISC-V processors. The cost is
very sensitive to the area of the die:
Obviously, the smaller the die, the more dies per wafer, and most of the cost of the die is
the processed wafer itself. Less obvious is that the smaller the die, the higher the yield, the
6 CHAPTER 1. WHY RISC-V?
Figure 1.4: An 8-inch diameter wafer of RISC-V dies designed by SiFive. It has two types of RISC-V dies
using an older, larger processing line. An FE310 die is 2.65 mm×2.72 mm and a SiFive test die that is
2.89 mm×2.72 mm. The wafer contains 1846 of the former and 1866 of the latter, totaling 3712 chips.
1.3. ISA DESIGN 101 7
fraction of manufactured dies that work. The reason is that the silicon manufacturing will
result in small flaws scattered about the wafer, so the smaller the die, the lower the fraction
that will be flawed.
An architect wants to keep the ISA simple to shrink the size of processors that imple-
ment it. As we shall see in the following chapters, the RISC-V ISA is much simpler ISA
than the ARM-32 ISA. As a concrete example of the impact of simplicity, let’s compare
a RISC-V Rocket processor to an ARM-32 Cortex-A5 processor in the same technology
(TSMC40GPLUS) using the same-sized caches (16 KiB). The RISC-V die is 0.27 mm2 ver-
sus 0.53 mm2 for ARM-32. Around twice the area, the ARM-32 Cortex-A5 die costs approx-
imately 4X (22 ) as much as RISC-V Rocket die. Even a 10% smaller die reduces cost by a
factor of 1.2 (1.12 ).
High-end proces-
Simplicity. Given the cost sensitivity to complexity, architects want a simple ISA to
sors can gain perfor-
reduce die area. Simplicity also reduces chip design time and verification time, which can be mance by combining
much of the cost of development of the chip. These costs must be added to the cost of the simple instructions to-
chip, with this overhead dependent on the number of chips shipped. Simplicity also reduces gether without burdening
all lower-end implementa-
the cost of documentation and the difficulty of getting customers to understand how to use tions with a larger, more
the ISA. complicated ISA. This
Below is a glaring example of ISA complexity from ARM-32: technique is called macro-
fusion, as it fuses “macro”
ldmiaeq SP!, {R4-R7, PC} instructions together.
The instruction stands for LoaD Multiple, Increment-Address, on EQual. It performs 5 data
loads and writes to 6 registers but executes only if the EQ condition code is set. Moreover, it
writes a result to the PC, so it is also performing a conditional branch. Quite a handful!
Ironically, simple instructions are much more likely to be used than complex ones. For A simple processor
example, x86-32 includes an enter instruction, which was intended to be the first instruction can be helpful for
executed on entering a procedure to create a stack frame for it (see Chapter 3). Most compilers embedded applica-
tions since it is eas-
instead use only these two simple x86-32 instructions: ier to predict execution
push ebp # Push the frame pointer onto the stack time. Assembly-language
programmers of micro-
mov ebp, esp # Copy the stack pointer to the frame pointer controllers often want to
maintain exact timing, so
Performance. Except for the tiny chips for embedded applications, architects are typ- they rely on code taking
ically concerned about performance as well as cost. Performance can be factored into three a predictable number of
terms: clock cycles that they can
instructions average clock cycles time time count by hand.
× × =
program instruction clock cycle program
The last factor is
Even if a simple ISA might execute more instructions per program than a complex ISA, it
the inverse of the
can more than make up for that by having a faster clock cycle or average fewer clock cycles clock rate, so a 1 GHz
per instruction (CPI). clock rate means the time
For example, for the CoreMark benchmark [Gal-On and Levy 2012] (100,000 iterations), per clock cycle is 1 ns
(1/109 ).
the performance on the ARM-32 Cortex-A9 is
32.27 B instructions 0.79 clock cycles 0.71 ns 18.15 secs The average num-
× × = ber of clock cycles
program instruction clock cycle program can be less than
1 because the A9 and
For the BOOM implementation of RISC-V, the equation is BOOM [Celio et al. 2015]
are so-called superscalar
29.51 B instructions 0.72 clock cycles 0.67 ns 14.26 secs processors, which execute
× × = more than one instruction
program instruction clock cycle program
per clock cycle.
8 CHAPTER 1. WHY RISC-V?
The ARM processor didn’t execute fewer instructions than RISC-V in this case. As we
shall see, the simple instructions are also the most popular instructions, so ISA simplicity can
win in all metrics. For this program, the RISC-V processor gains nearly 10% in each of the
three factors, which results in a performance advantage of almost 30%. If a simpler ISA also
results in a smaller chip, its cost-performance will be excellent.
Isolation of Architecture from Implementation. The original distinction between ar-
chitecture and implementation, which goes back to the 1960s, is that architecture is what a
machine language programmer needs to know to write a correct program, but not the perfor-
mance of that program. The temptation for an architect is to include instructions in an ISA
that help performance or cost of one implementation at a particular time, but burden different
or future implementations.
For the MIPS-32 ISA, the regrettable example was the delayed branch. Conditional
branches cause problems in pipelined execution because the processor wants to have the next
instruction to execute already in the pipeline, but it can’t decide whether it wants the next
sequential one (if the branch isn’t taken) or the one at the branch target address (if it is taken).
For their first microprocessor with a 5-stage pipeline, this indecision could have caused a one
Pipelined proces-
clock-cycle stall of the pipeline. MIPS-32 solved this problem by redefining branch to occur
sors today antic-
ipate branch out- in the instruction after the next one. Thus, the following instruction is always executed. The
comes using hardware job of the programmer or compiler writer was to put something useful into the delay slot.
predictors, which can ex- Alas, this “solution” didn’t help later MIPS-32 processors with many more pipeline stages
ceed 90% accuracy and
work with any pipeline
(hence many more instructions fetched before the branch outcome is computed), but it made
length. They only need a life harder for MIPS-32 programmers, compiler writers, and processor designers ever after,
mechanism to flush and since incremental ISAs demand backwards compatibility (see Section 1.2). In addition, it
restart the pipeline when makes the MIPS-32 code much harder to understand (see Figure 2.10 on page 29 ).
they mispredict.
While architects shouldn’t put features that help just one implementation at a point in
time, they also shouldn’t put in features that hinder some implementations. For example,
ARM-32 and some other ISAs have a Load Multiple instruction, as mentioned on the previ-
ous page. These instructions can improve performance of single-instruction issue pipelined
designs, but hurt multiple-instruction issue pipelines. The reason is that the straightforward
implementation precludes scheduling the individual loads of a Load Multiple in parallel with
other instructions, reducing instruction throughput of such processors.
Room for Growth. With ending of Moore’s Law, the only path forward for major
improvements in cost-performance is to add custom instructions for specific domains, such as
deep learning, augmented reality, combinatorial optimization, graphics, and so. That means
it’s important today for an ISA to reserve opcode space for future enhancements.
In the 1970s and 1980s, when Moore’s Law was in full force, there was little thought
of saving opcode space for future accelerators. Architects instead valued larger address and
immediate fields to reduce the number of instructions executed per program, the first factor
The ARM-32 in- in the performance equation on the prior page.
struction ldmiaeq An example of the impact of paucity of opcode space was when the architects of ARM-32
mentioned above later tried to reduce code size by adding 16-bit length instructions to the formerly uniform
is even more com-
plicated, since when 32-bit length ISA. There was simply no room left. Thus, the only solution was to create a new
it branches it can also ISA first with 16-bit instructions (Thumb) and later a new ISA with both 16-bit and 32-bit
change instruction set instructions (Thumb-2) using a mode bit to switch between ARM ISAs. To change modes,
modes between ARM-32
the programmer or compiler branches to a byte address with a 1 in the least-significant bit,
and Thumb/Thumb-2.
which worked because 16-bit and 32-bit instructions should have 0 in that bit.
1.3. ISA DESIGN 101 9
1.37 1.34
1.4
Code Size Rela,ve to RV32GC 1.26
1.2
1 0.99
1
0.8
0.6
0.4
0.2
0
RISC-V RV32GC RISC-V RV32G ARM Thumb2 ARM-32 INTEL x86-32
(16b & 32b) (32b) (16b & 32b) (32b) (variable 8b)
Figure 1.5: Relative program sizes for RV32G, ARM-32, x86-32, RV32C, and Thumb-2. The last two ISAs
are aimed at small code size. The programs were the SPEC CPU2006 benchmarks using the GCC compilers.
The small size advantage of Thumb-2 over RV32C is due to the code size savings of Load and Store Multiple
on procedure entry. RV32C excludes them to maintain the one-to-one mapping to instructions of RV32G,
which omits Load and Store Multiple to reduce implementation complexity for high-end processors (see
below). Chapter 7 explains RV32C. RV32G indicates a popular combination of RISC-V extensions (RV32M,
RV32F, RV32D, and RV32A), properly called RV32IMAFD. [Waterman 2016]
Program Size. The smaller the program, the smaller the area on a chip needed for the
program memory, which can be a significant cost for embedded devices. Indeed, that issue
inspired ARM architects to retroactively add shorter instructions in the Thumb and Thumb-2
ISAs. Smaller programs also lead to fewer misses in instruction caches, which saves power
since off-chip DRAM accesses use much more energy than on-chip SRAM accesses, and
One example 15-
improves performance as well. Small code size can be one of the goals of ISA architects. byte x86-32 in-
The x86-32 ISA has instructions as short as 1 byte and as long as 15 bytes. One would struction is lock
expect that the byte-variable length instructions of the x86 should certainly lead to smaller add dword ptr
programs than ISAs limited to 32-bit length instructions, like ARM-32 and RISC-V. Logi- ds:[esi+ecx*4
+0x12345678],
cally, 8-bit variable length instructions should also be smaller than ISAs that offer only 16-bit 0xefcdab89. It as-
and 32-bit instructions, like Thumb-2 and RISC-V using the RV32C extension (see Chap- sembles into (in hexadec-
ter 7). Figure 1.5 shows that, while ARM-32 and RISC-V code is 6% to 9% larger than code imal): 67 66 f0 3e 81 84
for x86-32 when all instructions are 32 bits long, surprisingly x86-32 is 26% larger than the 8e 78 56 34 12 89 ab cd
ef. The last 8 bytes are
compressed versions (RV32C and Thumb-2) that offer both 16-bit and 32-bit instructions. 2 addresses and the first
While a new ISA using 8-bit variable instructions would likely lead to smaller code than 7 bytes specify atomic
RV32C and Thumb-2, the architects of the first x86 in the 1970s had different concerns. memory operation, the add
Moreover, given the requirement of backwards binary-compatibility of an incremental ISA operation, 32-bit data, the
data segment register, the
(Section 1.2), the hundreds of new x86-32 instructions are longer than one might expect, 2 address registers, and
since they bear the burden of a one- or two-byte prefix to squeeze them into the limited free scaled indexed addressing
opcode space of the original x86. mode. An example 1-byte
instruction is inc eax
that assembles into 40.
10 CHAPTER 1. WHY RISC-V?
The RISC-V “reference card” on pages 3 and 4 is a handy summary of all RISC-V instruc- The reference card
tions in this book: RV32G, RV64G, and RV32/64V. is also called the
Chapter 7 describes the optional compressed extension RV32C, an excellent example of green card because
of the shade of the back-
the elegance of RISC-V. By restricting the 16-bit instructions to be short versions of existing ground color of the one-
32-bit RV32G instructions, they are almost free. The assembler can pick the instruction size, page cardboard summary
allowing the assembly language programmer and the compiler to be oblivious to RV32C. of ISAs from the 1960s.
The hardware decoder to translate 16-bit RV32C instructions into 32-bit RV32G instructions We kept the background
white for legibility instead
needs just 400 gates, which is a few percent of even the simplest implementation of RISC-V. of green for historical
Chapter 8 introduces RV32V, the vector extension. Vector instructions are another ex- accuracy.
ample of ISA elegance as compared to the numerous, brute-force Single Instruction Multiple
Data (SIMD) instructions of ARM-32, MIPS-32, and x86-32. Indeed, hundreds of the in-
structions added to x86-32 in Figure 1.2 were SIMD, and hundreds more are coming. RV32V
is even simpler than most vector ISAs, as it associates the data type and length with the vec-
tor registers instead of embedding them in the opcodes. RV32V may be the most compelling
reason for switching from a conventional SIMD-based ISA to RISC-V.
Chapter 9 shows the 64-bit address version of RISC-V, RV64G. As the chapter explains,
the RISC-V architects needed only to widen the registers and add a few word, doubleword,
or long versions of RV32G instructions to extend the address from 32 to 64 bits.
Chapter 10 explains the system instructions, showing how RISC-V handles paging and
the Machine, User, and Supervisor privilege modes.
The last chapter gives a quick description of the remaining extensions that are currently
under consideration by the RISC-V Foundation.
Next comes the largest section of the book, Appendix A, an instruction set summary in
alphabetical order. It defines the full RISC-V ISA with all extensions mentioned above and
all pseudoinstructions in about 50 pages, a testimony to the simplicity of RISC-V.
We end the book with an index.
Figure 1.6: Number of pages and words of ISA manuals [Waterman and Asanović 2017a], [Waterman and
Asanović 2017b], [Intel Corporation 2016], [ARM Ltd. 2014]. Hours and weeks to complete assumes reading
at 200 words per minute for 40 hours a week. Based in part of Figure 1 of [Baumann 2017].
One indication of complexity is the size of the documentation. Figure 1.6 shows the
size of the instruction set manuals for RISC-V, ARM-32, and x86-32 measured in pages and
words. If you read manuals as a full-time job—8 hours a day for 5 days a week—it would
take half a month to make a single pass over the ARM-32 manual and a full month for the
x86-32. At this level of intricacy, perhaps no single person fully understands ARM-32 or
1 1
x86-32. Using this common-sense metric, RISC-V is 12 complexity of the ARM-32 and 10
1
to 30 the complexity of x86-32. Indeed, the summary of RISC-V ISA including all extensions
is only two pages (see the Reference Card).
This minimal, open ISA was unveiled in 2011 and is now backed by a foundation that
will evolve it by adding optional extensions based strictly on technical justifications after
a prolonged debate. The openness enables free, shared implementations of RISC-V, which
lowers costs and the odds of unwanted malicious secrets being hidden in a processor.
However, hardware alone does not a system make. Software development costs likely
dwarf hardware development costs, so while stable hardware is important, stable software is
more so. It needs operating systems, boot-loaders, reference software, and popular software
tools. The foundation offers stability for the overall ISA, and the frozen base means that the
RV32I core that is the target for the software stack will never change. By its broad adoption
and openness, RISC-V can challenge the dominance of the prevailing proprietary ISAs.
Elegant is a word rarely applied to ISAs, but after reading this book, you may agree with
us that it applies to RISC-V. We’ll highlight features that we believe indicate elegance with a
Mona Lisa icon in the margins.
S. P. Morse. The Intel 8086 chip and the future of microprocessor design. Computer, 50(4):
8–9, 2017.
D. A. Patterson and J. L. Hennessy. Computer Organization and Design RISC-V Edition:
The Hardware Software Interface. Morgan Kaufmann, 2017.
S. Rodgers and R. Uhlig. X86: Approaching 40 and still going strong, 2017.
J. L. von Neumann, A. W. Burks, and H. H. Goldstine. Preliminary discussion of the logical
design of an electronic computing instrument. Report to the U.S. Army Ordnance Depart-
ment, 1947.
A. Waterman. Design of the RISC-V Instruction Set Architecture. PhD thesis, EECS Depart-
ment, University of California, Berkeley, Jan 2016. URL http://www2.eecs.berkeley.
edu/Pubs/TechRpts/2016/EECS-2016-1.html.
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual Volume
II: Privileged Architecture Version 1.10. May 2017a. URL https://riscv.org/
specifications/privileged-isa/.
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017b. URL https://riscv.org/specifications/.
Notes
1 http://parlab.eecs.berkeley.edu
2 RV32I: RISC-V Base Integer ISA
Frances Elizabeth . . . the only way to realistically realize the performance goals and make them accessible
“Fran” Allen (1932-) to the user was to design the compiler and the computer at the same time. In this way
was bestowed the Turing features would not be put in the hardware which the software could not use . . .
Award primarily for her work
on optimizing compilers. —Frances Elizabeth “Fran” Allen, 1981
The Turing Award is the
greatest prize in Computer
Science.
2.1 Introduction
Figure 2.1 is a one-page graphical representation of the RV32I base instruction set. You can
see the full RV32I instruction set by concatenating the underlined letters from left to right
for each diagram. The set notation using { } lists the possible variations of the instruction,
using either underlined letters or the underscore character _, which means no letter for this
variation. For example
_ _
set less than
immediate unsigned
RV32I
Integer Computation Loads and Stores
_
add byte
immediate load halfword
subtract store word
and _
or byte
immediate load unsigned
exclusive or halfword
Figure 2.1: Diagram of the RV32I instructions. The underlined letters are concatenated from left to right to
form RV32I instructions. The curly bracket notation { } means each vertical item in the set is a different
variation of the instruction. The underscore _ within a set means that one option is simply the instruction
name so far without a letter from this set. For example, the notation near the upper left-hand corner
represents the following six instructions: and, or, xor, andi, ori, xori.
31 30 25 24 21 20 19 15 14 12 11 8 7 6 0
funct7 rs2 rs1 funct3 rd opcode R-type
Figure 2.2: RISC-V instruction formats. We label each immediate subfield with the bit position (imm[x]) in
the immediate value being produced, rather than the bit position in the instruction’s immediate field as is
usually done. Chapter 10 explains how the control status register instructions use the I-type format slightly
differently. (Figure 2.2 of Waterman and Asanović 2017 is the basis of this figure).
16 CHAPTER 2. RV32I: RISC-V BASE INTEGER ISA
31 25 24 20 19 15 14 12 11 7 6 0
imm[31:12] rd 0110111 U lui
imm[31:12] rd 0010111 U auipc
imm[20|10:1|11|19:12] rd 1101111 J jal
imm[11:0] rs1 000 rd 1100111 I jalr
imm[12|10:5] rs2 rs1 000 imm[4:1|11] 1100011 B beq
imm[12|10:5] rs2 rs1 001 imm[4:1|11] 1100011 B bne
imm[12|10:5] rs2 rs1 100 imm[4:1|11] 1100011 B blt
imm[12|10:5] rs2 rs1 101 imm[4:1|11] 1100011 B bge
imm[12|10:5] rs2 rs1 110 imm[4:1|11] 1100011 B bltu
imm[12|10:5] rs2 rs1 111 imm[4:1|11] 1100011 B bgeu
imm[11:0] rs1 000 rd 0000011 I lb
imm[11:0] rs1 001 rd 0000011 I lh
imm[11:0] rs1 010 rd 0000011 I lw
imm[11:0] rs1 100 rd 0000011 I lbu
imm[11:0] rs1 101 rd 0000011 I lhu
imm[11:5] rs2 rs1 000 imm[4:0] 0100011 S sb
imm[11:5] rs2 rs1 001 imm[4:0] 0100011 S sh
imm[11:5] rs2 rs1 010 imm[4:0] 0100011 S sw
imm[11:0] rs1 000 rd 0010011 I addi
imm[11:0] rs1 010 rd 0010011 I slti
imm[11:0] rs1 011 rd 0010011 I sltiu
imm[11:0] rs1 100 rd 0010011 I xori
imm[11:0] rs1 110 rd 0010011 I ori
imm[11:0] rs1 111 rd 0010011 I andi
0000000 shamt rs1 001 rd 0010011 I slli
0000000 shamt rs1 101 rd 0010011 I srli
0100000 shamt rs1 101 rd 0010011 I srai
0000000 rs2 rs1 000 rd 0110011 R add
0100000 rs2 rs1 000 rd 0110011 R sub
0000000 rs2 rs1 001 rd 0110011 R sll
0000000 rs2 rs1 010 rd 0110011 R slt
0000000 rs2 rs1 011 rd 0110011 R sltu
0000000 rs2 rs1 100 rd 0110011 R xor
0000000 rs2 rs1 101 rd 0110011 R srl
0100000 rs2 rs1 101 rd 0110011 R sra
0000000 rs2 rs1 110 rd 0110011 R or
0000000 rs2 rs1 111 rd 0110011 R and
0000 pred succ 00000 000 00000 0001111 I fence
0000 0000 0000 00000 001 00000 0001111 I fence.i
000000000000 00000 000 00000 1110011 I ecall
000000000001 00000 000 00000 1110011 I ebreak
csr rs1 001 rd 1110011 I csrrw
csr rs1 010 rd 1110011 I csrrs
csr rs1 011 rd 1110011 I csrrc
csr zimm 101 rd 1110011 I csrrwi
csr zimm 110 rd 1110011 I csrrsi
csr zimm 111 rd 1110011 I csrrci
Figure 2.3: RV32I opcode map has instruction layout, opcodes, format type, and names. (Table 19.2 of
[Waterman and Asanović 2017] is the basis of this figure.)
2.2. RV32I INSTRUCTION FORMATS 17
only a two-operand instruction, the compiler or assembly language programmer must use an Sign-extended
extra move instruction to preserve the destination operand. Third, in RISC-V the specifiers of immediates even
help logical instruc-
the registers to be read and written are always in the same location in all instructions, which tions. For example, x
means the register accesses can begin before decoding the instruction. Many other ISAs & 0xfffffff0 uses only the
reuse a field as a source in some instructions and as a destination in others (e.g., ARM-32 and single instruction andi in
MIPS-32), which forces addition of extra hardware to be placed in a potentially time-critical RISC-V, but it requires two
instructions in MIPS-32
path to select the proper field. Fourth, the immediate fields in these formats are always sign (addiu to load the con-
extended, and the sign bit is always in the most significant bit of the instruction. This deci- stant, then and), since
sion means sign extension of the immediate, which may also be on a critical timing path, can MIPS zero-extends logi-
proceed before decoding the instruction. cal immediates. ARM-32
needed to add an addi-
tional instruction, bic,
Elaboration: B- and J-type formats? that performs rx & imme-
As mentioned below, the immediate field is rotated 1 bit for branch instructions, a variation diate to compensate for
zeroextending immediates.
of the S format that we relabel the B format. The immediate field of jump instructions rotated
12 bits for jump instructions, a variation of the U format relabeled J format. Hence, there are
a really four basic formats, but we can conservatively count RISC-V as having six formats.
To help programmers, a bit pattern of all zeros is an illegal RV32I instruction. Thus,
erroneous jumps into zeroed memory regions will immediately trap, an aid to debugging.
Similarly, the bit pattern of all ones is an illegal instruction, which will trap other common
errors such as unprogrammed non-volatile memory devices, disconnected memory buses, or
broken memory chips.
To leave ample room for ISA extensions, the base RV32I ISA uses less than 1/8-th of
the encoding space in the 32-bit instruction word. The architects also carefully picked
the RV32I opcodes so that instructions with common datapath operations share as many of
the same opcode bit values as possible, which simplifies the control logic. Finally, as we
shall see, the branch and jump addresses in the B and J formats must be shifted left 1 bit
so as to multiply the addresses by 2, thereby giving branches and jumps a greater range.
RISC-V rotates the bits in the immediate operands from a natural placement to reduce the RISC-V implemen-
instruction signal fanout and immediate multiplexing cost by almost a factor for two, which tations all use the
again simplifies datapath logic on low-end implementations. same opcodes for
the optional ex-
What’s Different? We’ll end each section in this and following chapters with description tensions such as
on how RISC-V differs from other ISAs. The contrast is often what RISC-V is missing. RV32M, RV32F, and
Architects demonstrate good taste by the features they omit as well as by what they include. so on. Non-standard ex-
The ARM-32 12-bit immediate field is not simply a constant but an input to a function that tensions that are unique to
processor are restricted to
produces a constant: 8 bits are zero-extended to full width and then rotated right by the value a reserved opcode space
in the 4 remaining bits multiplied by 2. The hope was encoding more useful constants in 12 in RISC-V.
bits would reduce the number of executed instructions. ARM-32 also dedicates a precious
four bits in most instruction formats to conditional execution. Despite being infrequently
used, conditional execution adds to the complexity of out-of-order processors.
31 0
x0 / zero Hardwired zero
x1 / ra Return address
x2 / sp Stack pointer
x3 / gp Global pointer
x4 / tp Thread pointer
x5 / t0 Temporary
x6 / t1 Temporary
x7 / t2 Temporary
x8 / s0 / fp Saved register, frame pointer
x9 / s1 Saved register
x10 / a0 Function argument, return value
x11 / a1 Function argument, return value
x12 / a2 Function argument
x13 / a3 Function argument
x14 / a4 Function argument
x15 / a5 Function argument
x16 / a6 Function argument
x17 / a7 Function argument
x18 / s2 Saved register
x19 / s3 Saved register
x20 / s4 Saved register
x21 / s5 Saved register
x22 / s6 Saved register
x23 / s7 Saved register
x24 / s8 Saved register
x25 / s9 Saved register
x26 / s10 Saved register
x27 / s11 Saved register
x28 / t3 Temporary
x29 / t4 Temporary
x30 / t5 Temporary
x31 / t6 Temporary
32
31 0
pc
32
Figure 2.4: The registers of RV32I. Chapter 3 explains the RISC-V calling convention, the rationale behind
the various pointers (sp, gp, tp, fp), Saved registers (s0-s11), and Temporaries (t0-t6). (Figure 2.1 and Table
20.1 of [Waterman and Asanović 2017] is the basis of this figure.)
20 CHAPTER 2. RV32I: RISC-V BASE INTEGER ISA
What’s Different? First, there are no byte or half-word integer computation operations.
The operations are always the full register width. Memory accesses take orders of magnitude
more energy than arithmetic operations, so narrow data accesses can save significant energy,
but narrow operations do not. ARM-32 has the unusual feature of having an option to shift
one of the operands in most arithmetic-logic operations, which complicates the datapath and
is rarely needed [Hohl and Hinds 2016]; RV32I has separate shift instructions.
Nor does RV32I include multiply and divide; they comprise the optional RV32M exten-
sion (see Chapter 4). Unlike ARM-32 and x86-32, the full RISC-V software stack can run
without them, which can shrink embedded chips. While not a hardware issue, the MIPS-32
assembler may replace a multiply with a sequence shifts and adds to try to improve perfor-
mance, which may confuse the programmer seeing instructions executed not found in the
assembly language program. RV32I also omits rotate instructions and detection of integer
arithmetic overflow. Both can be calculated in a few RV32I instructions (see Section 2.6).
register-indirect addressing mode. Unlike x86-32, RISC-V has no special stack instructions.
By using one of the 31 registers as the stack pointer (see Figure 2.4), the standard addressing
mode gets most of the benefits of push and pop instructions without the added ISA complex-
ity. Unlike MIPS-32, RISC-V rejected delayed load. Similar in spirit to delayed branches,
MIPS-32 redefined the load so the data is unavailable until two instructions later, when it
would show up in a five-stage pipeline. Whatever benefit it had evaporated for the longer
pipelines that came later.
While ARM-32 and MIPS-32 require data to be aligned naturally to data-sized boundaries
in memory, RISC-V does not. Misaligned accesses are sometimes required when porting
legacy code. One option is to disallow misaligned accesses in the base ISA and then provide
some separate instructions support for misaligned accesses, such as Load Word Left and Load
Word Right of MIPS-32. This option would complicate register access, however, since lwl
and lwr require writing pieces of registers instead of simply full registers. Requiring instead
that the regular loads and stores support misaligned accesses simplified the overall design.
Elaboration: Endianness
RISC-V chose little-endian byte ordering because it is dominant commercially: all x86-32
systems, and Apple iOS, Google Android OS, and Microsoft Windows for ARM are all little-
endian. Since the endian order matters only when accessing the identical data both as a word
and as bytes, endianness affects few programmers.
What’s Different? As noted above, RISC-V excluded the infamous delayed branch of
MIPS-32, Oracle SPARC, and others. It also avoided the condition codes of ARM-32 and
x86-32 for conditional branches. They add extra state that is implicitly set by most instruc-
tions, which needlessly complicate the dependence calculation for out-of-order execution. Fi-
nally, it omitted the loop instructions of the x86-32: loop, loope, loopz, loopne, loopnz.
Figure 2.5: Insertion Sort in C. While simple, Insertion Sort has many advantages over complicated sorting
algorithms: it is memory efficient and fast for small data sets while being adaptive, stable, and online. GCC
compilers produced the code for the following four figures. We set the optimization flags to reduce code size,
as that produced the easiest to understand code.
Figure 2.6: Number of instructions and code size for Insertion Sort for these ISAs. Chapter 7 describes
ARM Thumb-2, microMIPS, and RV32C.
The ecall instruction makes requests to the supporting execution environment, such
as system calls. Debuggers use the ebreak instruction to transfer control to a debugging
environment.
The fence instruction sequences device I/O and memory accesses as viewed by other
threads and by external devices or coprocessors. The fence.i instruction synchronizes the
instruction and data streams. RISC-V does not guarantee that stores to instruction memory
are visible to instruction fetches in the same processor until a fence.i instruction executes.
Chapter 10 covers the RISC-V system instructions.
What’s Different? RISC-V uses memory mapped I/O instead of the in, ins, insb,
insw and out, outs, outsb, outsw instructions of the x86-32. It supports strings using byte
loads and stores instead of the 16 special string instructions of the x86-32 rep, movs, coms,
scas, lods, ....
Figure 2.7 uses the seven metrics of ISA design from Chapter 1 to organize the lessons from
past ISAs listed it the previous sections, and shows the positive outcomes for RV32I. We’re
The genealogy of all not implying that RISC-V is the first ISA to have those outcomes. Indeed, RV32I inherits the
RISC-V instructions is following from RISC-I, its great-great-grandparent [Patterson 2017]:
chronicled in [Chen and
Patterson 2016]. • 32-bit byte-addressable address space
• All instructions are 32-bit long
• 31 registers, all 32 bits wide, with register 0 hardwired to zero
• All operations are between registers (none are register-to-memory)
• Load/store word plus signed and unsigned load/store byte and halfword
• Immediate option for all arithmetic, logical, and shift instructions
• Immediates always sign-extend
• One data addressing mode (register + immediate) and PC-relative branching
• No multiply or divide instructions
• An instruction to load a wide immediate into the upper part of register so that a 32-bit
constant takes only two instructions
RISC-V benefits from starting one-quarter to one-third century later, which allowed its
architects to follow Santayana’s advice to borrow the good ideas but to not repeat the mis-
takes of the past—including those of RISC-I—in the current RISC-V ISA. Moreover, the
RISC-V Foundation will grow the ISA slowly via optional extensions to prevent the rampant
incrementalism that has plagued successful ISAs of the past.
The Lindy effect [Lin
2017] observes that the
future life expectancy of Elaboration: Is RV32I unique?
a technology or idea is Early microprocessors had separate chips for floating-point arithmetic, so those instructions
proportional to its age. were optional. Moore’s Law soon brought everything on chip, and modularity faded in ISAs.
It has stood the test of
Subsetting the full ISA in simpler processors and trapping to software to emulate them goes
time, so the longer it
has survived in the past, back decades, with the IBM 360 model 44 and the Digital Equipment microVAX as exam-
the longer it likely will ples. RV32I is different in that the full software stack needs only the base instructions, so an
survive in the future. If that RV32I processor need not trap repeatedly for omitted instructions in RV32G. Probably the
hypothesis holds, RISC closest ISA to RISC-V in that respect is the Tensilica Xtensa, which is aimed at embedded
architecture may be a applications. Its 80-instruction base ISA is intended to be extended by users with custom
good idea for a long time. instructions that accelerate their applications. RV32I has a simpler base ISA, has a 64-bit
address version, and offers extensions that target supercomputers as well as microcontrollers.
Figure 2.7: Lessons that RISC-V architects learned from past instruction set mistakes. Often the lesson was
simply to avoid ISA “optimizations” of the past. The lessons and mistakes are classified by the seven ISA
metrics from Chapter 1. Many features listed under cost, simplicity, and performance could be swapped
with each other, as it’s a matter of taste, but they are important no matter where they appear.
26 NOTES
W. Hohl and C. Hinds. ARM Assembly Language: Fundamentals and Techniques. CRC
Press, 2016.
K. R. Irvine. Assembly language for x86 processors. Prentice Hall, 2014.
D. Patterson. How close is RISC-V to RISC-I?, 2017.
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.
Notes
1 http://parlab.eecs.berkeley.edu
NOTES 27
Figure 2.8: RV32I code for Insertion Sort in Figure 2.5. The address in hexadecimal is on the left, the
machine language code in hexadecimal is next, and then the assembly language instruction followed by a
comment. RV32I allocates two registers to point to a[j] and a[j-1]. It has plenty of registers, some of
which the ABI sets aside for procedure calls. Unlike the other ISAs, it skips saving and restoring these
registers to memory. While the code size is larger than x86-32, using the optional RV32C instructions (see
Chapter 7) closes the size gap. Note the compare and branch instructions avoid the three compare
instructions that ARM-32 and x86-32 require.
28 NOTES
Figure 2.9: ARM-32 code for Insertion Sort in Figure 2.5. The address in hexadecimal is on the left, the
machine language code in hexadecimal is next, and then the assembly language instruction followed by a
comment. Short on registers, ARM-32 saves two of them on the stack for later reuse along with the return
address. It uses an addressing mode that scales i and j to be byte addresses. Given that a branch has the
potential to change ISAs between ARM-32 and Thumb-2, bxcs first sets the least significant bit of the return
address to 0 before saving it. The condition codes save one compare instruction to check j after
decrementing it, but there are still three compares elsewhere.
NOTES 29
Figure 2.10: MIPS-32 code for Insertion Sort in Figure 2.5. The address in hexadecimal is on the left, the
machine language code in hexadecimal is next, and then the assembly language instruction followed by a
comment. The MIPS-32 code has three nop instructions, which adds to its length. Two are due to delayed
branches and one is due to the delayed load. The compiler couldn’t find useful instructions to place in those
delay slots. The delayed branches also make the code harder to understand, since the instruction that follows
is also executed when a branch or jump is taken. For example, the last instruction (addiu) at address 5c is
part of the loop even though it trails the branch instruction.
30 NOTES
Figure 2.11: x86-32 code for Insertion Sort in Figure 2.5. The address in hexadecimal is on the left, the
machine language code in hexadecimal is next, and then the assembly language instruction followed by a
comment. Lacking registers, the x86-32 saves two of them on the stack. Moreover, two of the variables
allocated to registers in RV32I are instead kept in memory (n and the pointer to a[0]). It uses the Scaled
Indexed addressing mode to good effect for accessing a[i] and a[j]. Seven of the 20 x32-86 instructions are
one byte long, which gives the x86-32 good code size for this simple program. There are two popular versions
of x86 assembly language: Intel/Microsoft and AT&T/Linux. We use the Intel syntax, in part because it
matches the operand order of RISC-V, ARM-32, and MIPS-32 with the destination on the left and the
source(s) on the right, while the operands are vice versa for AT&T (and the registers prepend a % before
their names). This seemingly trivial matter is nearly a religious issue for some programmers. Pedagogy
drives our choice, not orthodoxy.
NOTES 31
3 RISC-V Assembly Language
Ivan Sutherland (1938- It’s very satisfying to take a problem we thought difficult and find a simple solution. The
) is called the father of best solutions are always simple.
computer graphics for the
invention of Sketchpad— —Ivan Sutherland
the 1962 forerunner of the
graphical user interface in
modern computing—which
led to a Turing Award. 3.1 Introduction
Figure 3.1 shows the four classic steps in translation starting from a C program and ending
with a machine-language program ready to run in the computer. This chapter covers the last
three steps, but we begin with the role the assembler plays in the RISC-V calling convention.
C program
foo.c
Compiller
Assembly program
foo.s
Assembler
Linker
Loader
Figure 3.1: Steps of translation from C source code to a running program. These are the logical steps,
Memory We use the Unix file suffix name convention for
although some steps are combined to accelerate translation.
each type of file. The equivalent suffixes in MS-DOS are .C, .ASM, .OBJ, .LIB, and .EXE.
then the program must save register values in memory, but a surprising fraction of function
calls fall into this happy case.
Other registers within a function call must be considered either in the same class as saved
registers, which are preserved across a function call, or in the same class as the temporary
registers, which are not. A function will change the register(s) containing the return value(s),
so they are like temporary registers. There is no reason to preserve the registers to pass
arguments to functions, so they also are like temporaries. The caller can rely on the remaining
registers to be unchanged by across a function call: the registers used for the return address
and the stack pointer. Figure 3.2 lists the RISC-V application binary interface (ABI) names
of registers and the convention on whether they are preserved or not across function calls.
Given the ABI conventions, we can see the standard RV32I code for function entry and
exit. The function prologue looks like this:
entry_label:
addi sp,sp,-framesize # Allocate space for stack frame
# by adjusting stack pointer (sp register)
sw ra,framesize-4(sp) # Save return address (ra register)
# save other registers to stack if needed
... # body of the function
If there are too many function arguments and variables to fit in the registers, the prologue
allocates space on the stack for the function frame, as it is called. After the task of the
function is complete, the epilogue undoes the stack frame and returns to the point of origin:
34 CHAPTER 3. RISC-V ASSEMBLY LANGUAGE
Figure 3.2: Assembler mnemonics for RISC-V integer and floating-point registers. RISC-V has enough
registers that the ABI can allocate registers that procedures or methods are free to use without saving or
restoring when they don’t call other procedures or methods themselves. The registers preserved across a
procedure call are also named caller saved versus callee saved for those that aren’t. Chapter 5 explains the
floating-point f registers. (Table 20.1 of [Waterman and Asanović 2017] is the basis of this figure.)
3.3. ASSEMBLY 35
3.3 Assembly
The input to this step in Unix is a file with the suffix .s, such as foo.s; for MS-DOS it is
.ASM.
The job of the assembler step of Figure 3.1 is not simply to produce object code from the
instructions that the processor understands, but to extend them to include operations useful
for the assembly language programmer or the compiler writer. This category, based on clever
configurations of regular instructions, is called pseudoinstructions. Figures 3.3 and 3.4 list the
RISC-V pseudoinstructions, with those in the first figure all relying on register x0 to always
be zero while those in the second list do not. For example, ret mentioned above is actually
a pseudoinstruction that the assembler replaces with jalr x0, x1, 0 (see Figure 3.3). The
majority of RISC-V pseudoinstructions depend on x0. As you can see, setting aside one
of the 32 registers to be hardwired to zero greatly simplifies the RISC-V instruction set by
providing many popular operations—such as jump, return, and branch on equal to zero—as
pseudoinstructions.
Figure 3.5 shows the classic “Hello world” program in C. The compiler produces the The “Hello world”
assembly language output in Figure 3.6 using the calling convention in Figure 3.2 and the program is typically
pseudoinstructions from Figures 3.3 and 3.4. the first program run
on a newly designed
The commands that start with a period are assembler directives. They are commands to processor. Architects
the assembler rather than code to be translated by it. They tell the assembler where to place traditionally consider
code and data, specify text and data constants for use in the program, and so forth. Figure 3.9 running the operating
shows the assembler directives of RISC-V. For Figure 3.6, the directives are: system well enough to
print “Hello world” as
• .text—Enter text section. a strong sign that their
new chip largely works.
• .align 2—Align following code to 22 bytes. They email this output to
their management and
• .globl main—Declare global symbol “main”. colleagues, and then they
• .section .rodata—Enter read-only data section. celebrate.
Figure 3.3: 32 RISC-V pseudoinstructions that rely on x0, the zero register. Appendix A includes includes
the RISC-V pseudoinstructions as well as the real instructions. Those that read the 64-bit counters can read
by upper 32 bits in RV32I by using the “h” version of the instructions and the lower 32 bits using the plain
version. (Tables 20.2 and 20.3 of [Waterman and Asanović 2017] are the basis of this figure.).
3.3. ASSEMBLY 37
Figure 3.4: 28 RISC-V pseudoinstructions that are independent of x0, the zero register. For la, GOT stands
for Global Offset Table, which holds the runtime address of symbols in dynamically linked libraries.
Appendix A includes the RISC-V pseudoinstructions as well as the real instructions. (Tables 20.2 and 20.3 of
[Waterman and Asanović 2017] are the basis of this figure.)
38 CHAPTER 3. RISC-V ASSEMBLY LANGUAGE
#include <stdio.h>
int main()
{
printf("Hello, %s\n", "world");
return 0;
}
00000000 <main>:
0: ff010113 addi sp,sp,-16
4: 00112623 sw ra,12(sp)
8: 00000537 lui a0,0x0
c: 00050513 mv a0,a0
10: 000005b7 lui a1,0x0
14: 00058593 mv a1,a1
18: 00000097 auipc ra,0x0
1c: 000080e7 jalr ra
20: 00c12083 lw ra,12(sp)
24: 01010113 addi sp,sp,16
28: 00000513 li a0,0
2c: 00008067 ret
Figure 3.7: Hello World program in RISC-V machine language (hello.o). The six instructions that are later
patched by the linker (locations 8 to 1c) have zero in their address fields. The symbol table included in the
object file records the labels and addresses of all the instructions that need to be edited by the linker.
3.3. ASSEMBLY 39
000101b0 <main>:
101b0: ff010113 addi sp,sp,-16
101b4: 00112623 sw ra,12(sp)
101b8: 00021537 lui a0,0x21
101bc: a1050513 addi a0,a0,-1520 # 20a10 <string1>
101c0: 000215b7 lui a1,0x21
101c4: a1c58593 addi a1,a1,-1508 # 20a1c <string2>
101c8: 288000ef jal ra,10450 <printf>
101cc: 00c12083 lw ra,12(sp)
101d0: 01010113 addi sp,sp,16
101d4: 00000513 li a0,0
101d8: 00008067 ret
Figure 3.8: Hello World program as RISC-V machine language program after linking. In Unix systems, the
file would be named a.out.
Directive Description
.text Subsequent items are stored in the text section (machine code).
.data Subsequent items are stored in the data section (global variables).
Subsequent items are stored in the bss section (global variables initial-
.bss
ized to 0).
.section .foo Subsequent items are stored in the section named .foo.
Align the next datum on a 2n -byte boundary. For example, .align 2
.align n
aligns the next value on a word boundary.
Align the next datum on a n-byte boundary. For example, .balign 4
.balign n
aligns the next value on a word boundary.
.globl sym Declare that label sym is global and may be referenced from other files.
.string "str" Store the string str in memory and null-terminate it.
.byte b1,..., bn Store the n 8-bit quantities in successive bytes of memory.
.half w1,...,wn Store the n 16-bit quantities in successive memory halfwords.
.word w1,...,wn Store the n 32-bit quantities in successive memory words.
.dword w1,...,wn Store the n 64-bit quantities in successive memory doublewords.
Store the n single-precision floating-point numbers in successive mem-
.float f1,..., fn
ory words.
Store the n double-precision floating-point numbers in successive
.double d1,..., dn
memory doublewords.
.option rvc Compress subsequent instructions (see Chapter 7).
.option norvc Don’t compress subsequent instructions.
.option relax Allow linker relaxations for subsequent instructions.
.option norelax Don’t allow linker relaxations for subsequent instructions.
.option pic Subsequent instructions are position-independent code.
.option nopic Subsequent instructions are position-dependent code.
Push the current setting of all .options to a stack, so that a subsequent
.option push
.option pop will restore their value.
Pop the option stack, restoring all .options to their setting at the time
.option pop
of the last .option push.
sp = bfff fff0hex
Stack
Dynamic data
Static data
1000 0000hex
Text
pc = 0001 0000hex
Reserved
0
Figure 3.10: RV32I allocation of memory to program and data. The high addresses are the top of the figure
and the low addresses are the bottom. In this RISC-V software convention, the stack pointer (sp) starts at
bfff fff0hex and grows down toward the Static data. The text (program code) starts at 0001 0000hex and
includes the statically-linked libraries. The Static data starts immediately above the text region; in this
example, we assume that address is 1000 0000hex . Dynamic data, allocated in C by malloc(), is just above
the Static data. Called the heap, it grows upward toward the stack. It includes the dynamically-linked
libraries.
3.4 Linker
Rather than compile all the source code every time one file changes, the linker allows individ-
ual files to be compiled and assembled separately. It “stitches” the new object code together
with existing machine language modules, such as libraries. It derives its name from one of
its tasks, which is to all edit the links of the jump and link instructions in the object file. In
fact, linker is short for link editor, which was the historical name of this step in Figure 3.1. In
Unix systems, the input to the linker are files with the .o suffix (e.g., foo.o, libc.o ), and
its output is the a.out file. For MS-DOS, the inputs are files with the suffix .OBJ or .LIB
and the output is a .EXE file.
Figure 3.10 shows the addresses of the regions of memory allocated for code and data
in a typical RISC-V program. The linker must adjust the program and data addresses of
instructions in all the object files to match addresses in this figure. It is less work for the
linker if the input files are position independent code (PIC). PIC means that all the branches
to instructions and references to data inside the file are correct wherever the code is placed.
As mentioned in Chapter 2, the PC-relative branch of RV32I makes PIC much easier to fulfill.
In addition to the instructions, each object file contains a symbol table that includes all
the labels in the program that must be given addresses as part of the linking process. This list
includes labels to data as well as to code. Figure 3.6 has two data labels to be set (string1
and string2) and two code labels to be assigned in (main and printf). Since it’s hard to
specify a 32-bit address within a single 32-bit instruction, the linker must adjust two instruc-
tions per label in the RV32I code, as Figure 3.6 shows: lui and addi for data addresses, and
3.5. STATIC VS. DYNAMIC LINKING 41
auipc and jalr for code addresses. Figure 3.8 shows the final linked a.out version of the
object file in Figure 3.7.
RISC-V compilers support several ABIs, depending on whether the F and D extensions
are present. For RV32, the ABIs are named ilp32, ilp32f, and ilp32d. ilp32 means that the
C language data types int, long, and pointer are all 32 bits; the optional suffix indicates how
floating-point arguments are passed. In ilp32, floating-point arguments are passed in integer
registers. In ilp32f, single-precision floating-point arguments are passed in floating-point
registers. In ilp32d, double-precision floating-point arguments are also passed in floating-
point registers.
Naturally, to pass a floating-point argument in a floating-point register, you need the cor-
responding floating-point ISA extension F or D (see Chapter 5). So, to compile code for
RV32I (GCC flag ‘-march=rv32i‘), you must use the ilp32 ABI (GCC flag ‘-mabi=ilp32‘).
On the other hand, having floating-point instructions doesn’t mean the calling convention is
required to use them; so, for example, RV32IFD is compatible with all three ABIs: ilp32,
ilp32f, and ilp32d.
The linker checks that the program’s ABI matches all of its libraries. Although the com-
piler supports many combinations of ISA extensions and ABIs, only a few sets of libraries
might be installed. Hence, a common pitfall is attempting to link a program without having
compatible libraries installed. The linker will not produce a helpful diagnostic message in
this case; it will simply attempt to link with an incompatible library, then inform you of the
incompatibility. This pitfall generally occurs only when compiling on one computer for a
different computer (cross compiling).
Architects typically
3.5 Static vs. Dynamic Linking measure processor
performance using
benchmarks that are
The prior section describes static linking, where all potential library code is linked and then statically linked de-
loaded together before execution. Such libraries can be relatively large, so linking a popu- spite most real programs
lar library into multiple programs wastes memory. Moreover, the libraries are bound when having dynamic links. The
excuse is that users in-
linked—even when they are updated later to fix bugs—forcing the statically-linked code to
terested in performance
use the old, buggy version. should link statically, but
To avoid both problems, most systems today rely on dynamic linking, where the desired it’s a poor justification.
external function is loaded and linked to the program only after it is first called; if it’s never It makes more sense to
accelerate performance of
called, it’s never loaded and linked. Every call after the first one uses a fast link, so the real programs, not bench-
dynamic overhead is only paid once. Each time a program starts it links in the current version marks.
42 REFERENCES
of the library functions it needs, which is how it can get the newest version. Furthermore,
if multiple programs use the same dynamically linked library, the code in the library need
appear only once in memory.
The code that the compiler generates resembles that for static linking. Instead of jumping
to the real function, it jumps to a short (three-instruction) stub function. The stub function
loads the address of the real function from a table in memory, then jumps to it. However, on
the first call, the table lacks the address of the real function, but instead contains the address
of the dynamic-linking routine. When invoked, the dynamic linker uses the symbol table to
find the real function, copies it into memory, and then updates the table to point to the real
function. Each subsequent call pays only the three-instruction overhead of the stub function.
3.6 Loader
A program like the one in Figure 3.8 is an executable file kept in the computer’s storage.
When one is to be run, the loader’s job is to load it into memory and jump to the starting
address. The “loader” today is the operating system; stated alternatively, loading a.out is
one of many tasks of an operating system.
Loading is a little trickier for dynamically-linked programs. Instead of simply starting the
program, the operating system starts the dynamic linker. It in turn starts the desired program,
and then handles all first-time external calls, copies the functions into memory, and edits the
program after each call to point it to the correct function.
The assembler enhances the simple RISC-V ISA with 60 pseudoinstructions that make
RISC-V code easier to read and to write without increasing hardware costs. Simply dedicat-
ing one RISC-V register to zero enables many of these helpful operations. The Load Upper
Immediate (lui) and Add Upper Immediate to PC (auipc) instructions make it easier for
the compiler and linker to adjust addresses for external data and functions, and PC-relative
branching makes it easier to help the linker with position-independent code. Having plenty
of registers enables a calling convention that makes function call and return faster by reducing
the number of register spills and restores.
RISC-V offers a tasteful collection of simple but impactful mechanisms that reduce cost,
improve performance, and make it easier to program.
Notes
1 http://parlab.eecs.berkeley.edu
4 RV32M: Multiply and Divide
or alternatively
Dividend = Quotient × Divisor + Remainder
Remainder = Dividend − (Quotient × Divisor)
srl can do unsigned RV32M has divide instructions for both signed and unsigned integers: divide (div) and di-
division by 2i . For vide unsigned (divu), which place the quotient into the destination register. Less frequently,
example, if a2 = 16 (24 ) programmers want the remainder instead of the quotient, so RV32M offers remainder (rem)
then srli t2,a1,4
produces the same value
and remainder unsigned (remu), which write the remainder instead of the quotient.
as divu t2,a1,a2.
RV32M
multiply
_
multiply high unsigned
signed unsigned
_
divide
unsigned
remainder
31 25 24 20 19 15 14 12 11 7 6 0
0000001 rs2 rs1 000 rd 0110011 R mul
0000001 rs2 rs1 001 rd 0110011 R mulh
0000001 rs2 rs1 010 rd 0110011 R mulhsu
0000001 rs2 rs1 011 rd 0110011 R mulhu
0000001 rs2 rs1 100 rd 0110011 R div
0000001 rs2 rs1 101 rd 0110011 R divu
0000001 rs2 rs1 110 rd 0110011 R rem
0000001 rs2 rs1 111 rd 0110011 R remu
Figure 4.2: RV32M opcode map has instruction layout, opcodes, format type, and names. (Table 19.2 of
[Waterman and Asanović 2017] is the basis of this figure.)
Figure 4.3: RV32M code to divide by a constant by multiplying. It takes careful numerical analysis to show
that this algorithm works for any dividend, and for some other divisors, the correction step is more
complicated. The proof of correctness, and the algorithm for generating the reciprocals and correction steps,
is in [Granlund and Montgomery 1994].
To offer the smallest RISC-V processor for embedded applications, multiply and divide are
part of the first optional standard extension of RISC-V. Nevertheless, many RISC-V proces-
sors will include RV32M.
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.
Notes
1 http://parlab.eecs.berkeley.edu
NOTES 47
5 RV32F and RV32D: Single- and
Double-Precision Floating Point
Antoine de Saint Perfection is finally attained not when there is no longer anything to add, but when there
Exup’ery, L’Avion is no longer anything to take away.
(1900-1944) was a French
writer and aviator best —Antoine de Saint Exup’ery, L’Avion, 1940
known for the book The
Little Prince.
5.1 Introduction
Although RV32F and RV32D are separate, optional instruction set extensions, they are often
included together. Given single- and double-precision (32- and 64-bit) versions of nearly
all floating-point instructions, for brevity we present them in one chapter. Figure 5.1 is a
graphical representation of the RV32F and RV32D extension instruction sets. Figure 5.2 lists
the opcodes of RV32F and Figure 5.3 lists the opcodes of RV32D. Like virtually all other
modern ISAs, RISC-V obeys the IEEE 754-2008 floating-point standard [IEEE Standards
Committee 2008].
into the left and right 32-bit halves of a double-precision register. x86-32 floating-point Having only 16
arithmetic didn’t have any registers, but used a stack instead. The stack entries were 80-bits double-precision
wide to improve accuracy, so loads covert 32-bit or 64-bit operands to 80 bits, and vice versa registers was the
most painful ISA er-
for stores. A subsequent version of x86-32 added 8 traditional 64-bit floating-point registers ror in MIPS according
and associated instructions. Unlike RV32FD and MIPS-32, ARM-32 and x86-32 overlooked to John Mashey, one of its
instructions to move data directly between floating-point and integer registers. The only architects.
solution is to store a floating-point register in memory and then load it from memory to an
integer register, and vice versa.
31 27 26 25 24 20 19 15 14 12 11 7 6 0
imm[11:0] rs1 010 rd 0000111 I flw
imm[11:5] rs2 rs1 010 imm[4:0] 0100111 S fsw
rs3 00 rs2 rs1 rm rd 1000011 R4 fmadd.s
rs3 00 rs2 rs1 rm rd 1000111 R4 fmsub.s
rs3 00 rs2 rs1 rm rd 1001011 R4 fnmsub.s
rs3 00 rs2 rs1 rm rd 1001111 R4 fnmadd.s
0000000 rs2 rs1 rm rd 1010011 R fadd.s
0000100 rs2 rs1 rm rd 1010011 R fsub.s
0001000 rs2 rs1 rm rd 1010011 R fmul.s
0001100 rs2 rs1 rm rd 1010011 R fdiv.s
0101100 00000 rs1 rm rd 1010011 R fsqrt.s
0010000 rs2 rs1 000 rd 1010011 R fsgnj.s
0010000 rs2 rs1 001 rd 1010011 R fsgnjn.s
0010000 rs2 rs1 010 rd 1010011 R fsgnjx.s
0010100 rs2 rs1 000 rd 1010011 R fmin.s
0010100 rs2 rs1 001 rd 1010011 R fmax.s
1100000 00000 rs1 rm rd 1010011 R fcvt.w.s
1100000 00001 rs1 rm rd 1010011 R fcvt.wu.s
1110000 00000 rs1 000 rd 1010011 R fmv.x.w
1010000 rs2 rs1 010 rd 1010011 R feq.s
1010000 rs2 rs1 001 rd 1010011 R flt.s
1010000 rs2 rs1 000 rd 1010011 R fle.s
1110000 00000 rs1 001 rd 1010011 R fclass.s
1101000 00000 rs1 rm rd 1010011 R fcvt.s.w
1101000 00001 rs1 rm rd 1010011 R fcvt.s.wu
1111000 00000 rs1 000 rd 1010011 R fmv.w.x
Figure 5.2: RV32F opcode map has instruction layout, opcodes, format type, and names. The primary
difference in the encodings between this and the next figure is bit 12 is a 0 for the first two instructions and
bit 25 is a 0 for the rest of the instructions where both bits are 1 in RV32D. (Table 19.2 of [Waterman and
Asanović 2017] is the basis of this figure.)
5.3. FLOATING-POINT LOADS, STORES, AND ARITHMETIC 51
31 27 26 25 24 20 19 15 14 12 11 7 6 0
imm[11:0] rs1 011 rd 0000111 I fld
imm[11:5] rs2 rs1 011 imm[4:0] 0100111 S fsd
rs3 01 rs2 rs1 rm rd 1000011 R4 fmadd.d
rs3 01 rs2 rs1 rm rd 1000111 R4 fmsub.d
rs3 01 rs2 rs1 rm rd 1001011 R4 fnmsub.d
rs3 01 rs2 rs1 rm rd 1001111 R4 fnmadd.d
0000001 rs2 rs1 rm rd 1010011 R fadd.d
0000101 rs2 rs1 rm rd 1010011 R fsub.d
0001001 rs2 rs1 rm rd 1010011 R fmul.d
0001101 rs2 rs1 rm rd 1010011 R fdiv.d
0101101 00000 rs1 rm rd 1010011 R fsqrt.d
0010001 rs2 rs1 000 rd 1010011 R fsgnj.d
0010001 rs2 rs1 001 rd 1010011 R fsgnjn.d
0010001 rs2 rs1 010 rd 1010011 R fsgnjx.d
0010101 rs2 rs1 000 rd 1010011 R fmin.d
0010101 rs2 rs1 001 rd 1010011 R fmax.d
0100000 00001 rs1 rm rd 1010011 R fcvt.s.d
0100001 00000 rs1 rm rd 1010011 R fcvt.d.s
1010001 rs2 rs1 010 rd 1010011 R feq.d
1010001 rs2 rs1 001 rd 1010011 R flt.d
1010001 rs2 rs1 000 rd 1010011 R fle.d
1110001 00000 rs1 001 rd 1010011 R fclass.d
1100001 00000 rs1 rm rd 1010011 R fcvt.w.d
1100001 00001 rs1 rm rd 1010011 R fcvt.wu.d
1101001 00000 rs1 rm rd 1010011 R fcvt.d.w
1101001 00001 rs1 rm rd 1010011 R fcvt.d.wu
Figure 5.3: RV32D opcode map has instruction layout, opcodes, format type, and names. There are some
instructions in these two figures do not simply differ by data width. This figure uniquely has fcvt.s.d and
fcvt.d.s while the other has fmv.x.w and fmv.w.x. (Table 19.2 of [Waterman and Asanović 2017] is the
basis of this figure.)
52 CHAPTER 5. RV32FD: SINGLE/DOUBLE FLOATING POINT
63 32 31 0
f0 / ft0 FP Temporary
f1 / ft1 FP Temporary
f2 / ft2 FP Temporary
f3 / ft3 FP Temporary
f4 / ft4 FP Temporary
f5 / ft5 FP Temporary
f6 / ft6 FP Temporary
f7 / ft7 FP Temporary
f8 / fs0 FP Saved register
f9 / fs1 FP Saved register
f10 / fa0 FP Function argument, return value
f11 / fa1 FP Function argument, return value
f12 / fa2 FP Function argument
f13 / fa3 FP Function argument
f14 / fa4 FP Function argument
f15 / fa5 FP Function argument
f16 / fa6 FP Function argument
f17 / fa7 FP Function argument
f18 / fs2 FP Saved register
f19 / fs3 FP Saved register
f20 / fs4 FP Saved register
f21 / fs5 FP Saved register
f22 / fs6 FP Saved register
f23 / fs7 FP Saved register
f24 / fs8 FP Saved register
f25 / fs9 FP Saved register
f26 / fs10 FP Saved register
f27 / fs11 FP Saved register
f28 / ft8 FP Temporary
f29 / ft9 FP Temporary
f30 / ft10 FP Temporary
f31 / ft11 FP Temporary
32 32
Figure 5.4: The floating-point registers of RV32F and RV32D. The single-precision registers occupy the
rightmost half of the 32 double-precision registers. Chapter 3 explains the RISC-V calling convention for the
floating-point registers, the rationale behind the FP Argument registers (fa0-fa7), FP Saved registers
(fs0-fs11), and FP Temporaries (ft0-ft11). (Table 20.1 of [Waterman and Asanović 2017] is the basis of this
figure.)
5.4. FLOATING-POINT MOVES AND CONVERTS 53
31 87 5 4 3 2 1 0
Reserved Rounding Mode (frm) Accrued Exceptions (fflags)
NV DZ OF UF NX
24 3 1 1 1 1 1
Figure 5.5: Floating-point control and status register. It holds the rounding modes and the exception flags.
The rounding modes are round to nearest, ties to even (rte, 000 in frm); round towards zero (rtz, 001); round
down, towards −∞ (rdn, 010); round up, towards +∞ (rup, 011); and round to nearest, ties to max
magnitude (rmm, 100). The five accrued exception flags indicate the exception conditions that have arisen on
any floating-point arithmetic instruction since the field was last reset by software: NV is Invalid Operation;
DZ is Divide by Zero; OF is Overflow; UF is Underflow; and NX is Inexact. (Figure 8.2 of [Waterman and
Asanović 2017] is the basis of this figure.)
From
To 32b signed 32b unsigned 32b floating 64b floating
integer (w) integer (wu) point (s) point (d)
32b signed integer (w) – – fcvt.w.s fcvt.w.d
32b unsigned integer (wu) – – fcvt.wu.s fcvt.wu.d
32b floating point (s) fcvt.s.w fcvt.s.wu – fcvt.s.d
64b floating point (d) fcvt.d.w fcvt.d.wu fcvt.d.s –
Figure 5.6: RV32F and RV32D conversion instructions. The columns list the source data types and the rows
show the converted destination data type.
1. Float sign inject (fsgnj.s, fsgnj.d): the result’s sign bit is rs2’s sign bit.
2. Float sign inject negative (fsgnjn.s, fsgnjn.d): the result’s sign bit is the opposite
of rs2’s sign bit.
3. Float sign inject exclusive-or (fsgnjx.s, fsgnjx.d): the sign bit is the XOR of the
sign bits of rs1 and rs2.
As well as helping with sign manipulation in math libraries, sign-injection instructions
provide three popular floating-point pseudoinstructions (see Figure 3.4 on page 37):
1. Copy floating-point register:
fmv.s rd,rs is really fsgnj.s rd,rs,rs and
fmv.d rd,rs is really fsgnj.d rd,rs,rs.
2. Negate:
fneg.s rd,rs maps to fsgnjn.s rd,rs,rs and
fneg.d rd,rs maps to fsgnjn.d rd,rs,rs.
3. Absolute value (since 0 ⊕ 0 = 0 and 1 ⊕ 1 = 0):
fabs.s rd,rs becomes fsgnjx.s rd,rs,rs and
fabs.d rd,rs becomes fsgnjx.d rd,rs,rs.
The second unusual floating-point instruction is classify (fclass.s, fclass.d).
Classify instructions are also a great aid to math libraries. They test a source operand to see
which of 10 floating-point properties apply (see the table below), and then write a mask into
the lower 10 bits of the destination integer register with the answer. Only one of the ten bits
is set to 1, with the rest set to 0s.
Figure 5.8: Number of instructions and code size of DAXPY for four ISAs. It lists number of instructions per
loop and total. Chapter 7 describes ARM Thumb-2, microMIPS, and RV32C.
The IEEE 754-2008 floating-point standard [IEEE Standards Committee 2008] defines the
floating-point data types, the accuracy of computation, and the required operations. Its suc-
cess greatly reduces the difficulty of porting floating-point programs, and it also means that
the floating-point ISAs are probably more uniform than are the equivalent in other chapters.
56 NOTES
Notes
1 http://parlab.eecs.berkeley.edu
NOTES 57
Figure 5.9: RV32D code for DAXPY in Figure 5.7. The address in hexadecimal is on the left, the machine
language code in hexadecimal is next, and then the assembly language instruction followed by a comment.
The compare-and-branch instructions avoid the two compare instructions in the code of ARM-32 and
x86-32.
Figure 5.10: ARM-32 code for DAXPY in Figure 5.7. The autoincrement addressing mode of ARM-32 saves
two instructions as compared to RISC-V. Unlike Insertion Sort, there is no need to push and pop registers for
DAXPY in ARM-32.
58 NOTES
Figure 5.11: MIPS-32 code for DAXPY in Figure 5.7. Two of the three branch delay slots are filled with
useful instructions. The ability to check for equality between two registers avoids the two compare
instructions found in ARM-32 and x86-32. Unlike integer loads, floating-point loads have no delay slot.
Figure 5.12: x86-32 code for DAXPY in Figure 5.7. The lack of x86-32 registers is evident in this example,
with four variables allocated to memory that are in registers in the code for the other ISAs. It also
demonstrates x86-32 idioms to compare a register to zero (test ecx,ecx) or to set a register to zero (xor
eax,eax).
NOTES 59
6 RV32A: Atomic Instructions
Albert Einstein (1879- Everything should be made as simple as possible, but no simpler.
1955) was the most famous
—Albert Einstein, 1933
scientist of the 20th century.
He invented the theory of
relativity and advocated
building the atomic bomb for
World War II. 6.1 Introduction
Our assumption is that you already understand ISA support for multiprocessing, so our job is
just to explain the RV32A instructions and what they do. If you don’t feel you have sufficient
background or need a reminder, study “synchronization (computer science)” on Wikipedia
(https://en.wikipedia.org/wiki/Synchronization_(computer_science)) or read Section 2.1 of
our related RISC-V architecture book [Patterson and Hennessy 2017].
RV32A has two types of atomic operations for synchronization:
• atomic memory operations (AMO), and
• load reserved / store conditional.
Figure 6.1 is a graphical representation of the RV32A extension instruction set and Figure 6.2
lists their opcodes and instruction formats.
RV32A
add
and
or
swap
atomic memory operation xor .word
maximum
maximum unsigned
minimum
minimum unsigned
31 25 24 20 19 15 14 12 11 7 6 0
00010 aq rl 00000 rs1 010 rd 0101111 R lr.w
00011 aq rl rs2 rs1 010 rd 0101111 R sc.w
00001 aq rl rs2 rs1 010 rd 0101111 R amoswap.w
00000 aq rl rs2 rs1 010 rd 0101111 R amoadd.w
00100 aq rl rs2 rs1 010 rd 0101111 R amoxor.w
01100 aq rl rs2 rs1 010 rd 0101111 R amoand.w
01000 aq rl rs2 rs1 010 rd 0101111 R amoor.w
10000 aq rl rs2 rs1 010 rd 0101111 R amomin.w
10100 aq rl rs2 rs1 010 rd 0101111 R amomax.w
11000 aq rl rs2 rs1 010 rd 0101111 R amominu.w
11100 aq rl rs2 rs1 010 rd 0101111 R amomaxu.w
Figure 6.2: RV32A opcode map has instruction layout, opcodes, format type, and names. (Table 19.2 of
[Waterman and Asanović 2017] is the basis of this figure.
The AMO instructions atomically perform an operation on an operand in memory and set
the destination register to the original memory value. Atomic means there can be no interrupt
between the read and the write of memory, nor could other processors modify the memory
value between the memory read and write of the AMO instruction.
AMOs and LR/SC
Load reserved and store conditional provide an atomic operation across two instructions.
require naturally
Load reserved reads a word from memory, writes it to the destination register, and records a aligned memory
reservation on that word in memory. Store conditional stores a word at the address in a source addresses because it
register provided there exists a load reservation on that memory address. It writes zero to the is onerous for hardware to
guarantee atomicity across
destination register if the store succeeded, or a nonzero error code otherwise. cache-line boundaries.
An obvious question is: Why does RV32A have two ways to perform atomic operations?
The answer is that there are two quite distinct use cases.
Programming language developers assume the underlying architecture can perform an
atomic compare-and-swap operation: Compare a register value to a value in memory ad-
dressed by another register, and if they are equal, then swap a third register value with the one
in memory. They make that assumption because it is a universal synchronization primitive,
in that any other single-word synchronization operation can be synthesized from compare-
and-swap [Herlihy 1991].
While that is powerful argument for adding such an instruction to an ISA, it requires
three source registers in one instruction. Alas, going from two to three source operands
would complicate the integer datapath, control, and the instruction format. (The three source
operands of RV32FD’s multiply-add instructions affect the floating-point datapath, not the
integer datapath.) Fortunately, load reserved and store conditional have only two source
registers and can implement atomic compare and swap (see top half of Figure 6.3).
The rationale for also having AMO instructions is that they scale better to large multipro-
cessor systems than load reserved and store conditional. They can also be used to implement
reduction operations efficiently. AMOs are useful as well for communicating with I/O de-
vices, because they perform a read and a write in a single atomic bus transaction. This
atomicity can both simplify device drivers and improve I/O performance. The bottom half of
Figure 6.3 shows how to write a critical section using atomic swap.
62 REFERENCES
Figure 6.3: Two examples of synchronization. The first uses load reserved/store conditional lr.w,sc.w to
implement compare-and-swap, and the second uses an atomic swap amoswap.w to implement a mutex.
What’s Different? The original MIPS-32 had no mechanism for synchronization, but
architects added load reserved / store conditional instructions to a later MIPS ISA.
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.
Notes
1 http://parlab.eecs.berkeley.edu
7 RV32C: Compressed Instructions
Integer Computation
RV32C
Control transfer
_
c.add equal
immediate c.branch to zero
not equal
c.add immediate * 16 to stack pointer _
c.add immediate * 4 to stack pointer nondestructive c.jump and link
c.subtract _
c.jump and link register
shift left logical
c. shift right arithmetic immediate
Other instructions
shift right logical
_ c.environment break
c.and
immediate
c.or
c.move
c.exclusive or
_
c.load immediate
upper
Loads and Stores
_ load word _
c. float using stack pointer
store
_
c.float load doubleword using stack pointer
store
Figure 7.1: Diagram of the RV32C instructions. The immediate fields of the shift instructions and
c.addi4spn are zero extended and sign extended for the other instructions.
Figure 7.2: Instructions and code size for Insertion Sort and DAXPY for compressed ISAs.
RV32C gives RISC-V one of the smallest code sizes today. You can almost think of
them as hardware-assisted pseudoinstructions. However, now the assembler is hiding them
from the assembly language programmer and compiler writer rather than, as in Chapter 3,
7.4. TO LEARN MORE 67
expanding the real instruction set with popular operations that make RISC-V code easier to
use and to read. Both approaches aid programmer productivity.
We consider RV32C as one of RISC-V’s best examples of a simple, powerful mechanism
that improves its cost-performance.
Notes
1 http://parlab.eecs.berkeley.edu
68 NOTES
Figure 7.3: RV32C code for Insertion Sort. The twelve 16-bit instructions make the code 32% smaller. The
width of each instruction is evident by the number of hexadecimal characters in the second column. The
RV32C instructions (starting with c.) are shown explicitly in this example, but normally assembly language
programmers and compilers cannot see them.
NOTES 69
Figure 7.4: RV32DC code for DAXPY. The eight 16-bit instructions shrink the code by 36%. The width of
each instruction is evident by the number of hexadecimal characters in the second column. The RV32C
instructions (starting with c.) are shown explicitly in this example, but normally they are invisible to the
assembly language programmer and compiler.
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
000 nzimm[5] 0 nzimm[4:0] 01 CI c.nop
000 nzimm[5] rs1/rd6=0 nzimm[4:0] 01 CI c.addi
001 imm[11|4|9:8|10|6|7|3:1|5] 01 CJ c.jal
010 imm[5] rd6=0 imm[4:0] 01 CI c.li
011 nzimm[9] 2 nzimm[4|6|8:7|5] 01 CI c.addi16sp
011 nzimm[17] rd6={0, 2} nzimm[16:12] 01 CI c.lui
100 nzuimm[5] 00 rs10 /rd0 nzuimm[4:0] 01 CI c.srli
100 nzuimm[5] 01 rs10 /rd0 nzuimm[4:0] 01 CI c.srai
100 imm[5] 10 rs10 /rd0 imm[4:0] 01 CI c.andi
100 0 11 rs10 /rd0 00 rs20 01 CR c.sub
0 0
100 0 11 rs1 /rd 01 rs20 01 CR c.xor
0 0
100 0 11 rs1 /rd 10 rs20 01 CR c.or
100 0 11 rs10 /rd0 11 rs20 01 CR c.and
101 imm[11|4|9:8|10|6|7|3:1|5] 01 CJ c.j
110 imm[8|4:3] rs10 imm[7:6|2:1|5] 01 CB c.beqz
111 imm[8|4:3] rs10 imm[7:6|2:1|5] 01 CB c.bnez
Figure 7.5: RV32C opcode map (bits[1 : 0] = 01) lists layout, opcodes, format, and names. rd’, rs1’, and
rs2’ refer to the 10 popular registers a0–a5, s0–s1, sp, and ra. (Table 12.5 of Waterman and Asanović
2017] is the basis of this figure.)
70 NOTES
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
000 0 0 00 CIW Illegal instruction
000 nzuimm[5:4|9:6|2|3] rd0 00 CIW c.addi4spn
001 uimm[5:3] rs10 uimm[7:6] rd0 00 CL c.fld
010 uimm[5:3] rs10 uimm[2|6] rd0 00 CL c.lw
011 uimm[5:3] rs10 uimm[2|6] rd0 00 CL c.flw
101 uimm[5:3] rs10 uimm[7:6] rs20 00 CL c.fsd
110 uimm[5:3] rs10 uimm[2|6] rs20 00 CL c.sw
111 uimm[5:3] rs10 uimm[2|6] rs20 00 CL c.fsw
Figure 7.6: RV32C opcode map (bits[1 : 0] = 00) lists layout, opcodes, format, and names. rd’, rs1’, and
rs2’ refer to the 10 popular registers a0–a5, s0–s1, sp, and ra. (Table 12.4 of Waterman and Asanović
2017] is the basis of this figure.)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
000 nzuimm[5] rs1/rd6=0 nzuimm[4:0] 10 CI c.slli
000 0 rs1/rd6=0 0 10 CI c.slli64
001 uimm[5] rd uimm[4:3|8:6] 10 CSS c.fldsp
010 uimm[5] rd6=0 uimm[4:2|7:6] 10 CSS c.lwsp
011 uimm[5] rd uimm[4:2|7:6] 10 CSS c.flwsp
100 0 rs16=0 0 10 CJ c.jr
100 0 rd6=0 rs26=0 10 CR c.mv
100 1 0 0 10 CI c.ebreak
100 1 rs16=0 0 10 CJ c.jalr
100 1 rs1/rd6=0 rs26=0 10 CR c.add
101 uimm[5:3|8:6] rs2 10 CSS c.fsdsp
110 uimm[5:2|7:6] rs2 10 CSS c.swsp
111 uimm[5:2|7:6] rs2 10 CSS c.fswsp
Figure 7.7: RV32C opcode map (bits[1 : 0] = 10) lists layout, opcodes, format, and names. (Table 12.6
of Waterman and Asanović 2017] is the basis of this figure.)
Format Meaning 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
CR Register funct4 rd/rs1 rs2 op
CI Immediate funct3 imm rd/rs1 imm op
CSS Stack-relative Store funct3 imm rs2 op
CIW Wide Immediate funct3 imm rd0 op
0
CL Load funct3 imm rs1 imm rd0 op
0
CS Store funct3 imm rs1 imm rs20 op
0
CB Branch funct3 offset rs1 offset op
CJ Jump funct3 jump target op
Figure 7.8: Compressed 16-bit RVC instruction formats. rd’, rs1’, and rs2’ refer to the 10 popular
registers a0–a5, s0–s1, sp, and ra. (Table 12.1 of Waterman and Asanović 2017] is the basis of this figure.)
NOTES 71
8 RV32V: Vector
Seymour Cray (1925- I’m all for simplicity. If it’s very complicated I can’t understand it.
1996) was architect of the
—Seymour Cray
Cray-1 in 1976, the first
commercially successful
supercomputer using a
vector architecture. The 8.1 Introduction
Cray-1 was a gem; it was
the world’s fastest computer This chapter focuses on data-level parallelism, where there is plenty of data that the desired
even without using the application can compute on concurrently. Arrays are a popular example. While fundamental
vector instructions.
to scientific applications, multimedia programs use arrays as well. The former uses single-
and double-precision float-point data and the latter often uses 8- and 16-bit integer data.
The best known architecture for data-level parallelism is Single Instruction Multiple Data
(SIMD). SIMD first became popular by partitioning 64-bit registers into many 8-, 16-, or 32-
bit pieces and then computing on them in parallel. The opcode supplied the data width and
the operation. Data transfers are simply loads and stores of a single (wide) SIMD register.
The first step of partitioning existing 64-bit registers is tempting because it is straightfor-
The Intel Multimedia ward. To make SIMD faster, architects subsequently widen the registers to compute more
Extensions (MMX) in partitions concurrently. Because the SIMD ISAs belong to the incremental school of de-
1997 made SIMD popular.
They were embraced and sign, and the opcode specifies the data width, expanding the SIMD registers also expands the
expanded via Streaming SIMD instruction set. Each subsequent step of doubling the width of SIMD registers and the
SIMD Extensions (SSE) number of SIMD instructions leads ISAs down the path of escalating complexity, which is
in 1999 and Advanced borne by processor designers, compiler writers, and assembly language programmers.
Vector Extensions (AVX)
in 2010. MMX fame was An older and, in our opinion, more elegant alternative to exploit data-level parallelism is
fueled by an Intel ad the vector architecture. This chapter provides our rationale for using vectors instead of SIMD
campaign showing disco- in RISC-V.
dancing workers of a
Vector computers gather objects from main memory and put them into long, sequential
semiconductor line clad
in technicolor clean suits vector registers. Pipelined execution units compute very efficiently on these vector registers.
(https://www.youtube.com/ Vector architectures then scatter the results back from the vector registers to main memory.
watch?v=paU16B-bZEA). The size of the vector registers is determined by the implementation, rather than baked into
the opcode, as with SIMD. As we shall see, separating the vector length and maximum op-
erations per clock cycle from the instruction encoding is the crux of the vector architecture:
the vector microarchitect can flexibly design the data-parallel hardware without affecting the
programmer, and the programmer can take advantage of longer vectors without rewriting the
code. In addition, vector architectures have many fewer instructions than SIMD architectures.
Moreover, vector architectures have well-established compiler technology, unlike SIMD.
8.2. VECTOR COMPUTATION INSTRUCTIONS 73
Computation RV32V
add
Load and Store
multiply
_
multiply high load strided
vector
vector and .vv store indexed
or .vs
xor Comparison
minimum equal
.vv
maximum vector predicate not equal .vs
convert less than
subtract greater than or equal
divide .vv and
vector remainder .vs and not
shift left logical vector predicate or
.sv
shift right arithmetic exclusive or
shift right logical .vvv not
_ add .vvs vector predicate swap
vector fused negative multiply subtract .vsv
_ .vss Miscellaneous instructions
vector sign injection negative .vv set vector length
vector class.v exclusive or vector extract.vs
add vector merge.vv
vector move.vv
and vector select.vv
vector square root.v
or vector set data configuration
.vv
vector atomic memory operation swap
.vs
xor
minimum
maximum
Figure 8.1: Diagram of the RV32V instructions. Because of dynamic register typing, this instruction
diagram also works without change for RV64V in Chapter 9.
Vector architectures are rarer than SIMD architectures, so fewer readers know vector
ISAs. Thus, this chapter will have a more tutorial flavor than earlier ones. If you want to dig
deeper into vector architectures, read Chapter 4 and Appendix G of [Hennessy and Patterson
2011]. RV32V also has novel features that simplify the ISA, which requires more explanation
even if you already are familiar with vector architectures.
structions where the first operand is scalar and the second is vector (.sv suffix). Operations
like Y = a − X use them. They are superfluous for symmetric operations like addition and
multiplication, so those instructions have no .sv version. The fused multiply-add instruc-
tions have three operands, so they have the largest combination of vector and scalar options:
.vvv, .vvs, .vsv, and .vss.
Readers may notice that Figure 8.1 ignores the data type and width of the vector opera-
tions. The next section explains why.
Figure 8.2: RV32V encodings of vector register types. The rightmost three bits of the field show the width of
the data, and the two leftmost bits give its type. X64 and U64 are available only for RV64V. F16 and F32
require the RV32F extension and F64 requires RV32F and RV32D. F16 is the IEEE 754-2008 16-bit
floating-point format (binary16). Setting vetype to 00000 disables the vector registers. (Table 17.4 of
[Waterman and Asanović 2017] is the basis of this figure.)
length register vl tells it to stop. The returning data is written into sequential elements of the
destination vector register.
Thus far, we have assumed that the program is working with dense arrays. To support
sparse arrays, vector architectures offer indexed data transfers: vldx and vstx. One source
register for these instructions refers to a vector register and the other to a scalar register. The
scalar register has the starting address of the sparse array, and each element of the vector
register contains the index in bytes of the nonzero elements of the sparse array.
Suppose the starting address in a0 was address 1024, and vector register v1 had these byte
indices in the first 4 elements: 16, 48, 80, 160. vldx v0,a0,v1 would send this sequence of
Indexed load is also addresses to memory: 1040 (1024 + 16), 1072 (1024 + 48), 1104 (1024 + 80), 1184 (1024 +
called gather and 160). It loads the returning data into sequential elements of the destination vector register.
indexed store is We used sparse arrays as our motivation for indexed loads and stores, but there are many
often named scatter.
other algorithms that access data indirectly via a table of indices.
must be set to all ones. To swap one of the other six predicate registers quickly into vp0 or
vp1, RV32V has the vpswap instruction. The predicate registers are also enabled dynami-
cally, and disabling them clears all the predicate registers quickly.
For example, suppose all the even-numbered elements of vector register v3 were negative
integers and all the odd-numbered elements were positive integers. The result of this code:
would set all the even bits of vp0 to 1, all the odd bits to 0, and would replace all the even
elements of v0 with the sum of the corresponding elements of v1 and v2. The odd elements
of v0 would be unchanged.
# vindices holds values from 0..mvl-1 that select elements from vsrc
vselect vdest, vsrc, vindices
Thus, if the first four elements of v2 contain 8, 0, 4, 2, then vselect v0,v1,v2 will replace
the zeroth element of v0 with eighth element of v1, the first element of v0 with the zeroth
element of v1, the second element of v0 with the fourth element of v1, and the third element
of v0 with the second element of v1.
Vector merge (vmerge) resembles vector select, but it uses a vector predicate register to
choose which of the sources to use. It produces a new result vector by gathering elements
from one of two source registers depending on the predicate register. The new element comes
from vsrc1 if the predicate vector register element is 0 or from vsrc2 if it is 1:
Thus, if the first four elements of vp0 contain 1, 0, 0, 1, the first four elements of v1 contain 1,
2, 3, 4, and the first four elements of v2 contain 10, 20, 30, 40, then vmerge,vp0 v0,v1,v2
will make the first four elements of v0 be 10, 2, 3, 40.
The vector extract instruction takes elements starting from the middle of one vector and
places these at the beginning of a second vector register:
Figure 8.3: RV32V code for DAXPY in Figure 5.7. The machine language is missing because the RV32V
opcodes are yet to be defined.
For example, if vector length vl is 64 and a0 contains 32, then vextract v0,v1,a0 will
copy the last 32 elements of v1 into the first 32 elements of v0.
The vextract instruction assists reductions by following a recursive-halving approach
for any binary associative operator. For example, to sum all the elements of a vector register,
use vector extract to copy the last half of a vector into the first half of another vector register
and halve the vector length. Next, add these two vector registers together and repeat the
recursive-halving with their sum until vector length equals 1. The result in the zeroth element
will be the sum of all the original elements in the vector register.
iteration of the loop. setvl also writes to t0 to help with later loop bookkeeping at location Vector architectures
10. without setvl have
The instruction vld at address c is a vector load from the address of x in scalar register extra strip-mining code
to set vl to the last n
a1. It transfers vl elements of x from memory to v0. The following shift instruction slli elements of the loop and
multiplies the vector length by the width of the data in bytes (8) for later use in incrementing to check if n is initially
pointers to x and y. zero.
The instruction at address 14 (vld) loads vl elements of y from memory into v1 and the
next instruction (add) increments the pointer to x.
The instruction at address 1c is the jackpot. vfmadd multiplies vl elements of x (v0) by
the scalar a (f0) and adds each product to vl elements of y (v1) and stores those vl sums
back into y (v1).
All that is to left is store the results in memory and some loop overhead. The instruction
at address 20 (sub) decrements n (a0) by vl to record the number of operations completed in
this iteration of the loop. The following instruction (vst) stores vl results into y in memory.
The instruction at address 28 (add) increments the pointer to y and the following instruction
repeats the loop if n (a0) is not zero. If n is zero, the final instruction ret returns to the
calling site.
The power of vector architecture is that each iteration of this 10-instruction loop launches
3 × 64 = 192 memory accesses and 2 × 64 = 128 floating-point multiplies and additions
(assuming that n is at least 64). That averages about 19 memory accesses and 13 operations
per instruction. As we shall see in the next section, these ratios for SIMD are an order of
magnitude worse.
8.9 Comparing RV32V, MIPS-32 MSA SIMD, and x86-32 AVX SIMD
We’ll now see the contrast between how SIMD and vector executes DAXPY. If you tilt your ARM-32 has a SIMD
head, you can see SIMD as a restricted vector architecture with short vector registers—eight extension called
NEON but it doesn’t
8-bit “elements”—but it has no vector length register and no strided or indexed data transfers. support double-precision
floating-point instructions,
MIPS SIMD. Figure 8.5 on page 83 shows the MIPS SIMD Architecture (MSA) version so it doesn’t help DAXPY.
of DAXPY. Each MSA SIMD instruction can operate on two floating-point numbers since
the MSA registers are 128 bits wide.
Unlike RV32V, because there is no vector length register, MSA requires extra bookkeep- Such bookkeeping
code is considered part
ing instructions to check for problem values of n. When n is odd, there is extra code to
of strip mining in
compute a single floating-point multiply-add since MSA must operate on pairs of operands. vector architectures. As
That code is found in locations 3c to 4c in Figure 8.5. In the unlikely but possible case when the caption of Figure 8.5
n is zero, the branch at location 10 will skip the main computation loop. explains, the vector length
register vl renders such
If it doesn’t branch around the loop, the instruction at location 18 (splati.d) puts copies SIMD bookkeeping code
of a in both halves of the SIMD register w2. To add scalar data in SIMD, we need to replicate moot for RV32V. Traditional
it to be as wide as the SIMD register. vector architectures need
Inside the loop, the ld.d instruction at location 1c loads two elements of y into SIMD extra code to handle the
corner case of n = 0.
register w0 and then increments the pointer to y. It then does the a load of two elements of x RV32V just makes vector
into the SIMD register w1. The following instruction at location 28 increments the pointer to instructions act like nops
x. The payoff multiply-add instruction at location 2c is next. when n = 0.
The (delayed) branch at the end of the loop tests to see if the pointer to y has been
incremented beyond the last even element of y. If it hasn’t, the loop repeats. The SIMD store
in the delay slot at address 34 writes the result to two elements of y.
80 CHAPTER 8. RV32V: VECTOR
Figure 8.4: Number of instructions and code size of DAXPY for vector ISAs. It lists number of instructions
total (static), code size, number of instructions and results per loop, and number of instructions executed (n
= 1000). microMIPS with MSA shrinks code size to 64 bytes and RV32FDCV reduces it to 40 bytes.
After the main loop terminates, the code checks to see if n is odd. If so, it performs the
last multiply-add using scalar instructions from Chapter 5. The final instruction returns to the
calling site.
The 7-instruction loop at the heart of the MIPS MSA DAXPY code does 6 double-
precision memory accesses and 4 floating-point multiplies and additions. The average is
about 1 memory access and 0.5 operations per instruction.
x86 SIMD. Intel has gone through many generations of SIMD extensions, which we see
in the code in Figure 8.6 on page 84. The SSE expansion to 128-bit SIMD led to the xmm
registers and instructions that can use them, and the expansion to 256-bit SIMD as part of
AVX created the ymm registers and their instructions.
The first group of instructions at addresses 0 to 25 load the variables from memory, make
four copies of a in a 256-bit ymm registers, and tests to ensure n is at least 4 before entering the
main loop. It uses two SSE and one AVX instructions. (The caption of Figure 8.6 explains
how in more detail.)
The main loop does the heart of the DAXPY computation. The AVX instruction vmovapd
at address 27 loads 4 elements of x into ymm0. The AVX instruction vfmadd213pd at address
2c multiplies 4 copies of a (ymm2) times 4 elements of x (ymm0), adds 4 elements of y (in
memory at address ecx+edx*8), and puts the 4 sums into ymm0. The following AVX instruc-
tion at address 32, vmovapd, stores the 4 results into y. The next three instructions increment
counters and repeat the loop if needed.
As was the case for MIPS MSA, the “fringe” code between addresses 3e and 57 deals
with the cases when n is not a multiple of 4. It relies on three SSE instructions.
The 6 instructions of the main loop in the x86-32 AVX2 DAXPY code do 12 double-
precision memory accesses and 8 floating-point multiplies and additions. They average 2
memory accesses and about 1 operation per instruction.
Elaboration: The Illiac IV was the first to show the difficulty of compiling for SIMD.
With 64 parallel 64-bit floating-point units (FPUs), the Illiac IV was planned to have more
than 1 million logic gates before Moore published his law. Its architect originally predicted
1000 million floating-point operations per second (MFLOPS), but actual performance was 15
MFLOPS at best. Costs escalated from the $8M estimated in 1966 to $31M by 1972, despite
the construction of only 64 of the planned 256 FPUs. The project started in 1965 but took
until 1976 to run its first real application, the year the Cray-1 was unveiled. Perhaps the most
infamous supercomputer, it made a top 10 list of engineering disasters [Falk 1976].
8.10. CONCLUDING REMARKS 81
Figure 8.4 summarizes the number of instructions and number of bytes in DAXPY of pro-
grams for RV32IFDV, MIPS-32 MSA, and x86-32 AVX2. The SIMD computation code is
dwarfed by the bookkeeping code. Two-thirds to three-fourths of the code for MIPS-32 MSA
and x86-32 AVX2 is SIMD overhead, either to prepare the data for the main SIMD loop or to
handle the fringe elements when n is not a multiple of the number of floating-point numbers
in a SIMD register.
RV32V code in Figure 8.3 doesn’t need such bookkeeping code, which halves the number
of instructions. Unlike SIMD, it has a vector length register, which makes the vector instruc-
tions work at any value of n. You might think RV32V would have a problem when n is 0. It
doesn’t because RV32V vector instructions leave everything unchanged when vl = 0.
However, the most significant difference between SIMD and vector processing is not the
static code size. The SIMD instructions execute 10 to 20 times more instructions than RV32V
because each SIMD loop does only 2 or 4 elements instead of 64 in the vector case. The extra
instruction fetches and instruction decodes means higher energy to perform the same task.
Comparing the results in Figure 8.4 to the scalar versions of DAXPY in Figure 5.8 on
page 29 in Chapter 5, we see that SIMD roughly doubles the size of the code in instructions
and bytes, but the main loop is the same size. The reduction in the dynamic number of
instructions executed is a factor of 2 or 4, depending on the width of the SIMD registers.
However, the RV32V vector code size increases by a factor of 1.2 (with the main loop 1.4X)
but the dynamic instruction count is a factor of 43 smaller!
While dynamic instruction count is a large difference, in our view that is the second most
significant disparity between SIMD and vector. Lacking a vector length register explodes the
number of instructions as well as the bookkeeping code. ISAs like MIPS-32 and x86-32 that
follow the incrementalist doctrine must duplicate all the old SIMD instructions defined for
narrower SIMD registers every time they double the SIMD width. Surely, hundreds of MIPS-
32 and x86-32 instructions were created over many generations of SIMD ISAs and hundreds
more are in their future. The cognitive load on the assembly language programmer of this
brute force approach to ISA evolution must be overwhelming. How can one remember what
vfmadd213pd means and when to use it?
In comparison, RV32V code is unaffected by the size of the memory for vector registers.
Not only is RV32V unchanged if vector memory size expands, you don’t even have to re-
compile. Since the processor supplies the value of maximum vector length mvl, the code in
Figure 8.3 is untouched whether a processor raises the vector memory from 1024 bytes to,
say, 4096 bytes, or drops it to 256 bytes.
Unlike SIMD, where the ISA dictates the required hardware—and changing the ISA
means changing the compiler—the RV32V ISA allows processor designers to choose the
resources for data parallelism for their application without affecting the programmer or com-
piler. One can argue that SIMD violates the ISA design principle from Chapter 1 of isolating
the architecture from implementation.
We think the high contrast in cost-energy-performance, complexity, and ease of program-
ming between the modular vector approach of RV32V and the incrementalist SIMD architec-
tures of ARM-32, MIPS-32, and x86-32 might be the most persuasive argument for RISC-V.
82 NOTES
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.
Notes
1 http://parlab.eecs.berkeley.edu
NOTES 83
Figure 8.5: MIPS-32 MSA code for DAXPY in Figure 5.7. The bookkeeping overhead of SIMD is evident
when comparing this code to the RV32V code in Figure 8.3. The first part of the MIPS MSA code (addresses
0 to 18) duplicate the scalar variable a in a SIMD register and to check to ensure n is at least 2 before
entering the main loop. The third part of the MIPS MSA code (addresses 38 to 4c) handle the fringe case
when n is not a multiple of 2. Such bookkeeping code is unneeded in RV32V because the vector length
register vl and the setvl instruction lets the loop work for all values of n, whether odd or even.
84 NOTES
Figure 8.6: x86-32 AVX2 code for DAXPY in Figure 5.7. The SSE instruction vmovsd at address a loads a
into half of the 128-bit xmm1 register. The SSE instruction vmovddup at address 14 duplicates a into both
halves of xmm1 for later SIMD computation. The AVX instruction vinsertf128 at address 1d makes four
copies of a in ymm2 starting from the two copies of a in xmm1. The three AVX instructions at addresses 42 to
4d (vmovsd, vfmadd213sd, vmovsd) handle when mod(n,4) 6= 0. They perform the DAXPY computation one
element at a time, with the loop repeating until the function has performed exactly n multiple-add
operations. Once again, such code is unnecessary for RV32V because the vector length register vl and the
setvl instruction makes the loop work for any value of n.
NOTES 85
9 RV64: 64-bit Address Instructions
C. Gordon Bell (1934-) There is only one mistake that can be made in computer design that is difficult to recover
was one of the lead archi- from—not having enough address bits for memory addressing and memory management.
tects of two of the most
popular minicomputer ar- —C. Gordon Bell, 1976
chitectures of their day:
the Digital Equipment Cor-
poration PDP-11 (16-bit
address), which was an- 9.1 Introduction
nounced in 1970, and its
successor seven years Figures 9.1 to 9.4 shows graphical representations of the RV64G versions of the RV32G
later, the Digital Equipment
instructions. These figures illustrate the small increase in the number of instructions to switch
Corporation 32-bit address
VAX-11 (Virtual Address to a 64-bit ISA in RISC-V. The ISAs typically add only a few word, doubleword, or long
eXtension). versions of the 32-bit instructions and expand all the registers, including the PC, to 64 bits.
Thus, sub in RV64I subtracts two 64-bit numbers rather than two 32-bit numbers as in RV32I.
RV64 is a close but actually different ISA than RV32; it adds a few instructions and the base
instructions do slightly different things.
For example, Insertion Sort for RV64I in Figure 9.8 is quite near the code for RV32I
in Figure 2.8 on page 27 in Chapter 2. It is the same number of instructions and the same
number of bytes. The only changes are that the load and store word instructions become load
and store doublewords, and the address increment goes from 4 for words (4 bytes) to 8 for
doublewords (8 bytes). Figure 9.5 lists the opcodes of the RV64GC instructions in Figures 9.1
to 9.4.
Despite RV64I having 64-bit addresses and a default data size of 64 bits, 32-bit words are
valid data types in programs. Hence, RV64I needs to support words just as RV32I needs to
support bytes and halfwords. More specifically, since registers are now 64 bits wide, RV64I
adds word versions of addition and subtraction: addw, addiw, subw. They truncate their
results to 32 bits and write the sign-extended result to the destination register. RV64I also
includes word versions of the shift instructions to get 32-bit shift result instead of a 64-bit
shift result: sllw, slliw, srlw, srliw, sraw, sraiw. To do 64-bit data transfers, it
has load and store doubleword: ld, sd. Finally, just as there are unsigned versions of load
byte and load halfword in RV32I, RV64I must have an unsigned version of load word: lwu.
For similar reasons, RV64M needs to add word versions of multiply, divide, and remain-
der: mulw, divw, divuw, remw, remuw. To allow the programmer to synchronize on
both words and doublewords, RV64A adds doubleword versions of all 11 of its instructions.
9.1. INTRODUCTION 87
RV64I
Integer Computation Loads and Stores
_ _ byte
add immediate word
_ load halfword
subtract word store word
and doubleword
_
or byte
immediate
exclusive or
load halfword unsigned
word
shift left logical _ _ Miscellaneous instructions
shift right arithmetic immediate word fence loads & stores
shift right logical fence.instruction & data
load upper immediate environment break
add upper immediate to pc call
_ _ read & clear bit
set less than _
immediate unsigned control status register read & set bit
Control transfer
immediate
read & write
branch equal
not equal
_
branch greater than or equal
less than unsigned
_
jump and link
register
Figure 9.1: Diagram of the RV64I instructions. The underlined letters are concatenated from left to right to
form RV64I instructions. The dimmed portion are the old RV64I instructions extended to 64-bit registers
and the dark (red) portion are the new instructions for RV64I.
RV64M RV64A
_
multiply word add
_ and
unsigned or
multiply high
signed unsigned swap
.word
_ atomic memory operation xor
divide _ .doubleword
unsigned maximum
remainder word maximum unsigned
minimum
minimum unsigned
RV64C
Integer Computation Control transfer
_ _
c.add equal
immediate word c.branch to zero
not equal
c.add immediate * 16 to stack pointer _
c.add immediate * 4 to stack pointer nondestructive c.jump and link
_
c.subtract _
word
c.jump and link register
shift left logical
c. shift right arithmetic immediate
Other instructions
shift right logical
_ c.environment break
c.and
immediate
c.or
c.move
c.exclusive or
_
c.load immediate
upper
Loads and Stores
_ word _
c. float load
doubleword using stack pointer
store
_
c.float load doubleword using stack pointer
store
31 25 24 20 19 15 14 12 11 7 6 0
imm[11:0] rs1 110 rd 0000011 I lwu
imm[11:0] rs1 011 rd 0000011 I ld
imm[11:5] rs2 rs1 011 imm[4:0] 0100011 S sd
000000 shamt rs1 001 rd 0010011 I slli
000000 shamt rs1 101 rd 0010011 I srli
010000 shamt rs1 101 rd 0010011 I srai
imm[11:0] rs1 000 rd 0011011 I addiw
0000000 shamt rs1 001 rd 0011011 I slliw
0000000 shamt rs1 101 rd 0011011 I srliw
0100000 shamt rs1 101 rd 0011011 I sraiw
0000000 rs2 rs1 000 rd 0111011 R addw
0100000 rs2 rs1 000 rd 0111011 R subw
0000000 rs2 rs1 001 rd 0111011 R sllw
0000000 rs2 rs1 101 rd 0111011 R srlw
0100000 rs2 rs1 101 rd 0111011 R sraw
Figure 9.5: RV64 opcode map of the base instructions and optional extensions. It shows instruction layout,
opcodes, format type, and name. (Table 19.2 of [ Waterman and Asanović 2017] is the basis of this figure.)
90 CHAPTER 9. RV64: 64-BIT ADDRESS INSTRUCTIONS
RV64F and RV64D adds integer doublewords to the convert instructions, calling them
longs so to prevent confusion with double precision floating-point data: fcvt.l.s,
fcvt.l.d, fcvt.lu.s, fcvt.lu.d, fcvt.s.l, fcvt.s.lu, fcvt.d.l,
fcvt.d.lu. As the integer x registers are now 64 bits wide, they can now hold dou-
ble precision floating-point data, so RV64D adds two floating-point moves: fmv.x.w and
fmv.w.x.
The one exception to the superset relationship between RV64 and RV32 is the compressed
instructions. RV64C replaced a few RV32C instructions, since other instructions shrank
code more for 64-bit addresses. RV64C drops the compressed jump and link (c.jal) and the
integer and floating-point load and store word instructions (c.lw, c.sw, c.lwsp, c.swsp,
c.flw, c.fsw, c.flwsp, and c.fswsp). In their place, RV64C adds the more popular add
and subtract word instructions (c.addw, c.addiw, c.subw) and load and store double-
word instructions (c.ld, c.sd, c.ldsp, c.sdsp).
Figure 9.6: Number of instructions and code size for Insertion Sort for four ISAs. ARM Thumb-2 and
microMIPS are 32-bit address ISAs, so are unavailable for ARM-64 and MIPS-64.
count from 20 to 15 instructions. The code size is actually larger by one byte with the newer
ISA despite having fewer instructions: 46 versus 45. The reason is that to squeeze in the new
opcodes to enable more registers, x86-64 added a prefix byte to identify the new instructions.
The average instruction length increases in x86-64 over x86-32.
ARM faced the same address problem another decade later. Rather than evolve the old
ISA to have 64-bit addresses as did x86-64, they used the opportunity to invent a brand new
ISA. Given a fresh start, they changed many of the awkward ARM-32 traits to give them a
modern ISA:
It still shares some weaknesses of ARM-32: condition codes for branch, source and desti-
nation register fields move in the instruction format, conditional move instructions, complex
addressing modes, inconsistent performance counters, and only 32-bit length instructions.
ARM-64 can’t switch to the Thumb-2 ISA, as Thumb-2 only works with 32-bit addresses.
Intel didn’t invent
Unlike RISC-V, ARM decided to take a maximalist approach to ISA design. While cer-
the x86-64 ISA. When
tainly a better ISA than ARM-32, it is also bigger. For example, it has more than 1000 switching to 64-bit ad-
instructions and the ARM-64 manual is 3185 pages long [ARM 2015]. Moreover, it is still dresses, Intel invented a
growing. There have been three expansions of ARM-64 since its announcement a few years new ISA called Itanium
that was incompatible with
ago. x86-32. Its competitor for
The ARM-64 code for Insertion Sort in Figure 9.9 looks closer to the RV64I code or x86-32 processors was
x86-64 code than to the ARM-32 code. For example, with 31 registers, there is no need to locked out of Itanium, so
save and restore registers from the stack. And since the PC is no longer one of the registers, AMD invented a 64-bit
version of x86-32 called
ARM-64 uses a separate return instruction. AMD64. Itanium eventually
Figure 9.6 is a table that summarizes the number of instructions and number of bytes in failed, so Intel was forced
Insertion Sort for the ISAs. Figures 9.8 to 9.11 show the compiled code for RV64I, ARM-64, to adopt the AMD64 ISA
MIPS-64, and x86-64. Parenthetical phrases in the comments of these four programs identify as the 64-bit address suc-
cessor of x86-32, which
the differences between the RV32I versions in Chapter 2 and these RV64I versions. we call x86-64 [Kerner and
MIPS-64 needs the most instructions, primarily because of the nop instructions of the Padgett 2007].
unfilled delayed branch slots. RV64I needs fewer because of the compare-and-branch in-
structions and no delayed branch. While ARM-64 and x86-64 need two compare instructions
92 CHAPTER 9. RV64: 64-BIT ADDRESS INSTRUCTIONS
0.8
0.6
0.4
0.2
0
RISC-V RV64GC RISC-V RV64G ARM-64 INTEL x86-64
(16b & 32b) (32b) (32b) (variable 8b)
Figure 9.7: Relative program sizes for RV64G, ARM-64, and x86-64 versus RV64GC. This comparison
measures much bigger programs than in Figure 9.6. This graph is the 64-bit address equivalent to the graph
of 32-bit ISAs in Figure 1.5 on page 9 in Chapter 2. RV32C code size almost matches to RV64C; it is 1%
smaller. There is no Thumb-2 option for ARM-64, so the core of other 64-bit ISAs significantly exceeds the
size of RV64GC code. The programs measured were the SPEC CPU2006 benchmarks using the GCC
compilers [Waterman 2016].
that are unnecessary for RV64I, their scaling addressing modes avoid address arithmetic in-
structions needed in RV64I, giving them the fewest instructions. However, RV64I+RV64C
has much smaller code size, as the next section explains.
RISC-V benefited from designing both the 32-bit and the 64-bit architectures together,
whereas older ISAs had to architect them sequentially. Unsurprisingly, the transition between
32-bit and 64-bit is easiest for RISC-V programmers and compiler writers; the RV64I ISA
has virtually all RV32I instructions. Indeed, that is why we can list both RV32GCV and
RV64GCV in only two pages of the Reference Card. More important, the simultaneous
design meant the 64-bit architecture did not have to be squeezed into a cramped 32-bit opcode
space. RV64I has plenty of room for optional instruction extensions, particularly RV64C,
which makes it the leader in code size.
We see the 64-bit architecture as more evidence of RISC-V’s sound design, admittedly
easier to achieve if you start 20 years later so that you can borrow the pioneers’ good ideas as
well as learn from their mistakes.
Elaboration: RV128
RV128 began as an inside joke with the RISC-V architects, simply to show that a 128-bit
address ISA was possible. However, warehouse scale computers may soon have more than
264 bytes of semiconductor storage (DRAM and Flash memory), which programmers might
want to access as a memory address. There are also proposals to use a 128-bit address to
improve security [Woodruff et al. 2014]. The RISC-V manual does specify a full 128-bit ISA
called RV128G [Waterman and Asanović 2017]. The additional instructions are basically the
same as needed to go from RV32 to RV64, which Figures 9.1 to 9.4 illustrate. All the registers
also grow to 128 bits, and the new RV128 instructions specify either 128-bit versions of some
instructions (using Q in the name for quadword) or 64-bit versions of others (using D for in
the name doubleword).
A. Waterman and K. Asanović, editors. The RISC-V Instruction Set Manual, Volume I:
User-Level ISA, Version 2.2. May 2017. URL https://riscv.org/specifications/.
J. Woodruff, R. N. Watson, D. Chisnall, S. W. Moore, J. Anderson, B. Davis, B. Laurie,
P. G. Neumann, R. Norton, and M. Roe. The CHERI capability model: Revisiting RISC
in an age of risk. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st International
Symposium on, pages 457–468. IEEE, 2014.
Notes
1 http://parlab.eecs.berkeley.edu
NOTES 95
Figure 9.8: RV64I code for Insertion Sort in Figure 2.5. The RV64I assembly language program is very
similar to the RV32I assembly language in Figure 2.8 on page 27 in Chapter 2. We list the differences in
parentheses in the comments. The size of the data is now 8 bytes instead of 4, so three instructions change the
constant 4 to 8. This extra width also stretches two load words (lw) to load doublewords (ld) and two store
words (sw) to store doublewords (sd).
96 NOTES
28: f8227805 str x5, [x0, x2, lsl #3] # a[j] = a[j-1]
2c: f1000442 subs x2, x2, #0x1 # j--
30: 54ffff41 b.ne 18 # if j != 0, jump to Inner Loop
Exit Inner Loop:
34: f8227804 str x4, [x0, x2, lsl #3] # a[j] = x
38: 91000463 add x3, x3, #0x1 # i++
3c: 17fffff2 b 4 # jump to Outer Loop
Figure 9.9: ARM-64 code for Insertion Sort in Figure 2.5. The ARM-64 assembly language program is
different from to the ARM-32 assembly language in Figure 2.11 on page 30 in Chapter 2 since it is a new
instruction set. The registers start with x instead of a. The data addressing modes can shift a register by 3
bits to scale the index to a byte address. With 31 registers, there is no need to save and restore registers from
the stack. Since PC is not one of the registers, it uses is a separate return instruction. In fact, the code looks
closer to the RV64I code or x86-64 code than to the ARM-32 code.
NOTES 97
Figure 9.10: MIPS-64 code for Insertion Sort in Figure 2.5. The MIPS-64 assembly language program has
several differences from to the MIPS-32 assembly language in Figure 2.10 on page 29 in Chapter 2. First,
most operations for 64-bit data prepend a “d” to their names: daddiu, daddu, dsll. Like Figure 9.8,
three instructions change the constant from 4 to 8 since size of the data grew from 4 to 8 bytes. Again like
RV64I, the extra width also stretches two load words (lw) to load doublewords (ld) and two store words (sw)
to store doublewords (sd). Finally, MIPS-64 does not have the load delay slot from MIPS-32; the pipeline
stalls on a read-after-write dependence.
98 NOTES
Figure 9.11: x86-64 code for Insertion Sort in Figure 2.5. The x86-64 assembly language program is quite
different from to the x86-32 assembly language in Figure 2.11 on page 30 in Chapter 2. First, unlike RV64I,
the wider registers have different names rax, rcx, rdx, rsi, rdi, r8. Second, because x86-64 added 8
more registers, there are now enough to keep all the variables in registers instead of in memory. Third, the
x86-64 instructions are longer than for x86-32 since many need to prepend 8-bits or 16-bits to fit the new
instructions in the opcode space. For example, incrementing or decrementing a register (inc, dec) takes 1
byte in x86-32 but 3 bytes in x86-64. Hence, while many fewer instructions, x86-64 code size of Insertion Sort
is almost identical to x86-32: 45 bytes vs. 46 bytes.
NOTES 99
10 RV32/64 Privileged Architecture
31 27 26 25 24 20 19 15 14 12 11 7 6 0
0001000 00010 00000 000 00000 1110011 R sret
0011000 00010 00000 000 00000 1110011 R mret
0001000 00101 00000 000 00000 1110011 R wfi
0001001 rs2 rs1 000 00000 1110011 R sfence.vma
Figure 10.2: RISC-V privileged instruction layout, opcodes, format type, and name. (Table 6.1 of [Waterman
and Asanović 2017] is the basis of this figure.)
very few instructions; instead, several new control and status registers (CSRs) expose the
additional functionality.
This chapter describes the RV32 and RV64 privileged architectures together. Some con-
cepts differ only in the size of an integer register, so to keep the descriptions concise, we
introduce the term XLEN to refer to the width of an integer register in bits. XLEN is 32 for
RV32 or 64 for RV64.
.
Interrupt Exception Exception Code
Description
mcause[XLEN-1] mcause[XLEN-2:0]
1 1 Supervisor software interrupt
1 3 Machine software interrupt
1 5 Supervisor timer interrupt
1 7 Machine timer interrupt
1 9 Supervisor external interrupt
1 11 Machine external interrupt
0 0 Instruction address misaligned
0 1 Instruction access fault
0 2 Illegal instruction
0 3 Breakpoint
0 4 Load address misaligned
0 5 Load access fault
0 6 Store address misaligned
0 7 Store access fault
0 8 Environment call from U-mode
0 9 Environment call from S-mode
0 11 Environment call from M-mode
0 12 Instruction page fault
0 13 Load page fault
0 15 Store page fault
Figure 10.3: RISC-V exception and interrupt causes. The most-significant bit of mcause is set to 1 for
interrupts or 0 for synchronous exceptions, and the least-significant bits identify the interrupt or exception.
Supervisor interrupts and page-fault exceptions are only possible when supervisor mode is implemented (see
Section 10.5). (Table 3.6 of [Waterman and Asanović 2017] is the basis of this figure.)
10.3. MACHINE-MODE EXCEPTION HANDLING 103
XLEN-1 XLEN-2 23 22 21 20 19 18 17
SD 0 TSR TW TVM MXR SUM MPRV
1 XLEN-24 1 1 1 1 1 1
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
XS FS MPP 0 SPP MPIE 0 SPIE UPIE MIE 0 SIE UIE
2 2 2 2 1 1 1 1 1 1 1 1 1
Figure 10.4: The mstatus CSR. The only fields present in simple processors with only Machine mode and
without the F and V extensions are the global interrupt enable, MIE, and MPIE, which after an exception
holds the old value of MIE. XLEN is 32 for RV32, or 64 for RV64. Figure 3.7 of [Waterman and Asanović
2017] is the basis of this figure; see Section 3.1 of that document for a description of the other fields.
misaligned regular loads and stores, because it is a difficult feature to implement and is in-
frequently used. Processors without this hardware rely instead upon an exception handler
to trap and emulate misaligned loads and stores in software, using a sequence of smaller,
aligned loads and stores. Application code is none the wiser: misaligned memory accesses
work as expected, albeit slowly, while the hardware remains simple. Alternatively, more
performant processors can implement misaligned loads and stores in hardware. This imple-
mentation flexibility owes to RISC-V’s decision to permit misaligned loads and stores using
the regular load and store opcodes, following Chapter 1’s guideline to isolate architecture
from implementation.
There are three standard sources of interrupts: software, timer, and external. Software in-
terrupts are triggered by storing to a memory-mapped register and are generally used by one
hart to interrupt another hart, a mechanism other architectures refer to as an interprocessor
interrupt. Timer interrupts are raised when a hart’s time comparator, a memory-mapped reg-
ister named mtimecmp, exceeds the real-time counter mtime. External interrupts are raised
by a platform-level interrupt controller, to which most external devices are attached. As
different hardware platforms have different memory maps and demand divergent features
of their interrupt controllers, the mechanisms for raising and clearing these interrupts differ
from platform to platform. What is constant across all RISC-V systems is how exceptions
are handled and interrupts are masked, the topic of the next section.
• mtvec, Machine Trap Vector, holds the address the processor jumps to when an excep-
tion occurs.
• mepc, Machine Exception PC, points to the instruction where the exception occurred.
• mcause, Machine Exception Cause, indicates which exception occurred.
• mie, Machine Interrupt Enable, lists which interrupts the processor can take and which
it must ignore.
• mip, Machine Interrupt Pending, lists the interrupts currently pending.
104 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE
• mtval, Machine Trap Value, holds additional trap information: the faulting address for
address exceptions, the instruction itself for illegal instruction exceptions, and zero for
other exceptions.
• mscratch, Machine Scratch, holds one word of data for temporary storage.
• mstatus, Machine Status, holds the global interrupt enable, along with a plethora of
other state, as Figure 10.4 shows.
When executing in M-mode, interrupts are only taken if the global interrupt-enable bit,
mstatus.MIE, is set. Furthermore, each interrupt has its own enable bit in the mie CSR. The
bit positions in mie correspond to the interrupt codes in Figure 10.3: for example, mie[7]
corresponds to the M-mode timer interrupt. The mip CSR has the same layout and indicates
which interrupts are currently pending. Putting all three CSRs together, a machine timer
interrupt can be taken if mstatus.MIE=1, mie[7]=1, and mip[7]=1.
RISC-V also sup-
When a hart takes an exception, the hardware atomically undergoes several state transi-
ports vectored in-
terrupts, wherein the tions:
processor jumps to an
interrupt-specific address, • The PC of the exceptional instruction is preserved in mepc, and the PC is set to mtvec.
rather than a single entry (For synchronous exceptions, mepc points to the instruction that caused the exception;
point. This addressing
eliminates the need to
for interrupts, it points where execution should resume after the interrupt is handled.)
read and decode mcause, • mcause is set to the exception cause, as encoded in Figure 10.3, and mtval is set to
speeding up interrupt han-
dling. Setting mtval[0] the faulting address or some other exception-specific word of information.
to 1 enables this feature; • Interrupts are disabled by setting MIE=0 in the mstatus CSR, and the previous value
interrupt cause x then
sets the PC to (mtval- of MIE is preserved in MPIE.
1+4x), instead of the usual
• The pre-exception privilege mode is preserved in mstatus’ MPP field, and the privi-
mtval.
lege mode is changed to M. Figure 10.5 shows the encoding of the MPP field. (If the
processor only implements M-mode, this step is effectively skipped.)
To avoid overwriting the contents of the integer registers, the prologue of an interrupt
handler usually begins by swapping an integer register (say, a0) with the mscratch CSR.
Usually, the software will have arranged for mscratch to contain a pointer to additional in-
memory scratch space, which the handler uses to save as many integer registers as its body
will use. After the body executes, the epilogue of an interrupt handler restores the registers
it saved to memory, then again swaps a0 with mscratch, restoring both registers to their
pre-exception values. Finally, the handler returns with mret, an instruction unique to M-
mode. mret sets the PC to mepc, restores the previous interrupt-enable setting by copying
the mstatus MPIE field to MIE, and sets the privilege mode to the value in mstatus’ MPP
field, essentially reversing the actions described in the preceding paragraph.
10.4. USER MODE AND PROCESS ISOLATION IN EMBEDDED SYSTEMS 105
Figure 10.6 shows RISC-V assembly code for a basic timer interrupt handler following
this pattern. It simply increments the time comparator then returns to the previous task,
whereas a more realistic timer interrupt handler might invoke a scheduler to switch between
tasks. It is not preemptible, so it keeps interrupts disabled throughout the handler. Those
caveats aside, it is a complete example of a RISC-V interrupt handler on a single page!
Sometimes it’s desirable to take a higher-priority interrupt while processing a lower-
priority exception. Alas, there’s only one copy of the mepc, mcause, mtval, and mstatus
CSRs; taking a second interrupt would destroy the old values in these registers, causing data
loss without some additional help from software. A preemptible interrupt handler can save
these registers to an in-memory stack before enabling interrupts, then, just prior to exiting,
disable interrupts and restore the registers from the stack.
In addition to the mret instruction we introduced above, M-mode provides just one other
instruction: wfi (Wait For Interrupt). wfi informs the processor that there isn’t any useful
work to do, so it should enter a lower-power mode until any enabled interrupt becomes pend-
ing, i.e., (mie & mip)6=0. RISC-V processors implement this instruction in a variety of ways,
including stopping the clock until an interrupt becomes pending; some simply execute it as a
nop. Hence, wfi is typically used inside a loop.
# save registers
csrrw a0, mscratch, a0 # save a0; set a0 = &temp storage
sw a1, 0(a0) # save a1
sw a2, 4(a0) # save a2
sw a3, 8(a0) # save a3
sw a4, 12(a0) # save a4
Figure 10.6: RISC-V code for a simple timer interrupt handler. The code assumes that interrupts are
globally enabled by setting mstatus.MIE; that timer interrupts have been enabled by setting mie[7]; that
the mtvec CSR has been set to the address of this handler; and that the mscratch CSR has been set to the
address of a buffer that contains 16 bytes of temporary storage to save the registers. The prologue saves five
registers, preserving a0 in mscratch and a1–a4 in memory. It then decodes the exception cause by
examining mcause: interrupt if mcause<0, or synchronous exception if mcause≥0. If it is an interrupt, it
checks that the lower bits of mcause equal 7, indicating an M-mode timer interrupt. If it is a timer interrupt,
it adds 1000 cycles to the time comparator, so that the next timer interrupt will occur about 1000 timer cycles
in the future. Finally, the epilogue restores the a0–a4 and mscratch, then returns whence it came using mret.
10.4. USER MODE AND PROCESS ISOLATION IN EMBEDDED SYSTEMS 107
XLEN-1 0
address[PhysicalAddressSize-1:2]
7 6 5 4 3 2 1 0
L 0 A X W R
Figure 10.7: A PMP address and configuration register. The address register is right-shifted by 2, and if
physical addresses are less than XLEN-2 bits wide, the upper bits are zeros. The R, W, and X fields grant
read, write, and execute permissions. The A field sets the PMP mode, and the L field locks the PMP and
corresponding address registers.
31 24 23 16 15 8 7 0
PMP3 PMP2 PMP1 PMP0 pmpcfg0
RV32
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
PMP7 PMP6 PMP5 PMP4 PMP3 PMP2 PMP1 PMP0 pmpcfg0
RV64
Figure 10.8: The layout of PMP configurations in the pmpcfg CSRs. For RV32 (above), the sixteen
configuration registers are packed into four CSRs. For RV64 (below), they are packed into the two
even-numbered CSRs.
attempts to fetch an instruction, or execute a load or store, the address is compared against
all of the PMP address registers. If the address is greater than or equal to PMP address i, but
less than PMP address i+1, then PMP i+1’s configuration register decides whether that access
may proceed; otherwise, it raises an access exception.
Figure 10.7 shows the layout of a PMP address and configuration register. Both are CSRs,
with the address registers named pmpaddr0 to pmpaddrN, where N+1 is the number of PMPs
implemented. The address registers are shifted right two bits because PMPs have a four-byte
granularity. The configuration registers are densely packed in the CSRs to accelerate context
switching, as Figure 10.8 shows. A PMP’s configuration consists of R, W, and X bits, which
when set permit loads, stores, and fetches, respectively, and a mode field, A, which when 0
disables this PMP or when 1 enables it. The PMP configuration also supports other modes
and can be locked, features described in [Waterman and Asanović 2017].
108 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE
XLEN-1 XLEN-2 20 19 18 17
SD 0 MXR SUM 0
1 XLEN-21 1 1 1
16 15 14 13 12 9 8 76 5 4 32 1 0
XS[1:0] FS[1:0] 0 SPP 0 SPIE UPIE 0 SIE UIE
2 2 4 1 2 1 1 2 1 1
Figure 10.9: The sstatus CSR. sstatus is a subset of mstatus (Figure 10.4), hence the similar layout. SIE
and SPIE hold the current and pre-exception interrupt enables, analogous to MIE and MPIE in mstatus.
XLEN is 32 for RV32, or 64 for RV64. Figure 4.2 of [Waterman and Asanović 2017] is the basis of this figure;
see Section 4.1 of that document for a description of the other fields.
• The PC of the exceptional instruction is preserved in sepc, and the PC is set to stvec.
• scause is set to the exception cause, as encoded in Figure 10.3, and stval is set to
the faulting address or some other exception-specific word of information.
• Interrupts are disabled by setting SIE=0 in the sstatus CSR, and the previous value
of SIE is preserved in SPIE.
• The pre-exception privilege mode is preserved in sstatus’ SPP field, and the privilege
mode is changed to S.
• The V bit indicates whether the rest of this PTE is valid (V=1). If V=0, any virtual-
address translation that traverses this PTE results in a page fault.
110 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE
31 20 19 10 9 8 7 6 5 4 3 2 1 0
PPN[1] PPN[0] RSW D A G U X W R V
12 10 2 1 1 1 1 1 1 1 1
63 54 53 28 27 19 18 10 9 8 7 6 5 4 3 2 1 0
Reserved PPN[2] PPN[1] PPN[0] RSW D A G U X W R V
10 26 9 9 2 1 1 1 1 1 1 1 1
• The R, W, and X bits indicate whether the page has read, write, and execute permis-
sions, respectively. If all three bits are 0, this PTE is a pointer to the next level of the
page table; otherwise, it’s a leaf of the tree.
• The U bit indicates whether this page is a user page. If U=0, U-mode cannot access
this page, but S-mode can. If U=1, U-mode can access this page, but S-mode cannot.
The OS relies on the
A and D bits to de- • The G bit indicates this mapping exists in all virtual-address spaces, information the
cide which pages to hardware can use to improve address-translation performance. It is typically only used
swap to secondary
storage. Periodically
for pages that belong to the operating system.
clearing the A bits helps • The A bit indicates whether the page has been accessed since the last time the A bit
the OS approximate which
pages have been least-
was cleared.
recently used. The D bit • The D bit indicates whether the page has been dirtied (i.e., written) since the last time
indicates a page is even
more expensive to swap the D bit was cleared.
out, because it must be • The RSW field is reserved for the operating system’s use; the hardware ignores it.
written back to secondary
storage. • The PPN field holds a physical page number, which is part of a physical address. If this
PTE is a leaf, the PPN is part of the translated physical address. Otherwise, the PPN
gives the address of the next level of the page table. (Figure 10.10 divides the PPN into
two subfields to simplify the description of the address-translation algorithm.)
The other RV64 pag- RV64 supports multiple paging schemes, but we describe only the most popular one,
ing schemes simply
add more levels to
Sv39. Sv39 uses the same 4 KiB base page as Sv32. The page-table entries double in size to
the page table. Sv48 eight bytes so they can hold bigger physical addresses. To maintain the invariant that a page
is nearly identical to Sv39, table is exactly the size of a page, the radix of the tree correspondingly falls to 29 . The tree
but its virtual-address is three levels deep. Sv39’s 512 GiB address space is divided into 29 gigapages, each 1 GiB.
space is 29 times bigger
and its page table is one
Each gigapage is subdivided into 29 megapages, which in Sv39 are slightly smaller than in
level deeper. Sv32: 2 MiB. Each megapage is subdivided into 29 4 KiB base pages.
Figure 10.11 shows the layout of an Sv39 PTE. It’s identical to an Sv32 PTE, except the
PPN field has been widened to 44 bits to support 56-bit physical addresses, or 226 GiB of
physical-address space.
10.6. PAGE-BASED VIRTUAL MEMORY 111
31 30 22 21 0
MODE ASID PPN RV32
1 9 22
63 60 59 44 43 0
MODE ASID PPN RV64
4 16 44
Figure 10.12: The satp CSR. Figures 4.11 and 4.12 of [Waterman and Asanović 2017] are the bases for this
figure.
RV32
Value Name Description
0 Bare No translation or protection.
1 Sv32 Page-based 32-bit virtual addressing.
RV64
Value Name Description
0 Bare No translation or protection.
8 Sv39 Page-based 39-bit virtual addressing.
9 Sv48 Page-based 48-bit virtual addressing.
Figure 10.13: The encoding of the MODE field in the satp CSR. Table 4.3 of [Waterman and Asanović 2017]
is the basis for this figure.
An S-mode CSR, satp (Supervisor Address Translation and Protection), controls the
paging system. As Figure 10.12 shows, satp has three fields. The MODE field enables
paging and selects the page-table depth; Figure 10.13 shows its encoding. The ASID (Address
Space Identifier) field is optional and can be used to reduce the cost of context switches.
Finally, the PPN field holds the physical address of the root page table, divided by the 4 KiB
page size. Typically, M-mode software will write zero to satp before entering S-mode for
the first time, disabling paging, then S-mode software will write it again after setting up the
page tables.
112 CHAPTER 10. RV32/64 PRIVILEGED ARCHITECTURE
VPN
31 22 21 12 11 0
VPN[1] VPN[0] offset
VA
Page Table
satp
Page Table
PTE
PTE
33 12 11 0
PA PPN offset
When paging is enabled in the satp register, S-mode and U-mode virtual addresses are
translated into physical addresses by traversing the page table, starting at the root. Fig-
ure 10.14 depicts this process:
1. satp.PPN gives the base address of the first-level page table, and VA[31:22] gives the
first-level index, so the processor reads the PTE located at address (satp.PPN×4096
+ VA[31:22]×4).
2. That PTE contains the base address of the second-level page table and VA[21:12] gives
the second-level index, so the processor reads the leaf PTE located at (PTE.PPN×4096
+ VA[21:12]×4).
3. The leaf PTE’s PPN field and the page offset (the twelve least-significant bits
of the original virtual address) form the final result: the physical address is
(LeafPTE.PPN×4096 + VA[11:0]).
The processor then performs the physical memory access. The translation process is
almost the same for Sv39 as for Sv32, but with larger PTEs and one more level of indirec-
tion. Figure 10.19, at the end of this chapter, gives a complete description of the page-table
traversal algorithm, detailing the exceptional conditions and the special case of superpage
translations.
That’s almost all there is to the RISC-V paging system, save for one wrinkle. If all
instruction fetches, loads, and stores resulted in several page-table accesses, then paging
would reduce performance substantially! All modern processors reduce this overhead with
an address-translation cache (often called a TLB, for Translation Lookaside Buffer). To
reduce the cost of this cache, most processors don’t automatically keep it coherent with the
page table—if the operating system modifies the page table, the cache becomes stale. S-
mode adds one more instruction to solve this problem: sfence.vma informs the processor
10.7. CONCLUDING REMARKS 113
XLEN-1 12 11 10 9 8 7 6 5 4 3 2 1 0
WIRI MEIP WIRI SEIP UEIP MTIP WIRI STIP UTIP MSIP WIRI SSIP USIP
WPRI MEIE WPRI SEIE UEIE MTIE WPRI STIE UTIE MSIE WPRI SSIE USIE
XLEN-12 1 1 1 1 1 1 1 1 1 1 1 1
Figure 10.15: Machine interrupt registers. They are XLEN-bit read/write registers that hold pending
interrupts (mip) and the interrupt enable bits (mie) CSRs. Only the bits corresponding to lower-privilege
software interrupts (USIP, SSIP), timer interrupts (UTIP, STIP), and external interrupts (UEIP, SEIP) in
mip are writable through this CSR address; the remaining bits are read-only.
XLEN-1 10 9 8 7 6 5 4 3 2 1 0
WIRI SEIP UEIP WIRI STIP UTIP WIRI SSIP USIP
WPRI SEIE UEIE WPRI STIE UTIE WPRI SSIE USIE
XLEN-10 1 1 2 1 1 2 1 1
Figure 10.16: Supervisor interrupt registers. They are XLEN-bit read/write registers that hold pending
interrupts (sip) and the interrupt enable bits (sie) CSRs.
that software may have modified the page tables, so the processor can flush the translation
caches accordingly. It takes two optional arguments, which narrow the scope of the cache
flush: rs1 indicates which virtual address’ translation has been modified in the page table,
and rs2 gives the address-space identifier of the process whose page table has been modified.
If x0 is given for both, the entire translation cache is flushed.
The modularity of the RISC-V privileged architectures caters to the needs of a variety of
systems. The minimalist Machine mode supports bare-metal embedded applications at low
cost. The additional User mode and Physical Memory Protection together enable multitask-
ing in more sophisticated embedded systems. Finally, Supervisor mode and page-based
virtual memory provide the flexibility needed to host modern operating systems.
114 NOTES
XLEN-1 21 0
BASE[XLEN-1:2] MODE
XLEN-2 2
Figure 10.17: Machine and supervisor trap-vector base-address register (mtvec and stvec) CSRs. They are
XLEN-bit read/write registers that hold trap vector configuration, consisting of a vector base address
(BASE) and a vector mode (MODE). The value in the BASE field must always be aligned on a 4-byte
boundary. MODE = 0 means all exceptions set the PC to BASE. MODE = 1 sets the PC to
(BASE + (4 × cause)) on asynchronous interrupts.
XLEN-1 XLEN-2 0
Interrupt Exception Code
1 XLEN-1
Figure 10.18: Machine and supervisor cause (mcause and scause) CSRs. When a trap is taken, the CSR is
written with a code indicating the event that caused the trap. The Interrupt bit is set if the trap was caused
by an interrupt. The Exception Code field contains a code identifying the last exception. Tables 3.6 and 4.2 of
[Waterman and Asanović 2017] map the code values to the reason for the traps.
Notes
1 http://parlab.eecs.berkeley.edu
NOTES 115
Figure 10.19: The complete algorithm for virtual-to-physical address translation. va is the virtual address
input and pa is the physical address output. The PAGESIZE constant is 212 . For Sv32, LEVELS=2 and
PTESIZE=4, whereas for Sv39, LEVELS=3 and PTESIZE=8. Section 4.3.2 of [Waterman and Asanović
2017] is the basis for this figure.
11 Future RISC-V Optional Extensions
Alan Perlis (1922–1990) Fools ignore complexity. Pragmatists suffer it. Some can avoid it. Geniuses remove it.
was the first recipient of
—Alan Perlis, 1982
the Turing Award (1966),
conferred for his influence
The RISC-V Foundation will develop at least eight optional extensions.
on advanced programming
languages and compilers. In
1958 he helped design AL-
GOL, which has influenced
11.1 “B” Standard Extension for Bit Manipulation
virtually every imperative
programming language The B extension offers bit manipulation, including insert, extract, and test bit fields; rotations;
including C and Java. funnel shifts; bit and byte permutations; count leading and trailing zeros; and count bits set.
c.ebreak RaiseException(Breakpoint)
Environment Breakpoint. RV31IC and RV64IC.
Expands to ebreak.
15 13 12 11 76 21 0
100 1 00000 00000 10
31 25 24 20 19 15 14 12 11 76 0
0000001 rs2 rs1 101 rd 0111011
ebreak RaiseException(Breakpoint)
Environment Breakpoint. I-type, RV32I and RV64I.
Makes a request of the debugger by raising a Breakpoint exception.
31 20 19 15 14 12 11 76 0
000000000001 00000 000 00000 1110011
ecall RaiseException(EnvironmentCall)
Environment Call. I-type, RV32I and RV64I.
Makes a request of the execution environment by raising an Environment Call exception.
31 20 19 15 14 12 11 76 0
000000000000 00000 000 00000 1110011
31 25 24 20 19 15 14 12 11 76 0
1110000 00000 rs1 001 rd 1010011
138 RISC-V INSTRUCTIONS: FCVT.D.L
31 25 24 20 19 15 14 12 11 76 0
0010001 rs2 rs1 010 rd 1010011
31 25 24 20 19 15 14 12 11 76 0
0010000 rs2 rs1 010 rd 1010011
√
fsqrt.d rd, rs1, rs2 f[rd] = f[rs1]
Floating-point Square Root, Double-Precision. R-type, RV32D and RV64D.
Computes the square root of the double-precision floating-point number in register f[rs1] and
writes the rounded double-precision result to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0101101 00000 rs1 rm rd 1010011
√
fsqrt.s rd, rs1, rs2 f[rd] = f[rs1]
Floating-point Square Root, Single-Precision. R-type, RV32F and RV64F.
Computes the square root of the single-precision floating-point number in register f[rs1] and
writes the rounded single-precision result to f[rd].
31 25 24 20 19 15 14 12 11 76 0
0101100 00000 rs1 rm rd 1010011
j offset pc += sext(offset)
Jump. Pseudoinstruction, RV32I and RV64I.
Sets the pc to the current pc plus the sign-extended offset. Expands to jal x0, offset.
jr rs1 pc = x[rs1]
Jump Register. Pseudoinstruction, RV32I and RV64I.
Sets the pc to x[rs1]. Expands to jalr x0, 0(rs1).
mret ExceptionReturn(Machine)
Machine-mode Exception Return. R-type, RV32I and RV64I privileged architectures.
Returns from a machine-mode exception handler. Sets the pc to CSRs[mepc], the
privilege mode to CSRs[mstatus].MPP, CSRs[mstatus].MIE to CSRs[mstatus].MPIE, and
CSRs[mstatus].MPIE to 1; and, if user mode is supported, sets CSRs[mstatus].MPP to 0.
31 25 24 20 19 15 14 12 11 76 0
0011000 00010 00000 000 00000 1110011
RISC-V INSTRUCTIONS: MV 155
nop Nothing
No operation. Pseudoinstruction, RV32I and RV64I.
Merely advances the pc to the next instruction. Expands to addi x0, x0, 0.
ret pc = x[1]
Return. Pseudoinstruction, RV32I and RV64I.
Returns from a subroutine. Expands to jalr x0, 0(x1).
sret ExceptionReturn(Supervisor)
Supervisor-mode Exception Return. R-type, RV32I and RV64I privileged architectures.
Returns from a supervisor-mode exception handler. Sets the pc to CSRs[sepc], the privilege
mode to CSRs[sstatus].SPP, CSRs[sstatus].SIE to CSRs[sstatus].SPIE, CSRs[sstatus].SPIE
to 1, and CSRs[sstatus].SPP to 0.
31 25 24 20 19 15 14 12 11 76 0
0001000 00010 00000 000 00000 1110011
ABI, see also application binary in- amomin.w, see also Atomic Mem- ARMv8, 92
terface ory Operation Minimum Word, 121 ASIC, see also Application Specific
Add, 18, 119 amominu.d, see also Atomic Mem- Integrated Circuits
immediate, 18, 119 ory Operation Minimum Unsigned assembler directives, 35, 35
immediate word, 86, 119 Doubleword, 121 Atomic Memory Operation
upper immediate to PC, 123 amominu.w, see also Atomic Mem- Add
word, 86, 119 ory Operation Minimum Unsigned Doubleword, 86, 119
add, 18, see also c.add, 119 Word, 122 Word, 60, 120
Add upper immediate to PC, 18 amoor.d, see also Atomic Memory And
addi, see also Add immediate, Operation Or Doubleword, 122 Doubleword, 86, 120
see also c.addi16sp, see also amoor.w, see also Atomic Memory Word, 60, 120
c.addi4spn, see also c.addi, see Operation Or Word, 122 Exclusive Or
also c.li, 119 amoswap.d, see also Atomic Mem- Doubleword, 86, 123
addiw, see also Add immediate ory Operation Swap Doubleword, Word, 60, 123
word, see also c.addiw, 119 122 Maximum
addw, see also Add word, see also amoswap.w, see also Atomic Mem- Doubleword, 86, 120
c.addw, 119 ory Operation Swap Word, 122 Word, 60, 120
ALGOL, 116 amoxor.d, see also Atomic Mem- Maximum Unsigned
Allen, Fran, 14 ory Operation Exclusive Or Dou- Doubleword, 86, 121
AMD64, 92 bleword, 123 Word, 60, 121
amoadd.d, see also Atomic Memory amoxor.w, see also Atomic Memory Minimum
Operation Add Doubleword, 119 Operation Exclusive Or Word, 123 Doubleword, 86, 121
amoadd.w, see also Atomic Memory And, 18, 123 Word, 60, 121
Operation Add Word, 120 immediate, 18, 123 Minimum Unsigned
amoand.d, see also Atomic Memory and, see also c.and, 123 Doubleword, 86, 121
Operation And Doubleword, 120 andi, 18, see also And immediate, Word, 60, 122
amoand.w, see also Atomic Memory see also c.andi, 123 Or
Operation And Word, 120 application binary interface, 18, 26, Doubleword, 86, 122
amomax.d, see also Atomic Mem- 33, 34, 48 Word, 60, 122
ory Operation Maximum Double- Application Specific Integrated Cir- Swap
word, 120 cuits, 2 Doubleword, 86, 122
amomax.w, see also Atomic Mem- architecture, 8 Word, 60, 122
ory Operation Maximum Word, 120 ARM auipc, see also Add upper immediate
amomaxu.d, see also Atomic Mem- code size, 9, 92 to PC, 123
ory Operation Maximum Unsigned Cortex-A5, 7
Doubleword, 121 Cortex-A9, 8 backwards binary-compatibility, 4
amomaxu.w, see also Atomic Mem- instruction reference manual Bell, C. Gordon, 46, 86
ory Operation Maximum Unsigned number of pages, 12 beq, see also Branch if equal, see
Word, 121 Load Multiple, 7, 8 also c.beqz, 124
amomin.d, see also Atomic Memory number of registers, 10 beqz, 35, 124
Operation Minimum Doubleword, Thumb, 8, 9 bge, see also Branch if greater or
121 Thumb-2, 8, 9 equal, 124
INDEX 167
bgeu, see also Branch if greater or c.sd, 86, see also sd, 131 yield, 7
equal unsigned, 124 c.sdsp, 86, see also sd, 131 div, 135
bgez, 35, 124 c.slli, see also slli, 131 Divide, 44, 135
bgt, 35, 124 c.srai, see also srai, 132 unsigned, 44, 135
bgtu, 35, 124 c.srli, see also srli, 132 unsigned word, 86, 136
bgtz, 35, 125 c.sub, see also sub, 132 using shift right, 44
ble, 35, 125 c.subw, 86, see also subw, 132 word, 86, 136
bleu, 35, 125 c.sw, see also sw, 132 divu, see also Divide unsigned, 135
blez, 35, 125 c.swsp, see also sw, 132 divuw, see also Divide unsigned
blt, see also Branch if less than, 125 c.xor, see also xor, 133 word, 136
bltu, see also Branch if less than un- call, 35, 133 divw, see also Divide word, 136
signed, 125 Callee saved registers, 34 dynamic linking, 41
bltz, 35, 125 Caller saved registers, 34 dynamic register typing, 74, 90
bne, see also Branch if not equal, see Calling conventions, 32
also c.bnez, 126 Chanel, Coco, 118 ease of programming, compiling,
bnez, 35, 126 chip, see also die, 7 and linking, see also instruction set
Branch Compilers architecture, principles of design,
if equal, 21, 124 Turing Award, 116 ease of programming, compiling,
if greater or equal, 21, 124 context switch, 75 and linking
if greater or equal unsigned, 21, Control and Status Register ebreak, 136
124 read and clear, 22, 133 ecall, 136
if less than, 21, 125 read and clear immediate, 22, 134 Einstein, Albert, 60
if less than unsigned, 21, 125 read and set, 22, 134 ELF, see also executable and link-
if not equal, 21, 126 read and set immediate, 22, 134 able format
branch prediction, 8, 18 read and write, 22, 134 endianness, 21
Brooks, Fred, 113 read and write immediate, 22, 134 epilogue, see also function epilogue
Browning, Robert, 55 CoreMark benchmark, 8 Exception, 101
cost, see also instruction set archi- Exception Return
tecture, principles of design, cost Machine, 104, 154
c.add, see also add, 126 Supervisor, 108, 163
Cray, Seymour, 72, 93
c.addi, see also addi, 126 Exclusive Or, 18, 165
csrc, 35, 133
c.addi16sp, see also addi, 126 immediate, 18, 165
csrci, 35, 133
c.addi4spn, see also addi, 126 csrr, 35, 133 executable and linkable format, 35
c.addiw, 86, see also addiw, 127 csrrc, see also Control and Status
c.addw, 86, see also addw, 127 Register read and clear, 133 fabs.d, 35, 136
c.and, see also and, 127 csrrci, see also Control and Status fabs.s, 35, 136
c.andi, see also andi, 127 Register read and clear immediate, fadd.d, see also Floating-point Add
c.beqz, see also beq, 127 134 double-precision, 137
c.bnez, see also bne, 127 csrrs, see also Control and Status fadd.s, see also Floating-point Add
c.ebreak, see also ebreak, 128 Register read and set, 134 single-precision, 137
c.fld, see also fld, 128 csrrsi, see also Control and Status fclass.d, see also Floating-point
c.fldsp, see also fld, 128 Register read and set immediate, Classify double-precision, 137
c.flw, see also flw, 128 134 fclass.s, see also Floating-point
c.flwsp, see also flw, 128 csrrw, see also Control and Status Classify single-precision, 137
c.fsd, see also fsd, 128 Register read and write, 134 fcvt.d.l, see also Floating-point Con-
c.fsdsp, see also fsd, 129 csrrwi, see also Control and Status vert double from long, 138
c.fsw, see also fsw, 129 Register read and write immediate, fcvt.d.lu, see also Floating-point
c.fswsp, see also fsw, 129 134 Convert double from long unsigned,
c.j, see also jal, 129 csrs, 35, 135 138
c.jal, see also jal, 129 csrsi, 35, 135 fcvt.d.s, see also Floating-point
c.jalr, see also jalr, 129 csrw, 35, 135 Convert double from single, 138
c.jr, see also jalr, 130 csrwi, 35, 135 fcvt.d.w, see also Floating-point
c.ld, 86, see also ld, 130 Convert double from word, 138
c.ldsp, 86, see also ld, 130 da Vinci, Leonardo, 2 fcvt.d.wu, see also Floating-point
c.li, see also addi, 130 data-level parallelism, 72 Convert double from word un-
c.lui, see also lui, 130 de Saint Exup’ery L’Avion, Antoine, signed, 138
c.lw, see also lw, 130 48 fcvt.l.d, see also Floating-point Con-
c.lwsp, see also lw, 131 delay slot, 8 vert long from double, 139
c.mv, see also add, 131 delayed branch, 8 fcvt.l.s, see also Floating-point Con-
c.or, see also or, 131 die, see also chip, 7 vert long from single, 139
168 INDEX
fmadd.s, see also Floating- Sign-inject double-precision, 149 ease of programming, compil-
point fused multiply-add single- fsgnj.s, see also Floating-point Sign- ing, and linking, 10, 17, 18, 21, 24,
precision, 144 inject single-precision, 149 54, 72, 74–76, 81, 90, 91, 93, 105,
fmax.d, see also Floating-point max- fsgnjn.d, see also Floating-point 113
imum double-precision, see also Sign-inject negative double- isolation of architecture from
Floating-point maximum single- precision, 149 implementation, 8, 24, 72, 81, 101,
precision, 144 fsgnjn.s, see also Floating-point 117
fmax.s, 144 Sign-inject negative single- performance, 7, 14, 17, 24, 45,
fmin.d, see also Floating-point min- precision, 149 48, 53, 55, 61, 72, 76, 78, 79, 81,
imum double-precision, 144 fsgnjx.d, see also Floating-point 90, 92
fmin.s, see also Floating-point mini- Sign-inject XOR double-precision, program size, 9, 24, 64, 66, 90,
mum single-precision, 145 150 92, 93
fmsub.d, see also Floating-point fsgnjx.s, see also Floating-point room for growth, 8, 17, 24, 93
fused multiply-subtract double- Sign-inject XOR single-precision, simplicity, 7, 11, 12, 18, 20–24,
precision, 145 150 55, 61, 64, 73, 81, 105, 108, 113,
fmsub.s, see also Floating-point fsqrt.d, see also Floating-point 117
fused multiply-subtract single- Square Root double-precision, 150 mistakes of the past, 24
precision, 145 fsqrt.s, see also Floating-point modularity, 5
fmul.d, see also Floating-point Mul- Square Root single-precision, 150 open, 2
tiply double-precision, 145 fsrm, 35, 150 principles of design
fmul.s, see also Floating-point Mul- fsub.d, see also Floating-point Sub- cost, 42
tiply single-precision, 145, see tract double-precision, 151 ease of programming, compil-
also Floating-point Subtract single- fsub.s, 151 ing, and linking, 40, 42, 116
precision fsw, see also c.fswsp, see also c.fsw, performance, 32, 42, 116, 117
fmv.d, 35, 146 see also Floating-point store word, room for growth, 117
fmv.d.x, see also Floating-point 151 simplicity, 35
move doubleword from integer, 146 function epilogue, 35 Interrupt, 103
fmv.s, 35, 146 function prologue, 33 ISA, see instruction set architecture
fmv.w.x, see also Floating-point Fused multiply-add, 53 isolation of architecture from imple-
move word from integer, 146 mentation, see also instruction set
gather, 76 architecture, principles of design,
fmv.x.d, see also Floating-point
isolation of architecture from imple-
move doubleword to integer, 146
Hart, 101 mentation
fmv.x.w, see also Floating-point
Itanium, 91
move word to integer, 146
IEEE 754-2008 floating-point stan-
fneg.d, 35, 147
dard, 48 j, 35, 151
fneg.s, 35, 147 Illiac IV, 80 jal, 35, see also c.jal, see also c.j, see
fnmadd.d, see also Floating-point implementation, 8 also Jump and link, 151
fused negative multiply-add double- Instruction diagram jalr, 35, see also c.jalr, see also c.jr,
precision, see also Floating-point Privileged instructions, 101 see also Jump and link register, 152
fused negative multiply-add single- RV32A, 60 Johnson, Kelly, 42
precision, 147 RV32C, 64 jr, 35, 152
fnmadd.s, 147 RV32D, 48 Jump and link, 22, 151
fnmsub.d, see also Floating-point RV32F, 48 register, 22, 152
fused negative multiply-subtract RV32I, 14
double-precision, 147 RV32M, 44 la, 152
fnmsub.s, see also Floating-point RV64A, 86 lb, see also Load byte, 152
fused negative multiply-subtract RV64C, 86 lbu, see also Load byte unsigned,
single-precision, 148 RV64D, 86 152
FPGA, see also Field- RV64F, 86 ld, see also c.ldsp, see also c.ld, see
Programmable Gate Array, 2 RV64I, 86 also Load doubleword, 153
frcsr, 35, 148 RV64M, 86 leaf function, 33
frflags, 35, 148 instruction set architecture, 2 lh, see also Load halfword, 153
frrm, 35, 148 backwards binary-compatibility, 4 lhu, see also Load halfword un-
fscsr, 35, 148 elegance, 12, 24, 42, 67, 81, 93, signed, 153
fsd, see also c.fsdsp, see also c.fsd, 117 li, 35, 153
see also Floating-point store dou- incremental, 4 Lindy effect, 24
bleword, 148 metrics of design, 5 linker relaxation, 41
fsflags, 35, 149 cost, 5, 14, 17, 20, 21, 24, 46, 65, little-endian, 21
fsgnj.d, see also Floating-point 92, 112 lla, 153
170 INDEX
instruction set naming scheme, 5 sd, see also c.sdsp, see also c.sd, see srai, see also c.srai, see also Shift
lessons learned, 24 also Store doubleword, 159 right arithmetic immediate, 162
Linker, 40 seqz, 35, 159 sraiw, see also Shift right arithmetic
Loader, 42 Set less than, 18, 161 immediate word, 162
long, 90 immediate, 18, 161 sraw, see also Shift right arithmetic
macrofusion, 7 immediate unsigned, 18, 161 word, 163
memory allocation, 40 unsigned, 18, 161 sret, see also Exception Return Su-
modularity, 5 seven metrics of ISA design, see in- pervisor, 163
number of registers, 10 struction set architecture srl, see also Shift right logical, 163
pseudoinstruction, 35 metrics of design, 5 srli, see also c.srli, see also Shift
pseudoinstructions, 10 sext.w, 35, 159 right logical immediate, 163
Rocket, 7 sfence.vma, see also Fence Virtual srliw, see also Shift right logical im-
RV128, 93 Memory, 159 mediate word, 163
RV32A, 60 sgtz, 35, 159 srlw, see also Shift right logical
RV32C, 9, 11, 64 sh, see also Store halfword, 160 word, 164
RV32D, 48 Shift static linking, 41
RV32F, 48 left logical, 18, 160 Store
RV32G, 9, 11 left logical immediate, 18, 160 byte, 20, 158
RV32I, 14 left logical immediate word, 86, conditional
RV32M, 44 160 doubleword, 86, 158
RV32V, 11, 72 left logical word, 86, 161 word, 60, 159
RV64A, 86 right arithmetic, 18, 162 doubleword, 86, 159
RV64C, 86, 92 right arithmetic immediate, 18, halfword, 20, 160
RV64D, 86 162 word, 20, 160
RV64F, 86 strip mining, 79
right arithmetic immediate word,
RV64G, 11 sub, see also Subtract, 18, see also
86, 162
RV64I, 86 c.sub, see also Subtract, 164
right arithmetic word, 86, 163
RV64M, 86 Subtract, 18, 164
right logical, 18, 163
saved registers, 33 word, 86, 164
right logical immediate, 18, 163
stack region, 40 subw, see also c.subw, see also Sub-
right logical immediate word, 86,
static region, 40 tract word, 164
163
temporary registers, 33 superscalar, 2, 8, 66
right logical word, 86, 164
text region, 40 Sutherland, Ivan, 32
SIMD, see also Single Instruction
sw, see also c.swsp, see also c.sw,
RISC-V ABI, see RISC-V Applica- Multiple Data 11
see also Store word, 160
tion Binary Interface, 41 simplicity, see also instruction set
RISC-V Application Binary Inter- architecture, principles of design, tail, 35, 164
face simplicity Thoreau, Henry David, 117
ilp32, 41 Single Instruction Multiple Data, 2, Thumb-2, 55
ilp32d, 41 11, 72 TLB, 113
ilp32f, 41 sll, see also Shift left logical, 160 TLB shootdown, 113
lp64, 90 slli, see also c.slli, see also Shift left Translation Lookaside Buffer, 113
lp64d, 90 logical immediate, 160 Turing Award
lp64f, 90 slliw, see also Shift left logical im- Allen, Fran, 14
RISC-V Foundation, 2 mediate word, 160 Brooks, Fred, 113
room for growth, see also instruc- sllw, see also Shift left logical word, Dijkstra, Edsger W., 100
tion set architecture, principles of 161 Perlis, Alan, 116
design, room for growth slt, see also Set less than, 161 Sutherland, Ivan, 32
RV128, 93 slti, see also Set less than immediate, Wirth, Niklaus, 66
RV32C, 55 161
RV32V, 80 sltiu, see also Set less than immedi- User mode, 105
ate unsigned, 161
Santayana, George, 24 sltu, see also Set less than unsigned, Vector
sb, see also Store byte, 158 161 gather, 76
sc.d, see also Store conditional dou- sltz, 35, 162 indexed load, 76
bleword, 158 Small is Beautiful, 64 indexed store, 76
sc.w, see also Store conditional Smith, Jim, 81 scatter, 76
word, 159 snez, 35, 162 strided load, 75
scatter, 76 sra, see also Shift right arithmetic, strided store, 75
Schumacher, E. F., 64 162 strip-mining, 79
172 INDEX