RISCV
RISCV
Bernard Goossens
Guide to Computer
Processor
Architecture
A RISC-V Approach,
with High-Level Synthesis
Undergraduate Topics in Computer
Science
Series Editor
Ian Mackie, University of Sussex, Brighton, UK
Advisory Editors
Samson Abramsky , Department of Computer Science, University of Oxford,
Oxford, UK
Chris Hankin , Department of Computing, Imperial College London, London, UK
Mike Hinchey , Lero—The Irish Software Research Centre, University of
Limerick, Limerick, Ireland
Dexter C. Kozen, Department of Computer Science, Cornell University, Ithaca,
NY, USA
Andrew Pitts , Department of Computer Science and Technology, University of
Cambridge, Cambridge, UK
Hanne Riis Nielson , Department of Applied Mathematics and Computer Science,
Technical University of Denmark, Kongens Lyngby, Denmark
Steven S. Skiena, Department of Computer Science, Stony Brook University, Stony
Brook, NY, USA
Iain Stewart , Department of Computer Science, Durham University, Durham,
UK
Joseph Migga Kizza, College of Engineering and Computer Science,
The University of Tennessee-Chattanooga, Chattanooga, TN, USA
‘Undergraduate Topics in Computer Science’ (UTiCS) delivers high-quality
instructional content for undergraduates studying in all areas of computing and
information science. From core foundational and theoretical material to final-year
topics and applications, UTiCS books take a fresh, concise, and modern approach
and are ideal for self-study or for a one- or two-semester course. The texts are all
authored by established experts in their fields, reviewed by an international advisory
board, and contain numerous examples and problems, many of which include fully
worked solutions.
The UTiCS concept relies on high-quality, concise books in softback format, and
generally a maximum of 275–300 pages. For undergraduate textbooks that are
likely to be longer, more expository, Springer continues to offer the highly regarded
Texts in Computer Science series, to which we refer potential authors.
Bernard Goossens
Guide to Computer
Processor Architecture
A RISC-V Approach, with High-Level
Synthesis
123
Bernard Goossens
Université de Perpignan
Perpignan, France
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This book is a new textbook on processor architecture. What is new is not the topic,
even though actual multicore and multithreaded designs are depicted, but the way
processor architecture is presented.
This book can be related to the famous Douglas Comer textbook on Operating
System (OS) [1, 2]. As Douglas Comer did to present the design of an OS, I use a
DIY approach to present processor designs.
In his book, Douglas Comer builds a full OS from scratch, with C source code.
In the present book, I aim to make you build your own processors, also from scratch
and also with C source code.
All you need is a computer, an optional development board, and a set of freely
available softwares to transform C programs into equivalent FPGA implementa-
tions (field-programmable gate array).
If you don’t have a development board, you can still simulate the processors
presented in the book though.
In the 70s (of the twentieth century of course), it became possible for a single
person to build a full OS (Kenneth L. Thompson created Unix in 1970), and better
than that, to write a How-To book giving a complete recipe to implement a
Unix-like OS (Douglas Comer published Xinu in 1984).
Two improvements in computer hardware and software made it eventually
feasible: the availability of personal computers and the C programming language.
They give the full access to the hardware.
Nowadays, an FPGA plays the role of the personal computer of the 70s: It
gives access to the logic gates. The High-Level Synthesis tool (HLS) plays the role
of the C compiler of the 70s: It gives access to the FPGA through a high-level
language.
v
vi Preface
The Douglas Comer book explained how to build an OS using a self-made example
named Xinu. Even though Xinu was claimed not to be Unix (Xinu is a recursive
acronym meaning “Xinu Is Not Unix”), it would behave very likely, giving the reader
and implementer the opportunity to compare his/her own realization to the Unix
reference.
In the same idea, I have chosen a reference processor to be able to compare the
FPGA-based processor proposed in the book to real RISC-V industrial products.
RISC-V is an open-source Instruction Set Architecture (ISA), which means that
you can build a RISC-V processor, use it, and even sell it, without the permission of
any computer constructor. It would not be so for Intel’s X86 or ARM’s v7 or v8.
Moreover, RISC-V defines multiple gigogne ISA subsets. A processor may
implement any level of the matriochka organization of the ISA.
In this book, you will implement one of the most basic subsets, namely RV32I
(a set of machine instructions to compute on 32-bits integer words). But you will
know enough to be able to expand your processor to 64-bits words, to add a
floating-point computation subset and many more, according to the RISC-V
specification [3].
Moreover, the subset you will be implementing is enough to boot an OS like
Linux (which is not part of this book though).
All the processors designed in this book are provided as open-source projects
(either to build the simulation only version or the full FPGA-based projects to be
tested on a Xilinx-based development board) available in the goossens-book-
ip-projects repository of the https://github.com/goossens-springer github.
A full chapter of the book is devoted to the installation of the RISC-V tools (gnu
toolchain including the RISC-V cross-compiler, spike simulator, gdb debugger, and
the RISC-V-tests official test and benchmarks suite provided by the RISC-V
international organization (https://riscv.org/).
The implementations proposed in the book are compared from a performance
perspective, applying the famous Hennessy–Patterson “quantitative approach” [4].
The book presents an adaptation of a benchmark suite to the development board
no-OS environment.
The same benchmark suite is used throughout the book to test and compare the
performance of the successive designs. These comparisons highlight the cycle per
instruction (CPI) term in the processor performance equation.
Such comparisons based on really implemented softcores are more convincing
for students than similar evaluations relying on simulation with no real hardware
constraints.
The different microarchitectures described in the book introduce the general
concepts related to pipelining: branch delay and cancelation, bypassing in-flight
values, load delay, multicycle operators, and more generally, filling the pipeline
stages.
The chapter devoted to the RISC-V RV32I is also an introduction to assembly
programming. RISC-V codes are obtained from the compilation of C patterns
(expressions, tests, loops, and functions) and analyzed.
Throughout the book, some exercises are proposed which can serve as semester
projects, like extending the given implementations to the RISC-V M or F ISA
subsets.
viii Preface
Such a box signals some experimentation the reader can do from the
resources available in the goossens-book-ip-projects/2022.1 folder in the
https://github.com/goossens-springer github.
The Vitis_HLS projects are pre-built (you just need to select the testbench
file you want to use for your IP simulation) (IP means Intellectual Property,
i.e., your component).
The Vivado projects are also pre-built with drivers to directly test your IPs
on the development board. The expected results are in the book.
The book is a very detailed introduction to High-Level Synthesis. HLS will cer-
tainly become the standard way to produce RTL, progressively replacing
Verilog/VHDL, as in the 50s and 60s, high-level languages progressively replaced
the assembly language.
Chapter 2 of the book presents the Xilinx HLS environment in the Vitis tool
suite. Based on an IP implementation example, it travels through all the steps from
HLS to the Xilinx IP integrator Vivado and the Xilinx Vitis IDE (Integrated Design
Environment) to upload the bitstream on the FPGA.
The book explains how to implement, simulate, synthesize, run on FPGA, and
even debug HLS softcore projects without the need to go down to Verilog/VHDL
or chronograms. HLS is today a matured tool which gives engineers the ability to
quickly develop FPGA prototypes. The development of a RISC-V processor in
HLS is a one engineer-month duty, when implementing an ARM processor in
VHDL was more like a year job, thanks to HLS and thanks to the simplicity of the
RV32I instruction nucleus in the RISC-V ISA.
The book explains the major pragmas used by the HLS synthesizer (ARRAY
PARTITION, DEPENDENCE, INTERFACE, INLINE, LATENCY, PIPELINE,
UNROLL).
Part I of the book concerns individual IPs, and Part II is devoted to System-
on-Chips built from multiples core and memory IPs interconnected by an AXI
interconnection component.
The book is also an introduction to RISC-V. A full chapter is devoted to the
presentation of the RV32I ISA.
The market growth of RISC-V processors is already impressive in the domain of
embedded computing. The future of RISC-V might be the same as what was the
progression of Unix in the domain of operating systems. At least, the first steps are
comparable, with a no-constructor and open-source philosophy.
Preface ix
However, this book does not contain any implementation of the techniques found in
the most advanced processors, like superscalar execution, out-of-order execution,
speculation, branch prediction, or value prediction (however, these concepts are at
least defined).
The reason is that these microarchitectural features are too complex to fit in a
small FPGA like the one I used. For example, a cost-effective out-of-order design
requires a superscalar pipeline, an advanced branch predictor, and a hierarchical
memory altogether.
There is no implementation of advanced parallel management units like a shared
memory management (i.e., cache coherency) for the same reason.
The book does not include any implementation of caches or complex arithmetic
operators (multiplication, division, or floating-point unit). They can fit on the FPGA
(at least in a single core and single thread processor). They are left as an exercise for
the reader.
The book is divided into two parts and 14 chapters, including an introduction and a
conclusion. Part I, from Chaps. 1 to 10, is devoted to single core processors. Part II,
from Chaps. 11 to 14, presents some multicore implementations.
Chapter 1 is the introduction. It presents what an FPGA is and how HLS works
to transform a C program into a bitstream to configure the FPGA.
The two following chapters give the necessary indications to build the full
environment used in the book to develop the RISC-V processors.
Chapter 2 is related to the Xilinx Vitis FPGA tools (the Vitis_HLS FPGA
synthesizer, the Vivado FPGA integrator, and the Vitis IDE FPGA programmer).
Chapter 3 presents the RISC-V tools (the Gnu toolchain, the Spike simulator,
and the OpenOCD/gdb debugger), their installation, and the way to use them.
Chapter 4 presents the RISC-V architecture (more precisely, the RV32I ISA) and
the assembly language programming.
Chapter 5 shows the three main steps in building a processor: fetching, decoding,
and executing. The construction is incremental.
The general principles of HLS programming, in contrast to classic programming,
are explained in the first section of Chap. 5.
Chapter 6 completes chapter five with the addition of a data memory to fulfill the
first RISC-V processor IP. The implemented microarchitecture has the most simple
non-pipelined organization.
x Preface
Chapter 7 explains how you should test your processor IPs, using small RISC-V
codes to check each instruction format individually, also using the official
RISC-V-tests pieces of codes provided by the RISC-V organization. Eventually,
you should run some benchmarks to test your IP behavior on real applications.
I have combined the RISC-V-tests and the mibench benchmarks [12] to form a
suite which is used both to test and to compare the different implementations
throughout the book.
At the end of Chap. 7, you will find many hints on how to debug HLS codes and
IPs on the FPGA.
Chapter 8 describes pipelined microarchitectures, starting with a two-stage
pipeline and ending with a four-stage pipeline.
Chapter 9 pushes pipelining a step further to handle multicycle instructions. The
multicycle pipeline RISC-V IP has six stages. It is a necessary improvement in the
pipeline organization to run RISC-V multicycle instructions like the ones found in
the F and D floating-point extensions, or to implement cache levels building
hierarchized memories.
Chapter 10 presents a multiple hart IP (a hart is a HARdware Thread). Mul-
tithreading is a technique to help filling the pipeline and improve the processor
throughput. The implemented IP is able to run from two to eight threads
simultaneously.
Chapter 11 starts the second part. It describes the AXI interconnection system
and how multiple IPs can be connected together in Vivado and exchange data on
the FPGA.
Chapter 12 presents a multicore IP based on the multicycle six-stage pipeline.
The IP can host from two to eight cores running either independent applications or
parallelized ones.
Chapter 13 shows a multicore multihart IP. The IP can host two cores with four
harts each or four cores with two harts each.
Chapter 14 concludes by showing how you can use your RISC-V processor
implementations to play with your development board, lighting LEDs (Light-
Emitting Diode) as push buttons are pressed.
An acronym section is added in the Frontmatter to give the meaning of the
abbreviations used in the book.
A few exercises are proposed in the book (with no given solution) which should
be viewed by professors as project or laboratory suggestions.
References
1. Comer, D.: Operating System Design: The Xinu Approach. Prentice Hall International,
Englewood Cliffs, New Jersey (1984)
2. Comer, D.: Operating System Design: The Xinu Approach, Second Edition. Chapman and
Hall, CRC Press (2015)
Preface xi
3. https://riscv.org/specifications/isa-spec-pdf/
4. Hennessy, J.L., Patterson, D.A.: Computer Architecture, A quantitative Approach, 6th edition,
Morgan Kaufmann (2017)
5. Hennessy, J.L., Patterson, D.A.: Computer Organization and Design: The Hardware/Software
Interface, 6th edition, Morgan Kaufmann (2020)
6. Faggin, F.: The Birth of the Microprocessor. Byte, Vol.17, No.3, pp. 145–150 (1992)
7. Faggin, F.: The Making of the First Microprocessor. IEEE Solid-State Circuits Magazine,
(2009) https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4776530
8. Patterson, D., Ditzel, D.: The Case for the Reduced Instruction Set Computer.
ACM SIGARCH Computer Architecture News, Vol.8, No.6, pp. 5–33 (1980)
9. Chow, P., Horowitz, M.: Architectural tradeoffs in the design of MIPS-X, ISCA’87 (1987)
10. Tullsen, D.M., Eggers, S.J., Levy, H.M.: Simultaneous multithreading: Maximizing on-chip
parallelism. 22nd Annual International Symposium on Computer Architecture. IEEE.
pp. 392–403 (1995).
11. Tendler, J. M., Dodson, J. S., Fields Jr., J. S., Le, H., Sinharoy, B.: POWER4 system
microarchitecture, IBM Journal of Research and Development, Vol.46, No 1, pp. 5–26 (2002)
12. https://vhosts.eecs.umich.edu/mibench/
Acknowledgements
I want to thank my reviewers for the invaluable selfless work they have done,
reading and correcting an always changing manuscript. They were all great!
So, the only order of presentation which is acceptable is the alphabetical one
(based on the last name). If you find some mistakes in the textbook, they are all
mine of course.
Yves Benhamou is a computer engineer in a domain not at all related to pro-
cessor architecture. I asked Yves to be the neophyte reader, going through the steps
of the first constructions like installing the softwares and make the first example run
on the Pynq-Z2 board I had given him. Moreover, Yves mainly works on Windows
and he had to start with a fresh new Ubuntu installation, so he could give me a lot of
very important remarks for the reader in the same situation, with missing
commands, files, and needed environment preparations. Yves, I hope you had fun
discovering FPGA, softcore designs, and HLS! On my side, I was very happy to see
that again you were eager to make it run!
Johannes Schoder is a Ph.D. student at Friedrich Schiller University, Jena,
Germany. He contacted me in June 2021 to ask for a free access to my
“Out-of-Order RISC-V Core Developed with HLS” which I had presented at the
second RISCV-week in Paris in October 2019. I was already writing the book, but it
was still at a very preliminary stage. Later, after Johannes had been through the
code of what is in the book the “multicycle_pipeline_ip”, I proposed him to review
a part of the textbook. He accepted enthusiastically. Johannes, you did a great job!
Arnaud Tisserand is “Directeur de Recherche” in the French CNRS (Centre
National de la Recherche Scientifique). He is an expert in architecture, focusing his
research work on chip accelerators in arithmetic and cryptography. I asked him to
review the FPGA part of the textbook. It turns out that Arnaud wants to use HLS
more intensely for his future designs (mainly because HLS produces prototypes
very quickly; this speed in production is invaluable for PhD students), so he pro-
posed to review a bit more stuffs and even try some of the proposed implemen-
tations. Thank you so much Arnaud for your help!
I also want to thank the XUP (Xilinx University Program) and more specifically,
his manager Cathal Mac Cabe. Cathal is an engineer at Xilinx/AMD. I have been in
contact with him through mail, as all the members of the XUP are, asking him many
questions on HLS and Vitis. Cathal does a great job at XUP. He is very helpful to
xiii
xiv Acknowledgements
all the worldwide academic community of Xilinx product users. The XUP program
is crucial for many little research teams like mine. When a professor has the project
to teach FPGAs or HLS to undergraduate students, he/she must convince his/her
colleagues that devoting a few thousand of euros on the yearly budget to buy
development boards is worth the spending. This is where XUP plays a major role.
By providing free boards, it helps the pedagogical teams to set preliminary
experimentations which can serve as persuasive arguments for the faculty members
who are not in the computer science domain. So, thank you again Cathal for the
great supporting job (and for the free boards)!
There are many more people who directly or indirectly influenced the textbook.
Of course, my colleagues of the LIRMM laboratory (Laboratoire d’Informatique,
Robotique et Microélectronique de Montpellier) and more particularly the actual
members of the DALI team (Digits, Architecture et Logiciels Informatiques) in
Perpignan: Dushan Bikov, Matthieu Carrère, Youssef Fakhreddine, Philippe Lan-
glois, Kenelm Louetsi, Christophe Nègre, David Parello, Guillaume Révy, and
Vincent Zucca.
Last but not least, I thank my wife, Urszula, for her every day support. It is not
easy for the family to accept sacrificing so much time, including weekends and
holidays, to this kind of long ongoing project. Urszula, you took a so important part
in the realization of this book!
Contents
xv
xvi Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Acronyms
The given definitions are all taken from online Wikipedia. The “In short” versions
are all mine. They intend to be less general but a bit more practical.
ABI Application Binary Interface: an interface between two binary program
modules. In short, an ABI fixes a general frame to build applications
from the processor architecture.
ALU Arithmetic and Logic Unit: A combinational digital electronic circuit that
performs arithmetic and bitwise operations on integer binary numbers.
AXI Advanced eXtensible Interface: a parallel high-performance,
synchronous, high-frequency, multi-master, multi-slave communication
interface, mainly designed for on-chip communication. In short, an IP
interconnection system.
CLB Configurable Logic Block: A fundamental building block of field-
programmable gate array (FPGA) technology. Logic blocks can be
configured by the engineer to provide reconfigurable logic gates. In
short, the elementary building structure in FPGAs.
CPU Central Processing Unit: The electronic circuitry within a computer that
executes instructions that make up a computer program. In short, the
processor core.
ELF Executable and Linkable Format: It is a common standard file format for
executable files, object code, shared libraries, and core dumps. In 1999, it
was chosen as the standard binary file format for Unix and Unix-like
systems on x86 processors by the 86open project. By design, the ELF
format is flexible, extensible, and cross-platform. For instance, it
supports different endiannesses and address sizes, so it does not exclude
any particular central processing unit (CPU) or Instruction Set Archi-
tecture. This has allowed it to be adopted by many different operating
systems on many different hardware platforms. In short, the loadable
format of all the executable files in Linux or MacOS systems.
FPGA Field-Programmable Gate Array: An integrated circuit designed to be
configured by a customer or a designer after manufacturing. In short, it is
a programmable chip.
xxiii
xxiv Acronyms
GUI Graphical User Interface: A form of user interface that allows users to
interact with electronic devices through graphical icons and audio
indicator such as primary notation, instead of text-based user interfaces,
typed command labels or text navigation.
HDL Hardware Description Language: A specialized computer language used
to describe the structure and behavior of electronic circuits, and most
commonly, digital logic circuits. In short: an HDL is to integrated
circuits what a programming language is to algorithms.
HLS High-Level Synthesis: An automated design process that interprets an
algorithmic description of a desired behavior and creates digital hardware
that implements that behavior. In short, implementing hardware with a
program written in a high-level language like C or C++.
IP Intellectual Property: A category of property that includes intangible
creations of the human intellect. In short, a component.
ISA Instruction Set Architecture: An abstract model of a computer. It is also
referred to as architecture or computer architecture. A realization of an
ISA, such as a central processing unit (CPU), is called an implemen-
tation. In short, a processor architecture is defined by an ISA, i.e., a
machine language (or assembly language).
LAB Logic Array Block: see the CLB entry. The LAB is for the Altera FPGA
constructor what the CLB is for the Xilinx FPGA constructor.
LUT Lookup Table: An array that replaces runtime computation with a
simpler array indexing operation. An n-bit lookup table implements any
n
of the 22 Boolean functions of n variables. In an FPGA, a 6-bit LUT is a
64-bits addressable table, which is addressed with a 6-bit word built from
the Boolean values of six variables. The addressed bit gives the Boolean
value of the function for the input combination forming the address.
OoO Out-of-Order: A paradigm used in most high-performance central
processing units to make use of instruction cycles that would otherwise
be wasted. In short, a hardware organization to run instructions in their
producer to consumer dependencies.
OS Operating System: A system software that manages computer hardware,
software resources, and provides common services for computer
programs. In short, Linux, Windows, or MacOS.
PCB Printed Circuit Board: A printed circuit board (PCB) mechanically
supports and electrically connects electrical or electronic components
using conductive tracks, pads, and other features etched from one or
more sheet layers of copper laminated onto and/or between sheet layers
of a non-conductive substrate. In short, your development board.
RAM Random Access Memory: A form of computer memory that can be read
and changed in any order, typically used to store working data and
machine code. In short, the processor’s main memory.
RAW Read After Write dependency (or true dependency): An instruction refers
to a result that has not yet been calculated or retrieved.
Acronyms xxv
In this first part, I present a set of four implementations of the RV32I RISC-V ISA:
non-pipelined, pipelined, multicycle, and multihart (i.e., multithreaded). Each
defines a single core IP which has been simulated and synthesized with the
Vitis HLS tool, placed and routed by the Vivado tool, and tested on the Xilinx FPGA
available on a Pynq-Z1/Pynq-Z2 development board.
Introduction:What Is an FPGA,What Is
High-Level Synthesis or HLS? 1
Abstract
This chapter shows what an FPGA is and how it is structured from Configurable
Logic Blocks or CLB (in the Xilinx terminology, or LAB, i.e. Logic Array Blocks
in Altera FPGAs). It also shows how a hardware is mapped on the CLB resources
and how a C program can be used to describe a circuit. An HLS tool transforms
the C source code into an intermediate code in VHDL or Verilog and a placement
and routing tool builds the bitstream to be sent to configure the FPGA.
s <= a xor b ;
c <= a and b ;
end BHV ;
We want to add two 1-bit words, i.e. s = a + b, where a and b are Boolean variables.
The sum s is a 2-bit word, composed of the modulo 2 sum and the carry bit.
For example, if the pair (a, b) is (1, 1), their sum s in binary is 10, which is the
concatenation of the carry bit (1) and the modulo 2 sum (0).
Let us first concentrate on the modulo 2 sum.
We can define the modulo 2 sum as a Boolean function of two variables, which
truth table is presented as Table 1.1, where the arguments of the function are in blue
and the values of the function are in red.
For example, the last line of the table says that if a = b = 1 then s = 0, i.e.
s(1, 1) = 0.
A LUT (acronym of Look-Up Table) is a hardware device which is similar to
a memory. This memory is filled with the truth table of a Boolean function. For
example, a 4-bit LUT (or a LUT-2), can store the truth table of a Boolean function
of two variables (the red values in Table 1.1).
More generally, a 2n -bit LUT (or a LUT-n), can contain the truth table of a Boolean
function of n variables (where you can fit 2n truth values).
For example, NOT is a single-variable Boolean function. It is represented by a
two line truth table and a LUT-1, i.e. two truth values (NOT(0)=1 and NOT(1)=0).
The two-variable Boolean function AND can be extended to three variables (a
AND b AND c). The truth table has eight lines, as shown in Table 1.2.
Table 1.1 The truth table of the modulo 2 sum of two 1-bit words
a b s
0 0 0
0 1 1
1 0 1
1 1 0
6 1 Introduction: What Is an FPGA, What Is High-Level Synthesis or HLS?
Table 1.3 The truth table of the operator “IF a THEN b ELSE c”
a b c s
0 0 0 0
0 0 1 1
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 0
1 1 0 1
1 1 1 1
Let us continue our construction of an adder. This time, let us try to build a full adder,
that is, a hardware cell calculating the modulo 2 sum of two bits and an input carry.
The carry is a new input variable which extends the truth table from four to eight
rows. The modulo 2 sum of the three input bits a, b, and ci is their XOR (a ⊕ b ⊕ ci ).
Rather than being built with a LUT-3, the full adder is built with two LUT-2. It
calculates not only the modulo 2 sum of its three inputs but also their carry.
For example, if a = 0 and b = 1, the modulo 2 sum is s = 1. But if there is a
carry in, say ci = 1, then the modulo 2 sum is s = 0 and there is a carry out co = 1.
A first LUT-2 is used to store the truth table of the Boolean function generate. As
its name suggests, the generate function value is 1 when the sum of the two sources
a and b generates a carry, i.e. when a = b = 1, meaning a ∧ b is true (a AND b).
A second LUT-2 stores the truth table of the Boolean function propagate. The
propagate function value is 1 when both sources a and b can propagate an inbound
carry but cannot generate one, which happens when either of the two sources is a 1
but not both simultaneously. The propagate function is the XOR (a ⊕ b).
We link the two tables as shown in Fig. 1.4. In the figure, the addressed table
entries are shown in red. The values out of the LUTs are propagated to the mux box
and to the XOR gate on the right.
The box labeled mux is a multiplexer, that is, equivalent to the Boolean function
“IF x THEN y ELSE z”. The input coming from the left side of the box is the x
selector. The entries coming from the lower edge of the box are the choices y (the
right input) and z (the left input). If the selector x is a 0, the choice on the left (z) is
found at the output. Otherwise, it is the right choice which crosses the multiplexer
(y).
Thus, if a = 0 and b = 1, the propagate function value is 1 and the generate
function value is 0 (these values are in red in the figure). If the incoming carry ci
is 1, the multiplexer selector (1) passes the right choice ci and the outgoing carry is
co = 1.
The structure of the prefabricated circuit composed of the two LUTs and the two
gates brings out the propagation of the input carry ci towards the output co when the
multiplexer chooses its right input, i.e. when propagate is 1. This carry propagation
8 1 Introduction: What Is an FPGA, What Is High-Level Synthesis or HLS?
b=1
co
0 1 1
1
mux
0 1 0 1 1
0 s
propagate
1
a=0
0 0
0 1
ci=1
generate
mode is very efficient because the input signal ci is found at the output after a single
gate crossing.
The gate on the right of the figure is an XOR. It produces the modulo 2 sum of
propagate and ci . Since propagate is itself an XOR, the output s is the modulo 2
sum of the three inputs a, b, and ci (a ⊕ b ⊕ ci ).
In the example presented, the combination of the two LUTs, the multiplexer and
the XOR gate calculates two bits, one representing the modulo 2 sum of the three
inputs and the other being the outgoing carry. By sticking these two bits together,
we form the 2-bit sum of the three inputs (0 + 1 + 1 = 10 in binary).
The organization of the full adder proposed above corresponds to what is found
in a CLB, i.e. a Configurable Logic Block, which is the basic building block of the
FPGA (in [1], pages 19 and 20, you have the exact description of the CLBs you find
in the FPGA used in this book).
A CLB combines a LUT with a fast carry propagation mechanism. It is a kind
of Swiss army knife, which computes logic functions with the LUT and arithmetic
functions with the carry propagation.
The programming or configuration of the CLB is the filling of the LUTs with
the truth values of the desired Boolean functions (in the example, the propagate and
generate functions; the Swiss army knife may divide the LUT in two halves to install
two Boolean functions).
1.4 The Structure of an FPGA 9
Let us try to extend our adder from a 1-bit word adder to a 2-bit word adder.
Let A = a1 a0 and B = b1 b0 , for example A = 10, with a1 = 1 and a0 = 0 (i.e.
1 ∗ 21 + 0 ∗ 20 , or 2 in decimal), and B = 01 (i.e. 0 ∗ 21 + 1 ∗ 20 , or 1 in decimal).
The sum A + B + ci is the 3-bit word co s1 s0 (for example 10 + 01 + 1 = 100 or in
decimal 2 + 1 + 1 = 4).
By combining two CLBs configured as a full adder, with the output of the first
(co0 in Fig. 1.5) connected to the input of the second (ci1 in Fig. 1.5) and inputs a0
and b0 for the first and a1 and b1 for the second, three bits are output, forming the
co s1 s0 sum of the two 2-bit words A = a1 a0 and B = b1 b0 and an incoming carry
ci .
Figure 1.5 shows this 2-bit adder.
An FPGA contains a matrix of CLBs (see the left part of Fig. 1.6).
For example, the Zynq XC7Z020 from Xilinx is a SoC (System-on-Chip, i.e.
a circuit containing several components: processors, memories, USB and Ethernet
interfaces, and an FPGA of course) whose programmable part (the FPGA) contains
6650 CLBs.
One can imagine that the CLBs are organized in a square of more or less 80
columns and 80 rows (the exact geometry is not described). For a detailed presentation
of FPGAs (including their history), you can refer to the Hideharu Amano book [2].
Each CLB contains two identical, parallel and independent SLICEs (centre part
of Fig. 1.6). Each slice mainly consists of four LUT-6 and eight flip-flops (the red
squares labeled FF—for Flip-Flop—in the centre and right part of Fig. 1.6). Each
flip-flop is a 1-bit clocked memory point, which collects the output of the LUT.
Each LUT-6 can represent a six-variable Boolean function or can be split into two
LUT-5, each representing a five-variable Boolean function.
The LUTs of the same slice are linked together by a carry propagation chain
identical to that of Figs. 1.4 and 1.5 (including a multiplexer and an XOR gate).
co0
a0
s0
CLB0
b0
ci
10 1 Introduction: What Is an FPGA, What Is High-Level Synthesis or HLS?
LUT LUT FF
... ...
LUT LUT
A2 a2 m x s2
B2 CLB b2 LUT
A1 a1 m x s1
B1 CLB b1 LUT
A0 a0 m x s0
B0 CLB b0 LUT
0 ci
A0 = a03...a00 B0 = b03...b00
A1 = a07...a04 B1 = b07...b04
A2 = a11...a08 B2 = b11...b08
A3 = a15...a12 B3 = b15...b12
A LUT can be partially filled. It may contain only four useful bits out of the 64
available to represent a Boolean function with two variables. But it is better not to
waste this resource.
By continuing to extend the 2-bit adder, one can build an adder of any size.
A 16-bit adder links four CLBs of a single column, using 16 LUTs in series, each
LUT containing the two tables in Fig. 1.4 (notice that the adder uses only one of the
two available slices in each CLB).
The left part of Fig. 1.7 shows how the two 16-bit words to be added are distributed
by nibbles in the four CLBs (for exemple the A3 = a15 a14 a13 a12 nibble inputs the
highest CLB in the figure). The first CLB in the chain (the lowest in the figure)
receives an input carry entry ci = 0.
The right part of the figure shows the addition of the first nibbles (a3 a2 a1 a0 +
b3 b2 b1 b0 ) in the LUTs (each LUT is split into two half LUTs, the first containing
the generate function and the second containing the propagate function). The multi-
plexers (boxes labeled m) propagate the carry from the input ci to the output co . The
XOR gates (boxes labeled x) provide the modulo 2 sum bits, which may be stored
in the flip-flops (rightmost boxes, labeled s0 through s3 ).
1.5 Programming an FPGA 11
For example, the C function shown in Listing 1.2 builds a 32-bit adder.
Listing 1.2 A function defining a 32-bit adder
void a d d e r _ i p ( u n s i g n e d int a,
u n s i g n e d int b,
u n s i g n e d int * c ) {
*c = a + b;
}
References
1. https://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf
2. H. Amano, Principles and Structures of FPGAs (Springer, 2018)
Setting up and Using the Vitis_HLS,
Vivado, and Vitis IDE Tools 2
Abstract
This chapter gives you the basic instructions to setup the Xilinx tools to implement
some circuit on an FPGA and to test it on a development board. It is presented as
a lab that you should carry out. The aim is to learn how to use the Vitis/Vivado
tools to design, implement, and run an IP.
You should first order your development board on which you will later upload your
RISC-V processor design.
Any development board with an FPGA and an USB connection can fit.
I use a Pynq-Z1 board from Digilent equipped with a Xilinx Zynq XC7Z020
FPGA [1]. The FPGA is large enough to host the RV32I ISA subset. An equivalent
Pynq-Z2 board (from TUL [2]) with the same FPGA would also be fine and very
close to my Pynq-Z1.
The Basys3 from Digilent [3] has a Xilinx Artix-7 XC7A35T FPGA. It is also
suited to the book goals.
Older boards like the Zybo (Zynq XC7Z020 FPGA), the Zedboard (Zynq XC7Z020
FPGA), or the Nexys4 (Artix-7 XC7A100T FPGA) are also suitable.
More expensive boards like the ZCU 102/104/106 are very large. They can host
more ambitious IPs than the ones proposed in the book (for example, more than eight
cores or harts in a multicore or multihart processor).
More generally, any board embedding an FPGA with at least 10K LUTs is (more
than) large enough to host an RV32I RISC-V core (the more the LUTs, the larger the
FPGA). However, to implement the multicore designs presented in the second part
of the book, a larger FPGA with at least 30K LUTs is needed.
In the Xilinx Zynq-7000 FPGA family, the XC7Z010 FPGA has 18K LUTs. The
XC7Z020 FPGA has 53K LUTs.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 13
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_2
14 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
In the Xilinx Artix-7 FPGA family, the XC7A35T has 33K LUTs. The XC7A100T
has 101K LUTs.
In the Xilinx UltraScale+ FPGA family, the XCZU7EV has 230K LUTs (ZCU104/
106 boards). The XCZU9EG has 274K LUTs (ZCU 102 board).
I have tested two types of development boards: the ones embedding an Artix-7
series FPGA (e.g. Nexys4 and Basys3) and the ones embedding a Zynq-7000 series
FPGA (Pynq-Z1, Pynq-Z2, Zybo, and Zedboard).
The difference comes from the way they are interfaced.
The Artix-7 based boards give access to the programmable part of the FPGA
through a microblaze processor (the microblaze is a Xilinx softcore processor fea-
turing a MIPS-like CPU).
The Zynq based boards are interfaced through a Zynq7 Processing System IP,
placed between an embedded Cortex-A9 ARM processor and the programmable
part of the FPGA (you can find a very good description of what is inside a Zynq and
what you can do with it, for example on a Zedboard, in the Zynq Book [4]).
It does not make much difference in the programming because both processors,
microblaze or ARM, run programs built from C codes. But there are differences in
the way the IP you will develop should be connected to the interface system IP, either
microblaze or Zynq.
If you are a university professor, you may ask for a free board (you will receive a
Pynq-Z2 board) through the XUP Xilinx University Program [5].
Otherwise, the Pynq-Z2 board costs around 200e (210$ or 170£; these are the
2022 Q2 prices) (the Pynq-Z1 board is no more sold). The Basys3 has a smaller FPGA
but still large enough to host all the RV32I based RISC-V processors presented in
this book. It is sold at 130e (140$ or 110£).
The development board is the only element you will have to purchase (and this
book; but you already have it). Everything else is for free.
If you request XUP, it will probably take a few weeks before you receive your
board at home. This is why I started by this step. But meanwhile, you have plenty of
duties you can carry out (all the sections in this chapter except Sect. 2.7).
(If you already know how to use Vitis/Vivado, i.e. Vitis_HLS, Vivado, and Vitis IDE,
and if you have already installed Vitis on your computer, you can jump to Chap. 3 to
install the RISC-V tools.)
First of all, the following is an important notice concerning the compatibility of
the different softwares with the Operating System.
In this book, I assume Linux, Ubuntu (any version from the 16.04 should be
compatible with Vitis; I use Ubuntu 22.04 LTS ‘Jammy Jellyfish’ in this book). I
also assume Vitis 2022.1 or a later version (if you have an older version, some of my
HLS codes might not be synthesizable, but they surely can be simulated within the
Vitis HLS tool).
2.2 Getting the Software: The Xilinx Vitis Tool 15
If you are using Windows, you will have to find software installation procedures
with your preferred browser.
For the RISC-V simulator, the standard tool spike is not available for Windows.
As far as I know, people run spike within a Linux virtual machine inside Windows.
Maybe you should consider this option for the RISC-V simulations (and learn a bit
of Linux from the commands used in this book).
If you are using MacOS X, you have to install the Xilinx tools through a Linux
virtual machine.
If you are using another Linux distribution (e.g. Debian), the given explanations
should more or less work.
The Xilinx Vitis software is available for Windows and Linux. If you use Windows,
you should find a few differences on how to start the Xilinx tools. Once inside the
Vitis software, there is no difference between Linux and Windows.
The Vitis software [6] is freely downloadable from the Xilinx site at the URL
shown in Listing 2.1 (you will have to register at Xilinx to download).
Listing 2.1 Xilinx URL from where to download the Vitis software
https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/
vitis.html
On Ubuntu 20.04 and 22.04, before installing Vitis, you must install the libt-
info.so.5 library ("sudo apt-get install libtinfo5") (otherwise, the installer hangs as
mentioned at https://support.xilinx.com/s/article/76616?language=en_US).
I assume you will install the software in the /opt/Xilinx folder.
If you are working on a Linux computer, you downloaded a file named Xil-
inx_Unified_2022.1_0420_0327_Lin64.bin (your version name might differ from
mine, especially if you get a later one). You must set it to executable and run it
with the commands in Listing 2.2 (in sudo mode to install in the /opt/Xilinx folder)
(the commands are available in the install_vitis.txt file in the goossens-book-ip-
projects/2022.1/chapter_2 folder).
Listing 2.2 Run the Vitis installation software
$ cd $HOME / D o w n l o a d s
$ chmod u + x X i l i n x _ U n i f i e d _ 2 0 2 2 .1 _ 0 4 2 0 _ 0 3 2 7 _ L i n 6 4 . bin
$ sudo ./ X i l i n x _ U n i f i e d _ 2 0 2 2 .1 _ 0 4 2 0 _ 0 3 2 7 _ L i n 6 4 . bin
...
$
Within the installer, you should choose to install Vitis (first choice in the Select
Product to Install page).
The installation is a rather long process (i.e. several hours, but mainly depending on
the speed of your internet connection; on my computer with a Gb speed connection,
it took 12h45 to download 65.77 GB, i.e. 1.43 MB per second; the installation itself
took 30 min).
The installation requires a lot of disk space (272 GB), which you can reduce a
bit by deselecting some design tools (you really need Vitis, Vivado, and Vitis HLS)
and some devices (for the PynqZ1/Z2 boards you need the SoCs devices and for the
Basys3 board, you need the 7 Series ones).
16 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Since april 2022, git has been upgraded to face a security vulnerability. If you
have not freshly installed or recently upgraded your version of git, you should do
it (your version should be superior or equal to 2.35.2; check with "git --version").
Run the commands in Listing 2.3 (they are available in the install_git.txt file in the
chapter_2 folder).
Listing 2.3 upgrading git
$ sudo add - apt - r e p o s i t o r y -y ppa : git - core / ppa
$ sudo apt - get u p d a t e
$ sudo apt - get i n s t a l l git - y
$
From the github sites given in Listings 2.4–2.6, download either of these three files:
pynq-z1.zip, pynq-z2.zip, master.zip (the first one if you have a Pynq-Z1 board, the
second one if you have a Pynq-Z2 board, the third one if you have a Basys3 board;
the master.zip file from the Digilent site also contains the definitions of many other
boards among which the Nexys4, Zybo, and Zedboard; if you do not find the zip
files, search with your preferred browser for "pynq-z1 pynq-z2 basys3 board file").
Listing 2.4 github from where to download the pynq-z1.zip file
https://github.com/cathalmccabe/pynq-z1_board_files
Extract the zip and place the extracted folder (the main folder and its sub-folders)
into your Vitis installation in the /opt/Xilinx/Vivado/2022.1/data/boards/board_files
directory (create the missing board_files folder).
To install the book resources, run the commands in Listing 2.7 (available in in-
stall_book_resources.txt in the chapter_2 folder; as you have not cloned the re-
sources yet, you should get the file from the github repository).
2.5 Using the Software 17
All the source files and shell command files related to the my_adder_ip can be found
in the chapter_2 folder.
The Vitis software is a huge piece of code, with many usages among which you
will use just a tiny part. To quickly focus on what is useful for you, you will design
a small but complete IP example.
The example IP is the adder presented in the introduction, which receives two
32-bit integers and outputs their sum modulo 232 .
You will write two C pieces of code, one which describes the component and the
other to use it.
The C code can be input through any text editing tool but I recommend to use the
Vitis_HLS tool Graphical User Interface (GUI).
Before you build your project, make sure that the build-essential package is
installed. To check, just try to install it again by running the command in Listing 2.8
(this command is in the install_build-essential.txt file in the chapter_2 folder).
Listing 2.8 Install the build-essential package
$ sudo apt - get i n s t a l l build - e s s e n t i a l
...
$
To start the GUI environment, type in a terminal the commands shown in Listing
2.9 (these commands are in the start_vitis_hls.txt file in the chapter_2 folder).
Listing 2.9 Set the required environment and start Vitis_HLS
$ cd / opt / X i l i n x / V i t i s _ H L S / 2 0 2 2 . 1
$ s o u r c e s e t t i n g s 6 4 . sh
$ cd $HOME / goossens - book - ip - p r o j e c t s / 2 0 2 2 . 1
$ vitis_hls &
...
$
On the Debian distribution of Linux, you may have to update the LD_LIBRARY_
PATH environment variable before launching Vitis_HLS (export "LD_LIBRARY_
PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH").
18 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Once in the Vitis_HLS tool (see Fig. 2.1), you are in the Vitis_HLS Welcome Page
where you click on Project/Create Project.
This opens a dialog box (New Vitis_HLS Project/Project Configuration, see
Fig. 2.2) which you fill with the name of your project (my_adder_ip; I systematically
name my Vitis_HLS project with the _ip suffix). You click on the Next button.
The next dialog box opens (New Vitis_HLS Project/Add/Remove Design Files,
see Fig. 2.3), where you name the top function (e.g. my_adder_ip; I systematically
give the same name to the Vitis_HLS project and its top function). This name will
be given to the built IP. You can leave the Design Files frame empty (you will add
design files later). Click on Next.
In the next dialog box (New Vitis_HLS Project/Add/Remove Testbench Files,
see Fig. 2.4), you can leave the TestBench Files frame empty (you will add testbench
files later). Click on Next.
In the Solution Configuration dialog box (see Fig. 2.5), you have to select which
development board you are targetting. In the Part Selection frame, click on the "..."
box.
In the Device Selection dialog box (see Fig. 2.6), click on the Boards button.
Fig. 2.3 The Add/Remove Design Files dialog box to name the top function
20 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Fig. 2.5 The Solution Configuration dialog box to select your development board
2.5 Using the Software 21
Fig. 2.6 The Device Selection dialog box to select your development board
In the Search frame (see Fig. 2.7), type z1 or z2 or basys3 (according to your
board; even if you are still waiting for it, you can proceed). Select your board (in my
case, Pynq-Z1, xc7z020clg400-1) and click on OK (if you do not see your board in
the proposed selection, it means you have not installed the board files properly: go
back to 2.3).
You are back to the Solution Configuration box. Click on Finish to create your
Vitis_HLS project skeleton (see Fig. 2.8).
2.5.2 Creating an IP
The next step after preparing the project is to fill it with some content. You will
add two pieces, one which will represent your adder IP and a second devoted to its
verification through simulation.
After the project creation, the Vitis_HLS tool opens the Vitis_HLS 2022.1 -
my_adder_ip eclipse-based window (see Fig. 2.9).
I will not describe the full possibilities. What I will describe now only concerns
the editing of the source and testbench files.
22 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
The IP should be designed through a void top function. This top function depicts
a component.
A component has a pinout and this pinout is the only way to interact with the
component. There is no backdoor. You should not be able to observe nor to modify
inside the component, which means that no observer nor modifier function should
be provided.
A component may be combinational or sequential. In the former case, the outputs
are a combination of the inputs. In the latter case, the outputs are a combination of
the inputs and an internal state which evolves. The internal state is memorized. A
sequential component is clocked and at the start of each clock cycle, the internal
state is updated.
The internal state is initialized through input pins. It should never be directly
manipulated from outside the IP, neither to set it nor to observe its values.
For example, a processor is a sequential component. Its internal state includes the
register file. The register file is not visible from outside the processor. The processor
definition may include an initialization phase (i.e. a reset phase) to clear the register
file and an ending phase to dump it to memory (i.e. a halt phase). But the external
world should have no access at all to the register file.
24 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
The input arguments are values and the output argument is a pointer.
When your component will be simulated, you will use a main function to call the
my_adder_ip function. This call will have a hardware signification: apply inputs to
the component and let it the time to produce its output and save it into the unsigned
int c location.
In the main function, after the call to my_adder_ip, you will print the *c value to
mimic the electronic observation of the c pins.
In the FPGA, the a, b and c arguments will be implemented as memory points
connected to the my_adder_ip chip inside the programmable part.
Within the Vitis_HLS GUI, right-click on the Source button in the Explorer
frame and choose New Source File... (see Fig. 2.10).
In the navigation window, navigate to the my_adder_ip folder, open it, name your
new file my_adder_ip.cpp and click on Save (see Fig. 2.11).
The new file is added to the sources as shown in Fig. 2.12.
The my_adder_ip.cpp tab in the central frame let you edit your top function.
Copy/paste the my_adder_no_pragma_ip.cpp file from the chapter_2 folder).
Listing 2.11 The my_adder_ip top-function code
void m y _ a d d e r _ i p ( u n s i g n e d int a,
u n s i g n e d int b,
u n s i g n e d int * c ) {
*c = a + b;
}
2.5.3 Simulating an IP
Before you implement the IP on the FPGA, it is better to take the time to test it. You
will see that placing and routing a design on an FPGA may be a rather long process
(from a few seconds to a few hours, according to the design complexity and your
computer efficiency). To avoid repeatedly waits, it is a good practice to debug your
HLS code with the Vitis_HLS tool simulation facility. For such a simulation, you
2.5 Using the Software 25
Fig. 2.11 Adding a new my_adder_ip.cpp source file in the my_adder_ip folder
26 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
need to provide a main function within a testbench file. The role of this function is
to create the component, run it and observe its behaviour.
Right-click on the Test Bench button in the Explorer frame. Select New Test
Bench File... (see Fig. 2.14).
In the navigation window, name your new testbench file
testbench_my_adder_ip.cpp and click on Save (see Fig. 2.15).
The new testbench file is added as shown in Fig. 2.16.
Click on the testbench_my_adder_ip.cpp tab in the central frame and fill it with
the code in Listing 2.12 (copy/paste the testbench_my_adder_ip.cpp file from the
chapter_2 folder).
2.5 Using the Software 27
The result of the run (i.e. the print of "10,000 + 20,000 is 30,000") appears in the
middle of the output shown in the my_adder_ip_csim.log tab.
On Linux/Ubuntu, if the simulation complains about missing files, it means you
have to install some missing libraries. For example, if features.h is the requested
missing file, install the g++-multilib library (sudo apt-get install g++-multilib). To
know what to install, try to browse with the missing file message.
2.5.4 Synthesizing an IP
To transform your C code into a synthesizable IP, you need to add some indications to
the synthesizer. For example, you must specify how the pinout will be mapped on the
FPGA. These indications are given through pragmas. The Vitis_HLS environment
provides many HLS pragmas I will progressively use. For the moment, I will only
use the HLS INTERFACE pragma for the pinout.
30 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Update the my_adder_ip.cpp file with the pragmas shown in Listing 2.13 (copy-
/paste the my_adder_ip.cpp file in the chapter_2 folder) (one HLS INTERFACE
pragma for each argument of the my_adder_ip top function and one more named
return for the control of the IP start and stop).
Listing 2.13 The HLS INTERFACE pragma
void m y _ a d d e r _ i p ( u n s i g n e d int a,
u n s i g n e d int b,
u n s i g n e d int * c ) {
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = a
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = b
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = c
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = r e t u r n
*c = a + b;
}
In parallel, a and b are read on the input pins (b_read(read) and a_read(read)),
then the addition is done (labeled add_ln8(+), referring to line 8 in the my_adder_ip.
cpp file) and last, the result is copied on the c output pin (c_write_ln8(write)).
The duration is one FPGA cycle (cycle 0), i.e. 10 ns. The schedule shows that the
result is computed before the end of the cycle.
If you click on the add_ln8(+) line, you will highlight blue arrows showing the
dependencies between the inputs, the computation and the output (see Fig. 2.26).
If you click on the Properties tab in the lowest frame, you display properties of
the output signals involved in the addition (see Fig. 2.27): bit width of the sum (32
bits), delay (2.55 ns).
If you right-click on the add_ln8(+) line and the proposed Goto Source (see
Fig. 2.28), you display the source code in the C Source tab in the lowest frame.
This is the code which the add_ln8(+) comes from (see Fig. 2.29). The involved
line is highlighted with a blue background.
After you have successfully synthesized your adder you can try to cosimulate it.
Cosimulation runs two simulations. One simulation is the former C simulation
we have already done, which produces the expected result. The other simulation is
a logical simulation from the HDL description of the circuit and a simplified library
model of the FPGA. Its output is compared to the C simulation one. If they match,
it means that the signal propagation in the FPGA produced the same final value as
the one computed by the C simulation.
2.5 Using the Software 33
To run the cosimulation, in the Flow Navigator frame, click on C/RTL COSIM-
ULATION/Run Cosimulation (see Fig. 2.30).
In the dialog box (see Fig. 2.31), just click on the bottom OK button.
The Cosimulation Report for your my_adder_ip should show Pass in the Status
entry of the displayed General Information (see Fig. 2.32).
The IP can be exported to the library of available components, usable in the Vivado
tool.
To export the RTL file built by the synthesis, click on IMPLEMENTATION/Ex-
port RTL (see Fig. 2.33).
In the dialog box click on OK (see Fig. 2.34).
Click on the Console tab. The Console frame (see Fig. 2.35) should report that
the export has finished (which means successfully).
34 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
The last step before you test your design on the board FPGA is to run the exported
RTL. Click on IMPLEMENTATION/Run Implementation (see Fig. 2.36).
In the Run Implementation dialog box, change the default selection by checking
the RTL Synthesis, Place & Route button and click on OK (see Fig. 2.37).
The report shows the final resource usage and IP timing as shown in Fig. 2.38
(these are still estimations; they may differ from the real implementation done in
Vivado; moreover, you may have different values if you are working on a different
model of FPGA or a different version of Vitis_HLS). The design uses 128 LUTs and
150 FFs (Flip-Flops). No RAM block is used (BRAM).
2.6 Creating a Design with Vivado 35
(If you already know Vivado and either Vitis IDE or Vivado SDK, you can jump to
Chap. 3 to install the RISC-V tools.)
To program your FPGA, you need to generate a bitstream file from your exported
RTL. This should be done in another tool named Vivado.
To start the Vivado GUI, type the commands shown in Listing 2.14 in a terminal
(they can be found in the start_vivado.txt file in the chapter_2 folder).
Listing 2.14 Start the Vivado GUI tool
$ cd / opt / X i l i n x / V i t i s _ H L S / 2 0 2 2 . 1
$ s o u r c e s e t t i n g s 6 4 . sh
$ cd $HOME / goossens - book - ip - p r o j e c t s / 2 0 2 2 . 1
$ vivado &
...
$
A new window opens (see Fig. 2.39), in which you select Quick Start/Create
Project.
36 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Fig. 2.29 The C source file which the add_ln8(+) comes from
40 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Fig. 2.40 The Vivado Project Name page (set name and place)
The following (until further notice) applies to the Zynq based boards (e.g. Pynq-
Z1, Pynq-Z2, Zedboard, Zybo) (for the Artix-7 based boards (e.g. Basys3 or Nexys4),
jump to page 57).
On the right, you can see a Diagram frame, with a “+” button (see the position
of the pointer in Fig. 2.47). Click on it.
54 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Scroll all the way down the proposed list and select ZYNQ7 Processing System
(see Fig. 2.48).
A ZYNQ component is placed in the center of the Diagram frame (see Fig. 2.49).
This component will match an equivalent feature in the Zynq-based development
board FPGA. This component is used to interface your adder IP with the embedded
ARM processor (the ARM processor itself interfaces the ZYNQ7 Processing System
IP with the host computer).
A green line on top of the Diagram frame proposes to Run Block Automation.
This is to connect your component to the board environment. Click on it. In the
dialog box (see Fig. 2.50), keep the settings unchanged and click on OK.
56 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Fig. 2.50 Connecting the ZYNQ7 Processing System IP to the development board environment
The diagram frame shows the Zynq7 Processing System IP connected to some
output pads (see Fig. 2.51).
The following applies (until further notice) to the Artix-7 based boards (e.g.
Basys3 or Nexys4). For Zynq based boards, jump to the next paragraph.
For the Artix-7 based boards, add the microblaze IP instead of the Zynq7 Process-
ing System IP. After Run Block Automation, in Options/Debug Module, select
Debug & UART. After Run Connection Automation, select the diff_clk_rtl pad in
the diagram and delete it. In the BLOCK DESIGN frame, Board tab, double-click
on System Clock to have it connected in your design. You can then continue as for
the other boards.
The following applies (until further notice) to all the boards.
In the upper line menu, select Tools/Settings (see Fig. 2.52).
In the Project Settings frame (see Fig. 2.53), expand the IP entry.
Click on Repository (see Fig. 2.54).
In the right frame, click on the “+” button (see Figs. 2.54 and 2.55).
58 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Select the proposed folder, i.e. the one in which you saved your Vitis_HLS
my_adder_ip project (see Fig. 2.56).
Click on the Select button. A message box informs you that an IP repository has
been added to the set of IPs available to the Vivado project (see Fig. 2.57).
Click on OK and on OK again in the Settings page.
In the Diagram frame, click on the “+” button (see Fig. 2.58).
Scroll down and select My_adder_ip (this is your adder; see Fig. 2.59).
The Diagram frame shows the added My_adder_ip (see Fig. 2.60).
2.6 Creating a Design with Vivado 59
Fig. 2.56 Add the my_adder_ip Vitis_HLS project repository to the IP library
2.6 Creating a Design with Vivado 61
Fig. 2.61 The complete block design after clicking on the Regenerate Layout button
The bitstream generation is a many step process which will successively make
the synthesis, implement (i.e. place and route), and generate the bitstream.
The No Implementation Results Available warning box pops up (see Fig. 2.67).
Click on Yes.
The Launch Runs window pops up (see Fig. 2.68; the number of jobs may differ:
it is related to the number of cores of your computer). Click on OK.
You can follow the progression in the upper right corner (see Fig. 2.69).
66 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
Once the generation is done, a dialog box informs you about the success and
proposes you a set of possible continuations. Select View Reports (see Fig. 2.70).
In the bottom frame, Reports tab, scroll down to select the Implementa-
tion/impl_1/Place Design/Utilization - Place Design line (see Fig. 2.71).
A tab opens in the upper right frame. Scroll down to 1. Slice Logic. The IP uses
503 LUTs and 443 FFs (see Fig. 2.72; your values may differ if you are using a
different model of FPGA or if you are using a different version of Vivado). These
are the real values of the implementation cost on the FPGA. They are different from
the ones given in the Vitis_HLS tool.
In the upper line menu, select File/Export/Export Hardware (see Fig. 2.73). In
the pop up window, click on Next.
In the next window, check the Include bitstream option (the bitstream should be
included) and click on Next (see Fig. 2.74).
In the next window, click on Next. In the last window, click on Finish (see
Fig. 2.75).
Second, you should configure the power source of your board as coming from the
USB link. A jumper should be configured to connect two pins labeled USB. On the
Pynq-Z1, the jumper is JP5. It is on the right of the power switch. On the Pynq-Z2,
the jumper is J9. It is on the right of the SW1 switch. On the Basys3, the jumper is
JP2. It is on the left of the power button. On the Zybo-Z7-20, the jumper is J16. It is
on the right of the power switch.
Then, you can plug your board to your computer through a powering USB port. On
the board, you should find a micro-USB connector labeled “PROG”. On the Pynq-
Z1, the connector is just above the RJ45 ETHERNET connector. On the Pynq-Z2, it
is under the RJ45 connector. On the Basys3, it is on the right of the power switch.
On the Zybo-Z7-20 board, it is under the power switch.
Switch the power to the on position. A red led should light up (otherwise, use
another USB port until you find a powering one).
70 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
You must install the USB cable driver. Run the commands shown in Listing 2.9
(they are in the install_cable_driver.txt file) with the board plugged to the computer
and on (red led on).
Listing 2.15 Installing the USB cable driver
$ cd / opt / X i l i n x / Vitis / 2 0 2 2 . 1 / data / xicom
$ cd c a b l e _ d r i v e r s / lin64 / i n s t a l l _ s c r i p t / i n s t a l l _ d r i v e r s
$ sudo ./ i n s t a l l _ d r i v e r s
$
Then, switch the board off, unplug it, replug it and switch it on again.
The following applies (until further notice) to Zynq based boards, not to Artix-7
ones (if you have an Artix-7 based board, e.g. Basys3 or Nexys4, jump to page 74).
On your computer, in a terminal, run the putty serial terminal emulator as shown
in Listing 2.16 (in sudo mode; you may have to install putty first from the Linux
repositories: "sudo apt-get install putty").
Listing 2.16 Communicating with the board (1)
$ sudo putty
Fig. 2.77 Setting the USB serial line and communication speed
A new empty terminal should open, labeled /dev/ttyUSB1 - PuTTY (see Fig. 2.78).
This is where the board will print its messages while running.
The following applies (until further notice) to all the boards.
In a terminal, run the commands shown in Listing 2.15 (they are in the start_vitis_
ide.txt file).
Listing 2.17 Starting Vitis IDE
$ cd / opt / X i l i n x / Vitis / 2 0 2 2 . 1
$ s o u r c e s e t t i n g s 6 4 . sh
$ vitis &
...
$
Fig. 2.81 Selecting the Create a new platform from hardware (XSA) tab
2.7 Loading the IP and Running the FPGA with Vitis 77
Browse to find your bitstream file (XSA file in the my_adder_ip folder) and click
on Open (see Fig. 2.82).
In the New Application Project/Platform page, leave the Platform name box
unchanged (design_1_wrapper). Click on Next.
In the New Application Project/Application Project Details page, fill the
Application project name (e.g. z1_00; I name my application projects with the
development board name prefix, e.g. z1_, followed by an increasing counter starting
from 00 and incrementing for each new version). Click on Next (see Fig. 2.83).
In the Domain page, click on Next (see Fig. 2.84).
In the Templates page, select the Hello World template and click on Finish (see
Fig. 2.85).
A new window labeled workspace_my_adder_ip - z1_00/src/helloworld.c -
Vitis IDE opens. In the left panel, Explorer tab, expand src and open file helloworld.c
(see Fig. 2.86).
78 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
This is a basic driver to be downloaded on the FPGA and run (see Fig. 2.87). It
initializes the platform (init_platform call) and prints the Hello World message in
the /dev/ttyUSB1 - PuTTY communication window (with an Artix-7 based board
(e.g. Basys3), the message prints in the Vitis IDE window, Console frame).
2.7 Loading the IP and Running the FPGA with Vitis 81
Later, we will update this program to make it print our adder result.
Click on the Launch Target Connection button (see Fig. 2.88; the button is on the
top part of the window, under the main menu).
In the Target Connections dialog box, expand the Hardware Server entry. Double-
click on the Local [default] entry (see Fig. 2.89).
In the Target Connection Details window, click on OK (see Fig. 2.90). Close the
Target Connections dialog box.
82 2 Setting up and Using the Vitis_HLS, Vivado, and Vitis IDE Tools
In the left panel, Explorer tab, right click on your system folder name. Scroll
down and select Run As/1 Launch Hardware (see Fig. 2.93).
On the board, a green led lights up. In the /dev/ttyUSB1 - PuTTY communication
window (or in the Vitis IDE window, Console frame on an Artix-7 based board like
the Basys3), you should read Hello World and on the next line Successfully ran Hello
World application (see Fig. 2.94).
You will now update your helloworld application to make it print the adder result,
as shown in Listing 2.18 (you can copy/paste the helloworld.c file in the chapter_2
folder).
Listing 2.18 The updated helloworld.c file
# i n c l u d e < stdio .h >
# include " xmy_adder_ip .h"
# include " xparameters .h"
XMy_adder_ip_Config * cfg_ptr ;
X M y _ a d d e r _ i p ip ;
int main () {
cfg_ptr = XMy_adder_ip_LookupConfig ( XPAR_XMY_ADDER_IP_0_DEVICE_ID
);
X M y _ a d d e r _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X M y _ a d d e r _ i p _ S e t _ a (& ip , 1 0 0 0 0 ) ;
X M y _ a d d e r _ i p _ S e t _ b (& ip , 2 0 0 0 0 ) ;
X M y _ a d d e r _ i p _ S t a r t (& ip ) ;
while (! X M y _ a d d e r _ i p _ I s D o n e (& ip ) ) ;
p r i n t f ( " % d + % d is % d \ n " ,
( int ) X M y _ a d d e r _ i p _ G e t _ a (& ip ) ,
( int ) X M y _ a d d e r _ i p _ G e t _ b (& ip ) ,
( int ) X M y _ a d d e r _ i p _ G e t _ c (& ip ) ) ;
r e t u r n 0;
}
References 85
File xmy_adder_ip.h contains the driver interface, i.e. a set of function prototypes
to drive the IP on the FPGA. This file is automatically built by the Vitis_HLS tool
and exported to the Vitis IDE tool.
There are functions to create an IP, to set its inputs, to start its run, to wait for its
end and to get its outputs. All the functions are prefixed by X (for Xilinx I guess) and
the IP name (e.g. My_adder_ip_).
Functions XMy_adder_ip_LookupConfig and XMy_adder_ip_CfgInitialize serve
to allocate and initialize the adder IP.
Functions XMy_adder_ip_Set_a and XMy_adder_ip_Set_b serve to send input
values a and b from the Zynq component to the adder IP component through the
axilite connection.
Function XMy_adder_ip_Start starts the adder IP.
Function XMy_adder_ip_IsDone is used to wait until the adder IP has finished
its job.
Function XMy_adder_ip_Get_c reads the final c value in the adder IP (transmis-
sion to the Zynq component through the axilite connection).
If you use an Artix-7 based board like the Basys3, you should replace the printf
function by the xil_printf one (this is because the microblaze interface embarks a
light version of printf). You should include the xil_printf.h file. The output of the run
will not be displayed on the putty window but on the Vitis IDE one, in the Console
frame.
Once the updated helloworld.c code has been saved, in the Explorer tab, right-
click on the z1_00_system entry and select Build Project (as you already did to
compile the former helloworld.c code). When the console informs you that Build
Finished, right-click again on the z1_00_system entry and select Run As/1 Launch
Hardware. You should see "10000 + 20000 is 30000" printed on the /dev/ttyUSB1
- PuTTY window.
References
1. https://reference.digilentinc.com/reference/programmable-logic/pynq-z1/start
2. https://www.tulembedded.com/FPGA/ProductsPYNQ-Z2.html
3. https://reference.digilentinc.com/reference/programmable-logic/basys-3/start
4. L.H. Crockett, R.A. Elliot, M.A. Enderwitz, R.W. Stewart, The Zynq Book: Embedded Pro-
cessing with the Arm Cortex-A9 on the Xilinx Zynq-7000 All Programmable Soc (Strathclyde
Academic Media, 2014)
5. https://www.xilinx.com/support/university.html
6. https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html
Installing and Using the RISC-V Tools
3
Abstract
This chapter gives you the basic instructions to setup the RISC-V tools, i.e. the
RISC-V toolchain and the RISC-V simulator/debugger. The toolchain includes a
cross-compiler to produce RISC-V RV32I machine code. The spike simulator/de-
bugger is useful to run RISC-V codes with no RISC-V hardware. The result of a
spike simulation is to be compared to the result of a run on an FPGA implemen-
tation of a RISC-V processor IP.
When you will have implemented a RISC-V processor IP, you will need to compare its
behaviour to the official specification. If you do not want to buy a RISC-V hardware,
the simulation is a good option.
For the RISC-V ISA, there are many simulators available. What you need is a
tool to run your RISC-V code on your X86 machine. Soon enough, you will also
need a way to trace your execution (list the RISC-V instructions run), run step by
step and visualize the registers and the memory. This tool is more a debugger than
a simulator.
Whatever the simulator you choose, you will first need to be able to produce RISC-
V binaries from your host computer. Hence, a cross-compiler, and more extensively,
a toolchain (compiler, loader, assembler, elf dumper ...) is absolutely crucial. The
Gnu Project has developed the riscv-gnu-toolchain.
When the required commands have been installed and git has been upgraded (refer
back to 2.2), you can start the installation of the toolchain.
First, you must clone the riscv-gnu-toolchain git project on your computer. The
cloning creates a new folder in the current directory. Run the "git clone" command
in Listing 3.2.
Listing 3.2 Cloning the riscv-gnu-toolchain code
$ cd
$ git clone https://github.com/riscv/riscv-gnu-toolchain
...
$
Once this is done, you can build the RISC-V compiler. Configure and make, as
shown in Listing 3.3 (it takes more than one hour to build the toolchain; there is a first
long silent step to clone files from the git repository; 15 minutes on my computer).
Listing 3.3 Building the riscv-gnu-toolchain
$ cd $HOME / riscv - gnu - t o o l c h a i n
$ ./ c o n f i g u r e -- p r e f i x =/ opt / riscv -- enable - m u l t i l i b -- with - arch =
rv32i
...
$ sudo make
...
$
To try your tool, you can compile the hello.c source code shown in Listing 3.4 (it
is in the chapter_3 folder).
Listing 3.4 A hello.c file
# i n c l u d e < stdio .h >
void main () {
p r i n t f ( " hello world \ n " ) ;
}
3.1 Installing a RISC-V Toolchain and an Emulator/Debugger 89
Close your session and login again. After login, the PATH variable includes the
added path. You can check it by printing the PATH variable ("echo $PATH").
To compile the hello.c file, use the riscv32-unknown-elf-gcc compiling command
in Listing 3.6 (the current directory should be the chapter_3 folder) (the compilation
command for hello is available in the compile_hello.txt file in the chapter_3 folder).
Listing 3.6 Compiling with riscv32-unknown-elf-gcc
$ riscv32 - unknown - elf - gcc hello . c - o hello
$
The executable file is hello. To produce a dump file of your hello binary, use
the riscv32-unknown-elf-objdump command in Listing 3.7 (the > character is a
redirection of the output to the hello.dump file rather than the terminal) (the dump
command for hello is available in the compile_hello.txt file in the chapter_3 folder).
Listing 3.7 Dumping the executable file with objdump
$ riscv32 - unknown - elf - o b j d u m p - d hello > hello . dump
$
The hello.dump file contains the readable version of the executable file (you can
edit it). This is a way to produce RISC-V code in hexadecimal, safer than hand
encoding.
To run your RISC-V hello, you need a simulator.
Then you can build the simulator. Run the commands in Listing 3.9 (it takes about
five minute to build spike).
90 3 Installing and Using the RISC-V Tools
...
$ make
...
$ sudo make i n s t a l l
...
$
With the -h option ("spike -h"), you can have a look at the available options. You
can run in debug mode (-d) to progress instruction by instruction, print the registers
(reg 0 to get the content of all the registers in core 0, reg 0 a0 to get the content of
core 0, register a0).
srli t4 , t1 ,0x1 c
srai t5 , t1 ,0x1 c
ret
To compile the test_op_imm.s file, run the command in Listing 3.14 (the current
directory should be the chapter_3 folder).
Listing 3.14 Compiling test_op_imm.s with gcc
$ riscv32 - unknown - elf - gcc t e s t _ o p _ i m m . s - o t e s t _ o p _ i m m
Run it step by step with the commands in Listing 3.15 (you must first localize
the main function starting address: use "riscv32-unknown-elf-objdump -t" and pipe
into grep ("riscv32-unknown-elf-objdump -t test_op_imm | grep main"); on my
machine, the address is 0x1018c).
Listing 3.15 Debug test_op_imm with spike
$ riscv32 - unknown - elf - o b j d u m p - t t e s t _ o p _ i m m | grep main
0001018 c g . text 0 0 0 0 0 0 0 0 main
$ spike - d / opt / riscv / riscv32 - unknown - elf / bin / pk t e s t _ o p _ i m m
: u n t i l n pc 0 1018 c
...
core 0: 0 x 0 0 0 1 0 1 2 0 (0 x 0 6 c 0 0 0 e f ) jal pc + 0 x6c
:
core 0: 0 x 0 0 0 1 0 1 8 c (0 x 0 0 5 0 0 5 9 3 ) li a1 , 5
: reg 0 a1
0 x00000005
:
core 0: 0 x 0 0 0 1 0 1 9 0 (0 x 0 0 1 5 8 6 1 3 ) addi a2 , a1 , 1
: reg 0 a2
0 x00000006
: q
$
The first command "untiln pc X Y" runs the program until the pc of core X reaches
address Y.
The 0x1018c address is the main entry point (this address may differ on your
machine if you use a different compiler version or a different pk version than mine:
try "riscv32-unknown-elf-gcc --version"; in the book, I use riscv32-unknown-elf-
gcc (GCC) 11.1.0).
The n after "until" means noisy: the code run is printed (use "until pc X Y" to run
silently). The run stops after the execution of the instruction at address 0x10120 and
before the run of the instruction at address 0x1018c.
When no command is entered, i.e. the enter key is typed, the current instruction
is run and printed. Hence, typing enter repeatedly runs the program step by step.
The "reg C R" command prints the content of register R in core C. The register
name can be symbolic. In the example, I asked to successively print registers a1 and
a2 (of core 0).
The q command is quit.
The spike simulator printings when you run the program step by step from main
until the return are shown in Listing 3.16 (the "reg 0" command prints all the registers;
in your execution some register values other than the ones written by test_op_imm
may differ).
3.1 Installing a RISC-V Toolchain and an Emulator/Debugger 93
Listing 3.16 Run test_op_imm until return from main and print the registers
$ spike - d / opt / riscv / riscv32 - unknown - elf / bin / pk t e s t _ o p _ i m m
: u n t i l n pc 0 1018 c
...
core 0: 0 x 0 0 0 1 0 1 2 0 (0 x 0 6 c 0 0 0 e f ) jal pc + 0 x6c
:
core 0: 0 x 0 0 0 1 0 1 8 c (0 x 0 0 5 0 0 5 9 3 ) li a1 , 5
:
core 0: 0 x 0 0 0 1 0 1 9 0 (0 x 0 0 1 5 8 6 1 3 ) addi a2 , a1 , 1
...
core 0: 0 x 0 0 0 1 0 1 b c (0 x 4 1 c 3 5 f 1 3 ) srai t5 , t1 , 28
:
core 0: 0 x 0 0 0 1 0 1 c 0 (0 x 0 0 0 0 8 0 6 7 ) ret
: reg 0
zero : 0 x 0 0 0 0 0 0 0 0 ra : 0 x 0 0 0 1 0 1 2 4 sp :0 x 7 f f f f d a 0 gp :0 x 0 0 0 1 1 d b 0
tp : 0 x 0 0 0 0 0 0 0 0 t0 : 0 x 0 0 0 0 0 0 0 0 t1 :0 x b 0 0 0 0 0 0 0 t2 :0 x 0 0 0 0 0 0 0 1
s0 : 0 x 0 0 0 0 0 0 0 0 s1 : 0 x 0 0 0 0 0 0 0 0 a0 :0 x 0 0 0 0 0 0 0 1 a1 :0 x 0 0 0 0 0 0 0 5
a2 : 0 x 0 0 0 0 0 0 0 6 a3 : 0 x 0 0 0 0 0 0 0 4 a4 :0 x 0 0 0 0 0 0 0 3 a5 :0 x 0 0 0 0 0 0 0 7
a6 : 0 x 0 0 0 0 0 0 0 b a7 : 0 x 0 0 0 0 0 0 0 1 s2 :0 x 0 0 0 0 0 0 0 0 s3 :0 x 0 0 0 0 0 0 0 0
s4 : 0 x 0 0 0 0 0 0 0 0 s5 : 0 x 0 0 0 0 0 0 0 0 s6 :0 x 0 0 0 0 0 0 0 0 s7 :0 x 0 0 0 0 0 0 0 0
s8 : 0 x 0 0 0 0 0 0 0 0 s9 : 0 x 0 0 0 0 0 0 0 0 s10 :0 x 0 0 0 0 0 0 0 0 s11 :0 x 0 0 0 0 0 0 0 0
t3 : 0 x 0 0 0 0 0 0 0 0 t4 : 0 x 0 0 0 0 0 0 0 b t5 :0 x f f f f f f f b t6 :0 x 0 0 0 0 0 0 0 0
: q
$
When you compile with gcc, the linker adds the _start code which calls your main
function. Moreover, it places the code at an address compatible with the OS mapping
(e.g. 0x1018c in the test_op_imm example).
In contrast, the codes you build for a processor IP implemented on an FPGA are
assumed to be running on bare metal (no OS), directly starting with the main function
and ending when it returns. The code is to be placed at address 0.
For example, assume you want to compile the test_op_imm.s RISC-V assembly
file shown in the preceding section to run it on your processor IP on the FPGA.
The gcc compiler can be requested to link the executable code at any given address
with the "-Ttext address" option as shown in Listing 3.17 (the current directory should
be the chapter_3 folder).
The warning message is not important.
Listing 3.17 Compile and base the main address at 0
$ riscv32 - unknown - elf - gcc - n o s t a r t f i l e s - Ttext 0 t e s t _ o p _ i m m . s - o
t e s t _ o p _ i m m _ 0 . elf
/ opt / riscv / lib / gcc / riscv32 - unknown - elf / 1 1 . 1 . 0 / . . / . . / . . / . . / riscv32 -
unknown - elf / bin / ld : w a r n i n g : cannot find entry s y m b o l _ s t a r t ;
d e f a u l t i n g to 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
$
Of course, such an executable file is not to be run with spike, only with your
processor IP on the FPGA. To be run with spike/pk, the executable file must be
obtained with a classic riscv32-unknown-elf-gcc compilation command ("riscv32-
unknown-elf-gcc test_op_imm.s -o test_op_imm").
94 3 Installing and Using the RISC-V Tools
You can check your standalone and based 0 test_op_imm_0.elf executable file
(in ELF format) by dumping it with riscv32-unknown-elf-objdump (the Linux cat
command –short for concatenate, because it concatenates its output to the standard
output– is used to view a file). Run the commands in Listing 3.18.
Listing 3.18 Dump the based 0 executable file test_op_imm_0.elf
$ riscv32 - unknown - elf - o b j d u m p - d t e s t _ o p _ i m m _ 0 . elf > t e s t _ o p _ i m m _ 0 .
dump
$ cat t e s t _ o p _ i m m _ 0 . dump
D i s a s s e m b l y of s e c t i o n . text :
You can transform the ELF file test_op_imm_0.elf into a binary file test_op_imm_
0_text.bin (i.e. remove the ELF format around the binary code) with riscv32-
unknown-elf-objcopy and dump the binary file with od (octal dump; the leftmost
column lists the addresses in octal). To build the test_op_imm_0_text.bin file and
have a look at it, run the commands in Listing 3.19.
Listing 3.19 From ELF to binary
$ riscv32 - unknown - elf - o b j c o p y - O b i n a r y t e s t _ o p _ i m m _ 0 . elf
t e s t _ o p _ i m m _ 0 _ t e x t . bin
$ od - t x4 t e s t _ o p _ i m m _ 0 _ t e x t . bin
0 0 0 0 0 0 0 0 0 5 0 0 5 9 3 0 0 1 5 8 6 1 3 00 c 6 7 6 9 3 f f f 6 8 7 1 3
0 0 0 0 0 2 0 0 0 5 7 6 7 9 3 00 c 7 c 8 1 3 00 d 8 3 8 9 3 00 b 8 3 2 9 3
0 0 0 0 0 4 0 01 c 8 1 3 1 3 f f 6 3 2 3 9 3 7 e 6 3 3 e 1 3 01 c 3 5 e 9 3
0 0 0 0 0 6 0 41 c 3 5 f 1 3 0 0 0 0 8 0 6 7
0000070
$
This binary file can be translated into an hexadecimal file with hexdump. To build
the test_op_imm_0_text.hex file, run the commands in Listing 3.20.
Listing 3.20 From binary to hexadecimal
$ h e x d u m p - v - e ’" 0 x " /4 " %08 x " " ,\ n " ’ t e s t _ o p _ i m m _ 0 _ t e x t . bin >
t e s t _ o p _ i m m _ 0 _ t e x t . hex
$ cat t e s t _ o p _ i m m _ 0 _ t e x t . hex
0 x00500593 ,
0 x00158613 ,
0 x00c67693 ,
0 xfff68713 ,
0 x00576793 ,
0 x00c7c813 ,
0 x00d83893 ,
0 x00b83293 ,
0 x01c81313 ,
0 xff632393 ,
0 x7e633e13 ,
3.2 Debugging With Gdb 95
0 x01c35e93 ,
0 x41c35f13 ,
0 x00008067 ,
$
This hex file has the appropriate structure to become an array initializer in a C
program, listing the values to populate a code RAM array.
The spike simulator can be connected to the standard gdb debugger. The gdb de-
bugger has all the facilities you will need.
First, you should install gdb with the command in Listing 3.21.
Listing 3.21 Installation of gdb
$ sudo apt - get i n s t a l l build - e s s e n t i a l gdb
...
$
The gdb debugger is mainly an interface between you and the machine running
some code you want to debug. The machine can be a real one (i.e. either the host
on which gdb is running or an external hardware like your development board) or a
simulator like spike.
To debug with gdb, a code must be compiled with the -g option.
If gdb is interfacing the host, it simply accesses the machine on which it is running.
If gdb is interfacing some external hardware, there should be some tool to access
the external hardware for gdb. OpenOCD (Open On-Chip Debugger) is such a sort
of universal interface between a debugger like gdb and some external processor.
When gdb is interfacing the spike simulator, the same OpenOCD tool is used to
connect the debugger to the simulator.
As a result, you will run spike, OpenOCD, and gdb together. The gdb debugger
reads the code to be run from the executable file and sends a run request to OpenOCD
which forwards it to spike. The spike simulator runs the code it is requested to and
updates its simulated machine state. The gdb debugger can read this new state
through some new request sent to OpenOCD and forwarded to spike, like reading
the registers or the memory.
96 3 Installing and Using the RISC-V Tools
Second, after the gdb installation, you should install OpenOCD. Run the commands
in Listing 3.22.
Listing 3.22 Installation of OpenOCD
$ cd
$ git clone https :// git . code . sf . net / p / o p e n o c d / code openocd - code
...
$ cd openocd - code
$ ./ b o o t s t r a p
...
$ ./ c o n f i g u r e -- enable - j t a g _ v p i -- enable - remote - b i t b a n g
...
$ make
...
$ sudo make i n s t a l l
...
$
The code to debug should be placed in a memory area compatible with the simulated
machine. As spike is run on the host machine, it defines a memory space compatible
with the hosting OS, in which the code to be simulated should be placed. The memory
address used by spike in Ubuntu for its simulated code is 0x10010000.
The spike.lds file (in the chapter_3 folder) is shown in Listing 3.23. It defines a
text section starting at address 0x10010000 and a data section immediately following
the text one.
The structure of linker description files is explained in 7.3.1.4.
Listing 3.23 The spike.lds file
$ cat spike . lds
O U T P U T _ A R C H ( " riscv " )
SECTIONS
{
. = 0 x10010000 ;
. text : { *(. text ) }
. data : { *(. data ) }
}
$
You can compile the test_op_imm.s file to build an executable which can be simu-
lated with spike (the -g option adds a table of symbols to the executable file which
3.2 Debugging With Gdb 97
is used by gdb). Run the command in Listing 3.24 (the current directory should be
the chapter_3 folder).
Listing 3.24 Using the spike.lds file to link for spike
$ riscv32 - unknown - elf - gcc -g - n o s t a r t f i l e s -T spike . lds
t e s t _ o p _ i m m . s -o t e s t _ o p _ i m m
$
set _ C H I P N A M E r i s c v
jtag newtap $ _ C H I P N A M E cpu - irlen 5 - expected - id 0 x 1 0 e 3 1 9 1 3
set _ T A R G E T N A M E $ _ C H I P N A M E . cpu
t a r g e t c r e a t e $ _ T A R G E T N A M E riscv - chain - p o s i t i o n $ _ T A R G E T N A M E
gdb_report_data_abort enable
init
halt
$
To connect spike, OpenOCD, and gdb, you will need to open three terminals suc-
cessively.
In a first terminal with the chapter_3 folder as the current directory, start spike
with a code to be simulated placed at address 0x10010000 (the same address as the
one set in the spike.lds file), with a memory size of 0x10000. Run the command in
Listing 3.26.
Listing 3.26 Starting spike
$ spike -- rbb - port =9824 - m 0 x 0 0 1 0 0 1 0 0 0 0 :0 x 1 0 0 0 0 t e s t _ o p _ i m m
L i s t e n i n g for r e m o t e b i t b a n g c o n n e c t i o n on port 9824.
w a r n i n g : t o h o s t and f r o m h o s t s y m b o l s not in ELF ; can ’ t c o m m u n i c a t e
with target
To start the debugging session within gdb, you need to connect gdb to OpenOCD. In
the terminal where gdb is running, run the "target remote localhost:3333" command
in Listing 3.29.
Listing 3.29 Connect gdb to OpenOCD
( gdb ) t a r g e t r e m o t e l o c a l h o s t :3333
w a r n i n g : No e x e c u t a b l e has been s p e c i f i e d and t a r g e t does not
support
determining executable automatically . Try using the " file " c o m m a n d
.
0 x 0 0 0 0 0 0 0 0 in ?? ()
( gdb )
Then, you must specify which executable file is to be debugged (answer "y" to the
question). Run the "file test_op_imm" command in Listing 3.30.
Listing 3.30 Debug the test_op_imm executable file
( gdb ) file t e s t _ o p _ i m m
A p r o g r a m is being d e b u g g e d a l r e a d y .
Are you sure you want to change the file ? ( y or n ) y
R e a d i n g s y m b o l s from t e s t _ o p _ i m m ...
( gdb )
Load the informations of the linker (necessary to initialize pc). Run the "load"
command in Listing 3.31.
Listing 3.31 Load the informations of the linker
( gdb ) load
L o a d i n g s e c t i o n . text , size 0 x38 lma 0 x 1 0 0 1 0 0 0 0
S t a r t a d d r e s s 0 x10010000 , load size 56
T r a n s f e r rate : 448 bits in <1 sec , 56 bytes / write .
( gdb )
3.2 Debugging With Gdb 99
You can list the source code (command "l"; the "l" abbreviation stands for "list"; as
a general rule in gdb, you can use the shortest unambiguous prefix of any command;
for the "load" command above, you should at least type "lo"). Run the "l" command
in Listing 3.32.
Listing 3.32 List the instructions to run
( gdb ) l
1 . globl main
2 main :
3 li a1 ,5
4 addi a2 , a1 ,1
5 andi a3 , a2 ,12
6 addi a4 , a3 , -1
7 ori a5 , a4 ,5
8 xori a6 , a5 ,12
9 sltiu a7 , a6 ,13
10 sltiu t0 , a6 ,11
( gdb )
You can visualize pc. Run the "p $pc" command ("p" for "print") in Listing 3.33.
Listing 3.33 Print pc
( gdb ) p $pc
$1 = ( void (*) () ) 0 x 1 0 0 1 0 0 0 0 < main >
( gdb )
You can run one instruction. Run the "si" command in Listing 3.34 ("si" stands
for step instruction; the instruction run is the one preceding the printed one; line 3 is
run, i.e. "li a1,5"; the debugger is stopped on line 4).
Listing 3.34 Run one machine instruction
( gdb ) si
[ riscv . cpu ] Found 4 t r i g g e r s
4 addi a2 , a1 ,1
( gdb )
If you type enter, the last command is repeated (in the example, "si" is repeated
and a new instruction is run). Repeat the "si" command, as shown in Listing 3.35.
Listing 3.35 Run another machine instruction
( gdb )
5 andi a3 , a2 ,12
( gdb )
You can print registers a1 and a2 (command "info reg"). Run the "info reg"
commands in Listing 3.36.
Listing 3.36 Print registers
( gdb ) info reg a1
a1 0 x5 5
( gdb ) info reg a2
a2 0 x6 6
( gdb )
100 3 Installing and Using the RISC-V Tools
You can place a breakpoint on the last instruction. Run the "b 16" command in
Listing 3.37.
Listing 3.37 Place a breakpoint
( gdb ) l
1 . globl main
2 main :
3 li a1 ,5
4 addi a2 , a1 ,1
5 andi a3 , a2 ,12
6 addi a4 , a3 , -1
7 ori a5 , a4 ,5
8 xori a6 , a5 ,12
9 sltiu a7 , a6 ,13
10 sltiu t0 , a6 ,11
( gdb )
11 slli t1 , a6 ,0 x1c
12 slti t2 , t1 , -10
13 sltiu t3 , t1 ,2022
14 srli t4 , t1 ,0 x1c
15 srai t5 , t1 ,0 x1c
16 ret
( gdb ) b 16
Breakpoint 1 at 0 x 1 0 0 1 0 0 3 4 : file t e s t _ o p _ i m m .s , line 16.
( gdb )
You can continue running up to the breakpoint. Run the "c" command in Listing
3.38.
Listing 3.38 Continuing the run until the breakpoint
( gdb ) c
Continuing .
B r e a k p o i n t 1 , main () at t e s t _ o p _ i m m . s :16
16 ret
( gdb )
You end the different runs in the three terminals with ctrl-c (or q for gdb; two
successive ctrl-c for spike). You close the terminals with ctrl-d.
Compile the basicmath_simple.c file with the command in Listing 3.40 (the
current directory should be the chapter_3 folder).
Listing 3.40 Compile the basicmath_simple.c file
$ riscv32 - unknown - elf - gcc - n o s t a r t f i l e s - T spike . lds -g - O3
b a s i c m a t h _ s i m p l e . c -o b a s i c m a t h _ s i m p l e - lm
$
Start OpenOCD in a second terminal with the command in Listing 3.42 (the
current directory should be the chapter_3 folder).
Listing 3.42 Start OpenOCD
$ o p e n o c d - f spike . cfg
Open On - Chip D e b u g g e r 0 . 1 1 . 0 + dev -00550 - g d 2 7 d 6 6 b
...
Info : s t a r t i n g gdb s e r v e r for riscv . cpu on 3333
102 3 Installing and Using the RISC-V Tools
Start gdb in a third terminal with the command in Listing 3.43 (the current direc-
tory should be the chapter_3 folder).
Listing 3.43 Start gdb
$ riscv32 - unknown - elf - gdb
GNU gdb ( GDB ) 10.1
...
Type " a p r o p o s word " to s e a r c h for c o m m a n d s r e l a t e d to " word " .
( gdb )
Connect to OpenOCD, specify the executable, and load the linking informations
with the commands in Listing 3.44.
Listing 3.44 Continue gdb
( gdb ) t a r g e t r e m o t e l o c a l h o s t :3333
R e m o t e d e b u g g i n g u s i n g l o c a l h o s t :3333
w a r n i n g : No e x e c u t a b l e has been s p e c i f i e d and t a r g e t does not
support
determining executable automatically . Try using the " file " c o m m a n d
.
0 x 0 0 0 0 0 0 0 0 in ?? ()
( gdb ) file b a s i c m a t h _ s i m p l e
A p r o g r a m is being d e b u g g e d a l r e a d y .
Are you sure you want to change the file ? ( y or n ) y
R e a d i n g s y m b o l s from b a s i c m a t h _ s i m p l e ...
( gdb ) load
L o a d i n g s e c t i o n . text , size 0 x8390 lma 0 x 1 0 0 1 0 0 0 0
L o a d i n g s e c t i o n . text . startup , size 0 x58 lma 0 x 1 0 0 1 8 3 9 0
...
L o a d i n g s e c t i o n . sdata . _ i m p u r e _ p t r , size 0 x4 lma 0 x 1 0 0 1 c c c 8
S t a r t a d d r e s s 0 x10010000 , load size 52417
T r a n s f e r rate : 2 KB / sec , 1310 bytes / write .
( gdb )
Set pc and sp with the commands in Listing 3.45 (pc is set at main, i.e.
.text.startup; if the value returned by your gdb is not 0x10018390, update the pc
initialization accordingly; sp is set at the end of the memory, i.e. 0x10020000).
Listing 3.45 Set pc and sp
( gdb ) set $pc = 0 x 1 0 0 1 8 3 9 0
( gdb ) set $sp = 0 x 1 0 0 2 0 0 0 0
( gdb )
List the source code. As the -g option has been added to the compile command
with basicmath_simple.c as the input file, gdb works on the C source file rather than
on the assembly code like it was for test_op_imm. Run the commands in Listing
3.46.
Listing 3.46 List the source code
( gdb ) l
18 else {
19 * s o l u t i o n s = 1;
20 x [0] = pow ( sqrt ( R2_Q3 ) + fabs ( R ) , 1 / 3 . 0 ) ;
21 x [0] += Q / x [0];
3.3 Debugging a Complex Code with Gdb 103
Place a breakpoint on the return instruction in main, line 33 with the command
in Listing 3.47.
Listing 3.47 Place a breakpoint at the end of the run
( gdb ) b 33
B r e a k p o i n t 1 at 0 x 1 0 0 1 8 3 d c : file b a s i c m a t h _ s i m p l e .c , line 33.
( gdb )
You continue the run until the breakpoint with the gdb c command in Listing
3.48.
Listing 3.48 Continue the run up to the breakpoint
( gdb ) c
Continuing .
B r e a k p o i n t 1 , main () at b a s i c m a t h _ s i m p l e . c :33
33 return ;
( gdb )
The result is somewhere on the stack (at the place allocated for array x by the run; x
should contain three double values corresponding to 2.0, 6.0, and 2.5, i.e. in hexadec-
imal: 0x4000000000000000, 0x4018000000000000, and 0x4004000000000001).
To find the result, you can dump the memory used by the stack with the gdb
"x" command in Listing 3.49 (the "x/16x 0x1001ffc0" command dumps 16 words in
hexadecimal format, starting at address 0x1001ffc0).
Listing 3.49 Dump the stack
( gdb ) x /16 x 0 x1001ffc0
0 x1001ffc0 : 0 x1001ffd4 0 x1001ffd8 0 x00000000 0 x00000000
0 x1001ffd0 : 0 x00000000 0 x00000003 0 x00000000 0 x40000000
0 x1001ffe0 : 0 x00000000 0 x40180000 0 x00000001 0 x40040000
0 x1001fff0 : 0 x00000000 0 x00000000 0 x00000000 0 x00000000
( gdb )
The three values are located at address 0x1001ffd8-df, 0x1001ffe0-e7 and 0x1001ff
e8-ef (low order 32-bit word first, i.e. little endian).
The RISC-V Architecture
4
Abstract
This chapter briefly presents the RISC-V architecture and more precisely, its
RV32I instruction set with examples taken from the compiler translations of small
C codes.
Floating-Point Unit (FPU). The compiler and the libraries provide the necessary func-
tions to simulate the floating-point operations from integer ones. In this book, I will
test the proposed processors on benchmarks involving floating-point computations.
Figure 4.1 shows the mapping of the integer register file. There are 32 registers (32
bits wide in the RV32I ISA), named x0 to x31. The 16 first registers are in black in
the figure and the 16 last registers are in red. In the RV32I ISA, the 32 registers are
all available. In the RVE (or RV32E) ISA (E stands for embedded), the register file
is restricted to the registers in black.
In the figure, the alias column, the role column, and the save column define the
RISC-V Application Binary Interface (ABI). The ABI is a general frame to build
applications from the processor architecture.
Each register has an ABI name (the alias column) related to the suggested role of
the register (the role column).
The save column shows which registers should be saved in the stack. When a
saved register is used in a function, it should first be copied to the stack. It should
also be restored (from the stack copy) before returning from the function. In such a
way, its content is preserved across function calls.
Non saved registers either contain global values, valid all along the application,
or temporary values, only valid in a local computation within a single function level.
The instruction set does not really distinguish the 32 registers except for the first
of them. This means that all the instructions apply to all the registers and manipulate
them the same way. But this first register has a special semantic.
Register zero or x0 contains the value 0 and cannot be changed. You cannot write
to register zero (if you use this register as a destination, your write operation is
simply discarded). Each time you need to mention the 0 special value, you can use
register zero. For example, if you want to say something like "if (x==0) goto label",
you would write "beq a0, zero, label" (i.e. branch to label if a0 and zero are equal),
with register a0 holding the value of variable x.
4.1 The RISC-V Instruction Set Architecture 107
Even though the architecture does not give any particular role to the registers
through the ISA, the ABI does impose some common usage for all the registers.
The programmer is strongly invited to follow the ABI constraints to keep his/her
programs compatible with the other available pieces of softwares (e.g. libraries) and
with the different translators like compilers or assemblers.
Register ra or x1 should be used to save the return address, i.e. the destination
of call instructions (JAL and JALR). The usage is to write "jal ra, foo" to call the foo
function and save the return address into register ra. However, you could as well
write "jal t0, foo" and save the return address into the t0 register. You can also write
"jal zero, label". In this case, you do not save any return address and the jal instruction
is used as a jump to label.
Register sp or x2 should be used as a stack pointer, i.e. it should contain the
address of the top of stack and move according to space allocations and liberations
(e.g. "addi sp,sp,-16" run at the start of a function allocates 16 bytes on the top of
stack; "addi sp,sp,16" run at the end of the function de-allocates the 16 bytes; in the
function, the 16 bytes can be used as a frame to save and restore values with store
and load instructions).
Registers a0 to a7 (or x10 to x17) should be used to hold function arguments (and
results for a0 and a1).
Registers t0 to t6 (or x5 to x7 and x28 to x31) should be used to hold temporary
values.
Registers s0 to s11 (or x8, x9, and x18 to x27) should be preserved across functions
(if one of these registers is used in a function, it should be saved in the function stack
frame at the function start and restored at the function end).
Operating systems may use registers x3 or gp as a global pointer (i.e. for the
global data of the running process), x4 or tp as a thread pointer (data for the running
thread), and x8 or s0 or fp as a frame pointer, i.e. a portion of stack allocated for a
function to hold its arguments and locals.
The instruction set is traditionally partitioned into three subsets: computing instruc-
tions, control flow instructions, and memory access instructions.
Figure 4.2 shows examples of RISC-V instructions. Computing instructions are
in red, control flow instructions are in brown, and memory accesses are in green.
The instructions are themselves data, i.e. words to be placed into memory.
The RV32I ISA is a set of 32-bit instructions. Each instruction is defined as a
32-bit word. All the semantic details concerning an instruction are encoded in these
32 bits. The way the encoding is done is called a format.
The RV32I instruction encoding ([1], Chap. 24) defines four formats: Register or
R-TYPE, Immediate or I-TYPE, Upper or U-TYPE, and Store or S-TYPE.
Figure 4.3 shows the decomposition of the 32-bit instruction word into the main
fields according to the format (the B-TYPE is a variant of the S-TYPE and the
J-TYPE is a variant of the U-TYPE).
The two low order bits are set to 11 (there are instructions with their two low or-
der bits different from 11 but not in the RV32I nucleus; they belong to the RVC
extension—the C stands for Compressed—which encodes instructions in 16-bit
words).
110 4 The RISC-V Architecture
sub a2 a1 add a0 OP
0100000 01100 01011 000 01010 01100 11
4 0 c 5 8 5 3 3
−1 a0 add a0 OP_IMM
111111111111 01010 000 01010 00100 11
f f f 5 0 5 1 3
4 a1 w a0 LOAD
000000000100 01011 010 01010 00000 11
0 0 4 5 a 5 0 3
4 a0 − ra JALR
000000000100 01010 000 00001 11001 11
0 0 4 5 0 0 e 7
The four last terms keep the same meaning as for the R-TYPE format (the opcode
term is OP_IMM instead of OP).
The imm12 term is the immediate value, i.e. the constant involved in the compu-
tation instruction. It is a signed 12-bit field (hence the constant range is from 0x800
to 0x7ff or from −211 to 211 − 1).
For example (see Fig. 4.5), "addi a0, a0, -1" is the quintuplet [0b111111111111,
0b01010, 0b000, 0b01010, 0b0010011] (i.e. 0xfff50513 in hexadecimal).
Instruction codes ending by 0x13 or 0x93 are I-TYPE ALU instructions (opcode
OP_IMM).
The load instructions (see Fig. 4.6) also have the I-TYPE format (e.g. "lw a0, 4(a1)"
is the quintuplet [0b000000000100, 0b01011, 0b010, 0b01010, 0b0000011], i.e.
0x0045a503).
Instruction codes ending by 0x03 or 0x83 are I-TYPE load instructions (opcode
LOAD).
The indirect jump instructions JALR (see Fig. 4.7) are also of the I-TYPE format
(e.g. "jalr ra, 4(a0)" is the quintuplet [0b000000000100, 0b01010, 0b000, 0b00001,
0b1100111], i.e. 0x004500e7).
Instruction codes ending by 0x67 or 0xe7 are I-TYPE jalr instructions (opcode
JALR).
112 4 The RISC-V Architecture
0 a0 a1 w 4 STORE
0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 0 110 0 0 1 1
0 0 a 5 a 2 2 3
0 0 a0 a1 beq 6 0 BRANCH
0 000000 01010 01011 000 0110 0 11000 11
0 0 a 5 8 6 6 3
12 ra JAL
0 0000000110 0 00000000 00001 11011 11
0 0 e 0 0 0 e f
Fig. 4.12 Decoding RV32I instructions from the least significant byte
Instruction codes ending by 0x6f or 0xef are J-TYPE “jal” instructions (opcode
JAL).
The best way to learn an assembler language is to use the compiler. You write a piece
of C which you compile with a moderate optimization level (i.e. -O0 or -O1) and the
-S option (to produce the assembly source file). Then, you just have to look at the
assembly file and compare it to the C source.
4.2.1 Expressions
Compile the C piece of code shown in Listing 4.1 (all the C source files in this chapter
are in the chapter_4 folder; all the compilation commands are in the compile.txt
file in the same folder; the C source code to compute delta is in the exp.c file; the
compilation command for exp.c builds exp.s).
Listing 4.1 Compiling a C expression
void main () {
int a =3 , b =5 , c =2 , delta ;
delta = b * b - 4* a * c ;
}
The "exp.s" file produced is shown in Listing 4.2 (from the original "exp.s" file,
some unimportant details have been removed and comments have been added).
Listing 4.2 An expression in RISC-V assembly language
...
main : addi sp , sp , -32 /* a l l o c a t e 32 bytes on the stack */
sw ra ,28( sp ) /* save ra on the stack */
sw s0 ,24( sp ) /* save s0 on the stack */
sw s1 ,20( sp ) /* save s1 on the stack */
addi s0 , sp ,32 /* copy the stack p o i n t e r to s0 */
li a5 ,3 /* a5 =3 ; " a " is i n i t i a l i z e d */
sw a5 , -20( s0 ) /* place " a " on the stack */
li a5 ,5 /* a5 =5 ; " b " is i n i t i a l i z e d */
sw a5 , -24( s0 ) /* place " b " on the stack */
li a5 ,2 /* a5 =2 ; " c " is i n i t i a l i z e d */
sw a5 , -28( s0 ) /* place " c " on the stack */
lw a1 , -24( s0 ) /* load " b " into a1 */
lw a0 , -24( s0 ) /* load " b " into a0 */
call __mulsi3 /* m u l t i p l y a0 by a1 , r e s u l t in a0 */
mv a5 , a0 /* a5 = a0 ( " b * b " ) */
mv s1 , a5 /* s1 = a5 ( " b * b " ) */
lw a1 , -28( s0 ) /* load " c " into a1 */
lw a0 , -20( s0 ) /* load " a " into a0 */
call __mulsi3 /* m u l t i p l y a0 by a1 , r e s u l t in a0 */
mv a5 , a0 /* a5 = a0 ( " a * c " ) */
slli a5 , a5 ,2 /* a5 = a5 < <2 ; "4* a * c "*/
sub a5 , s1 , a5 /* a5 = s1 - a5 ; " b *b -4* a * c "*/
sw a5 , -32( s0 ) /* place " b *b -4* a * c " on the stack */
nop /* no o p e r a t i o n */
lw ra ,28( sp ) /* r e s t o r e ra */
lw s0 ,24( sp ) /* r e s t o r e s0 */
lw s1 ,20( sp ) /* r e s t o r e s1 */
addi sp , sp ,32 /* free the 32 bytes from the stack */
jr ra /* r e t u r n */
118 4 The RISC-V Architecture
As you can see, it is easy to read, but tedious, mostly because there are so many
unnecessary movements with the memory.
This is due to the low level of optimization I have used to compile (-O0). With a
higher level (even only -O1), the compiler would have simplified the code by simply
doing nothing (as the delta result is not used).
I could trick the compiler to force it to compute delta by printing its value. But
even in this case, the compiler would be smarter than me and simply compute delta
itself (as it is a few simple operations on constants) and print the resulting constant.
However, this simple example gives a lot of informations: how to write to and read
from memory, how to call a function (__mulsi3 which is provided by the standard
library; the __mulsi3 function multiplies its two arguments in registers a0 and a1
and returns the product into register a0).
It also shows how to initialize (li) and move (mv) registers, to allocate and free
space on the stack ("addi sp, sp, constant") and many other details.
4.2.2 Tests
Compile the C piece of code shown in Listing 4.3 (the C source file is test.c; the
compilation command in the compile.txt file builds test.s).
Listing 4.3 Compiling C tests
void main () {
int a =3 , b =5 , c =2 , delta ;
delta = b * b - 4* a * c ;
if ( delta <0) p r i n t f ( " no real s o l u t i o n \ n " ) ;
else if ( delta ==0) p r i n t f ( " one s o l u t i o n \ n " ) ;
else p r i n t f ( " two s o l u t i o n s \ n " ) ;
}
The "test.s" file produced is shown in Listing 4.4 (I removed the computation of
"b*b-4*a*c" which you can see in "exp.s").
Listing 4.4 A few tests in RISC-V assembly language
.LC0 : . s t r i n g " no real s o l u t i o n "
.align 2
.LC1 : . s t r i n g " one s o l u t i o n "
.align 2
.LC2 : . s t r i n g " two s o l u t i o n s "
...
main : addi sp , sp , -32
sw ra ,28( sp )
sw s0 ,24( sp )
sw s1 ,20( sp )
addi s0 , sp ,32
... /* the " b *b -4* a * c " c o m p u t a t i o n */
sw a5 , -32( s0 ) /* place " b *b -4* a * c " on the stack */
lw a5 , -32( s0 ) /* load " b *b -4* a * c " into a5 */
bge a5 , zero , .L2 /* if ( b *b -4* a *c >=0) goto .L2 */
lui a5 ,% hi ( .LC0 ) /* a0 = .LC0 */
addi a0 , a5 ,% lo ( .LC0 )
call puts /* puts ( " no real s o l u t i o n " ) */
j .L5 /* goto .L5 */
.L2 : lw a5 , -32( s0 ) /* load " b *b -4* a * c " into a5 */
4.2 Code Examples 119
Apart from the branch and jump instructions, this example shows how to initialize
a register with an address in the code (e.g. "a0 = .LC0"). This is done in two steps.
First, the register is set with the upper part of the address ("lui a5,%hi(.LC0)"; %hi
is a directive to the assembler which extracts the 20 upper bits of the .LC0 address).
Second, the lower part is added ("addi a0,a5,%lo(.LC0)"; %lo extracts the 12 lower
bits of the .LC0 address).
There are also examples of .align and .string assembler directives. Any word
starting with a dot is either a label (if it starts a line) or an assembler directive (if it
does not start a line).
Assembler directives are not RISC-V instructions. The assembler builds a text
section (i.e. a future code memory) with the hexadecimal values of the instructions.
It also builds a data section with the data values found in the text (e.g. what follows
the .LC0 label).
The ".align 2" directive aligns the next construction (either a RISC-V instruction
or some data) on the next even address (i.e. it moves one byte forward if the current
construction ends on an odd address, otherwise it is ineffective).
The .string directive builds a data in the data section from the string which follows.
4.2.3 Loops
The program shown in Listing 4.5 computes the 10th term of the Fibonacci sequence.
Listing 4.5 Compiling C for loop
# i n c l u d e < stdio .h >
void main () {
int i , un , unm1 =1 , unm2 =0;
for ( i =2; i <=10; i ++) {
un = unm1 + unm2 ;
unm2 = unm1 ;
unm1 = un ;
}
p r i n t f ( " f i b o n a c c i (10) =% d \ n " , un ) ;
}
Compile with the -O1 optimization level (the C source file is loop.c; the compi-
lation command in the compile.txt file builds loop.s). The RISC-V assembly code
produced by the compilation is shown in Listing 4.6.
120 4 The RISC-V Architecture
Compile the code shown in Listing 4.7 with optimization level 1 and build the
assembly file (the C source file is fib.c; the compilation command builds the fib.s
file).
Listing 4.7 Compiling C function calls
# i n c l u d e < stdio .h >
u n s i g n e d int f i b o n a c c i ( u n s i g n e d int n ) {
u n s i g n e d int i , un , unm1 =1 , unm2 =0;
if ( n ==0) r e t u r n unm2 ;
if ( n ==1) r e t u r n unm1 ;
for ( i =2; i <= n ; i ++) {
un = unm1 + unm2 ;
unm2 = unm1 ;
unm1 = un ;
}
r e t u r n ( un ) ;
}
void main () {
p r i n t f ( " f i b o n a c c i (0) =% d \ n " , f i b o n a c c i (0) ) ;
p r i n t f ( " f i b o n a c c i (1) =% d \ n " , f i b o n a c c i (1) ) ;
p r i n t f ( " f i b o n a c c i (10) =% d \ n " , f i b o n a c c i (10) ) ;
p r i n t f ( " f i b o n a c c i (11) =% d \ n " , f i b o n a c c i (11) ) ;
p r i n t f ( " f i b o n a c c i (12) =% d \ n " , f i b o n a c c i (12) ) ;
}
4.2 Code Examples 121
The RISC-V assembly translation of the main function is shown in Listing 4.9.
Listing 4.9 Function calls in RISC-V assembly language: the main function
...
main : addi sp , sp , -16
sw ra ,12( sp )
li a0 ,0 /* n =0*/
call f i b o n a c c i /* f i b o n a c c i ( n ) */
mv a1 , a0 /* a1 = f i b o n a c c i ( n ) */
lui a0 ,% hi ( .LC0 )
addi a0 , a0 ,% lo ( .LC0 )
call printf /* p r i n t f ( " f i b o n a c c i (0) =% d \ n " , f i b o n a c c i ( n ) ) */
li a0 ,1 /* n =1*/
call f i b o n a c c i /* f i b o n a c c i ( n ) */
mv a1 , a0 /* a1 = f i b o n a c c i ( n ) */
lui a0 ,% hi ( .LC1 )
addi a0 , a0 ,% lo ( .LC1 )
call printf /* p r i n t f ( " f i b o n a c c i (1) =% d \ n " , f i b o n a c c i ( n ) ) */
li a0 ,10 /* n =10*/
call f i b o n a c c i /* f i b o n a c c i ( n ) */
mv a1 , a0 /* a1 = f i b o n a c c i ( n ) */
lui a0 ,% hi ( .LC2 )
addi a0 , a0 ,% lo ( .LC2 )
call printf /* p r i n t f ( " f i b o n a c c i (10) =% d \ n " , f i b o n a c c i ( n ) ) */
li a0 ,11 /* n =11*/
call f i b o n a c c i /* f i b o n a c c i ( n ) */
mv a1 , a0 /* a1 = f i b o n a c c i ( n ) */
lui a0 ,% hi ( .LC3 )
addi a0 , a0 ,% lo ( .LC3 )
call printf /* p r i n t f ( " f i b o n a c c i (11) =% d \ n " , f i b o n a c c i ( n ) ) */
122 4 The RISC-V Architecture
li a0 ,12 /* n =12*/
call f i b o n a c c i /* f i b o n a c c i ( n ) */
mv a1 , a0 /* a1 = f i b o n a c c i ( n ) */
lui a0 ,% hi ( .LC4 )
addi a0 , a0 ,% lo ( .LC4 )
call printf /* p r i n t f ( " f i b o n a c c i (12) =% d \ n " , f i b o n a c c i ( n ) ) */
lw ra ,12( sp )
addi sp , sp ,16
jr ra
The calls apply the ABI constraints, i.e. transmit arguments through the a0-a7
registers, starting with a0 and return their result in a0.
There is much more to learn about assembly programming but this is enough to
be able to build a processor.
References
1. https://riscv.org/specifications/isa-spec-pdf/
2. https://www.cs.virginia.edu/~robins/Turing_Paper_1936.pdf
3. D. Patterson, A. Waterman: The RISC-V Reader: An Open Architecture Atlas (Strawberry
Canyon, 2017)
Building a Fetching, Decoding,
and Executing Processor 5
Abstract
This chapter prepares the building of your first RISC-V processor. First, a fetching
machine is implemented. It is only able to fetch successive words from a code
memory. Second, the fetching machine is upgraded to include a decoding mech-
anism. Third, the fetching and decoding machine is completed with an execution
engine to run computation and control instructions, but not yet memory accessing
ones.
If you are familiar with classic programming in C/C++, you will find surprising
methods in HLS programming.
In classic programming, you improve your code either to save time (I mean exe-
cution time, i.e. running the program faster) or (but not excluding) to save space, i.e.
use less memory words.
The piece of code in Listing 5.1 shows a loop representing the simulation of a
processor. The loop iteration is composed of four functions representing succes-
sive phases in the processing of one instruction: fetch the instruction from the code
memory, decode it, execute it, and set an is_running continuation condition.
It is a do ... while loop as the loop has to be run at least once and as it terminates
depending on the processed instruction.
Listing 5.1 A basic processor simulation loop
do {
fetch ( pc , code_ram , & i n s t r u c t i o n ) ;
d e c o d e ( i n s t r u c t i o n , & d_i ) ;
e x e c u t e ( pc , reg_file , data_ram , d_i , & pc ) ;
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 123
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_5
124 5 Building a Fetching, Decoding, and Executing Processor
r u n n i n g _ c o n d _ u p d a t e ( i n s t r u c t i o n , reg_file , & i s _ r u n n i n g ) ;
} while ( i s _ r u n n i n g ) ;
In classic programming, the execution time of such a loop is the sum of the
execution times of all the iterations.
For a given program to be simulated by the loop, the number of iterations cannot
be changed. It is equal to the number of simulated instructions, i.e. the number of
instructions of the program run. Hence, the only way to improve the execution time
of the simulation is to speed up at least one iteration by improving the code of any
of the four functions.
The simulation loop can be used in an HLS tool to define an IP. In this case, each
iteration of the do ... while loop represents a cycle of the IP.
This is where the first very fundamental difference between a software simulator
and an IP defined by an HLS program resides.
In an IP, all the iterations have the same IP cycle time, defined as the critical path,
i.e. the longest delay to cross any path of successive gates from the cycle start (i.e.
the iteration start) to the cycle end (i.e. the iteration end).
When looking at the iteration which successively calls the four functions, it seems
that the delay is the same for all the iterations. However, the execute function is coded
to be able to execute any instruction of the processor instruction set. An addition
instruction does not use the same computing unit than a logical and for example.
The delay of an adder is probably much longer than the delay of a simple AND gate
(in this matter, hardware differs from software: in C, the “a = b & c” expression is
considered as time equivalent to the “a = b + c” expression).
Figure 5.1 shows two different paths, one crossing an AND gate and the other
crossing an adder. The execute unit has many other unshown paths to take care of all
the other instructions in the ISA. For the two shown paths, the longest delay is the
adder one. Hence, the critical path will be at least as long as the path crossing the
adder.
For the IP, the run time of any instruction fits in the critical path, the logical and
as well as the addition one.
Improving the IP requires to shorten the critical path, i.e. to improve the execution
time of the longest iteration. Comparatively, improving the execution time of any
iteration improves the software simulation time.
In the software simulator, reducing the execution time of one iteration benefits
the overall execution time. In the IP, reducing the delay of all the paths excepting the
critical one does not improve the IP cycle at all.
In the IP, the “if (condition)” may lengthen the critical path, hence degrade the
cycle. Removing it changes the semantic of course. But this change can have no
impact on the IP behaviour. This is the case if the dosomething function has only a
local impact on the current iteration but no global impact on the next iterations and
on the computation after the loop exit.
In software simulation, the condition should not be removed because its elimina-
tion would degrade the performance. In the IP, the “if (condition)” removal should
be considered. It implies computing more (i.e. uselessly computing “dosomething”
when the “condition” is false). But in hardware, computing more can improve the
cycle time. Including useless computations is a good choice if it helps to shorten the
critical path.
A third difference is illustrated by the way loops are executed. In software simulation,
there is no parallelism. Iterations are run one after the other. In an IP, a for loop with
a static number of iterations is unrolled.
That means, an IP runs all the iterations of the loop in Listing 5.3 in parallel
(dosomething(0) ... dosomething(15)) and the iteration counter i is eliminated (no
“i++” update, no “i < 16” test).
Listing 5.3 for loop computation
for ( i =0; i <16; i ++)
dosomething (i);
In an IP, parallelism also concerns the way functions are run, as illustrated in
Listing 5.4. If functions f and g are independent (i.e. they could be permuted without
changing the result, which is true if and only if f does not modify any variable used
by g and conversely, g does not modify any variable used by f ), they are implemented
to be run in parallel in the IP but still sequentially called in the software simulation.
Listing 5.4 Independent functions
f (...) ;
g (...) ;
126 5 Building a Fetching, Decoding, and Executing Processor
The execution time of the fill function simulation is the cumulated times of the
three fields settings. The execution time on the FPGA is either independent of the fill
function (if its run is not in the critical path) or related only to the longest of the three
field value computations if these computations are independent with one another.
If you add a fourth field to the structure, the simulation time is increased but the
IP critical path is probably not impacted (it is impacted only if the fill function run
is in the critical path and if the new field computation is longer than the three other
ones).
The execution time of a program applied to some data on a processor is given by the
number of cycles of the run multiplied by the cycle duration. To highlight how the
ISA and the processor implementation can impact the performance, this equation to
compute the execution time can be further detailed by using three terms instead of
two.
The execution time (in seconds) of a program P applied to a data D on a CPU C
is the multiplication of three values: the number of machine instructions run (nmi(P,
D, C)), the average number of processor cycles to run one instruction (cpi(P, D, C),
number of Cycles Per Instruction or CPI) and the duration (in seconds) of a cycle
(c(C)), as formalized in Eq. 5.1.
time(P, D, C) = nmi(P, D, C) ∗ cpi(P, D, C) ∗ c(C) (5.1)
To improve the overall performance of a CPU C, its cycle c(C) can be decreased.
This improvement can be obtained by shortening the critical path either through a
microarchitectural improvement (i.e. eliminating some gates on the path), or through
a technological improvement (i.e. reducing the size of the transistors to decrease the
delay to cross the critical path because of its reduced length). It can also be achieved
by increasing the voltage to increase the processor component frequency through
overclocking.
The performance is improved only if in the same time the number of cycles of the
run is not increased (if we assume that the number of instructions run nmi(P, D, C) is
unchanged, an increase of the number of cycles comes from an increase of the CPI).
5.3 First Step: Build the Path to Update PC 127
On the FPGA, we cannot reduce the size of the transistors of course. It is possible
to increase the frequency of the programmable part of the FPGA but I will not
investigate this possibility in the book. Hence, the only way to improve the designs
is to eliminate gates on the critical path through microarchitectural improvements.
Throughout the book, I will present successive implementations of a RISC-V
processor. I will apply Eq. 5.1 to a set of program examples and compare the computed
execution times to illustrate the improvements.
Now that you know how to run an IP on a development board (Chap. 2), I will assume
you are able to test all the codes given in the book. They have all been tested on my
Pynq-Z1 board.
The code presented in the book is not the full code. Only the parts which deserve
some explanations are presented in the listings throughout the book.
Each IP project is defined as a subfolder in the goossens-book-ip-projects/2022.1
folder. Each subfolder contains the full code of the corresponding IP project, which
I recommend you to have a look at.
For example, the three IP projects presented in this chapter, i.e. fetching_ip,
fetching_decoding_ip, and fde_ip, have their full code in the three folders with the
same names.
Moreover, in each IP folder you will find prebuilt Vitis_HLS projects to help you
simulate the proposed IPs without typing the code. You will also find predesigned
Vivado projects including the Vitis_HLS prebuilt IPs. In these Vivado projects you
will find pregenerated bitstreams.
Hence, to test the processor IPs defined in the book, you only have to set the
helloworld drivers in the Vitis IDE workspaces from helloworld.c files included in
the IP folders, as explained in Sect. 2.7.
A processor is made of three complementary paths: the path to update the program
counter (pc), the path to update the register file, and the path to update the data
memory.
The fetching_ip design concentrates on the first path. Its duty is also to read the
instruction addressed by pc from the code memory. In parallel, a new instruction is
fetched (i.e. is read) and the new pc is computed.
A fetching IP is a component which reads instructions from a memory containing
instruction words. The fetching process continues while a continuation condition
keeps true. This continuation condition should be related to the fetched instructions
themselves.
128 5 Building a Fetching, Decoding, and Executing Processor
+1
current next
pc pc
In the fetching_ip design, I will define that condition as: “the fetched instruction
is not a RET ” (RET is the return from function pseudo-instruction).
The fetching IP moves along the memory. It starts from an initial address and
moves forward (during each IP cycle, the address is incremented by the size of the
fetched instruction, i.e. four bytes or one word as I restrict the recognized ISA to
the RV32I set, excluding the RVC standard extension devoted to compressed 16-bit
instructions).
Figure 5.2 shows the hardware I will implement in the fetching_ip design. The
leftmost vertical rectangle represents a separation register, which is a clocked circuit.
The dotted line represents the separation between the register input and the output.
At the beginning of each IP cycle, the present state at the left of the dotted line
is copied to the right side. During the cycle, the left and the right sides are isolated
from each other: what is computed after the "+1" square, labeled next pc, is driven
to the input of the pc separation register but does not cross the dotted line before the
next cycle starts.
The code RAM box is a memory circuit. The addressed word is copied to the
instruction output.
The "+1" box is an adder circuit. It outputs the incremented value of its input.
All the source files related to the fetching_ip can be found in the fetching_ip folder.
The pinout of the top function is associated to a set of INTERFACE pragmas (one
pragma for each argument).
HLS pragmas are informations for the synthesizer, as mentioned in Sect. 2.5.4.
They are ignored by the C compiler when the code is compiled before running an IP
simulation. The synthesizer uses the pragmas to build his translation to RTL.
The INTERFACE pragmas are used by the synthesizer to organize the way the IP
is connected to its surrounding. They offer many protocols to exchange data from/to
the IP. I will progressively introduce these different protocols.
The axilite protocol used in the fetching_ip is related to the AXI communication
interface (I present the AXI interface in Chap. 11).
The s_axilite name designates a slave, i.e. the fetching_ip component is a slave
device of the AXI interconnect. A slave receives requests (i.e. reads or writes) from
masters and serves them.
An IP using an AXI interconnection interface can be connected to an AXI inter-
connect IP which links IPs together and allows them to communicate.
In the preceding chapter, I have connected the adder_ip to the Zynq7 Processing
System IP through an AXI interconnect IP (in fact, the interconnection was done
automatically by Vivado). Later in this chapter, I connect the fetching_ip the same
way.
The top function arguments using the s_axilite interface should be typed with the
C/C++ basic integer types (i.e. int, unsigned int, char, short and pointers).
You should not use the template types provided by Vitis through the ap_int.h
header file (these types are presented in Sect. 5.3.2.4). The reason is that the AXI
interface uses a standard 32-bit bus with byte enable, hence IP arguments should
have a one, two or four bytes width.
To run the fetching_ip, its input arguments should be set, i.e. the start_pc variable
with the starting point of the code to be run and the code_ram memory with the
RISC-V code itself.
The Zynq IP sends these initial values to the fetching_ip through the AXI inter-
connect IP as shown in Fig. 5.3.
The Zynq IP writes to the AXI interconnect bridge. The write base address iden-
tifies the fetching_ip component. The write address offset identifies the fetching_ip
top function argument, i.e. either start_pc or code_ram. Multiple writes are neces-
sary to initialize the fetching_ip component.
For example, to initialize the start_pc argument as address 0, the Zynq IP writes
word 0 to address *(fetching_ip + start_pc).
130 5 Building a Fetching, Decoding, and Executing Processor
5.3.2.6 Parallelism
The fetching IP top function successively calls three functions: fetch, execute, and
running_cond_update.
The fetch function reads an instruction, the execute function computes the next
pc, and the running_cond_update function computes the is_running condition to
continue or end the do ... while loop.
This succession of function calls leads to a sequential run according to the C
semantic and it is how the IP will be simulated. In hardware however, the gates and
circuits implementing the functions are linked together if they have some producer/-
consumer dependencies between their arguments.
When two functions are independent, they are run in parallel. It is so for all the
instructions implementing an IP.
While the synthesizer implements your code, it will point out the computation
independencies, hence their parallelism. When you think of your code, you should
always keep in mind the hardware reordering of the computations.
The Schedule Viewer of the Vitis_HLS tool will help you to visualize when an
operation is done, how long it lasts and what it is depending on.
This is one of the nice things when it comes to using HLS tools: to access memory,
just access the C array representing the memory. As a general rule, each time you
need to access to some hardware unit, just use a C variable or a C operator and let
the synthesizer transform your C code into hardware.
For example, if you want to compute a product between integer values, just use
the C multiplication operator on two integer variables (e.g. a*b). The synthesizer
will miraculously build the necessary integer multiplier in the FPGA. If you want to
multiply two floating-point numbers, just type the variables as float or double and
the synthesizer will use a floating-point multiplier instead of an integer one.
Listing 5.9 The fetch.cpp file
# include " debug_fetching_ip .h"
# include " fetching_ip .h"
# ifndef __SYNTHESIS__
# ifdef D E B U G _ F E T C H
# i n c l u d e < stdio .h >
# endif
# endif
void fetch (
c o d e _ a d d r e s s _ t pc ,
i n s t r u c t i o n _ t * code_ram ,
instruction_t * instruction ){
# p r a g m a HLS I N L I N E off
* i n s t r u c t i o n = c o d e _ r a m [ pc ];
# ifndef __SYNTHESIS__
# ifdef D E B U G _ F E T C H
p r i n t f ( " %04 d : %08 x \ n " , ( int ) ( pc < <2) , * i n s t r u c t i o n ) ;
# endif
# endif
}
word access
To minimize the number of flip-flops used to hold the pc register, the two
cleared low-order bits are kept implicit. Thus, the pc register has a width of
LOG_CODE_RAM_SIZE bits (it is a word pointer; the code_address_t type is de-
fined as ap_uint<LOG_CODE_RAM_SIZE> in the fetching_ip.h file; see Listing
5.8).
The execute function is defined in the execute.cpp file and shown in Listing 5.11.
It only computes the next pc.
Listing 5.11 The execute function
void e x e c u t e (
code_address_t pc ,
code_address_t * next_pc ){
# p r a g m a HLS I N L I N E off
* n e x t _ p c = c o m p u t e _ n e x t _ p c ( pc ) ;
}
The next_pc variable always points to the next instruction in the code memory.
The fetching IP reads successive addresses. In the fetching_decoding_ip design
presented in Sect. 5.4, I will add the possibility to change the control flow through
branches and jumps.
At the end of the do ... while loop (see Listing 5.7), the running_cond_update
function sets the loop continuation condition.
For a processor, this should correspond to the last instruction run, i.e. the return
from the main function.
In the fetching_ip, I stop the run when the first RET instruction is met.
The running_cond_update function is defined in the fetching_ip.cpp file and
shown in Listing 5.13.
Listing 5.13 The running_cond_update function
static void r u n n i n g _ c o n d _ u p d a t e (
instruction_t instruction ,
bit_t * is_running ){
# p r a g m a HLS I N L I N E off
* i s _ r u n n i n g = ( i n s t r u c t i o n != RET ) ;
}
Experimentation
To simulate the fetching_ip, in Vitis_HLS select Open Project, navigate to the
fetching_ip/fetching_ip folder and click on Open (the folder to open is the second
fetching_ip in the hierarchy; the opened folder contains a .apc folder). Then in
the Vitis_HLS Explorer frame, right-click on TestBench, Add Test Bench File,
and add the testbench_fetching_ip.cpp file in the fetching_ip folder. In the Flow
Navigator frame, click on Run C Simulation and OK. The result of the run should
print in the fetching_ip_csim.log tab.
The testbench_fetching_ip.cpp file (see Listing 5.14) contains the main function
to simulate this IP. The main function calls the fetching_ip top function which runs
the RISC-V code placed in the code_ram array.
The array is initialized with a #include directive to the preprocessor. The
test_op_imm_0_text.hex hexadecimal file to be included should have been built
as explained in Sect. 3.1.3, with the objcopy and the hexdump commands applied
to the test_op_imm_0.elf file (this file is the result of the text based 0 compilation
of test_op_imm.s).
5.3 First Step: Build the Path to Update PC 137
The main function must have an int result (void is not permitted by Vitis_HLS).
In the fetching_ip project, the test_op_imm program is considered as an arbitrary
set of RISC-V instructions ended by a RET instruction.
When the simulation is successful, the synthesis can be done (Run C Synthesis).
Figure 5.5 shows the synthesizer report. The BRAM column gives the number of
BRAM blocks used in the FPGA (128 are used; 128 × 36 Kb or 4 KB blocks out
of 140, to map the fetching_ip top function arguments among which the 256 KB
code_ram). The FF column gives the number of Flip-Flops used by the implemented
logic (225 out of 106,400). The LUT column gives the number of LUTs used (272
out of 53,200).
138 5 Building a Fetching, Decoding, and Executing Processor
Figure 5.6 shows the schedule. The VITIS_LOOP_20_1 (i.e. the do ... while loop;
so named by the synthesizer because it starts at line 20 in the source code) has a
latency of three FPGA cycles. Hence, the processor cycle is 30 ns (33 MHz).
The schedule shows that the functions fetch and execute are run in parallel (cycles
1 and 2 for fetch, cycle 2 for execute).
In the upper left part of the window, select the Module Hierarchy tab. Expand
VITIS_LOOP_20_1. Click on fetch to display the fetch function schedule and click
on the code_ram_addr(getelementptr) line.
In the fetch function schedule (Fig. 5.7), the code memory is loaded
(code_ram_load(read)) during cycles 0 and 1. After one and a half FPGA cycle, the
instruction is obtained (cycle 1).
If you click on the Properties tab in the bottom central frame, you get informations
on the signal you are viewing, as shown in Fig. 5.8.
The code_ram_addr(getelemptr) bus is 16 bits wide (Bit Width field in the Prop-
erties tab).
Click on code_ram_addr(getelementptr) to select it. Then, you right-click on it
and validate Go to source.
The Code Source tab opens and displays the code of the fetch function in the
fetch.cpp file (see Fig. 5.9). Line 13 is highlighted which indicates that it is the source
of code_ram_addr(getelementptr), i.e. "*instruction = code_ram[pc]".
In the execute function schedule (Fig. 5.10), the pc argument is read at cycle 0,
i.e. the second cycle of the do ... while loop iteration (pc_read(read) in Fig. 5.10)
and incremented in the same cycle (add_ln232(+)).
Fig. 5.9 The code_ram_addr variable: the matching source code line
140 5 Building a Fetching, Decoding, and Executing Processor
The Schedule Viewer also indicates the dependencies between the different phases
of the computation. In Fig. 5.8, there is an incoming purple arrow on the left of the
code_ram_load(read) line, coming from the code_ram_addr(getelementptr) line
(the memory read depends on the addressing pc).
In the Vivado tool, a block design can be built, as shown in Fig. 5.11.
You can use the prebuilt z1_fetching_ip.xpr file in the fetching_ip folder (in Vi-
vado, open project, navigate to the fetching_ip folder and open the z1_fetching_ip.xpr
file, then Open Block Design).
The report after the pre-generated bitstream is shown in Fig. 5.12 (Reports tab,
scroll down to Implementation/Place Design/Utilization - Place Design; in the
Utilization - Place Design - impl_1 frame, scroll down to 1. Slice Logic). The
Pynq-Z1 design uses 1538 LUTs (2.89%) instead of 272.
Experimentation
To run the fetching_ip on the development board, plug in your board and switch
it on.
Launch Vitis IDE (in a terminal with the /opt/Xilinx/Vitis/2022.1 folder as the
current directory, run the "source settings64.sh" command, then run the "vitis"
command).
Name the workspace as workspace_fetching_ip.
In a terminal, run "sudo putty", select Serial, set the Serial line as /dev/ttyUSB1,
set the Speed as 115200 and click on Open.
In Vitis IDE, build an Application Project, Create a new platform from
hardware (XSA), browse to the goossens-book-ip-projects/2022.1/fetching_ip
folder, select the design_1_wrapper.xsa file and click on Open, then on Next.
Name the Application project as z1_00, click on Next twice, then on Finish.
In a terminal with goossens-book-ip-projects/2022.1/fetching_ip as the current
directory, run the update_helloworld.sh shell script.
In Vitis IDE, click on the Launch Target Connection button (refer back to Fig.
2.88). In the Target Connections Dialog Box, expand Hardware Server. Double-
click on Local [default]. In the Target Connection Details pop-up window, click
on OK. Close the Target Connections Dialog Box.
In Vitis IDE Explorer frame, expand z1_00_system/z1_00/src and double click
on helloworld.c to open the file. You can replace the default helloworld program
with the code in the goossens-book-ip-projects/2022.1/fetching_ip/helloworld.c
file.
Right-click on z1_00_system, Build Project. Wait until the Build Finished mes-
sage is printed in the Console.
Right-click again on z1_00_system and Run As/Launch Hardware. The result
of the run (i.e. done) should be printed in the putty window.
The code in Listing 5.16 is the helloworld.c driver associated to the fetch-
ing IP (do not forget to adapt the path to the hex file to your environment with
the update_helloworld.sh shell script). It mimics the main function in the test-
bench_fetching_ip.cpp file (i.e. runs the IP and prints the done message at the
end).
Listing 5.16 The helloworld.c file
# i n c l u d e < stdio .h >
# include " xfetching_ip .h"
# include " xparameters .h"
# d e f i n e L O G _ C O D E _ R A M _ S I Z E 16
// size in words
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
XFetching_ip_Config * cfg_ptr ;
142 5 Building a Fetching, Decoding, and Executing Processor
XFetching_ip ip ;
w o r d _ t y p e c o d e _ r a m [ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ o p _ i m m _ 0 _ t e x t . hex "
};
int main () {
cfg_ptr = XFetching_ip_LookupConfig ( XPAR_XFETCHING_IP_0_DEVICE_ID
);
X F e t c h i n g _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X F e t c h i n g _ i p _ S e t _ s t a r t _ p c (& ip , 0) ;
X F e t c h i n g _ i p _ W r i t e _ c o d e _ r a m _ W o r d s (& ip , 0 , code_ram , C O D E _ R A M _ S I Z E
);
X F e t c h i n g _ i p _ S t a r t (& ip ) ;
while (! X F e t c h i n g _ i p _ I s D o n e (& ip ) ) ;
p r i n t f ( " done \ n " ) ;
}
All the functions prefixed with XFetching_ip_ are defined in the xfetching_ip.h
file which is built by the Vitis_HLS tool.
A copy of xfetching_ip.h can be found in the Vitis_HLS project when exploring
solution1/impl/misc/drivers/fetching_ip_v1_0/src (it can also be viewed within the
Vitis IDE environment by double clicking on the xfetching_ip.h header file name in
the Outline frame on the right side of the Vitis IDE page).
The XFetching_ip_LookupConfig function builds a configuration structure asso-
ciated to the IP given as the argument and defined as a constant in the xparameters.h
file (XPAR_XFETCHING_IP_0_DEVICE_ID which is the fetching_ip component in the
Vivado design). It returns a pointer to the created structure.
The XFetching_ip_CfgInitialize function initializes the configuration structure,
i.e. creates the IP and returns a pointer to it.
The XIp_name_LookupConfig and XIp_name_CfgInitialize functions should be
used to create and handle any ip_name IP imported from the added repositories.
Once the fetching_ip has been created and can be addressed through the ip pointer
returned by the XFetching_ip_CfgInitialize function, initial values should be sent.
The start_pc argument is initialized with a call to the XFetching_ip_Set_start_pc
function. There is one XFetching_ip_Set_ function for each scalar input argument
and one XFetching_ip_Get_ function per scalar output argument.
The code_ram array argument is initialized with a call to the XFetching_ip_Write_
code_ram_Words function which sends a succession of words on the AXI bus in
burst mode.
The fourth argument of the XFetching_ip_Write_code_ram_Words function de-
fines the number of words sent (CODE_RAM_SIZE). The second argument is the start
address of the write to the destination array (0).
Once the start pc and the code memory have been initialized, the fetching_ip can
be started with a call to the XFetching_ip_Start function. The Zynq7 Processing
System IP sends the start signal to the fetching_ip through the AXI interconnect.
The main function should then wait until the fetching_ip has finished its run, i.e.
has exited the do ... while loop of the fetching_ip top function. This is the role of the
while loop in the helloworld main function, controlled by the value returned by the
XFetching_ip_IsDone function.
You may feel a bit disappointed after the run on the FPGA, as the only output is
the done message. You know that your IP has done something but you do not see
what has been done as the debug messages showing the fetched code are not printed.
The prints will get more convincing in the next designs. However, you should keep
in mind that a processor does not output anything. Mainly, it computes in its internal
registers which are not visible from the external world, and stores and loads from its
memory. When the external world is given an access to the memory, the result of a
computation can be observed by dumping the data memory. The computed program
should store its results to memory rather than simply saving them to registers.
Once an instruction has been fetched from memory, it must be decoded to prepare
its execution.
The instruction is a 32-bit word (in RV32I, 32 refers to the data width, not to the
instruction width: e.g. RV64I refers to 32-bit instructions operating on 64-bit data).
The instruction word is composed of fields as defined in Sect. 2.2 (Base Format)
of the RISC-V specification [1] and presented in Sect. 4.1.3 of the present book.
Decoding the instruction means decomposing a single word into the set of its
fields.
The components building the RISC-V instruction encoding are the major opcode,
the minor opcode func3, the source registers rs1 and rs2, the destination register rd,
an operation specifier func7, and an immediate value imm.
144 5 Building a Fetching, Decoding, and Executing Processor
instruction instruction
word word
1 0 1 0
3 2 3 2 opcode
5 4 5 4
7 6 inst_7 7 6
9 8 inst_11_8 9 8 rd
11 10 11 10
13 12 13 12 func3
15 14 inst_19_12 15 14
17 16 17 16 rs1
19 18 19 18
21 20 inst_20 21 20
23 22 inst_24_21 23 22 rs2
25 24 25 24
27 26 inst_30_25 27 26
29 28 29 28 func7
31 30 inst_31 31 30
Table 5.1 Opcode and format association (opcodes with bits 4-2 between 000 and 011)
opcode[65][432] 000 001 010 011
00 LOAD LOAD-FP CUSTOM-0 MISC-MEM
I-TYPE OTHER-TYPE OTHER-TYPE OTHER-TYPE
01 STORE STORE-FP CUSTOM-1 AMO
S-TYPE OTHER-TYPE OTHER-TYPE OTHER-TYPE
10 MADD MSUB NMSUB NMADD
OTHER-TYPE OTHER-TYPE OTHER-TYPE OTHER-TYPE
11 BRANCH JALR RESERVED-1 JAL
B-TYPE I-TYPE OTHER-TYPE J-TYPE
Table 5.2 Opcode and format association (opcodes with bits 4-2 between 100 and 111)
opcode[65][432] 100 101 110 111
00 OP-IMM AUIPC OP-IMM-32 RV48-0
I-TYPE U-TYPE OTHER-TYPE OTHER-TYPE
01 OP LUI OP-32 RV64
R-TYPE U-TYPE OTHER-TYPE OTHER-TYPE
10 OP-FP RESERVED-0 CUSTOM2-RV128 RV48-1
OTHER-TYPE OTHER-TYPE OTHER-TYPE OTHER-TYPE
11 SYSTEM RESERVED-2 CUSTOM3-RV128 RV80
OTHER-TYPE OTHER-TYPE OTHER-TYPE OTHER-TYPE
I have added two more format numbers: UNDEFINED-TYPE (format number 0) and
OTHER-TYPE (format number 7, used for all the instructions not part of the RV32I
set).
Tables 5.1 and 5.2 show the association between opcodes and formats according
to the RISC-V specification. It is not limited to the RV32I subset but to all the
instructions of the full non privileged RISC-V ISA.
Table 5.1 shows the opcodes with the low order bits ranging from 0 to 3.
Table 5.2 shows the opcodes with the low order bits ranging from 4 to 7.
Figure 5.14 shows the circuits to decode the instruction formats. It is a set of four
multiplexers in parallel and a serial fifth one. The four multiplexers select one among
their eight 3-bit format number entries. The selected entry is the one addressed by
the 3-bit selection code from the instruction word bits 2–4 (named opcl in Fig. 5.14).
For example, when the opcl 3-bit code is 0b100, the four multiplexers output their
fourth entry (numbered from 0 to 7, from the top to the bottom of the multiplexer),
i.e. the I-TYPE encoding for the top first multiplexer (0b010, format number 2), the
R-TYPE encoding for the second multiplexer (0b001), the OTHER-TYPE encoding for
the third multiplexer (0b111), and the OTHER-TYPE encoding again for the fourth
bottom multiplexer.
146 5 Building a Fetching, Decoding, and Executing Processor
instruction
opcode
opch
65
opcl
32
4 3−bit code
i−type
o−type
o−type
o−type
i−type
u−type
mux 3*8 −> 3
o−type
o−type
s−type
o−type
o−type
o−type
r−type
u−type
mux 3*8 −> 3 2−bit code
32 different o−type
o−type
mux 3*4 −> 3 3−bit format
of code o−type
o−type
o−type
o−type
o−type
o−type
mux 3*8 −> 3
o−type
o−type
b−type
i−type
o−type
j−type
o−type
o−type
mux 3*8 −> 3
o−type
o−type
The rightmost multiplexer selects one of the outputs of the four leftmost multi-
plexers, according to its 2-bit opch code (instruction word bits 5 and 6).
For example, when the 2-bit code is 0b00, the rightmost multiplexer outputs the
I-TYPE encoding (0b010) coming from the uppermost left multiplexer. Hence, the
format associated to opcode 0b00100 (OP-IMM) is I-TYPE.
The synthesizer simplifies the circuits. For example, the third left multiplexer has
eight times the same OTHER-TYPE input. It can be eliminated and replaced by driving
the OTHER-TYPE code directly to the right multiplexer second entry (numbered from
0 to 3, from the top to the bottom of the multiplexer).
All the source files related to the fetching_decoding_ip can be found in the fetch-
ing_decoding_ip folder.
The fetching_decoding_ip function is defined in the fetching_decoding_ip.cpp
file (see Listing 5.8).
There are a few differences compared to the preceding fetching_ip top function
shown in Figs. 5.5 and 5.6.
I have added a third argument (nb_instruction) to the IP to provide the number
of instructions fetched and decoded. It helps to check that the correct path has been
followed.
The instruction count is computed in a local counter nbi which is copied to the
nb_instruction output argument at the end of the run. The nbi counter is updated
in the statistic_update function shown in Listing 5.18 and defined in the fetch-
ing_decoding_ip.cpp file.
Listing 5.18 The statistic_update function
static void s t a t i s t i c _ u p d a t e (
u n s i g n e d int * nbi ) {
# p r a g m a HLS I N L I N E off
* nbi = * nbi + 1;
}
As every iteration fetches and decodes one instruction, the number of instructions
fetched and decoded is also the number of IP cycles.
The nb_instruction argument has been associated to the s_axilite INTERFACE,
like all the other arguments. But, as it is an output it is represented as a pointer.
I have used the HLS PIPELINE pragma to keep the loop iteration duration within
3 cky three FPGA cycles. Khoang thoi gian khoi tao
The HLS PIPELINE II=3 sets an Initiation Interval (II) of 3. This interval is the
number of cycles before the next iteration can start.
If a loop iteration has a duration of d cycles and the II is set as i (with i <= d),
then an iteration starts every i cycles and each new iteration overlaps of d-i cycles
with its predecessor.
The lower the II, the faster the design. The II value gives the throughput of the
pipeline. With II=3, the pipeline outputs one instruction every three cycles. With
II=1, the output rate is one instruction per cycle, i.e. three times more.
148 5 Building a Fetching, Decoding, and Executing Processor
FPGA cycle
IP cycle
Figure 5.15 shows the difference between HLS PIPELINE II=3 and HLS
PIPELINE II=1.
When II is set to i, it indicates that the next iteration should, if possible, start i
FPGA cycles after the current one.
In my design, an II of 1 would lead to two overlapping FPGA cycles for two
successive iterations. This is incompatible with the fact that the next pc is computed
during the third cycle (it takes three FPGA cycles to fetch, decode, and select the
incremented pc in the execute function), too late to be used by the next iteration if
it is scheduled one FPGA cycle after its predecessor. Even an II of 2 would fail (in
this case, the synthesizer detects an II violation).
The do ... while loop contains five function calls, with the adding of the calls to
the decode function and to the statistic_update one.
Even though the fetching_decoding_ip has no register file yet, the instruction
encoding contains fields using register numbers (first source register rs1, second
source register rs2, and destination register rd).
Listing 5.21 shows the type definitions in the fetching_decoding_ip.h file.
Listing 5.21 The type definitions in the fetching_decoding_ip.h file
...
typedef u n s i g n e d int instruction_t ;
typedef ap_uint < L O G _ C O D E _ R A M _ S I Z E > code_address_t ;
typedef ap_uint <3 > type_t ;
typedef ap_int <20 > immediate_t ;
typedef ap_int <12 > i_immediate_t ;
typedef ap_int <12 > s_immediate_t ;
typedef ap_int <12 > b_immediate_t ;
typedef ap_int <20 > u_immediate_t ;
typedef ap_int <20 > j_immediate_t ;
typedef ap_uint <5 > opcode_t ;
typedef ap_uint <5 > reg_num_t ;
150 5 Building a Fetching, Decoding, and Executing Processor
The fetch function and the running_cond_update function are unchanged from the
fetching_ip.
As a general rule, the successive IPs you will build are incrementally designed.
A function is changed only if it is necessary in the new implementation. Otherwise,
what has been designed in an IP is kept in the successors.
5.4 Second Step: Add a Bit of Decoding to Compute the Next PC 151
The decode function is defined in the decode.cpp file (see Listing 5.24).
Listing 5.24 The decode function in the decode.cpp file
void d e c o d e (
instruction_t instruction ,
d e c o d e d _ i n s t r u c t i o n _ t * d_i ) {
# p r a g m a HLS I N L I N E off
d e c o d e _ i n s t r u c t i o n ( i n s t r u c t i o n , d_i ) ;
decode_immediate ( i n s t r u c t i o n , d_i ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ D E C O D E
p r i n t _ d e c o d e (* d_i ) ;
# endif
# endif
}
The type field (i.e. the instruction format) is set by the type function, according
to the instruction opcode.
The right shift applied to opcode to set opch in the "opch = opcode>>3"
instruction is synthesized as a selection of the two upper bits of opcode.
The "opcl = opcode" instruction shrinks the opcode value to the three bits size
of the opcl destination variable.
The switch-case branch on the opch value corresponds to the rightmost multi-
plexer in Fig. 5.14.
Listing 5.26 The type function in the type.cpp file
type_t type ( o p c o d e _ t o p c o d e ) {
# p r a g m a HLS I N L I N E
ap_uint <2 > opch ;
ap_uint <3 > opcl ;
opch = opcode > >3;
opcl = o p c o d e ;
s w i t c h ( opch ) {
case 0 b00 : r e t u r n t y p e _ 0 0 ( opcl ) ;
case 0 b01 : r e t u r n t y p e _ 0 1 ( opcl ) ;
case 0 b10 : r e t u r n t y p e _ 1 0 ( opcl ) ;
case 0 b11 : r e t u r n t y p e _ 1 1 ( opcl ) ;
}
return UNDEFINED_TYPE ;
}
Functions type_00 (LOAD, OP_IMM, and AUIPC), type_01 (STORE, OP, and LUI),
type_10 (instructions out of the RV32I ISA) and type_11 (BRANCH, JALR, and JAL)
are also defined in the type.cpp file.
They correspond to the four leftmost multiplexers in Fig. 5.14.
I only present the type_00 function (see Listing 5.27) as the other functions are
based on the same model.
Listing 5.27 The type_00 function in the type.cpp file
s t a t i c t y p e _ t t y p e _ 0 0 ( ap_uint <3 > opcl ) {
# p r a g m a HLS I N L I N E
s w i t c h ( opcl ) {
case 0 b000 : r e t u r n I _ T Y P E ; // LOAD
case 0 b001 : r e t u r n O T H E R _ T Y P E ; // LOAD - FP
case 0 b010 : r e t u r n O T H E R _ T Y P E ; // CUSTOM -0
case 0 b011 : r e t u r n O T H E R _ T Y P E ; // MISC - MEM
case 0 b100 : r e t u r n I _ T Y P E ; // OP - IMM
case 0 b101 : r e t u r n U _ T Y P E ; // AUIPC
case 0 b110 : r e t u r n O T H E R _ T Y P E ; // OP - IMM -32
case 0 b111 : r e t u r n O T H E R _ T Y P E ; // RV48 -0
}
return UNDEFINED_TYPE ;
}
The type functions will be left unchanged in all the successive designs you will
build.
The execute function is defined in the execute.cpp file. It is shown in Listing 5.30.
It computes the next pc according to the instruction format saved in the d_i
structure after decoding.
Listing 5.30 The execute function
void e x e c u t e (
code_address_t pc ,
d e c o d e d _ i n s t r u c t i o n _ t d_i ,
code_address_t * next_pc ){
# p r a g m a HLS I N L I N E off
* n e x t _ p c = c o m p u t e _ n e x t _ p c ( pc , d_i ) ;
}
The compute_next_pc function (see Listing 5.31) is also defined in the exe-
cute.cpp file. It is a slight extension of the version presented in Listing 5.12. It
takes care of immediate jump instructions (JAL opcode, e.g. "jal foo" to call the foo
function).
The JAL instructions belong to the J-TYPE. They contain an encoded constant
which is a displacement from the instruction position. The processor adds this dis-
placement to the current pc to set the next pc.
The displacement value is extracted from the instruction by the j_immediate
function and decoded in the d_i.imm field.
For all the other instructions (including BRANCH ones), the next pc is always set
to point to the next instruction (i.e. pc + 1). Conditional branches and indirect jumps
will be considered in the fde_ip design in Sect. 5.5.
The RISC-V specification states that the J-TYPE constant should be multiplied by
2, i.e. shifted of one position on the left, before being added to the current pc to form
the jump target address.
But, as we shrink the two lowest bits to address the code_ram word memory
(word pointer), the computed displacement is shifted of one position on the right
("next_pc = pc + (d_i.imm>>1)").
Listing 5.31 The compute_next_pc function
code_address_t compute_next_pc (
code_address_t pc ,
d e c o d e d _ i n s t r u c t i o n _ t d_i ) {
# p r a g m a HLS I N L I N E
code_address_t next_pc ;
s w i t c h ( d_i . type ) {
case R _ T Y P E :
n e x t _ p c = pc + 1;
break ;
...
case J _ T Y P E :
n e x t _ p c = pc + ( d_i . imm > >1) ;
break ;
default :
n e x t _ p c = pc + 1;
break ;
}
return next_pc ;
}
5.4 Second Step: Add a Bit of Decoding to Compute the Next PC 155
The d_i.imm field has type immediate_t (see Listing 5.22), which is 20 bits wide.
However, the next_pc destination of the computation involving the d_i.imm value
has type code_address_t, which is 16 bits wide (LOG_CODE_RAM_SIZE; see Listing
5.8).
The synthesizer shrinks the d_i.imm value to its 16 least significant bits and adds
this 16-bit value to pc.
This can be checked after synthesis on the Schedule Viewer by navigating down
to the execute graph. Then, you look at the Properties of the select_ln7_2(select)
line (i.e. line 7 in the execute.cpp file, i.e. the "switch(d_i.type)" instruction in the
compute_next_pc function).
The Bit Width property is 16 bits (see Fig. 5.16).
The next line (next_pc(+)) is the addition with pc. Again the Bit Width is 16.
Experimentation
To simulate the fetching_decoding_ip, operate as explained in Sect. 5.3.6, re-
placing fetching_ip with fetching_decoding_ip.
156 5 Building a Fetching, Decoding, and Executing Processor
As a general rule, do not pay too much attention to the Timing Violation warnings,
except when they concern the end of the IP cycle, i.e. operations which do not fit in
the last FPGA cycle of the main loop.
You can try to use the synthesis in Vivado. If the IP works fine on the development
board, you can forget about the timing violation.
Figure 5.19 shows the fetching_decoding_ip schedule. The loop latency is three
FPGA cycles (30ns, 33 Mhz), as expected from the HLS PIPELINE II=3 pragma (refer
back to Listing 5.24).
158 5 Building a Fetching, Decoding, and Executing Processor
Experimentation
To run the fetching_decoding_ip on the development board, proceed as explained
in Sect. 5.3.10, replacing fetching_ip with fetching_decoding_ip.
The code in the helloworld.c file is given in Listing 5.34 (do not forget to adapt
the path to the hex file to your environment with the update_helloworld.sh shell
script).
Listing 5.34 The helloworld.c file in the fetching_decoding_ip folder
# i n c l u d e < stdio .h >
# include " xfetching_decoding_ip .h"
# include " xparameters .h"
# d e f i n e L O G _ C O D E _ R A M _ S I Z E 16
// size in words
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
XFetching_decoding_ip_Config * cfg_ptr ;
XFetching_decoding_ip ip ;
w o r d _ t y p e c o d e _ r a m [ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ o p _ i m m _ 0 _ t e x t . hex "
};
int main () {
cfg_ptr = XFetching_decoding_ip_LookupConfig (
XPAR_XFETCHING_DECODING_IP_0_DEVICE_ID );
X F e t c h i n g _ d e c o d i n g _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X F e t c h i n g _ d e c o d i n g _ i p _ S e t _ s t a r t _ p c (& ip , 0) ;
X F e t c h i n g _ d e c o d i n g _ i p _ W r i t e _ c o d e _ r a m _ W o r d s (& ip , 0 , code_ram ,
CODE_RAM_SIZE );
X F e t c h i n g _ d e c o d i n g _ i p _ S t a r t (& ip ) ;
while (! X F e t c h i n g _ d e c o d i n g _ i p _ I s D o n e (& ip ) ) ;
p r i n t f ( " % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \ n " ,
( int ) X F e t c h i n g _ d e c o d i n g _ i p _ G e t _ n b _ i n s t r u c t i o n (& ip ) ) ;
}
Listing 5.35 shows what the run of the RISC-V code in the test_op_imm.s file
should print on the putty terminal.
Listing 5.35 The helloworld print when running the test_op_imm RISC-V code on the Pynq Z1
board
14 f e t c h e d and d e c o d e d i n s t r u c t i o n s
5.5 Third Step: Filling the Execute Stage to Build the Register
Path
All the source files related to the fde_ip design can be found in the fde_ip folder.
160 5 Building a Fetching, Decoding, and Executing Processor
The Vitis_HLS fde_ip project (fetch, decode, and execute) adds a register file to
the processor. Figure 5.22 shows the fde_ip component. Contrarily to the code_ram
memory block, the reg_file entity belongs to the IP and is not externally visible.
A register file is a multi-ported memory. The fde_ip register file groups 32 reg-
isters. Each register memorizes a 32-bit value. The register file can be addressed in
three simultaneous ways: reading from two sources and writing to one destination,
with three different ports.
For example, instruction "add a0, a1, a2" reads registers a1 and a2 and writes their
sum into register a0.
Listing 5.36 shows the prototype, local declarations, and initializations of the
fde_ip top function defined in the fde_ip.cpp file.
The IP clears all the registers before it starts running the code. This is my choice:
it is not part of the RISC-V specification.
The HLS ARRAY_PARTITION pragma is used to partition the reg_file variable.
The choosen partitioning indicates to the synthesizer that each element of the array
(i.e. each register) should be considered as individually accessible. Consequently, the
one dimension array will be mapped on FPGA flip-flops instead of a BRAM block
(i.e. a memory).
When an array is implemented with a BRAM block, it has at most two access
ports (i.e. you can access at most two entries of the array simultaneously). When it
is implemented with flip-flops (i.e. one flip-flop per memorized bit), all the entries
can be accessed simultaneously.
Small arrays should be implemented as flip-flops through an ARRAY_PARTITION
pragma, at least to keep BRAM blocks for big arrays (e.g. the code and data memo-
ries).
code_ram
current_pc instruction
fde_ip
reg_file
current_pc instruction d_i reg_file
pc fetch decode execute
current_pc next_pc
Listing 5.36 The fde_ip function prototype, local variable declarations, and initializations
void f d e _ i p (
u n s i g n e d int start_pc ,
u n s i g n e d int code_ram [ CODE_RAM_SIZE ],
u n s i g n e d int * n b _ i n s t r u c t i o n ) {
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = s t a r t _ p c
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = c o d e _ r a m
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = n b _ i n s t r u c t i o n
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = r e t u r n
code_address_t pc ;
int r e g _ f i l e [ N B _ R E G I S T E R ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = r e g _ f i l e dim =1 c o m p l e t e
instruction_t instruction ;
bit_t is_running ;
u n s i g n e d int nbi ;
d e c o d e d _ i n s t r u c t i o n _ t d_i ;
for ( int i =0; i < N B _ R E G I S T E R ; i ++) r e g _ f i l e [ i ] = 0;
pc = start_pc ;
nbi = 0;
...
The HLS PIPELINE II=6 pragma in the do .. while loop of the fde_ip top function
(see Listing 5.37) bounds the IP cycle to six FPGA cycles (16.67 MHz). As the
complexity increases, the computations to be done in one processor cycle need more
time.
Listing 5.37 The fde_ip function do ... while loop
...
do {
# p r a g m a HLS P I P E L I N E II =6
fetch ( pc , code_ram , & i n s t r u c t i o n ) ;
d e c o d e ( i n s t r u c t i o n , & d_i ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ D I S A S S E M B L E
d i s a s s e m b l e ( pc , i n s t r u c t i o n , d_i ) ;
# endif
# endif
e x e c u t e ( pc , reg_file , d_i , & pc ) ;
s t a t i s t i c _ u p d a t e (& nbi ) ;
r u n n i n g _ c o n d _ u p d a t e ( i n s t r u c t i o n , pc , & i s _ r u n n i n g ) ;
} while ( i s _ r u n n i n g ) ;
* n b _ i n s t r u c t i o n = nbi ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ R E G _ F I L E
print_reg ( reg_file );
# endif
# endif
}
162 5 Building a Fetching, Decoding, and Executing Processor
The fde_ip.h file contains the constants associated to the register file size, as shown
in Listing 5.42.
Listing 5.42 The fde_ip.h file (register file size related constants)
...
# define LOG_REG_FILE_SIZE 5
# define NB_REGISTER (1 < < L O G _ R E G _ F I L E _ S I Z E )
...
It also contains (see Listing 5.43) the constants defining the RISC-V com-
parison operators matching the different branch instructions (B-TYPE format, op-
code BRANCH). These constants are part of the RISC-V specification (refer to [1],
Chap. 24, p. 130, funct3 field).
Listing 5.43 The fde_ip.h file (comparison operator related constants)
...
# define BEQ 0
# define BNE 1
# define BLT 4
# define BGE 5
# define BLTU 6
# define BGEU 7
...
The arithmetic and logical operator related constants are defined (see Listing
5.44). They match the computation instructions between two register sources (R-
TYPE format, opcode OP). These constants are also part of the RISC-V specification
(refer to [1], Chap. 24, p. 130, funct3 field).
Listing 5.44 The fde_ip.h file (arithmetic and logical operator related constants for register-register
instructions)
...
# define ADD 0
# define SUB 0
# define SLL 1
# define SLT 2
# define SLTU 3
# define XOR 4
# define SRL 5
# define SRA 5
# define OR 6
# define AND 7
...
The immediate version of the same operators related constants are defined (see
Listing 5.45). They match the computation instructions between one register source
and one constant (I-TYPE format, opcode OP-IMM). These constants are part of the
RISC-V specification too (refer to [1], Chap. 24, p. 130, funct3 field).
5.5 Third Step: Filling the Execute Stage to Build the Register Path 165
Listing 5.45 The fde_ip.h file (arithmetic and logical operator related constants for register-
constant instructions)
...
# define ADDI 0
# define SLLI 1
# define SLTI 2
# define SLTIU 3
# define XORI 4
# define SRLI 5
# define SRAI 5
# define ORI 6
# define ANDI 7
...
The fde_ip.h file also contains the definition of the register file related types
reg_num_t and reg_num_p1_t (p1 stands for plus 1), as shown in Listing 5.46.
Listing 5.46 The fde_ip.h file (reg_num_t and reg_num_p1_t types)
...
t y p e d e f ap_uint < L O G _ R E G _ F I L E _ S I Z E +1 > r e g _ n u m _ p 1 _ t ;
t y p e d e f ap_uint < L O G _ R E G _ F I L E _ S I Z E > reg_num_t ;
...
Type names with the p1_t suffix indicate that the variables of the type use one
more bit. Such a plus 1 type is necessary when a variable is used as a loop control.
For example, in "for (i=0; i<16; i++)", variable i should be five bits wide to be
compared to constant 16, i.e. 0b10000 in binary, even though in the loop i ranges
between 0 and 15, i.e. requires only four bits. Hence, I would write what is shown
in Listing 5.47.
Listing 5.47 Plus 1 type example
t y p e d e f ap_uint <4 > l o o p _ c o u n t e r _ t ;
t y p e d e f ap_uint <5 > l o o p _ c o u n t e r _ p 1 _ t ;
l o o p _ c o u n t e r _ p 1 _ t i1 ;
loop_counter_t i;
for ( i1 =0; i1 <16; i1 ++) {
i = i1 ; // " i1 " is s h r i n k e d to 4 bits to fit in " i "
... // i t e r a t i o n body using " i "
}
The synthesizer produces 4-bit values each time variable i is used throughout the
loop body and only for the loop control, generates a 5-bit value to increment and test
variable i1.
Other constants and types in the fde_ip.h file are unchanged from the fetch-
ing_decoding_ip project.
As in the fetching_decoding_ip project, the main loop still contains the five function
calls: fetch, decode, execute, statistic_update, and running_cond_update.
The fetch function is unchanged (fetch.cpp file).
The decode function (see Listing 5.48; decode.cpp file) is unchanged except that
the debugging print of the decoding has been removed.
166 5 Building a Fetching, Decoding, and Executing Processor
reg_file
reg_file
reg_file
d_i.rs2 rv2 result
read_reg compute_result write_reg
d_i.rs1 rv1
current_pc current_pc
The emulate function is a debugging feature to print the updates to the register
file (if the instruction writes to a destination register) and to the pc (if the instruction
is a jump or a taken branch).
The emulate function is an equivalent of the spike simulator (the difference
between an emulator and a simulator is not significant enough to deserve more
explanation; you can just take the two terms as synonyms in our context).
I do not present the function. You can find its full code in the emulate.cpp file.
Figure 5.24 shows the read and write accesses from and to the register file. The
write_enable signal enables the writing access to the register file when the signal is
set (i.e. when the destination register is not register zero, as specified in the RISC-V
ISA, and when the instruction is not a conditional branch; notice that if the instruction
is a JAL or a JALR, there is a destination: they write a link address to destination rd).
The code of the read_reg and write_reg functions is shown in Listing 5.50.
Listing 5.50 The read_reg and write_reg functions in the execute.cpp file
static void r e a d _ r e g (
int * reg_file ,
r e g _ n u m _ t rs1 ,
r e g _ n u m _ t rs2 ,
int * rv1 ,
int * rv2 ) {
# p r a g m a HLS I N L I N E
* rv1 = r e g _ f i l e [ rs1 ];
* rv2 = r e g _ f i l e [ rs2 ];
}
static void w r i t e _ r e g (
int * reg_file ,
d e c o d e d _ i n s t r u c t i o n _ t d_i ,
int result ){
# p r a g m a HLS I N L I N E
if ( d_i . rd != 0 &&
d_i . o p c o d e != B R A N C H &&
d_i . o p c o d e != STORE )
r e g _ f i l e [ d_i . rd ] = r e s u l t ;
}
168 5 Building a Fetching, Decoding, and Executing Processor
d_i result
d_i.type
d_i.opcode == LUI
d_i.imm
u type result
mux 2*32 −> 32
+
pc j type result
4 +
0 other type result
5.5.7 Computing
Figure 5.25 shows how the different results according to the instruction format are
computed and how the final result is selected.
The compute_result function code is shown in Listing 5.51.
Figure 5.25 rightmost multiplexer is implemented as a switch on the d_i.type
variable (in the figure, the multiplexer is named mux 8*32 -> 32, meaning that it
selects one 32-bit word out of eight).
Listing 5.51 The compute_result function in the execute.cpp file
s t a t i c int c o m p u t e _ r e s u l t (
int rv1 ,
int rv2 ,
d e c o d e d _ i n s t r u c t i o n _ t d_i ,
code_address_t pc ) {
# p r a g m a HLS I N L I N E
int imm12 = (( int ) d_i . imm ) < <12;
c o d e _ a d d r e s s _ t pc4 = pc < <2;
c o d e _ a d d r e s s _ t npc4 = pc4 + 4;
int result ;
s w i t c h ( d_i . type ) {
case R _ T Y P E :
5.5 Third Step: Filling the Execute Stage to Build the Register Path 169
d_i
f7_6 (d_i.func7>>5) f7_6 (d_i.func7>>5) d_i.func3
r_type &
d_i.type == R_TYPE
r_type && f7_6
rv2 rv2
add/sub result
+/−
24−20 mux 2*5 −> 5
shift
sll result
<<
slt result
<
sltu result op result
^ mux 8*32 −> 32
rv1 xor result
|
or result
&
and result
It should be understood that the unit computes all the possible results in parallel
and the func3 field is used to select the instruction result among them (remember the
general programming concepts for HLS in Sect. 5.1: computing more can improve
the critical path).
The code to implement the compute_op_result function is shown in Listing 5.52.
It is defined in the execute.cpp file.
Listing 5.52 The compute_op_result function in the execute.cpp file
s t a t i c int c o m p u t e _ o p _ r e s u l t (
int rv1 ,
int rv2 ,
d e c o d e d _ i n s t r u c t i o n _ t d_i ) {
# p r a g m a HLS I N L I N E
bit_t f7_6 = d_i . func7 > >5;
bit_t r _ t y p e = d_i . type == R _ T Y P E ;
ap_uint <5 > shift ;
int result ;
if ( r _ t y p e )
shift = rv2 ;
else // I _ T Y P E
shift = d_i . rs2 ;
s w i t c h ( d_i . func3 ) {
case ADD : if ( r _ t y p e && f7_6 )
r e s u l t = rv1 - rv2 ; // SUB
else
r e s u l t = rv1 + rv2 ;
break ;
case SLL : r e s u l t = rv1 << shift ;
break ;
case SLT : r e s u l t = rv1 < rv2 ;
break ;
case SLTU : r e s u l t = ( u n s i g n e d int ) rv1 < ( u n s i g n e d int ) rv2 ;
break ;
case XOR : r e s u l t = rv1 ^ rv2 ;
break ;
case SRL : if ( f7_6 )
r e s u l t = rv1 >> shift ; // SRA
else
r e s u l t = ( u n s i g n e d int ) rv1 >> shift ;
break ;
case OR : r e s u l t = rv1 | rv2 ;
break ;
case AND : r e s u l t = rv1 & rv2 ;
break ;
}
return result ;
}
Experimentation
To simulate the fde_ip, operate as explained in Sect. 5.3.6, replacing fetching_ip
with fde_ip.
You can play with the simulator, replacing the included test_mem_0_text.hex
file with any other .hex file you find in the same folder.
I have added six new test programs: test_branch.s to test the BRANCH instructions,
test_jal_jalr.s to test the JAL and JALR instructions, test_lui_auipc.s to test LUI and
AUIPC instructions, test_op.s to test OP instructions, and test_sum.s to sum the
10 first natural numbers. These test programs will be used to test all the processor
IPs (more test programs will be added in the next chapter for the data memory
manipulations).
The testbench_fde_ip.cpp file shown in Listing 5.55 is applied to the
test_op_imm_0_text.hex hexadecimal code file (obtained from the test_op_imm.s
source).
To run another test code, you just need to replace the test_op_imm_0_text.hex
name in the code_ram array declaration. To build any .hex file, use the build.sh script:
"./build.sh test_branch" builds test_branch_0_text.hex (do not pay attention to the
warning message).
Anyway, the fde_ip folder contains prebuilt hex files for all the proposed test
codes.
Listing 5.55 The testbench_fde_ip.cpp file
# i n c l u d e < stdio .h >
# include " fde_ip .h"
u n s i g n e d int c o d e _ r a m [ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ s u m _ 0 _ t e x t . hex "
};
int main () {
u n s i g n e d int nbi ;
f d e _ i p (0 , code_ram , & nbi ) ;
p r i n t f ( " % d fetched , d e c o d e d and e x e c u t e d i n s t r u c t i o n s \ n " ,
nbi ) ;
r e t u r n 0;
}
The test_branch.s file contains a code to test the branch instructions (see Listing
5.56).
Listing 5.56 The test_branch.s file
.globl main
main :
li a0 , -8 /* a0 = -8*/
li a1 ,5 /* a1 =5*/
beq a0 , a1 , .L1 /* if ( a0 == a1 ) goto .L1 */
li a2 ,1 /* a2 =1*/
.L1 :
bne a0 , a1 , .L2 /* if ( a0 != a1 ) goto .L2 */
5.5 Third Step: Filling the Execute Stage to Build the Register Path 173
li a2 ,2 /* a2 =2*/
.L2 :
blt a0 , a1 , .L3 /* if ( a0 < a1 ) goto .L3 */
li a3 ,1 /* a3 =1*/
.L3 :
bge a0 , a1 , .L4 /* if ( a0 >= a1 ) goto .L4 */
li a3 ,2 /* a3 =2*/
.L4 :
bltu a0 , a1 , .L5 /* if ( a0 < a1 ) goto .L5 ( u n s i g n e d ) */
li a4 ,1 /* a4 =1*/
.L5 :
bgeu a0 , a1 , .L6 /* if ( a0 >= a1 ) goto .L6 ( u n s i g n e d ) */
li a4 ,2 /* a4 =2*/
.L6 :
ret
Its run produces the output shown in Listing 5.57 (SYMB_REG defined in the
print.h file; registers containing a null value are not shown).
Listing 5.57 The test_branch.s code output
0000: f f 8 0 0 5 1 3 li a0 , -8
a0 = -8 ( f f f f f f f 8 )
0004: 0 0 5 0 0 5 9 3 li a1 , 5
a1 = 5 ( 5)
0 0 0 8 : 00 b 5 0 4 6 3 beq a0 , a1 , 16
pc = 12 ( c)
0012: 0 0 1 0 0 6 1 3 li a2 , 1
a2 = 1 ( 1)
0 0 1 6 : 00 b 5 1 4 6 3 bne a0 , a1 , 24
pc = 24 ( 18)
0 0 2 4 : 00 b 5 4 4 6 3 blt a0 , a1 , 32
pc = 32 ( 20)
0 0 3 2 : 00 b 5 5 4 6 3 bge a0 , a1 , 40
pc = 36 ( 24)
0036: 0 0 2 0 0 6 9 3 li a3 , 2
a3 = 2 ( 2)
0040: 00 b56463 bltu a0 , a1 , 48
pc = 44 ( 2c)
0044: 0 0 1 0 0 7 1 3 li a4 , 1
a4 = 1 ( 1)
0048: 00 b57463 bgeu a0 , a1 , 56
pc = 56 ( 38)
0056: 0 0 0 0 8 0 6 7 ret
pc = 0 ( 0)
...
a0 = -8 ( f f f f f f f 8 )
a1 = 5 ( 5)
a2 = 1 ( 1)
a3 = 2 ( 2)
a4 = 1 ( 1)
...
12 f e t c h e d and d e c o d e d i n s t r u c t i o n s
The test_jal_jalr.s file (see Listing 5.58) contains a code to test the jump and link
instructions JAL and JALR.
Listing 5.58 The test_jal_jalr.s file
.globl main
main :
mv t0 , ra /* t0 = ra ( save return a d d r e s s ) */
here0 :
174 5 Building a Fetching, Decoding, and Executing Processor
auipc a0 ,0 /* a0 = pc +0 ( a0 =4) */
here1 :
auipc a1 ,0 /* a1 = pc +0 ( a0 =8) */
li a2 ,0 /* a2 =0*/
li a4 ,0 /* a4 =0*/
j .L1 /* goto .L1 */
.L1 :
addi a2 , a2 ,1 /* a2 ++*/
jal f /* f () ( call f ) */
li a3 ,3 /* a3 =3*/
jalr 52( a1 ) /*(*( a1 +52) ) () ( call f ) */
jr 44( a0 ) /* goto *( a0 +44) ( goto there ) */
addi a4 , a4 ,1 /* a4 ++*/
there :
addi a4 , a4 ,1 /* a4 ++*/
mv ra , t0 /* ra = t0 ( r e s t o r e r e t u r n a d d r e s s ) */
ret
f:
addi a2 , a2 ,1 /* a2 ++*/
ret
It produces the output shown in Listing 5.59 (SYMB_REG defined; registers con-
taining a null value are not shown).
Listing 5.59 The test_jal_jalr.s code output
0000: 0 0 0 0 8 2 9 3 addi t0 , ra , 0
t0 = 0 ( 0)
0004: 0 0 0 0 0 5 1 7 auipc a0 , 0
a0 = 4 ( 4)
0008: 0 0 0 0 0 5 9 7 auipc a1 , 0
a1 = 8 ( 8)
0012: 0 0 0 0 0 6 1 3 li a2 , 0
a2 = 0 ( 0)
0016: 0 0 0 0 0 7 1 3 li a4 , 0
a4 = 0 ( 0)
0020: 0040006 f j 24
pc = 24 ( 18)
0024: 0 0 1 6 0 6 1 3 addi a2 , a2 , 1
a2 = 1 ( 1)
0 0 2 8 : 0 2 0 0 0 0 ef jal ra , 60
pc = 60 ( 3c)
ra = 32 ( 20)
0060: 0 0 1 6 0 6 1 3 addi a2 , a2 , 1
a2 = 2 ( 2)
0064: 0 0 0 0 8 0 6 7 ret
pc = 32 ( 20)
0032: 0 0 3 0 0 6 9 3 li a3 , 3
a3 = 3 ( 3)
0036: 034580 e7 jalr 52( a1 )
pc = 60 ( 3c)
ra = 40 ( 28)
0060: 0 0 1 6 0 6 1 3 addi a2 , a2 , 1
a2 = 3 ( 3)
0064: 0 0 0 0 8 0 6 7 ret
pc = 40 ( 28)
0 0 4 0 : 02 c 5 0 0 6 7 jr 44( a0 )
pc = 48 ( 30)
0048: 0 0 1 7 0 7 1 3 addi a4 , a4 , 1
a4 = 1 ( 1)
0052: 0 0 0 2 8 0 9 3 addi ra , t0 , 0
ra = 0 ( 0)
0056: 0 0 0 0 8 0 6 7 ret
5.5 Third Step: Filling the Execute Stage to Build the Register Path 175
pc = 0 ( 0)
...
a0 = 4 ( 4)
a1 = 8 ( 8)
a2 = 3 ( 3)
a3 = 3 ( 3)
a4 = 1 ( 1)
...
18 f e t c h e d and d e c o d e d instructions
The test_lui_auipc.s file (see Listing 5.60) contains a code to test the upper in-
structions LUI and AUIPC.
Listing 5.60 The test_lui_auipc.s file
.globl main
main :
lui a1 ,0x1 /* a1 =(1 < <12) ( 4 0 9 6 ) */
auipc a2 ,0x1 /* a2 = pc +(1 < <12) ( pc + 4 0 9 6 ) */
sub a2 , a2 , a1 /* a2 -= a1 */
addi a2 , a2 ,20 /* a2 + = 2 0 * /
jr a2 /* goto a2 ( .L1 ) */
li a1 ,3 /* a1 =3*/
.L1 :
li a3 ,100 /* a3 = 1 0 0 * /
ret
It produces the output shown in Listing 5.61 (SYMB_REG defined; registers con-
taining a null value are not shown).
Listing 5.61 The test_lui_auipc.s code output
0000: 000015 b7 lui a1 , 4096
a1 = 4096 ( 1000)
0004: 0 0 0 0 1 6 1 7 auipc a2 , 4096
a2 = 4100 ( 1004)
0 0 0 8 : 40 b 6 0 6 3 3 sub a2 , a2 , a1
a2 = 4 ( 4)
0012: 0 1 4 6 0 6 1 3 addi a2 , a2 , 20
a2 = 24 ( 18)
0016: 0 0 0 6 0 0 6 7 jr a2
pc = 24 ( 18)
0024: 0 6 4 0 0 6 9 3 li a3 , 100
a3 = 100 ( 64)
0028: 0 0 0 0 8 0 6 7 ret
pc = 0 ( 0)
...
a1 = 4096 ( 1000)
a2 = 24 ( 18)
a3 = 100 ( 64)
...
7 f e t c h e d and d e c o d e d i n s t r u c t i o n s
The test_op.s file (see Listing 5.62) contains a code to test the OP instructions
(register-register operations, i.e. with two register sources and no immediate value).
Listing 5.62 The test_op.s file
.globl main
main :
li a0 ,13 /* a0 =13*/
li a4 ,12 /* a4 =12*/
li a1 ,7 /* a1 =7*/
176 5 Building a Fetching, Decoding, and Executing Processor
li t0 ,28 /* t0 =28*/
li t6 , -10 /* t6 = -10*/
li s2 ,2022 /* s2 = 2 0 2 2 * /
add a2 , a1 , zero /* a2 = a1 */
and a3 , a2 , a0 /* a3 = a2 & a0 */
or a5 , a3 , a4 /* a5 = a3 | a4 */
x or a6 , a5 , t0 /* a6 = a5 ^ t0 */
sub a6 , a6 , a1 /* a6 -= a1 */
sltu a7 , a6 , a0 /* a7 = a6 < a0 ( u n s i g n e d ) */
sll t1 , a6 , t0 /* t1 = a6 < < t0 */
slt t2 , t1 , t6 /* t2 = t1 < t6 ( s i g n e d ) */
sltu t3 , t1 , s2 /* t3 = t1 < s2 ( u n s i g n e d ) */
srl t4 , t1 , t0 /* t4 = t1 > > t0 ( u n s i g n e d ) */
sra t5 , t1 , t0 /* t5 = t1 > > t0 ( s i g n e d ) */
ret
It produces the output shown in Listing 5.63 (SYMB_REG defined; registers con-
taining a null value are not shown).
Listing 5.63 The test_op.s code output
0 0 0 0 : 00 d 0 0 5 1 3 li a0 , 13
a0 = 13 ( d)
0 0 0 4 : 00 c 0 0 7 1 3 li a4 , 12
a4 = 12 ( c)
0008: 0 0 7 0 0 5 9 3 li a1 , 7
a1 = 7 ( 7)
0 0 1 2 : 01 c 0 0 2 9 3 li t0 , 28
t0 = 28 ( 1c)
0016: f f 6 0 0 f 9 3 li t6 , -10
t6 = -10 ( f f f f f f f 6 )
0020: 7 e600913 li s2 , 2022
s2 = 2022 ( 7 e6 )
0024: 0 0 0 5 8 6 3 3 add a2 , a1 , zero
a2 = 7 ( 7)
0 0 2 8 : 00 a 6 7 6 b 3 and a3 , a2 , a0
a3 = 5 ( 5)
0 0 3 2 : 00 e 6 e 7 b 3 or a5 , a3 , a4
a5 = 13 ( d)
0036: 0057 c833 xor a6 , a5 , t0
a6 = 17 ( 11)
0 0 4 0 : 40 b 8 0 8 3 3 sub a6 , a6 , a1
a6 = 10 ( a)
0044: 00 a838b3 sltu a7 , a6 , a0
a7 = 1 ( 1)
0048: 0 0 5 8 1 3 3 3 sll t1 , a6 , t0
t1 = -1610612736 ( a0000000 )
0 0 5 2 : 01 f 3 2 3 b 3 slt t2 , t1 , t6
t2 = 1 ( 1)
0056: 01233 e33 sltu t3 , t1 , s2
t3 = 0 ( 0)
0060: 00535 eb3 srl t4 , t1 , t0
t4 = 10 ( a)
0064: 40535 f33 sra t5 , t1 , t0
t5 = -6 ( f f f f f f f a )
0068: 0 0 0 0 8 0 6 7 ret
pc = 0 ( 0)
...
t0 = 28 ( 1c)
t1 = -1610612736 ( a0000000 )
t2 = 1 ( 1)
...
a0 = 13 ( d)
5.5 Third Step: Filling the Execute Stage to Build the Register Path 177
a1 = 7 ( 7)
a2 = 7 ( 7)
a3 = 5 ( 5)
a4 = 12 ( c)
a5 = 13 ( d)
a6 = 10 ( a)
a7 = 1 ( 1)
s2 = 2022 ( 7 e6 )
...
t4 = 10 ( a)
t5 = -6 ( fffffffa )
t6 = -10 ( fffffff6 )
18 f e t c h e d and d e c o d e d instructions
The test_op_imm.s file has already been presented (refer back to Listing 3.13).
It produces the output shown in Listing 5.64 (SYMB_REG defined; registers con-
taining a null value are not shown).
Listing 5.64 The test_op_imm.s code output
0000: 0 0 5 0 0 5 9 3 li a1 , 5
a1 = 5 ( 5)
0004: 0 0 1 5 8 6 1 3 addi a2 , a1 , 1
a2 = 6 ( 6)
0008: 00 c67693 andi a3 , a2 , 12
a3 = 4 ( 4)
0012: f f f 6 8 7 1 3 addi a4 , a3 , -1
a4 = 3 ( 3)
0016: 0 0 5 7 6 7 9 3 ori a5 , a4 , 5
a5 = 7 ( 7)
0020: 00 c7c813 xori a6 , a5 , 12
a6 = 11 ( b)
0 0 2 4 : 00 d 8 3 8 9 3 s l t i u a7 , a6 , 13
a7 = 1 ( 1)
0 0 2 8 : 00 b 8 3 2 9 3 s l t i u t0 , a6 , 11
t0 = 0 ( 0)
0032: 01 c81313 slli t1 , a6 , 28
t1 = -1342177280 ( b0000000 )
0036: f f 6 3 2 3 9 3 slti t2 , t1 , -10
t2 = 1 ( 1)
0040: 7 e633e13 sltiu t3 , t1 , 2022
t3 = 0 ( 0)
0044: 01 c35e93 srli t4 , t1 , 28
t4 = 11 ( b)
0048: 41 c35f13 srai t5 , t1 , 28
t5 = -5 ( f f f f f f f b )
0052: 0 0 0 0 8 0 6 7 ret
pc = 0 ( 0)
...
t1 = -1342177280 ( b0000000 )
t2 = 1 ( 1)
...
a1 = 5 ( 5)
a2 = 6 ( 6)
a3 = 4 ( 4)
a4 = 3 ( 3)
a5 = 7 ( 7)
a6 = 11 ( b)
a7 = 1 ( 1)
...
t4 = 11 ( b)
t5 = -5 ( f f f f f f f b )
178 5 Building a Fetching, Decoding, and Executing Processor
...
14 f e t c h e d and d e c o d e d i n s t r u c t i o n s
The test_sum.s file (see Listing 5.65) contains a code to sum the 10 first integers
into register a0 (x10).
Listing 5.65 The test_sum.s file
.globl main
main :
li a0 ,0 /* a0 =0*/
li a1 ,0 /* a1 =0*/
li a2 ,10 /* a2 =10*/
.L1 :
addi a1 , a1 ,1 /* a1 ++*/
add a0 , a0 , a1 /* a0 += a1 */
bne a1 , a2 , .L1 /* if ( a1 != a2 ) goto .L1 */
ret
It produces the output shown in Listing 5.66 (SYM_REG defined; registers con-
taining a null value are not shown; intermediate iterations are not shown).
Listing 5.66 The test_sum.s code output
0000: 0 0 0 0 0 5 1 3 li a0 , 0
a0 = 0 ( 0)
0004: 0 0 0 0 0 5 9 3 li a1 , 0
a1 = 0 ( 0)
0 0 0 8 : 00 a 0 0 6 1 3 li a2 , 10
a2 = 10 ( a)
0012: 0 0 1 5 8 5 9 3 addi a1 , a1 , 1
a1 = 1 ( 1)
0 0 1 6 : 00 b 5 0 5 3 3 add a0 , a0 , a1
a0 = 1 ( 1)
0020: f e c 5 9 c e 3 bne a1 , a2 , 12
pc = 12 ( c)
...
0012: 0 0 1 5 8 5 9 3 addi a1 , a1 , 1
a1 = 10 ( a)
0 0 1 6 : 00 b 5 0 5 3 3 add a0 , a0 , a1
a0 = 55 ( 37)
0020: f e c 5 9 c e 3 bne a1 , a2 , 12
pc = 24 ( 18)
0024: 0 0 0 0 8 0 6 7 ret
pc = 0 ( 0)
...
a0 = 55 ( 37)
a1 = 10 ( a)
a2 = 10 ( a)
...
34 f e t c h e d and d e c o d e d i n s t r u c t i o n s
For the synthesis, all the functions are inlined (pragma HLS INLINE) except the high-
est level ones fetch, decode, execute, statistic_update, and running_cond_update.
Figure 5.27 shows the synthesis report.
5.5 Third Step: Filling the Execute Stage to Build the Register Path 179
The Schedule Viewer shows that the IP cycle corresponds to six FPGA cycles
(see Fig. 5.28) as requested by the HLS PIPELINE II=6 pragma.
The synthesis report shows a Timing Violation warning. This is not critical as it
concerns operations inside the processor cycle and does not impact the six cycles
scheduling. The synthesizer could not fit the end of the fetch function, the decode
one, and the beginning of the execute function within loop cycle 2 (see Fig. 5.29).
The z1_fde_ip Vivado project defines the design shown in Fig. 5.30.
The bitstream generation shows a resource utilization of 3313 LUTs, 6.23% of
the available LUTs on the FPGA (see Fig. 5.31).
180 5 Building a Fetching, Decoding, and Executing Processor
Experimentation
To run the fde_ip on the development board, proceed as explained in Sect. 5.3.10,
replacing fetching_ip with fde_ip.
You can play with your IP, replacing the included test_mem_0_text.hex file with
any other .hex file you find in the same folder.
Reference 181
The code in the helloworld.c file is shown in Listing 5.67 (do not forget to adapt
the path to the hex file to your environment with the update_helloworld.sh shell
script).
Listing 5.67 The helloworld.c file
# i n c l u d e < stdio .h >
# include " xfde_ip .h"
# include " xparameters .h"
# d e f i n e L O G _ C O D E _ R A M _ S I Z E 16
// size in words
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
XFde_ip_Config * cfg_ptr ;
XFde_ip ip ;
w o r d _ t y p e c o d e _ r a m [ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ o p _ i m m _ 0 _ t e x t . hex "
};
int main () {
cfg_ptr = XFde_ip_LookupConfig ( XPAR_XFDE_IP_0_DEVICE_ID );
X F d e _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X F d e _ i p _ S e t _ s t a r t _ p c (& ip , 0) ;
X F d e _ i p _ W r i t e _ c o d e _ r a m _ W o r d s (& ip , 0 , code_ram , C O D E _ R A M _ S I Z E ) ;
X F d e _ i p _ S t a r t (& ip ) ;
while (! X F d e _ i p _ I s D o n e (& ip ) ) ;
p r i n t f ( " % d fetched , d e c o d e d and e x e c u t e d i n s t r u c t i o n s \ n " ,
( int ) X F d e _ i p _ G e t _ n b _ i n s t r u c t i o n (& ip ) ) ;
}
If you run the RISC-V code in the test_op_imm_0_text.hex file, your putty
terminal should print what is shown in Listing 5.68.
Listing 5.68 The FPGA execution print on the putty terminal for the run of the
test_op_imm_0_text.hex RISC-V code
14 fetched , d e c o d e d and e x e c u t e d i n s t r u c t i o n s
Reference
1. https://riscv.org/specifications/isa-spec-pdf/
Building a RISC-V Processor
6
Abstract
This chapter makes you build your first RISC-V processor. The implemented
microarchitecture proposed in this first version is not pipelined. The IP cycle
encompasses the fetch, the decoding, and the execution of an instruction.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 183
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_6
184 6 Building a RISC-V Processor
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = d a t a _ r a m
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = n b _ i n s t r u c t i o n
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = r e t u r n
# p r a g m a HLS I N L I N E r e c u r s i v e
code_address_t pc ;
int r e g _ f i l e [ N B _ R E G I S T E R ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = r e g _ f i l e dim =1 c o m p l e t e
instruction_t instruction ;
bit_t is_running ;
u n s i g n e d int nbi ;
d e c o d e d _ i n s t r u c t i o n _ t d_i ;
for ( int i =0; i < N B _ R E G I S T E R ; i ++) r e g _ f i l e [ i ] = 0;
pc = start_pc ;
nbi = 0;
...
A legitimate question is: why two memory arrays (code_ram and data_ram)?
Why do I separate the data from the code? In a processor, there is usually one
memory, shared by the code and the data. When the compiler builds the executable
file, it maps both the code and the data on the same memory space.
We need two separate spaces because we need two accesses in the same time.
If the processor has a single memory array with a single port, the access for the
code (i.e. the instruction fetch) and the data access (i.e. the execution of a LOAD
or a STORE instruction) must be done at different moments, i.e. should start at two
different FPGA cycles.
The rv32i_npp_ip processor is not pipelined. Hence, the processing of each in-
struction goes through multiple successive steps: fetch, decode, execution with mem-
ory access for loads/stores, and writeback. It is easy to separate fetch from execution
in time and have the fetch access start at a different FPGA cycle than the load/store
access. So, for the rv32i_npp_ip implementation, I could have used a single memory.
But Chap. 8 will introduce a pipelined design in which in the same FPGA cycle the
processor starts the fetch phase and the load/store one.
However, even though both accesses are simultaneous there is a way to avoid
separating the data and the code memory on Xilinx FPGAs. Each BRAM (Block
RAM) has two access ports, allowing to use one port for a fetch and the other for
a memory access. But I will use the second port for another purpose in the second
part of the book, when implementing multicore processors. The second port is used
to provide a remote access (a core i accesses a memory word inside a core j data
memory).
So, to avoid having different memory models across the various implementations,
I decided to have separate code and data memories. This is not much different from
a classic processor which has separate code and data first level caches.
This will impact the way the executable files are built but I will explain how to
adapt to that later (see Sect. 7.3.1).
The DATA_RAM_SIZE constant in the rv32i_npp_ip.h file defines the data memory
size as 216 words (218 bytes, 256 KB) (if the RISC-V processor implementations are
to be tested on a Basys3 development board, this size should be reduced to 64 KB
because the XC7A35T FPGA has only 200 KB of RAM: set LOG_DATA_RAM_SIZE
to 14, i.e. 16K words, in all the designs).
6.2 Decoding Update 185
The data memory is externally accessed through the s_axilite interface protocol.
Hence, I will use the AXI connection to externally view the content of the memory
after the run.
To optimize the iteration time, inlining has been systematically enabled with the
pragma HLS INLINE recursive added after the HLS INTERFACE pragmas.
The recursive option for the INLINE pragma implies inlining for all the functions
called by the top function.
The rv32i_npp_ip top function do ... while loop code is shown in Listing 6.2.
Listing 6.2 The rv32i_npp_ip top function do ... while loop
...
do {
# p r a g m a HLS P I P E L I N E II =7
fetch ( pc , code_ram , & i n s t r u c t i o n ) ;
d e c o d e ( i n s t r u c t i o n , & d_i ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ D I S A S S E M B L E
d i s a s s e m b l e ( pc , i n s t r u c t i o n , d_i ) ;
# endif
# endif
e x e c u t e ( pc , reg_file , data_ram , d_i , & pc ) ;
s t a t i s t i c _ u p d a t e (& nbi ) ;
r u n n i n g _ c o n d _ u p d a t e ( i n s t r u c t i o n , pc , & i s _ r u n n i n g ) ;
} while ( i s _ r u n n i n g ) ;
* n b _ i n s t r u c t i o n = nbi ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ R E G _ F I L E
print_reg ( reg_file );
# endif
# endif
}
The processor cycle is set as seven FPGA cycles (pragma HLS PIPELINE II=7).
The fetch function is defined in the fetch.cpp file. It is unchanged (see Sect. 5.3.3).
I have added some precomputed Boolean fields as the result of the decoding of
the opcode and added them to the decoded_instruction_t type (e.g. "d_i.opcode
== LOAD" sets bit_t variable is_load). It is less expensive to drive a single bit than
to compare 5-bit values multiple times.
I have added the necessary decodings to replace all the opcode comparisons found
in the execute function and its dependencies. There are eight new bit fields in the
definition of the decoded_instruction_t type in the rv32i_npp_ip.h file. The new
definition is shown in Listing 6.3.
186 6 Building a RISC-V Processor
Similarly, a word is aligned if the two least significant bits of its address are both
0, as shown in Fig. 6.2.
When a word is stored to memory, its bytes can be written in two opposite orders.
A little endian processor writes the bytes starting with the least significant one
(i.e. the least significant byte is written to the byte with the lowest address).
A big endian processor writes the bytes starting with the most significant one (i.e.
the most significant byte is written to the byte with the lowest address).
Figure 6.3 shows the difference between little and big endian stores.
A little endian processor loads the byte at the lowest address (i.e. byte 0xc3 in the
left part of Fig. 6.3) and writes it to the least significant byte position in the destination
register (0x12F62BC3 is loaded into the destination register).
A big endian processor loads and writes it (i.e. byte 0x12 in the right part of
Fig. 6.3) to the highest significant byte position (0x12F62BC3 is loaded into the
destination register).
The RISC-V specification does not precise the endianness of the implementation.
From the C code used to define the load and store RISC-V operations (in Sect. 6.4),
the Vitis HLS synthesizer builds a little endian byte ordering.
The execute function (see Listing 6.5; defined in the execute.cpp file) has the data
memory pointer as a new argument (data_ram). It also includes the execution of the
LOAD and STORE instructions (calls to the mem_load and mem_store functions).
Listing 6.5 The execute function
void e x e c u t e (
code_address_t pc ,
int * reg_file ,
int * data_ram ,
d e c o d e d _ i n s t r u c t i o n _ t d_i ,
code_address_t * next_pc ){
int rv1 , rv2 , r e s u l t ;
b_data_address_t address ;
r e a d _ r e g ( reg_file , d_i . rs1 , d_i . rs2 , & rv1 , & rv2 ) ;
result = c o m p u t e _ r e s u l t ( rv1 , rv2 , d_i , pc ) ;
address = result ;
if ( d_i . i s _ s t o r e )
m e m _ s t o r e ( data_ram , address , rv2 , ( ap_uint <2 >) d_i . func3 ) ;
if ( d_i . i s _ l o a d )
r e s u l t = m e m _ l o a d ( data_ram , address , d_i . func3 ) ;
w r i t e _ r e g ( reg_file , d_i , r e s u l t ) ;
* n e x t _ p c = c o m p u t e _ n e x t _ p c ( pc , rv1 , d_i , ( bit_t ) r e s u l t ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ E M U L A T E
e m u l a t e ( reg_file , d_i , * n e x t _ p c ) ;
# endif
# endif
}
The compute_result function (see Listing 6.6; defined in the execute.cpp file) is
completed to take care of the S-TYPE instructions (STORE opcode) and the LOAD
variant of the I-TYPE format.
In both cases, the computed result is the accessed address, obtained from the sum
of rv1 and d_i.imm.
6.4 The Execute Function 189
The mem_store function (see Listing 6.7; defined in the execute.cpp file) is orga-
nized as a switch on the access width msize, i.e. the 3-bit func3 field.
The value to be stored is either the lower byte rv2_0 (SB or store byte), the lower
half word rv2_01 (SH or store half word) or the full rv2 value (SW or store word).
The write address is either a byte pointer (char *), a half word pointer (short *),
or a word pointer (int *).
The synthesizer sets the byte enables to restraint the writes into the addressed word
to the concerned bytes (a single byte anywhere in the addressed word is enabled for
SB, a pair of aligned adjacent bytes, i.e. either at the start or at the end of the addressed
word are enabled for SH, or all the bytes of the addressed word are enabled for SW).
The RISC-V specification leaves the implementation free to decide what to do
with misaligned accesses.
In my implementation the two least significant bits of the address are discarded
(two bits right shift) before a SW or LW access, forcing the 4-byte boundary alignment.
For SH, LH, and LHU, the least significant bit of the address is discarded (one bit
right shift), forcing the 2-byte boundary alignment.
190 6 Building a RISC-V Processor
So, even though an access address is misaligned, the processor performs an aligned
access (e.g. if the address is 4a + 3, the accessed word is the aligned word at address
4a, composed of bytes 4a, 4a + 1, 4a + 2 and 4a + 3).
Listing 6.7 The mem_store function
label
static void m e m _ s t o r e (
int * data_ram ,
b _ d a t a _ a d d r e s s _ t address ,
int rv2 ,
ap_uint <2 > msize ) {
h _ d a t a _ a d d r e s s _ t a1 = ( a d d r e s s >> 1) ;
w _ d a t a _ a d d r e s s _ t a2 = ( a d d r e s s >> 2) ;
char rv2_0 ;
short rv2_01 ;
rv2_0 = rv2 ;
r v 2 _ 0 1 = rv2 ;
s w i t c h ( msize ) {
case SB :
*(( char *) ( d a t a _ r a m ) + a d d r e s s ) = rv2_0 ;
break ;
case SH :
*(( short *) ( d a t a _ r a m ) + a1 ) = rv2_01 ;
break ;
case SW :
d a t a _ r a m [ a2 ] = rv2 ;
break ;
case 3:
break ;
}
}
In other words, the programmer should align his/her multibyte data. For example,
a structure should be padded to avoid misalignment. Listing 6.8 shows an example
of such a padded structure. Two padding fields of one byte each (pad1 and pad2) are
added to ensure the alignment of the j field in both s1 and s2 variables.
Listing 6.8 A structure with aligned fields
// it is a s s u m e d that the start of the s1 s t r u c t u r e is word a l i g n e d
s t r u c t s_s {
int i ; // word a l i g n e d
short s ; // half word a l i g n e d
char pad1 ; // byte a l i g n e d
char pad2 ; // byte a l i g n e d
int j ; // word aligned , thanks to the pad1 and pad2 fields
} s1 , s2 ;
The addresses for LOAD and STORE instructions are either word addresses with
type w_data_address_t, or h_data_address_t with one more bit for half word ad-
dresses, or b_data_address_t with two more bits for byte addresses.
These types are defined in the rv32i_npp_ip.h file (see Listing 6.9).
Listing 6.9 The data memory type definitions in the rv32i_npp_ip.h file
...
t y p e d e f ap_uint < L O G _ D A T A _ R A M _ S I Z E > w_data_address_t ;
t y p e d e f ap_uint < L O G _ D A T A _ R A M _ S I Z E +1 > h _ d a t a _ a d d r e s s _ t ;
t y p e d e f ap_uint < L O G _ D A T A _ R A M _ S I Z E +2 > b _ d a t a _ a d d r e s s _ t ;
...
6.4 The Execute Function 191
The mem_load function (see Listing 6.10; defined in the execute.cpp file) works
differently from the store function.
The store function writes the value to the addressed bytes and only them are
accessed, thanks to the byte write enable bits.
The implementation of the load function accesses a full aligned word, from which
the bytes requested by the LW, LH, LHU, LB, or LBU instruction are extracted.
The load function accesses four bytes aligned on the addressed word boundary
(the address argument is a byte address, from which a word address a2 is derived
with a two bits right shift).
Listing 6.10 The mem_load function: load the addressed word
label
s t a t i c int m e m _ l o a d (
int * data_ram ,
b _ d a t a _ a d d r e s s _ t address ,
func3_t msize ){
ap_uint <2 > a01 = address ;
bit_t a1 = ( a d d r e s s >> 1) ;
w _ d a t a _ a d d r e s s _ t a2 = ( a d d r e s s >> 2) ;
int result ;
char b , b0 , b1 , b2 , b3 ;
u n s i g n e d char ub , ub0 , ub1 , ub2 , ub3 ;
short h , h0 , h1 ;
u n s i g n e d short uh , uh0 , uh1 ;
int w , ib , ih ;
u n s i g n e d int iub , iuh ;
w = d a t a _ r a m [ a2 ];
b0 = w;
ub0 = b0 ;
b1 = w > >8;
ub1 = b1 ;
h0 = (( ap_uint <16 >) ub1 < <8) | ( ap_uint <16 >) ub0 ;
uh0 = h0 ;
b2 = w > >16;
ub2 = b2 ;
b3 = w > >24;
ub3 = b3 ;
h1 = (( ap_uint <16 >) ub3 < <8) | ( ap_uint <16 >) ub2 ;
uh1 = h1 ;
...
The loaded word is used to build the requested data according to the size (byte,
half-word, or word) and to the address least significant bits (a01 for a byte access,
a1 for a half word access).
For LB and LH, the built data is sign extended to the size of the destination register
(i.e. left padding the word data with copies of the loaded value sign).
For LBU and LHU, the loaded value is 0-extended (i.e. left padding the word data
with zeros).
192 6 Building a RISC-V Processor
The write_reg function (see Listing 6.12; defined in the execute.cpp file) writes
the result back into the destination register, except if the instruction is a STORE, a
BRANCH, or if the destination is register zero.
Listing 6.12 The write_reg function
static void w r i t e _ r e g (
int * reg_file ,
d e c o d e d _ i n s t r u c t i o n _ t d_i ,
int result ){
if ( d_i . rd != 0 &&
! d_i . i s _ b r a n c h &&
! d_i . i s _ s t o r e )
r e g _ f i l e [ d_i . rd ] = r e s u l t ;
}
6.5 Simulating the Rv32i_npp_ip With the Testbench 193
Experimentation
To simulate the rv32i_npp_ip, operate as explained in Sect. 5.3.6, replacing fetch-
ing_ip with rv32i_npp_ip.
You can play with the simulator, replacing the included test_mem_0_text.hex
file with any other .hex file you find in the same folder.
The code_ram array initialization includes the RISC-V code to be run. This code
is obtained from the .text section of the ELF file.
Two new test programs have been added: test_load_store.s (see Listing 6.14) to
test the various size of loads and stores defined in the RV32I ISA and test_mem.s
(see Listing 6.18) which sets an array of the 10 first natural numbers and sums them.
Their hex translations are available in the rv32i_npp_ip folder.
Listing 6.14 The test_load_store.s file
main :
li t0 ,1 /* t0 =1*/
li t1 ,2 /* t1 =2*/
li t2 , -3 /* t2 = -3*/
li t3 , -4 /* t3 = -4*/
li a0 ,0 /* a0 =0*/
sw t0 ,0( a0 ) /* t [ a0 ]= t0 ( word access ) */
addi a0 , a0 ,4 /* a0 +=4*/
sh t1 ,0( a0 ) /* t [ a0 ]= t1 ( half word a c c e s s ) */
sh t0 ,2( a0 ) /* t [ a0 +2]= t0 ( half word a c c e s s ) */
194 6 Building a RISC-V Processor
addi a0 , a0 ,4 /* a0 +=4*/
sb t3 ,0( a0 ) /* t [ a0 ]= t3 ( byte access ) */
sb t2 ,1( a0 ) /* t [ a0 +1]= t2 ( byte access ) */
sb t1 ,2( a0 ) /* t [ a0 +2]= t1 ( byte access ) */
sb t0 ,3( a0 ) /* t [ a0 +3]= t0 ( byte access ) */
lb a1 ,0( a0 ) /* a1 = t [ a0 ] ( byte access ) */
lb a2 ,1( a0 ) /* a2 = t [ a0 +1] ( byte access ) */
lb a3 ,2( a0 ) /* a3 = t [ a0 +2] ( byte access ) */
lb a4 ,3( a0 ) /* a4 = t [ a0 +3] ( byte access ) */
lbu a5 ,0( a0 ) /* a5 = t [ a0 ] ( u n s i g n e d byte access ) */
lbu a6 ,1( a0 ) /* a6 = t [ a0 +1] ( u n s i g n e d byte access ) */
lbu a7 ,2( a0 ) /* a7 = t [ a0 +2] ( u n s i g n e d byte access ) */
addi a0 , a0 , -4 /* a0 -=4*/
lh s0 ,2( a0 ) /* s0 = t [ a0 +2] ( half word a c c e s s ) */
lh s1 ,0( a0 ) /* s1 = t [ a0 ] ( half word a c c e s s ) */
lhu s2 ,4( a0 ) /* s2 = t [ a0 +4] ( u n s i g n e d h.w. access ) */
lhu s3 ,6( a0 ) /* s3 = t [ a0 +6] ( u n s i g n e d h.w. access ) */
addi a0 , a0 , -4 /* a0 -=4*/
lw s4 ,8( a0 ) /* s4 = t [ a0 +8] ( word access ) */
ret
The store instructions forming the first part of the run print what is shown in
Listing 6.15.
Listing 6.15 The test_load_store.s print: store instructions
label
0000: 0 0 1 0 0 2 9 3 li t0 , 1
t0 = 1 ( 1)
0004: 0 0 2 0 0 3 1 3 li t1 , 2
t1 = 2 ( 2)
0008: f f d 0 0 3 9 3 li t2 , -3
t2 = -3 ( f f f f f f f d )
0012: f f c 0 0 e 1 3 li t3 , -4
t3 = -4 ( f f f f f f f c )
0016: 0 0 0 0 0 5 1 3 li a0 , 0
a0 = 0 ( 0)
0020: 0 0 5 5 2 0 2 3 sw t0 , 0( a0 )
m[ 0] = 1 ( 1)
0024: 0 0 4 5 0 5 1 3 addi a0 , a0 , 4
a0 = 4 ( 4)
0028: 0 0 6 5 1 0 2 3 sh t1 , 0( a0 )
m[ 4] = 2 ( 2)
0032: 0 0 5 5 1 1 2 3 sh t0 , 2( a0 )
m[ 6] = 1 ( 1)
0036: 0 0 4 5 0 5 1 3 addi a0 , a0 , 4
a0 = 8 ( 8)
0 0 4 0 : 01 c 5 0 0 2 3 sb t3 , 0( a0 )
m[ 8] = -4 ( f f f f f f f c )
0 0 4 4 : 0 0 7 5 0 0 a3 sb t2 , 1( a0 )
m[ 9] = -3 ( f f f f f f f d )
0048: 0 0 6 5 0 1 2 3 sb t1 , 2( a0 )
m[ a] = 2 ( 2)
0 0 5 2 : 0 0 5 5 0 1 a3 sb t0 , 3( a0 )
m[ b] = 1 ( 1)
The load instructions forming the last part of the run print what is shown in Listing
6.16.
6.5 Simulating the Rv32i_npp_ip With the Testbench 195
After the run, the processor prints the register file and the testbench program
prints the number of instructions run and the non null memory words, as shown in
Listing 6.17.
Listing 6.17 The test_load_store.s print: register file and memory dump
label
ra = 0 ( 0)
...
tp = 0 ( 0)
t0 = 1 ( 1)
t1 = 2 ( 2)
t2 = -3 ( fffffffd )
s0 = 1 ( 1)
s1 = 2 ( 2)
a0 = 0 ( 0)
a1 = -4 ( fffffffc )
a2 = -3 ( fffffffd )
a3 = 2 ( 2)
a4 = 1 ( 1)
a5 = 252 ( fc )
a6 = 253 ( fd )
a7 = 2 ( 2)
s2 = 65020 ( fdfc )
s3 = 258 ( 102)
s4 = 16973308 ( 102 fdfc )
s5 = 0 ( 0)
...
s11 = 0 ( 0)
t3 = -4 ( f f f f f f f c )
196 6 Building a RISC-V Processor
t4 = 0 ( 0)
...
29 f e t c h e d and d e c o d e d i n s t r u c t i o n s
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 65538 ( 10002)
m[ 8] = 1 6 9 7 3 3 0 8 ( 102 fdfc )
The second test program is test_mem.s, shown in Listing 6.18 (which computes
the sum of the 10 elements of an array initialized with the 10 first natural num-
bers; in the comments, I give the C equivalent semantic of each RISC-V assembly
instruction).
Listing 6.18 The test_mem.s file
label
.globl main
main :
li a0 ,0 /* a0 =0*/
li a1 ,0 /* a1 =0*/
li a2 ,0 /* a2 =0*/
addi a3 , a2 ,40 /* a3 =40*/
.L1 :
addi a1 , a1 ,1 /* a1 ++*/
sw a1 ,0( a2 ) /* t [ a2 ]= a1 */
addi a2 , a2 ,4 /* a2 +=4*/
bne a2 , a3 , .L1 /* if ( a2 != a3 ) goto .L1 */
li a1 ,0 /* a1 =0*/
li a2 ,0 /* a2 =0*/
.L2 :
lw a4 ,0( a2 ) /* a4 = t [ a2 ]*/
add a0 , a0 , a4 /* a0 += a4 */
addi a2 , a2 ,4 /* a2 +=4*/
bne a2 , a3 , .L2 /* if ( a2 != a3 ) goto .L2 */
sw a0 ,4( a2 ) /* t [ a2 +4]= a0 */
ret
The first part of the run of the test_mem.s file prints what is shown in Listing
6.19.
Listing 6.19 The test_mem.s print: first iteration of the write array loop
label
0000: 0 0 0 0 0 5 1 3 li a0 , 0
a0 = 0 ( 0)
0004: 0 0 0 0 0 5 9 3 li a1 , 0
a1 = 0 ( 0)
0008: 0 0 0 0 0 6 1 3 li a2 , 0
a2 = 0 ( 0)
0012: 0 2 8 6 0 6 9 3 addi a3 , a2 , 40
a3 = 40 ( 28)
0016: 0 0 1 5 8 5 9 3 addi a1 , a1 , 1
a1 = 1 ( 1)
0 0 2 0 : 00 b 6 2 0 2 3 sw a1 , 0( a2 )
m[ 0] = 1 ( 1)
0024: 0 0 4 6 0 6 1 3 addi a2 , a2 , 4
a2 = 4 ( 4)
0028: f e d 6 1 a e 3 bne a2 , a3 , 16
pc = 16 ( 10)
0016: 0 0 1 5 8 5 9 3 addi a1 , a1 , 1
a1 = 2 ( 2)
0 0 2 0 : 00 b 6 2 0 2 3 sw a1 , 0( a2 )
6.5 Simulating the Rv32i_npp_ip With the Testbench 197
m[ 4] = 2 ( 2)
...
After the array has been initialized, it is read to accumulate its values in the a0
register. The second part of the run prints what is shown in Listing 6.20.
Listing 6.20 The test_mem.s print: out of the write array loop and in the read array loop
...
0024: 0 0 4 6 0 6 1 3 addi a2 , a2 , 4
a2 = 40 ( 28)
0028: f e d 6 1 a e 3 bne a2 , a3 , 16
pc = 32 ( 20)
0032: 0 0 0 0 0 5 9 3 li a1 , 0
a1 = 0 ( 0)
0036: 0 0 0 0 0 6 1 3 li a2 , 0
a2 = 0 ( 0)
0040: 0 0 0 6 2 7 0 3 lw a4 , 0( a2 )
a4 = 1 ( 1) (m[ 0])
0 0 4 4 : 00 e 5 0 5 3 3 add a0 , a0 , a4
a0 = 1 ( 1)
0048: 0 0 4 6 0 6 1 3 addi a2 , a2 , 4
a2 = 4 ( 4)
0052: f e d 6 1 a e 3 bne a2 , a3 , 40
pc = 40 ( 28)
0040: 0 0 0 6 2 7 0 3 lw a4 , 0( a2 )
a4 = 2 ( 2) (m[ 4])
...
When all the elements of the array have been added, the final sum is saved to
memory. The third part of the run prints what is shown in Listing 6.21.
Listing 6.21 The test_mem.s print: out of the read array loop and store the result
label
...
0 0 4 4 : 00 e 5 0 5 3 3 add a0 , a0 , a4
a0 = 55 ( 37)
0048: 0 0 4 6 0 6 1 3 addi a2 , a2 , 4
a2 = 40 ( 28)
0052: f e d 6 1 a e 3 bne a2 , a3 , 40
pc = 56 ( 38)
0 0 5 6 : 00 a 6 2 2 2 3 sw a0 , 4( a2 )
m[ 2c] = 55 ( 37)
0060: 0 0 0 0 8 0 6 7 ret
pc = 0 ( 0)
The processor prints the final state of its register file, with the sum in a0 as shown
in Listing 6.22.
Listing 6.22 The test_mem.s print: the register file
label
...
a0 = 55 ( 37)
a1 = 0 ( 0)
a2 = 40 ( 28)
a3 = 40 ( 28)
a4 = 10 ( a)
...
198 6 Building a RISC-V Processor
The testbench program prints the number of executed instructions and the non
null memory words, i.e. the array and the sum, as shown in Listing 6.23.
Listing 6.23 The test_mem.s print: the memory
label
88 f e t c h e d and d e c o d e d i n s t r u c t i o n s
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
The synthesis report shows that the IP cycle corresponds to seven FPGA cycles (see
Fig. 6.4).
The Schedule Viewer confirms the seven cycles (see Fig. 6.5).
Experimentation
To run the rv32i_npp_ip on the development board, proceed as explained in
Sect. 5.3.10, replacing fetching_ip with rv32i_npp_ip.
You can play with your IP, replacing the included test_mem_0_text.hex file with
any other .hex file you find in the same folder.
The code in the helloworld.c file is shown in Listing 6.24 (do not forget to adapt
the path to the hex file to your environment with the update_helloworld.sh shell
script; to run another test program, update the #include line in the initialization of
the code_ram array).
200 6 Building a RISC-V Processor
Listing 6.24 The helloworld.c program to run the test_mem.s code on the Pynq-Z1 board
# i n c l u d e < stdio .h >
# include " xrv32i_npp_ip .h"
# include " xparameters .h"
# d e f i n e L O G _ C O D E _ R A M _ S I Z E 16
// size in words
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
# d e f i n e L O G _ D A T A _ R A M _ S I Z E 16
// size in words
# define DATA_RAM_SIZE (1 < < L O G _ D A T A _ R A M _ S I Z E )
XRv32i_npp_ip_Config * cfg_ptr ;
XRv32i_npp_ip ip ;
w o r d _ t y p e c o d e _ r a m [ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ m e m _ 0 _ t e x t . hex "
};
int main () {
word_type w;
cfg_ptr = XRv32i_npp_ip_LookupConfig (
XPAR_XRV32I_NPP_IP_0_DEVICE_ID );
X R v 3 2 i _ n p p _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X R v 3 2 i _ n p p _ i p _ S e t _ s t a r t _ p c (& ip , 0) ;
X R v 3 2 i _ n p p _ i p _ W r i t e _ c o d e _ r a m _ W o r d s (& ip , 0 , code_ram ,
CODE_RAM_SIZE );
X R v 3 2 i _ n p p _ i p _ S t a r t (& ip ) ;
while (! X R v 3 2 i _ n p p _ i p _ I s D o n e (& ip ) ) ;
p r i n t f ( " % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \ n " ,
( int ) X R v 3 2 i _ n p p _ i p _ G e t _ n b _ i n s t r u c t i o n (& ip ) ) ;
p r i n t f ( " data memory dump ( non null words ) \ n \ r " ) ;
for ( int i =0; i < D A T A _ R A M _ S I Z E ; i ++) {
X R v 3 2 i _ n p p _ i p _ R e a d _ d a t a _ r a m _ W o r d s (& ip , i , &w , 1) ;
if ( w != 0)
p r i n t f ( " m [%5 x ] = %16 d (%8 x ) \ n " , 4* i , ( int )w ,
( u n s i g n e d int ) w ) ;
}
}
The run of the RISC-V code in the test_mem.s file prints the output shown in
Listing 6.25 on the putty terminal.
Listing 6.25 The helloworld.c program output on the putty terminal
88 f e t c h e d and d e c o d e d i n s t r u c t i o n s
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
Testing Your RISC-V Processor
7
Abstract
This chapter lets you test your first RISC-V processor in three steps: test all the
instructions in their most frequent usage (my six test programs), pass the official
riscv-tests and test benchmark programs from the mibench suite and from the
official riscv-tests.
I have already presented six RISC-V test programs: test_branch.s to test BRANCH
instructions, test_jal_jalr.s to test JAL and JALR instructions, test_lui_auipc.s to test
LUI and AUIPC instructions, test_load_store.s to test LOAD and STORE instructions,
test_op.s to test OP instructions and test_op_imm.s to test OP_IMM instructions.
They are enough to make sure that the decoder recognizes all the instructions in
the RV32I ISA and that the execution unit can run them.
However, there are many special situations which are not checked by these six
programs. For example, is register zero preserved from any writeback? As a source,
does it provide the value 0? Are the constants in every format properly decoded
(there are a lot of cases to test because the decoded immediate value is composed of
many fields from different bits in the instruction word, assembled in different ways
according to the instruction format)?
The RISC-V organization provides a set of programs to test all the instructions
more exhaustively than what I do in my own codes. However, you should first run
my codes before running the official riscv-tests programs, for two reasons.
First, the riscv-tests codes are embedded and the embedding code is itself made
of RISC-V instructions. So, if your processor is buggy, you might not be able to run
the part of the code to launch the tests.
Second, because you will learn that debugging hardware is not as simple as de-
bugging software. On the FPGA, you do not have a debugger at hand (you do, but
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 201
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_7
202 7 Testing Your RISC-V Processor
just for the helloworld driver code run on the ARM processor within the Zynq SoC,
not for the RISC-V code run on the IP implemented on the FPGA). When your
FPGA does not send anything to the putty terminal, the only way to debug your IP
is to think about your code and evaluate every instruction, step by step, again and
again. My simple RISC-V programs are less likely to bug because of the processor
implementation than the complex riscv-tests ones.
Once your processor has successfully run my six test codes (not only on the
simulator but also on the FPGA), you can try to pass the riscv-tests.
All the source files related to the riscv-tests can be found in the riscv-tests folder.
The pk interpreter is not used. The simulated code (e.g. rv32ui-p-add) contains
its own OS proxy and it is loaded by spike itself.
As the spike simulator has no known bug, all the files are empty after the run.
7.2 More Testing with the Official Riscv-Tests 203
Before I describe the port of the riscv-tests to the Vitis_HLS environment, I must
present their structure in the original code.
RVTEST_RV64U
RVTEST_CODE_BEGIN
#----------------------------------------------
# A r i t h m e t i c tests
#----------------------------------------------
A macro is defined in three parts: the #define keyword, the name of the macro
with its possible arguments (i.e. TEST_RR_OP( testnum, inst, result, val1, val2 )) and
its definition (the remaining of the line, possibly extended with \ characters).
204 7 Testing Your RISC-V Processor
RVTEST_RV64U
RVTEST_CODE_BEGIN
# ------------------------------------------------
# A r i t h m e t i c tests
# ------------------------------------------------
T E S T _ C A S E ( 2 , x14 , 0 x 0 0 0 0 0 0 0 0 , \
li x1 , M A S K _ X L E N ( 0 x 0 0 0 0 0 0 0 0 ) ; \
li x2 , M A S K _ X L E N ( 0 x 0 0 0 0 0 0 0 0 ) ; \
add x14 , x1 , x2 ; \
);
T E S T _ C A S E ( 3 , x14 , 0 x 0 0 0 0 0 0 0 2 , \
li x1 , M A S K _ X L E N ( 0 x 0 0 0 0 0 0 0 1 ) ; \
li x2 , M A S K _ X L E N ( 0 x 0 0 0 0 0 0 0 1 ) ; \
add x14 , x1 , x2 ; \
)
Listing 7.5 The MASK_XLEN and TEST_CASE macro definitions in the test_macros.h file
# d e f i n e M A S K _ X L E N (x) ((x) \& ((1 << ( _ _ r i s c v _ x l e n - 1) << 1) - 1) )
# d e f i n e T E S T _ C A S E ( testnum , testreg , correctval , c o d e . . . ) \
test_ # # t e s t n u m : \
code ; \
li x7 , M A S K _ X L E N ( c o r r e c t v a l ) ; \
li TESTNUM , t e s t n u m ; \
bne testreg , x7 , fail ;
The substitution process continues until all the macros have been replaced. You
can check these substitutions by running the compiler only for the preprocessor job
with the -E option (see Listing 7.6).
ecall
pass : fence
li gp , 1
li a7 , 93
li a0 , 0
ecall
The last test adds 16 to 30 into register x0. Register x0, alias zero, should never
be written. Hence, register x0 is compared to a cleared x7. If they do not match, the
run branches to fail. Otherwise, it branches to pass.
Whatever the result of the equality comparison between x0 and x7, the run calls
the 93 system call (in a spike simulated RISC-V machine, the ecall instruction calls
the system call indicated by the x7 register, i.e. 93).
The called system call belongs to spike. It receives arguments through the a0 to
a5 registers (a0 is cleared if pass and set to 2*test number + 1 if fail; gp is set to 1
if pass and to the test number if fail; this is probably why the first test number is 2
instead of 1).
It is not very clear (because undocumented) how this system call operates. It
certainly writes a fail message to the standard error stream. The spike command
launched by make run redirects this standard error stream to an error file.
The test programs run on spike must be adapted to be run on the rv32i_npp_ip
simulator and on the FPGA. For example, the ebreak instruction and the system
calls included in spike are not present in the rv32i_npp_ip implementation.
I have adapted the riscv_test.h file included by each x.S test program to the Vi-
tis_HLS environment (I mostly adapted the changes for Vitis_HLS from a proposition
I found on Peter Gu’s blog, thanks to him (https://www.ustcpetergu.com/MyBlog/
experience/2021/07/09/about-riscv-testing.html), himself being inspired by the Pi-
coRV32 implementation at https://github.com/YosysHQ/picorv32).
The new version is my_riscv_test.h shown in Listing 7.9. It is located in the
riscv-tests/my_env/p folder.
For each tested instruction, the result of the test is available in the a0 register (null
if passed) and saved in the result_zone memory (starting at byte address 0x2000 or
word address 0x800).
The RVTEST_CODE_BEGIN macro only contains the TEST_FUNC_NAME label
definition (see the "#define RVTEST_CODE_BEGIN" line in Listing 7.9).
TEST_FUNC_NAME is a macro to be defined as the name of the tested instruction
(e.g. addi).
This macro is not defined in the code. It is to be defined directly in the compilation
command through the -D option (the shell script including the compilation command
and the -D option is presented in 7.2.6).
The RVTEST_FAIL and RVTEST_PASS macros are defined to save their a0 result
into the result zone (see the "#define RVTEST_FAIL" and "#define RVTEST_PASS"
lines in Listing 7.9).
7.2 More Testing with the Official Riscv-Tests 207
The tests are all glued together in a single program named _start.S, shown in Listing
7.10. The _start.S file is in the riscv-tests/my_isa/my_rv32ui folder.
The _start.S file runs the tests of the 37 instructions consecutively (plus the sim-
ple.S test) and saves their result in the result_zone array.
The _start.S file defines the TEST(n) macro.
The TEST(n) macro jumps to label n ("jal zero, n"; e.g. TEST(addi) jumps to label
addi defined in RVTEST_CODE_BEGIN, which starts the run of addi.S).
The macro also defines label n_ret (e.g. TEST(addi) defines label addi_ret which
is the jump address after RVTEST_PASS or RVTEST_FAIL).
To summarize the construction let me detail the addi instruction test example.
TEST(addi) in _start.S jumps to label addi ("jal zero, addi" after the TEST macro
expansion). The addi label is defined in the addi.S file by the expansion of the
RVTEST_CODE_BEGIN macro.
The addi.S tests are run (the succession of macros in addi.S, starting with
TEST_IMM_OP). The addi.S code ends with the TEST_PASSFAIL macro which is
defined in the my_test_macros.h file and expanded as a branch to the fail or pass
labels.
At the pass label, the RVTEST_PASS macro expands to the code given in the
my_riscv_test.h file, which clears the current result_zone word (meaning the test is
passed) and jumps to addi_ret.
The addi_ret label is at the end of the TEST(addi) macro expansion. It is followed
by the TEST(add) macro which starts the add instruction test.
Listing 7.10 The _start.S file
# i n c l u d e " .. / .. / m y _ e n v / p / m y _ r i s c v _ t e s t . h "
.text
.globl _start
.equ result_zone , 0 x 2 0 0 0
_start :
lui a0 ,% hi ( r e s u l t _ z o n e )
addi a0 , a0 ,% lo ( r e s u l t _ z o n e )
addi a1 , a0 ,4
sw a1 ,0( a0 )
INIT_XREG
# define TEST ( n ) \
.global n; \
jal zero , n ; \
.global n # # _ret ; \
n # # _ret :
TEST ( addi )
TEST ( add )
...
TEST (x ori )
TEST (x or )
li ra ,0
ret
7.2 More Testing with the Official Riscv-Tests 209
A new testbench code is written (see Listing 7.11) to read the result_zone and print
the result of each test (testbench_riscv_tests_rv32i_npp_ip.cpp file in the riscv-
tests/my_isa/my_rv32ui folder).
The code run by the rv32i_npp_ip processor is defined as the test_0_text.hex
file (to be built with a shell script presented in 7.2.6). This file contains the RISC-V
instruction codes provided by the compiler from the compilation of the _start.S file.
Concerning the test_0_data.hex file, the LOAD instructions in the tests read from
memory words which must be provided by the data hex file.
The data these LOAD instructions access are defined in the .S files (lw.S, lh.S, lhu.S,
lb.S, and lbu.S).
They are incorporated to the test_0_data.hex file which serves to initialize the
data_ram array (this file is built by the same shell script) (remember that the code
and data memories are separate, hence two initialization files: test_0_data.hex to
initialize the data RAM and test_0_code.hex to initialize the code RAM).
When the rv32i_npp_ip function is called, the code in the code_ram array is
run, i.e. the _start.S file. The results of the tests are saved in the result_zone in the
data_ram array (at word address 0x801 is the result of the first instruction tested,
i.e. addi). When the memory word is null, the test is passed. Otherwise, the value is
the identification number of the first failing test.
Listing 7.11 The testbench_riscv_tests_rv32i_npp_ip.cpp file
# i n c l u d e < stdio .h >
# include " ../../../ rv32i_npp_ip / rv32i_npp_ip .h"
u n s i g n e d int d a t a _ r a m [ D A T A _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ 0 _ d a t a . hex "
};
u n s i g n e d int c o d e _ r a m [ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ 0 _ t e x t . hex "
};
char * name [38] = {
" addi " , " add " , " andi " , " and " , " auipc " ,
" beq " , " bge " , " bgeu " , " blt " , " bltu " , " bne ",
" jalr " , " jal ",
" lb " , " lbu " , " lh " , " lhu " , " lui " , " lw ",
" ori " , " or ",
" sb " , " sh " , " simple " ,
" slli " , " sll " , " slti " , " sltiu " , " slt " , " sltu ",
" srai " , " sra " , " srli " , " srl " , " sub " , " sw ",
" xori " , " xor "
};
int main () {
u n s i g n e d int nbi ;
int w;
r v 3 2 i _ n p p _ i p (0 , ( i n s t r u c t i o n _ t *) code_ram , ( int *) data_ram , & nbi ) ;
for ( int i =0; i <38; i ++) {
p r i n t f ( " % s : " , name [ i ]) ;
if ( d a t a _ r a m [0 x801 + i ]==0)
p r i n t f ( " all t e s t s p a s s e d \ n " ) ;
else
p r i n t f ( " test % d failed \ n " , d a t a _ r a m [0 x801 + i ]) ;
}
r e t u r n 0;
}
210 7 Testing Your RISC-V Processor
Experimentation
To simulate the riscv-tests on the rv32i_npp_ip, in a terminal with riscv-tests/
my_isa/my_rv32ui as the current directory, run ./my_build_all.sh.
Proceed as explained in 5.3.6, replacing fetching_ip with rv32i_npp_ip and test-
bench_fetching_ip.cpp with riscv-tests/my_isa/my_rv32ui/ testbench_riscv_
tests_rv32i_npp_ip.cpp.
The simulation result in the rv32i_npp_ip_csim.log tab should show that all the
tests have been passed.
The my_build_all.sh script (shown in Listing 7.13) compiles each .S file, defining
the TEST_FUNC_NAME and TEST_FUNC_RET variables (for example
"-DTEST_FUNC_NAME=addi" and "-DTEST_FUNC_RET=addi_ret" for the compi-
lation of addi.S).
Then it compiles the _start.S file.
When all the source files have been compiled, the script links them to build the
test.elf file. Usually, the linker builds a memory with the concatenation of the code
and the data (the .text section followed by the .data section).
The rv32i_npp_ip processor has separate code and data memory banks. For this
reason, the script orders the linker to base the code and the data at the same address
0 (-Ttext 0 and -Tdata 0). The linker complains when two sections are overlapping.
To avoid this, the script uses the -no-check-sections option. This option is not for
the gcc compiler but for the ld linker (-Wl,-no-check-sections).
Moreover, the _start label defined in _start.S should be the entry point of the run.
The script uses the -nostartfiles to prevent the linker to add some OS related start
code. Similarly, no library should be added (-nostdlib) to keep the code as short as
possible in the code_ram array.
The test.elf file built by the linker is structured in sections. There may be many
sections and among them, a .text section for the code, a .rodata section for read-
only data, a .data section for the initialized global data and a .bss section for the
uninitialized global data.
7.2 More Testing with the Official Riscv-Tests 211
Then, the run prints the disassembling and emulation of the addi first test (from
0300 to 0316) as shown in Listing 7.15.
Listing 7.15 The print of the rv32i_npp_ip function
...
0300: 0 0 0 0 0 0 9 3 li ra , 0
ra = 0 ( 0)
0304: 0 0 0 0 8 7 1 3 addi a4 , ra , 0
a4 = 0 ( 0)
0308: 0 0 0 0 0 3 9 3 li t2 , 0
t2 = 0 ( 0)
0312: 0 0 2 0 0 1 9 3 li gp , 2
gp = 2 ( 2)
0316: 26771 c63 bne a4 , t2 , 948
pc = 320 ( 140)
...
Then (see Listing 7.16), it prints the disassembling and emulation of the other tests
run for addi until test 25 (addi test number 25 from 0924 to 0940). If all the tests
for addi have been successfull, the run continues at address 0980, i.e. RVTEST_PASS
(0948 otherwise, i.e. RVTEST_FAIL). When the success has been reported in the result
zone, the run continues at address 0144 (addi_ret).
Listing 7.16 The print of the rv32i_npp_ip function
...
0924: 0 2 1 0 0 0 9 3 li ra , 33
ra = 33 ( 21)
0928: 0 3 2 0 8 0 1 3 addi zero , ra , 50
0932: 0 0 0 0 0 3 9 3 li t2 , 0
t2 = 0 ( 0)
0936: 0 1 9 0 0 1 9 3 li gp , 25
gp = 25 ( 19)
0940: 0 0 7 0 1 4 6 3 bne zero , t2 , 948
pc = 944 ( 3 b0 )
0944: 0 2 3 0 1 2 6 3 bne zero , gp , 980
pc = 980 ( 3 d4 )
7.2 More Testing with the Official Riscv-Tests 213
0980: 0 0 0 0 0 5 1 3 li a0 , 0
a0 = 0 ( 0)
0984: 000022 b7 lui t0 , 8192
t0 = 8192 ( 2000)
0988: 0 0 0 2 8 2 9 3 addi t0 , t0 , 0
t0 = 8192 ( 2000)
0992: 0002 a303 lw t1 , 0( t0 )
t1 = 8196 ( 2004) (m[ 2000])
0 9 9 6 : 00 a 3 2 0 2 3 sw a0 , 0( t1 )
m[ 2004] = 0 ( 0)
1000: 0 0 4 3 0 3 1 3 addi t1 , t1 , 4
t1 = 8200 ( 2008)
1004: 0062 a023 sw t1 , 0( t0 )
m[ 2000] = 8200 ( 2008)
1008: c a 1 f f 0 6 f j 144
pc = 144 ( 90)
...
All the instructions are successively tested until xor (test 2 from 30288 to 30320
in Listing 7.17).
Listing 7.17 The print of the rv32i_npp_ip function
...
30288: ff0100b7 lui ra , 65536
ra = -16711680 ( f f 0 1 0 0 0 0 )
30292: f0008093 addi ra , ra , -256
ra = -16711936 ( f f 0 0 f f 0 0 )
30296: 0 f0f1137 lui sp , -61440
sp = 252645376 ( f0f1000 )
30300: f0f10113 addi sp , sp , -241
sp = 252645135 ( f0f0f0f )
30304: 0020 c733 xor a4 , ra , sp
a4 = -267390961 ( f 0 0 f f 0 0 f )
30308: f00ff3b7 lui t2 , -4096
t2 = -267390976 ( f 0 0 f f 0 0 0 )
3 0 3 1 2 : 00 f 3 8 3 9 3 addi t2 , t2 , 15
t2 = -267390961 ( f 0 0 f f 0 0 f )
30316: 00200193 li gp , 2
gp = 2 ( 2)
30320: 4 a771063 bne a4 , t2 , 3 1 5 0 4
pc = 30324 ( 7674)
...
The last test for xor is test number 27 (from 31468 to 31496). The last test report
is saved to the result zone (from 31536 to 31564; see Listing 7.18). The run ends
with the ret to 0.
Listing 7.18 The print of the rv32i_npp_ip function
...
3 1 4 6 8 : 1 1 1 1 1 0 b7 lui ra , 6 9 6 3 2
ra = 286330880 (11111000)
31472: 11108093 addi ra , ra , 273
ra = 286331153 (11111111)
31476: 22222137 lui sp , 1 3 9 2 6 4
sp = 572661760 (22222000)
31480: 22210113 addi sp , sp , 546
sp = 572662306 (22222222)
31484: 0020 c033 xor zero , ra , sp
31488: 00000393 li t2 , 0
t2 = 0 ( 0)
214 7 Testing Your RISC-V Processor
3 1 4 9 2 : 01 b 0 0 1 9 3 li gp , 27
gp = 27 ( 1b)
31496: 00701463 bne zero , t2 , 31504
pc = 31500 ( 7 b0c )
31500: 02301263 bne zero , gp , 31536
pc = 31536 ( 7 b30 )
31536: 00000513 li a0 , 0
a0 = 0 ( 0)
3 1 5 4 0 : 0 0 0 0 2 2 b7 lui t0 , 8192
t0 = 8192 ( 2000)
31544: 00028293 addi t0 , t0 , 0
t0 = 8192 ( 2000)
31548: 0002 a303 lw t1 , 0( t0 )
t1 = 8344 ( 2098) (m[ 2000])
3 1 5 5 2 : 00 a 3 2 0 2 3 sw a0 , 0( t1 )
m[ 2098] = 0 ( 0)
31556: 00430313 addi t1 , t1 , 4
t1 = 8348 ( 209 c )
31560: 0062 a023 sw t1 , 0( t0 )
m[ 2000] = 8348 ( 209 c )
31564: dd8f806f j 292
pc = 292 ( 124)
0292: 0 0 0 0 0 0 9 3 li ra , 0
ra = 0 ( 0)
0296: 0 0 0 0 8 0 6 7 ret
pc = 0 ( 0)
...
The registers are printed (prints from the rv32i_npp_ip function, after the do ...
while loop exit in Listing 7.19).
Listing 7.19 The print of the rv32i_npp_ip function
...
ra = 0 ( 0)
sp = 572662306 (22222222)
gp = 27 ( 1b)
tp = 2 ( 2)
t0 = 8192 ( 2000)
t1 = 8348 ( 209 c )
t2 = 0 ( 0)
...
a1 = 96 ( 60)
a2 = 0 ( 0)
a3 = 9476 ( 2504)
a4 = 267390960 ( ff00ff0 )
a5 = 0 ( 0)
...
Eventually, the testbench main function prints the result of the tests as shown in
Listing 7.20.
Listing 7.20 The print of the main function in the testbench_riscv_tests_rv32i_npp_ip.cpp file
...
addi : all tests passed
add : all tests passed
andi : all tests passed
and : all tests passed
...
sub : all t e s t s p a s s e d
sw : all t e s t s p a s s e d
x ori : all t e s t s p a s s e d
7.2 More Testing with the Official Riscv-Tests 215
x or : all t e s t s p a s s e d
If a test failed, the message would be for example "addi : test 8 failed".
Experimentation
To run the riscv-tests on the development board, proceed as explained in 5.3.10,
replacing fetching_ip with rv32i_npp_ip.
The helloworld.c driver is the riscv-tests/my_isa/my_rv32ui/helloworld_rv32i_
npp_ip.c file.
As for the simulation, the run on the board should print in the putty window that
all the tests passed.
X R v 3 2 i _ n p p _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X R v 3 2 i _ n p p _ i p _ S e t _ s t a r t _ p c (& ip , 0) ;
X R v 3 2 i _ n p p _ i p _ W r i t e _ c o d e _ r a m _ W o r d s (& ip , 0 , code_ram ,
CODE_RAM_SIZE );
X R v 3 2 i _ n p p _ i p _ W r i t e _ d a t a _ r a m _ W o r d s (& ip , 0 , data_ram ,
DATA_RAM_SIZE );
X R v 3 2 i _ n p p _ i p _ S t a r t (& ip ) ;
while (! X R v 3 2 i _ n p p _ i p _ I s D o n e (& ip ) ) ;
for ( int i =0; i <38; i ++) {
p r i n t f ( " % s : " , name [ i ]) ;
X R v 3 2 i _ n p p _ i p _ R e a d _ d a t a _ r a m _ W o r d s (& ip , 0 x801 +i , &d , 1) ;
if ( d == 0)
p r i n t f ( " all t e s t s p a s s e d \ n " ) ;
else
p r i n t f ( " test % d failed \ n " ,( int ) d ) ;
}
}
All the source files related to the mibench benchmark suite can be found in the
mibench/my_mibench folder.
To further test your processor, you should run real programs and compare their
results with the ones computed with spike.
A benchmark suite is such a set of programs which can be used to compare
processors. In your case, the benchmark suite has two goals: testing your processor
and comparing different designs (as you will implement other versions of the RISC-V
processor in the next chapters).
I have selected the mibench suite. It is made of applications devoted to embedded
computing, which is more the target of a home made processor than high performance
computing applications.
There are more recent suites but the main problem with a processor on an FPGA is
the size of the programmable part of the SoC. The Zynq XC7Z020 offers 140 blocks
of BRAM, each block representing 4KB of data. Hence, the code and the data of the
RISC-V application to be run on a Pynq-Z1/Pynq-Z2 development board should not
be bigger than 560KB.
In the mibench suite there are some oversized codes which I had to remove.
Another constraint comes from the absence of an OS. All the benchmarks are
composed of three parts: input, computation, output. The I/O parts use the I/O func-
tions of the standard library, like scanf and printf. It is rather easy to substitute these
I/O functions with memory access operations. The scanf inputs are replaced by a set
of data initializing the processor data_ram array. The printf outputs are replaced by
a set of produced data saved to the data_ram array.
However, some applications use other OS related functions. The most frequently
used is malloc. The malloc function allocates memory handled by the OS, not by
the compiler. To use malloc on a bare-metal platform, the user needs to mimic the
7.3 Running a Benchmark Suite on the Rv32i_npp_ip Processor 217
OS, implementing the management of a memory space out of the one addressed by
the compiled code (the heap memory).
I decided to simplify the benchmark suite, and to keep only the applications which
were easy to port to the no-OS environment.
However, normally a benchmark suite should not be modified. It would not be fair
to compare the performance of your RISC-V processor running the modified suite
to the performance of an existing processor based on the run of the original suite.
But, when you use a benchmark for your own comparisons (e.g. compare the non
pipelined rv32i_npp_ip processor to the rv32i_pp_ip pipelined implementation pre-
sented in the next chapter), you can organize your benchmarking as you wish, and as
long as the adaptations of the benchmark are the same for every single implementa-
tion of the processor. The modified benchmark should also be representative of the
particularities of your designs.
Representative means that for example, if you implement caches you need to
incorporate to your benchmark suite some programs with different memory access
patterns to measure the impact of cache misses on the performance. If on the contrary
the memory has a unique access time, the benchmark suite can be simplified.
Experimentation
To simulate the mibench benchmarks, for example the basicmath_small bench-
mark, in a terminal with current directory mibench/my_mibench/my_
automotive/basicmath, run the build.sh shell script.
In Vitis_HLS, open the rv32i_npp_ip project and set the testbench file as test-
bench_basicmath_rv32i_npp_ip.cpp file in the my_mibench/my_
automotive/basicmath folder.
In the debug_rv32i_npp_ip.h file, comment all the degugging constant definitions
(you open the file from the rv32i_npp_ip.cpp one, Outline frame).
Then, Run C Simulation.
You can work the same way with the other benchmarks proposed in the mibench/
my_mibench and riscv-tests/benchmarks folders.
In the original mibench suite, for each benchmark, two data sets are provided, a
small one and a large one. I have discarded the large data sets as either the text or
the data is too big to fit in the data_ram or the code_ram array (the main problem
is the size, not the time of the large run). Moreover, for some benchmarks (e.g.
basicmath_small), I have reduced the computation to make it runnable on the FPGA
(i.e. keep beyond the limit of 140 BRAM blocks available on the Zynq XC7Z020
FPGA).
The basicmath_small code is big because it contains floating-point computations
and the compiler targets an ISA with no floating-point instructions (RV32I ISA).
Hence, the floating-point operations are computed with library functions using in-
teger instructions. These functions of the mathematical library are linked with the
basicmath_small program.
The code in the basicmath_small.c file uses the printf function to display the
results. The spike simulator includes a display driver to print on your computer
screen. But the rv32i_npp_ip processor does not.
So, if you compile calls to printf, spike runs the compiled code, which prints. But
the same code run on the rv32i_npp_ip processor does not produce any output.
To have an output despite the lack of a driver, the rv32i_npp_ip processor saves
its results in the data_ram array. The testbench program or the FPGA driver reads
the data_ram array and reconstitutes the output.
7.3 Running a Benchmark Suite on the Rv32i_npp_ip Processor 219
The shell script (see Listing 7.25) compiles, builds the two binary files with obj-
copy and translates to hexadecimal with hexdump.
Listing 7.25 The build.sh file
$ cat build . sh
riscv32 - unknown - elf - gcc - s t a t i c - O3 - n o s t a r t f i l e s -o
b a s i c m a t h _ s m a l l _ n o _ p r i n t . elf -T l i n k e r . lds - Wl , - no - check -
s e c t i o n s b a s i c m a t h _ s m a l l _ n o _ p r i n t . c r a d 2 d e g . c cubic . c isqrt . c -
lm
riscv32 - unknown - elf - o b j c o p y - O b i n a r y -- only - s e c t i o n =. text
b a s i c m a t h _ s m a l l _ n o _ p r i n t . elf b a s i c m a t h _ s m a l l _ n o _ p r i n t _ 0 _ t e x t .
bin
riscv32 - unknown - elf - o b j c o p y - O b i n a r y -- only - s e c t i o n =. data
b a s i c m a t h _ s m a l l _ n o _ p r i n t . elf b a s i c m a t h _ s m a l l _ n o _ p r i n t _ 0 _ d a t a .
bin
h e x d u m p - v - e ’" 0 x " /4 " %08 x " " ,\ n " ’
b a s i c m a t h _ s m a l l _ n o _ p r i n t _ 0 _ t e x t . bin >
b a s i c m a t h _ s m a l l _ n o _ p r i n t _ 0 _ t e x t . hex
h e x d u m p - v - e ’" 0 x " /4 " %08 x " " ,\ n " ’
b a s i c m a t h _ s m a l l _ n o _ p r i n t _ 0 _ d a t a . bin >
b a s i c m a t h _ s m a l l _ n o _ p r i n t _ 0 _ d a t a . hex
$
220 7 Testing Your RISC-V Processor
The linker.lds file describes three output sections: the text section named .text,
the initialized data section named .data and the uninitialized data section named .bss
(they are the sections named out of the curly braces).
These output sections are built from the concatenation of input sections taken
from an input ELF file (the input sections are named in the curly braces).
The .text output section is composed of all the input sections named .text.main
followed by all the input sections prefixed by .text. The .text output section starts at
address 0 in the code RAM (". = 0" sets the current address as 0).
The .data output section is also starting at address 0 in the data RAM. It is built
from the concatenation of all the data sections in the input files, including the read
only data (the read only data sections usually use the word rodata in their names).
The .bss output section is starting after the .data section in the data RAM. It is
built from the concatenation of all the bss sections in the input files.
For example, the basicmath_small_no_print.elf is built from the compilation
of four files: basicmath_small_no_print.c, rad2deg.c, cubic.c, and isqrt.c. Each
compilation builds a .text section. Thus, the .text output section in the basic-
math_small_no_print.elf file is the concatenation of the four input .text sections
from the intermediate files produced by the compilation of the four ".c" files.
7.3 Running a Benchmark Suite on the Rv32i_npp_ip Processor 221
The place of the result global array in memory (0xb18) can be obtained with the
objdump tool (see Listing 7.29), after the basicmath_small_no_print.elf file has
been built by the build.sh script (the script also builds the basicmath_small_no_
222 7 Testing Your RISC-V Processor
Experimentation
To run the basicmath_small benchmark on the development board, proceed as
explained in 5.3.10, replacing fetching_ip with rv32i_npp_ip.
The helloworld.c driver is the mibench/my_mibench/my_automotive/
basicmath/helloworld_rv32i_npp_ip.c file.
The printed results should be identical to the simulation ones and to the spike run
ones.
The prints of the helloworld (run on the FPGA from Vitis IDE) and the testbench
(simulated in Vitis HLS) are identical.
7.3 Running a Benchmark Suite on the Rv32i_npp_ip Processor 225
To run the other benchmarks of the adapted mibench suite, you proceed the same
way.
For a benchmark named bench.c in the mibench/my_mibench/my_dir folder,
first make (the Makefile builds the bench executable file) (as bench.c is compiled
with the riscv32-unknown-elf-gcc compiler with the standard linking, it is to be run
only by spike, not by the rv32i_npp_ip processor).
Then, run spike on the bench executable. This produces the reference output file.
The second step is to build the bench_no_print_0_text.hex and bench_no_print_
0_data.hex files which are used by the testbench program (you can find pre-
built hex files in each benchmark folder). You run build.sh which compiles the
bench_no_print.c version of bench.c with no printing. The data and text sections
of the ELF files are extracted by objcopy and the hex files are built with hexdump.
The third step is to run the rv32i_npp_ip processor. You add the testbench_
bench_rv32i_npp_ip.cpp file found in the mibench/my_mibench/my_dir folder
and you start the Vitis_HLS simulation.
After the simulation, you can compare its output to the reference one. They should
be identical.
Then, you have to run the helloworld_rv32i_npp_ip.c on the FPGA (z1_rv32i_
npp_ip Vivado project).
Before building the Vitis IDE project from the helloworld_rv32i_npp_ip.c pro-
gram, make sure to adapt the path to the hex files to your own environment. For this
purpose, run the update_helloworld.sh shell script.
Notice that some mibench benchmarks are not suited to the Basys3 (XC7A35T)
or any board based on the XC7Z010 FPGA because of the number of BRAM blocks
available. On the XC7Z020 chip, there are 140 blocks, i.e. 560KB of memory. On
the XC7Z010 chip, there are only 60 blocks, i.e. 240KB and on the XC7A35T there
are only 50 blocks, i.e. 200KB.
I have added seven benchmarks which are included in the riscv-tests (in the riscv-
tests/benchmarks folder). Their testing procedure is identical to the mibench one
(each benchmark folder includes a build.sh shell script, an update_helloworld.sh
script and a Makefile to build the executable for spike).
Table 7.1 shows the execution time (in seconds) on the FPGA implementation of the
rv32i_npp_ip processor of the benchmarks as computed with equation 5.1 (nmi *
cpi * c, where c = 70 ns). The CPI value is 1 (each instruction is fully processed in
a single loop iteration, i.e. in a single processor cycle).
These execution times are the run times on the FPGA (not considering the recon-
stitution of the output). The run times on the Vitis_HLS simulator are significantly
higher (for example, the mm benchmark is run in 11 s on the FPGA and 13 min on
the simulator).
226 7 Testing Your RISC-V Processor
Table 7.1 Number of Machine Instructions run (nmi) and execution times of the mibench and
riscv-tests suites on the rv32i_npp_ip processor
Benchmark NMI Time (s)
Basicmath 30,897,739 2.162841730
Bitcount 32,653,239 2.285726730
qsort 6,683,571 0.467849970
Stringsearch 549,163 0.038441410
Rawcaudio 633,158 0.044321060
Rawdaudio 468,299 0.032780930
crc32 300,014 0.021000980
fft 31,365,408 2.195578560
fft_inv 31,920,319 2.234422330
Median 27,892 0.001952440
mm 157,561,374 11.029296180
Multiply 417,897 0.029252790
qsort 271,673 0.019017110
spmv 1,246,152 0.087230640
Towers 403,808 0.028266560
vvadd 16,010 0.001120700
This section will make you modify the rv32i_npp_ip processor to add the M RISC-V
extension.
I do not provide the resulting IP. You must design it yourself until it works on your
development board. I will simply list the different steps you will have to go through
to achieve the exercise.
First, you should check the RISC-V specification document to learn about the M
extension (Chap. 7 of [1]). In the same document, you will find the coding definition
of these new instructions in Chap. 24 (page 131: RV32M standard extension).
Second, you should open a new Vitis_HLS project named rv32im_npp_ip with a
top function having the same name. You can copy all the code from the rv32i_npp_ip
folder to the rv32im_npp_ip one as a starting point.
Third, you should update the rv32im_npp_ip.h file to add the new constants to
define the M extension (opcode, func3 and func7 fields in the decoding).
Fourth, you should update the execute function of the rv32im_npp_ip code. The
fetch and the decode functions are not impacted (fetching a multiplication is not
different from fetching an addition, decoding, i.e. decomposing into pieces, is not
impacted either as the format of multiplication and division instructions is R-TYPE).
7.4 Proposed Exercises: The RISC-V M and F Instruction Extensions 227
In the execute function, you will find the compute_result function which com-
putes a result according to the format of the instruction. The R-TYPE computation
calls the more specialized function compute_op_result which computes arithmetic
operations according to the func3 and func7 fields of the instruction encoding.
You must update the switch in the function to take care of the different new
encodings proposed in the M extension. For example, to multiply rv1 by rv2, just
write rv1 * rv2 in C and the synthesizer will do the rest, implementing a hardware
multiplier. For each RISC-V operation, you only need to find the matching C operator
(e.g. remember that the % operator in C is the remainder of the division).
Once your compute_op_result function has been adapted, you can simulate your
IP. You will need to write a RISC-V assembly code involving each of the newly
implemented instructions.
It is probably faster to directly write in RISC-V assembly language than compile
from C. Use the test_op.h code as a model.
If you want to produce assembly code from a C source, use the march option of
the compiler to get multiplication/division RISC-V instructions: "riscv32-unknown-
elf-gcc source.c -march=rv32im -mabi=ilp32 -o executable".
You keep the same testbench program and you make it run with your multiplication
and division test code.
When your IP is correctly simulated, you can start the synthesis. You will have
to adapt the processor cycle duration to the multiplication and division duration (the
processor cycle should be long enough to compute a division). This means changing
the HLS PIPELINE II value.
It is totally inefficient to align all the instructions execution time to the division one.
However, this inefficiency comes from the non pipelined organization. In Chap. 9,
I present the multicycle pipeline organization which is suited to different execution
time operators.
Once the synthesis is done, you can export it.
In Vivado, you can build your new design involving your new rv32im_npp_ip.
Then, you create the HDL wrapper, you generate the bitstream and you export the
hardware, including the bitstream.
You switch your board on and you start a putty terminal.
In Vitis IDE, you create your application project with the exported bitstream, you
generate an initial helloworld.c file, you replace it by a driver producing the same
results than your testbench program.
You open a connection with the board.
You program the FPGA with the bitstream and you run the driver on the develop-
ment board.
The whole process should be fulfilled in a time ranging from a single day to a full
week.
228 7 Testing Your RISC-V Processor
In the simulation phase, such a code works fine. But it assumes that the u.i and
u.f fields of the union share the same memory location. This is not the case for the
synthesized version of the union.
In Vitis_HLS, I recommend to use unions with care. They are safe if the utilization
of the union is confined in a single function (e.g. set the union at the beginning of
the function and use its fields in the function body).
This is what the Vitis documentation says about unions:
“Unlike C/C++ compilation, synthesis does not guarantee using the same memory
(in the case of synthesis, registers) for all fields in the union. Vitis HLS perform the
optimization that provides the most optimal hardware.”
7.5 Debugging Hints 229
If you are used to debugging your programs with fancy debuggers, you may find
debugging hardware rather spartan. However, there are a few tricks to make it possible
to debug an IP without going down to VHDL and chronograms.
The main hint comes from the remark that synthesis is not simulation. During syn-
thesis, the code is not run. Hence, the code to be synthesized does not need to be
runnable. You can easily disactivate some parts of your code.
As in all debugging work, the main technique is to isolate the faulty part (with the
faulty code, there is an error, without there is no error). In the same manner, when
your constraints cannot be satisfied by the synthesizer, try to reduce your code.
You eliminate the error by disactivating functions or by commenting some lines.
Do not be shy in doing so: remember that the synthesis does not run the code.
However, beware that the synthesizer will eliminate unused parts of the code and the
resulting synthesis might become empty.
When the constraints are satisfied, start adding back the discarded parts of the
code, step by step, until the problem arises again.
When the origin of the problem is detected, you can work on the code to reorganize
it in order to obtain a synthesizable version.
7.5.2 Infinite Simulation: Replace the “Do ... While” Loop By a “For”
Loop
A second hint concerns the simulation of the processors. The do ... while loop may
sometimes run infinitely.
A way around this is to turn it into a for loop, running enough iterations to end
the RISC-V test program (all the test programs I have given terminate in a few tens
of cycles).
A third hint concerns frozen IPs on the FPGA. Sometimes, your synthesis seems fine
but when you run the bitstream on the board, nothing prints. This is because your IP
did not reach the end of the do ... while loop. Hence, no IsDone signal is sent from
the IP to the Zynq and your helloworld.c driver stays stuck waiting for ever in the
"while (!X...IsDone(&ip));" loop.
This might come from a missing bit in an ap_uint or ap_int variable.
For example, the code shown in Listing 7.36 illustrates a badly typed loop counter.
230 7 Testing Your RISC-V Processor
The simulation will be correct because the compiler will replace the loop_
counter_t type by a char type (the smallest C defined type large enough to fit the
loop_counter_t type). But the synthesizer will produce an RTL with exactly four
bits for i. The i++ operation is computed on four bits, i.e. modulo 16, and i never
reaches value 16. Hence, on the FPGA, the loop never ends.
Please refer back to the end of 5.5.4 to find the correct way to declare i.
An IP run may also block because some necessary computation is within #ifndef
__SYNTHESIS__ and #endif. In this case, the simulation may be correct and the
FPGA run incorrect.
When an IP is not exiting, you can try to replace the while loop waiting for the
Done signal by an empty for loop in the helloworld driver (i.e. replace "while
(!X..._ip_IsDone(...));" by "for (int i=0; i<1000000; i++);").
The empty for loop does not wait for the IP to finish. The remaining of the driver
should print the state of the memory to give some indications on how far the run
went until the IP got stuck.
Another hint is related to the fact that your IP is a processor running a RISC-V code.
As the IP is built to run any RISC-V code, you can eliminate some parts of the RISC-
V code directly in the hex file included by the helloworld driver (e.g. comment the
hexadecimal codes you want to remove).
For example, you can limitate the RISC-V code to its last RET instruction and see
if you IP runs properly. By reducing the RISC-V code, you can isolate (i.e. eliminate)
the faulty instructions and know which part of your HLS code should be corrected.
Reference 231
If multiple runs of the same RISC-V code produce different results on the FPGA,
it probably means that some variable in the IP implementation code has not been
initialized. There are no implicit initializations on the FPGA as there are in C. Hence,
the simulation might be correct and the run on the FPGA be faulty.
Even though there is no debugger for the IP itself (Vitis IDE has a debugger but
it concerns only the helloworld driver, not the IP of course), you can make your
processor IP give you some information about the progression of the run.
Instead of printing a message, which the RISC-V code cannot do, you can add
some STORE instructions to the RISC-V code (of course, your IP must at least be
able to run STORE instructions properly). In your driver program running in Vitis
IDE, you can check the stores, e.g. read the respective memory addresses, like you
would check successive prints.
Reference
1. https://riscv.org/specifications/isa-spec-pdf/
Building a Pipelined RISC-V Processor
8
Abstract
This chapter will make you build your second RISC-V processor. The imple-
mented microarchitecture proposed in this second version is pipelined. Within a
single processor cycle, the updated processor fetches and decodes instruction i,
executes instruction i-1, accesses memory for instruction i-2 and writes a result
back for instruction i-3.
All the source files related to the simple_pipeline_ip can be found in the sim-
ple_pipeline_ip folder.
Figure 8.1 shows the difference between a non-pipelined microarchitecture (as the
rv32i_npp_ip built in Chap. 6) and a pipelined one. The main difference is in the
way the iteration of the do ... while loop is considered.
In the non-pipelined design, one iteration of the main loop (rectangular red box)
contains all four of the main steps of an instruction processing: fetch, decode, execute
(including memory access), and writeback. This was done in seven FPGA cycles on
the rv32i_npp_ip design. One loop iteration matches the full processing of one
instruction (blue box) from fetch to writeback. In the non-pipelined design, the blue
box (which surrounds the instruction execution) and the red box (which surrounds
the iteration) are identical.
In a pipelined design with multiple pipeline stages (two stages in Fig. 8.1), one
iteration (vertical red box) contains the execute and writback ending steps of an
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 233
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_8
234 8 Building a Pipelined RISC-V Processor
execute_wb
iteration i
With these two techniques (permutation and duplication), the loop contains inde-
pendent functions which the synthesizer schedules in parallel.
I have added structured variables encapsulating what each of the two stages computes
for the next iteration.
• f_to_f and f_to_e are variables written by the fetch_decode stage at the end of
the iteration. At iteration i, the fetch_decode stage saves respectively what is to
be used by the fetch_decode stage and by the execute_wb stage at iteration i+1.
f_to_f stands for a communication link between the fetch_decode stage and itself.
f_to_e stands for a communication link between the fetch_decode stage and the
execute_wb stage.
• e_to_f and e_to_e are similar variables written by the execute_wb stage.
I have also added matching variables following the naming scheme y_from_x
and x_to_y for stages x and y, e.g. e_from_f and f_to_e. Variable y_from_x serves
as a duplication of variable x_to_y. Variable x_to_y is copied into y_from_x at the
beginning of the iteration (like pc_ip1 was copied into pci).
I added the other _from_ variables to make all the function calls fully permutable.
It turns out that when the calls are ordered with fetch_decode before execute_wb,
the Vivado implementation uses less resources than when they are in the reverse
order. I have no explanation for this.
236 8 Building a Pipelined RISC-V Processor
The simple_pipeline_ip.h file (see Listing 8.2) contains the type definitions for
the _from_ and _to_ variable declarations:
Listing 8.2 The inter-stage transmission types definitions in the simple_pipeline_ip.h file
typedef struct from_f_to_f_s {
code_address_t next_pc ;
} from_f_to_f_t ;
typedef struct from_f_to_e_s {
code_address_t pc ;
d e c o d e d _ i n s t r u c t i o n _ t d_i ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ D I S A S S E M B L E
instruction_t instruction ;
# endif
# endif
} from_f_to_e_t ;
typedef struct from_e_to_f_s {
code_address_t target_pc ;
bit_t set_pc ;
} from_e_to_f_t ;
typedef struct from_e_to_e_s {
bit_t cancel ;
} from_e_to_e_t ;
The fetch_decode stage sends its computed next pc to itself (for non branch-
ing/jumping instructions) with f_to_f.next_pc. It sends the fetching pc and the de-
coded instruction d_i to the next iteration execute_wb stage with f_to_e.pc and
respectively f_to_e.d_i.
The execute_wb stage sends its computed target_pc and the set_pc bit to the
next iteration of the fetch_decode stage with e_to_f.target_pc and respectively
e_to_f.set_pc. The set_pc bit indicates if the executed instruction is a branch-
ing/jumping one.
The execute_wb stage sends a cancel bit to itself with e_to_e.cancel. This cancel
bit is set if the executed instruction is a branching/jumping one (hence, set_pc sent to
the fetch_decode stage and cancel sent to execute_wb stage have the same value).
The cancel bit indicates that the next instruction execution should be cancelled (this
is explained in Sect. 8.1.4).
The f_to_f, f_to_e, e_to_f, e_to_e, and the matching _from_ variables are de-
clared in the simple_pipeline_ip function. They are initialized to initiate the pipeline
as if the previous instruction executed is a branching/jumping one, with start_pc as
the target (e_to_f.set_pc is set). The execute_wb stage should not do any work
during the first iteration (e_to_e.cancel is set)
The main loop (see Listing 8.4) contains the current fetch (instruction i, fetch_
decode function) and the execution of the instruction fetched and decoded in the
previous iteration (instruction i-1, execute_wb function).
The Initiation Interval is set to 5 (II = 5). This will lead to a processor cycle set at
five FPGA cycles (20Mhz).
Listing 8.4 The simple_pipeline_ip function: main loop
...
do {
# p r a g m a HLS P I P E L I N E II =5
f_from_f = f_to_f ; e_from_f = f_to_e ;
f_from_e = e_to_f ; e_from_e = e_to_e ;
f e t c h _ d e c o d e ( f_from_f , f_from_e , code_ram , & f_to_f , & f _ t o _ e ) ;
e x e c u t e _ w b ( e_from_f , e_from_e , reg_file , data_ram , & e_to_f , &
e_to_e );
s t a t i s t i c _ u p d a t e ( e _ f r o m _ e . cancel , & nbi ) ;
r u n n i n g _ c o n d _ u p d a t e ( e _ f r o m _ e . cancel , e _ f r o m _ f . d_i . is_ret ,
e _ t o _ f . target_pc , & i s _ r u n n i n g ) ;
} while ( i s _ r u n n i n g ) ;
...
}
When the decoded instruction is a control flow one (i.e. either a jump or a taken
conditional branch), the next fetch address (the computed control flow target address)
is given by the execute_wb function of the next iteration.
238 8 Building a Pipelined RISC-V Processor
Hence, the next iteration fetch_decode function fetches at a wrong address (pc
+ 1 instead of the computed target address).
To avoid running wrong instructions, we can cancel an execution by blocking its
updates (at least, destination register writes and memory stores).
This is a general pattern: an instruction i has an impact on a later one i+k (e.g.
i cancels i+1). A value computed by i is forwarded to i+k (e.g. a cancel bit is
forwarded).
In the control flow instruction case, there are three forwarded values: the set_pc
bit, the computed target address and the cancel bit. The set_pc bit is set by the
execute_wb function if the instruction is a JAL, a JALR, or a taken branch. For any
other instruction, the set_pc bit is cleared. The cancel bit is a copy of the set_pc bit.
The set_pc bit is used by the fetch_decode function called in the next iteration.
It is transmitted by the e_to_f and f_from_e variables and used to decide which pc
is valid for the next fetch: the next_pc computed in the fetch_decode stage (set_pc
bit is 0) or the target_pc forwarded by the execute_wb stage (set_pc bit is 1).
Figure 8.3 illustrates this double transmission (in magenta) between the red rect-
angle (iteration i) and the green adjacent one (next iteration i+1).
The cancellation bit is forwarded by the execute_wb stage to itself (see the blue
arrow in Fig. 8.4). When the cancel bit is set, it cancels the execution in the next
iteration.
fetch−decode execute−wb
iteration i
fetch−decode execute−wb
if set_pc is 0, pc is
f_from_f.next_pc
if set_pc is 1, pc is
f_from_e.target_pc
iteration i+1
fetch−decode execute−wb
cancelled if
e_to_e.cancel is 1
iteration i
fetch−decode execute−wb
iteration i+1
The fetch_decode function code (in the fetch_decode.cpp file) is shown in Listing
8.5.
The fetch pc is either the next pc computed by the fetch_decode function or
the target pc computed by the execute_wb function at the preceding iteration. The
selection bit is set_pc computed by the execute_wb function at the preceding it-
eration. The fetch_decode function sends pc to the execute_wb function through
f_to_e->pc.
Listing 8.5 The fetch_decode function
void f e t c h _ d e c o d e (
from_f_to_f_t f_from_f ,
from_e_to_f_t f_from_e ,
u n s i g n e d int * code_ram ,
f r o m _ f _ t o _ f _ t * f_to_f ,
from_f_to_e_t * f_to_e ){
c o d e _ a d d r e s s _ t pc ;
instruction_t instruction ;
pc = ( f _ f r o m _ e . s e t _ p c ) ?
f_from_e . target_pc : f_from_f . next_pc ;
fetch ( pc , code_ram , &( f_to_f - > n e x t _ p c ) , & i n s t r u c t i o n ) ;
d e c o d e ( i n s t r u c t i o n , &( f_to_e - > d_i ) ) ;
f_to_e - > pc = pc ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ D I S A S S E M B L E
f_to_e - > i n s t r u c t i o n = i n s t r u c t i o n ;
# endif
# endif
}
The fetch function code (in the fetch.cpp file) is shown in Listing 8.6.
Listing 8.6 The fetch function
void fetch (
code_address_t pc ,
instruction_t * code_ram ,
c o d e _ a d d r e s s _ t * next_pc ,
instruction_t * instruction ){
* n e x t _ p c = ( c o d e _ a d d r e s s _ t ) ( pc + 1) ;
* i n s t r u c t i o n = c o d e _ r a m [ pc ];
}
240 8 Building a Pipelined RISC-V Processor
The decode function (in the decode.cpp file) is unchanged except for the adding
of an is_jal bit (new field in the decoded_instruction_t type and set in the de-
code_instruction function). The is_jal bit is used in the execute function.
The execute_wb function (see Listings 8.7 to 8.10) is located in the execute_wb.cpp
file.
It receives e_from_f and e_from_e and sends e_to_f and e_to_e.
Listing 8.7 The execute_wb function prototype and local declarations
void e x e c u t e _ w b (
from_f_to_e_t e_from_f ,
from_e_to_e_t e_from_e ,
int * reg_file ,
int * data_ram ,
f r o m _ e _ t o _ f _ t * e_to_f ,
from_e_to_e_t * e_to_e ){
int rv1 , rv2 , rs , op_result , r e s u l t ;
bit_t bcond , t a k e n _ b r a n c h ;
...
The execute_wb stage stays idle when the e_from_e.cancel bit is set. In the same
iteration, the fetch_decode function fetches the target_pc instruction (see iteration
i in Fig. 8.5).
When the e_from_e.cancel bit is set, the execute_wb function does not compute
anything (see Listing 8.8). It only clears the e_to_f.set_pc and the e_to_e.cancel
bits.
Listing 8.8 The execute_wb function computation: cancellation
...
if ( e _ f r o m _ e . c a n c e l ) {
e_to_f - > s e t _ p c = 0;
e_to_e - > c a n c e l = 0;
}
...
Figure 8.6 shows what happens on the i+1 iteration after cancellation, due to
cleared set_pc and cancel bits.
Figure 8.7 shows the pipeline if the target instruction is also a taken jump.
When the e_from_e.cancel bit is not set, the execute_wb stage executes the
instruction fetched in the preceding iteration and computes the value of the e_to_f.
set_pc and the e_to_e.cancel bits.
Listing 8.9 The execute_wb function computation: no cancellation
...
else {
r e a d _ r e g ( reg_file , e _ f r o m _ f . d_i . rs1 , e _ f r o m _ f . d_i . rs2 , & rv1 , &
rv2 ) ;
bcond = c o m p u t e _ b r a n c h _ r e s u l t ( rv1 , rv2 , e _ f r o m _ f . d_i .
func3 ) ;
t a k e n _ b r a n c h = e _ f r o m _ f . d_i . i s _ b r a n c h && bcond ;
rs =( e _ f r o m _ f . d_i . i s _ r _ t y p e ) ?
rv2 :( int ) e _ f r o m _ f . d_i . imm ;
op_result = c o m p u t e _ o p _ r e s u l t ( e _ f r o m _ f . d_i , rv1 , rs ) ;
242 8 Building a Pipelined RISC-V Processor
The debugging prints are all placed in the execute_wb function. They are orga-
nized in a way to produce the same outputs than in the rv32i_npp_ip design.
Listing 8.10 The execute_wb function computation: debugging prints
...
# ifndef __SYNTHESIS__
# ifdef D E B U G _ F E T C H
p r i n t f ( " %04 d : %08 x ",
( int ) ( e _ f r o m _ f . pc < <2) , e _ f r o m _ f . i n s t r u c t i o n ) ;
# ifndef DEBUG_DISASSEMBLE
printf ("\n");
# endif
# endif
# ifdef D E B U G _ D I S A S S E M B L E
d i s a s s e m b l e ( e _ f r o m _ f . pc , e _ f r o m _ f . i n s t r u c t i o n ,
e _ f r o m _ f . d_i ) ;
# endif
# ifdef D E B U G _ E M U L A T E
e m u l a t e ( reg_file , e _ f r o m _ f . d_i , e_to_f - > t a r g e t _ p c ) ;
# endif
# endif
}
}
The computation, memory access, register read, and write functions are mostly
unchanged (they are all located in the execute.cpp file).
Experimentation
To simulate the simple_pipeline_ip, operate as explained in Sect. 5.3.6, replacing
fetching_ip with simple_pipeline_ip.
You can play with the simulator, replacing the included test_mem_0_text.hex
file with any other .hex file you find in the same folder.
8.1 First Step: Control a Pipeline 243
All the functions are inlined (pragma HLS INLINE recursive) to optimize the size
and speed.
The testbench code (in the testbench_simple_pipeline_ip.cpp file) is un-
changed and the eight test files remain the same with an identical output.
Figure 8.8 shows the synthesis report. The iteration latency takes five FPGA cycles
(Fig. 8.9).
There is a timing violation in cycle 2 of the iteration scheduling. It can be ignored.
Experimentation
To run the simple_pipeline_ip on the development board, proceed as explained
in Sect. 5.3.10, replacing fetching_ip with simple_pipeline_ip.
You can play with your IP, replacing the included test_mem_0_text.hex file with
any other .hex file you find in the same folder.
244 8 Building a Pipelined RISC-V Processor
The helloworld.c file is shown in Listing 8.11 (do not forget to adapt the path
to the hex file to your environment with the update_helloworld.sh shell script). To
run another test program with the simple_pipeline_ip, you just need to update the
"#include test_mem_0_text.hex" line.
Listing 8.11 The helloworld.c file to run test_mem_0_text.hex
# i n c l u d e < stdio .h >
# include " xsimple_pipeline_ip .h"
# include " xparameters .h"
# d e f i n e L O G _ C O D E _ R A M _ S I Z E 16
// size in words
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
# d e f i n e L O G _ D A T A _ R A M _ S I Z E 16
// size in words
# define DATA_RAM_SIZE (1 < < L O G _ D A T A _ R A M _ S I Z E )
XSimple_pipeline_ip_Config * cfg_ptr ;
XSimple_pipeline_ip ip ;
8.1 First Step: Control a Pipeline 245
The run on the FPGA prints on the putty terminal what is shown in Listing 8.12.
Listing 8.12 The helloworld output
88 f e t c h e d and d e c o d e d i n s t r u c t i o n s
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
It is wise to pass the riscv-tests. Some of the tests are specific to pipelines. They
check that dependencies are respected.
To pass the riscv-tests on the Vitis_HLS simulator, you just need to use the test-
bench_riscv_tests_simple_pipeline_ip.cpp program in the riscv-tests/my_isa/my_
rv32ui folder as the testbench.
To pass the riscv-tests on the FPGA, you must use the helloworld_simple_
pipeline_ip.c in the riscv-tests/my_isa/my_rv32ui folder.
Normally, since you already ran the update_helloworld.sh shell script for the
rv32i_npp_ip processor, you should have adapted the helloworld_simple_pipeline_
ip.c paths to your environment. However if you did not, you must run the up-
date_helloworld.sh shell script.
246 8 Building a Pipelined RISC-V Processor
Table 8.1 Execution time of the benchmarks on the 2-stage pipelined simple_pipeline_ip proces-
sor
suite benchmark Cycles cpi Time (s) Baseline Improvement
time (s) (%)
mibench basicmath 37,611,825 1.22 1.880591250 2.162841730 13
mibench bitcount 37,909,997 1.16 1.895499850 2.285726730 17
mibench qsort 8,080,719 1.21 0.404035950 0.467849970 14
mibench stringsearch 634,391 1.16 0.031719550 0.038441410 17
mibench rawcaudio 747,169 1.18 0.037358450 0.044321060 16
mibench rawdaudio 562,799 1.20 0.028139950 0.032780930 14
mibench crc32 330,014 1.10 0.016500700 0.021000980 21
mibench fft 38,221,438 1.22 1.911071900 2.195578560 13
mibench fft_inv 38,896,511 1.22 1.944825550 2.234422330 13
riscv-tests median 35,469 1.27 0.001773450 0.001952440 9
riscv-tests mm 193,397,241 1.23 9.669862050 11.029296180 12
riscv-tests multiply 540,022 1.29 0.027001100 0.029252790 8
riscv-tests qsort 322,070 1.19 0.016103500 0.019017110 15
riscv-tests spmv 1,497,330 1.20 0.074866500 0.087230640 14
riscv-tests towers 418,189 1.04 0.020909450 0.028266560 26
riscv-tests vvadd 18,010 1.12 0.000900500 0.001120700 20
8.2 Second Step: Slice a Pipeline into Stages 247
All the source files related to the rv32i_pp_ip can be found in the rv32i_pp_ip folder.
As I have already pointed out, the processing of a single instruction is done in several
steps: fetch, decode, execute (which computes a memory address if the instruction
is a memory access), memory access, and writeback.
The 2-stage pipeline organization can be further refined to divide the instruction
processing in four phases: fetch and decode, execute, access to memory, register
writeback (the fetch and decode steps are kept in the same pipeline stage as in the
simple_pipeline_ip design; in Chap. 9, I will design a pipeline with separate fetch
and decode stages).
If the instruction is not a memory access, nothing is done during the memory
access stage, just propagating the already computed result up to the writeback stage.
Figure 8.12 shows a 4-stage pipeline: fetch and decode (f+d), execute, memory
access (mem), and writeback (wb). The green horizontal rectangle is the 4-step
processing of a single instruction. The red vertical rectangle is what should be inside
the main loop of the top function.
e_from_f e_to_f
cancel e_to_e
e_from_e execute "e"
e_to_m
m_from_e m_to_w
memory access "m"
w_from_m
write back "w"
The transmission from the m stage to the w stage (see Listing 8.14) sends the
computed result (either the result received from the e stage or the one loaded in the m
stage) and propagates the destination rd and the decoded bits is_ret and has_no_dest.
It propagates the cancel bit received from the e stage.
8.2 Second Step: Slice a Pipeline into Stages 249
The rv32i_pp_ip top function (rv32i_pp_ip.cpp file) is shown in Listings 8.15 and
8.16.
The rv32i_pp_ip adds a new nb_cycle argument. The cancellings imply that the
pipeline does not output one instruction per cycle, hence the number of instructions
run is not equal to the number of cycles of the run.
From the nb_cycle number of cycles and the nb_instruction number of instruc-
tions, I compute the IPC (nb_instruction/nb_cycle) in the testbench result (the IPC
is the inverse of the CPI).
The cancel and set_pc bits are set to initialize the pipeline: all the stages are
cancelled except the fetch and decode one (e_to_e.cancel set to cancel stage e,
e_to_m.cancel set to cancel stage m and m_to_w.cancel set to cancel stage w).
Listing 8.15 The rv32i_pp_ip function prototype, local declarations, and initializations
void r v 3 2 i _ p p _ i p (
u n s i g n e d int start_pc ,
u n s i g n e d int code_ram [ CODE_RAM_SIZE ],
int data_ram [ DATA_RAM_SIZE ],
u n s i g n e d int * n b _ i n s t r u c t i o n ,
250 8 Building a Pipelined RISC-V Processor
u n s i g n e d int * n b _ c y c l e ) {
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = s t a r t _ p c
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = c o d e _ r a m
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = d a t a _ r a m
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = n b _ i n s t r u c t i o n
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = n b _ c y c l e
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = r e t u r n
# p r a g m a HLS I N L I N E r e c u r s i v e
int r e g _ f i l e [ N B _ R E G I S T E R ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = r e g _ f i l e dim =1 c o m p l e t e
from_f_to_f_t f_to_f , f _ f r o m _ f ;
from_f_to_e_t f_to_e , e _ f r o m _ f ;
from_e_to_f_t e_to_f , f _ f r o m _ e ;
from_e_to_e_t e_to_e , e _ f r o m _ e ;
from_e_to_m_t e_to_m , m _ f r o m _ e ;
from_m_to_w_t m_to_w , w _ f r o m _ m ;
bit_t is_running ;
u n s i g n e d int nbi ;
u n s i g n e d int nbc ;
for ( r e g _ n u m _ p 1 _ t i =0; i < N B _ R E G I S T E R ; i ++) r e g _ f i l e [ i ] = 0;
e_to_f . target_pc = start_pc ;
e_to_f . set_pc = 1;
e_to_e . cancel = 1;
e_to_m . cancel = 1;
m_to_w . cancel = 1;
nbi = 0;
nbc = 0;
...
The rv32i_pp_ip main loop (see Listing 8.16) reflects the 4-stage pipeline orga-
nization with six parallel calls: fetch_decode to fetch and decode instruction i+3,
execute to execute instruction i+2, mem_access to access memory for instruction
i+1, wb to writeback the result of instruction i, statistic_update to compute the num-
ber of instructions and the number of cycles of the run, and running_cond_update
to update the running condition.
When the loop starts, the pipeline is empty. All of the stages except the
fetch_decode one receive a cancel input bit set by the copy from the _to_ vari-
ables to the _from_ ones.
The HLS PIPELINE sets the Initiation Interval (II) to 3. Hence, the processor cycle
is three FPGA cycles (30 ns, 33 Mhz).
Listing 8.16 The rv32i_pp_ip do ... while loop and return
...
do {
# p r a g m a HLS P I P E L I N E II =3
f_from_f = f_to_f ; f_from_e = e_to_f ; e_from_f = f_to_e ;
e_from_e = e_to_e ; m_from_e = e_to_m ; w_from_m = m_to_w ;
f e t c h _ d e c o d e ( f_from_f , f_from_e , code_ram , & f_to_f , & f_to_e );
e x e c u t e ( f_to_e , e_from_f , e _ f r o m _ e . cancel ,
m _ f r o m _ e . cancel , m _ f r o m _ e . d_i . h a s _ n o _ d e s t ,
m _ f r o m _ e . d_i . rd , m _ f r o m _ e . result ,
w _ f r o m _ m . cancel , w _ f r o m _ m . h a s _ n o _ d e s t ,
w _ f r o m _ m . rd , w _ f r o m _ m . result , reg_file ,
& e_to_f , & e_to_e , & e_to_m );
m e m _ a c c e s s ( m_from_e , data_ram , & m _ t o _ w ) ;
wb ( w_from_m , r e g _ f i l e ) ;
s t a t i s t i c _ u p d a t e ( w _ f r o m _ m . cancel , & nbi , & nbc ) ;
8.2 Second Step: Slice a Pipeline into Stages 251
r u n n i n g _ c o n d _ u p d a t e ( w _ f r o m _ m . cancel , w _ f r o m _ m . is_ret , w _ f r o m _ m .
result , & i s _ r u n n i n g ) ;
} while ( i s _ r u n n i n g ) ;
* nb_cycle = nbc ;
* n b _ i n s t r u c t i o n = nbi ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ R E G _ F I L E
print_reg ( reg_file );
# endif
# endif
}
In the execute stage, the sources are read from the register file. However, two prior
instructions are still in progress in the pipeline. They have computed their result but
they still have not written it back to the register file. If one of the executed instruction
sources is a register to be updated by one of the two preceding instructions, its result
should bypass the register source read as illustrated in Fig. 8.14.
The blue result is the one computed in the m stage and the red result is the one to
be written back in the w stage.
The mux 3->1 boxes are 3-to-1 multiplexers. Each multiplexer outputs its upper
input if the m stage is not cancelled, if the instruction has a destination, and if the
destination is the same as the e stage source (rs1 for the upper multiplexer, rs2 for
the lower one). If not, the multiplexer outputs its middle input if the w stage is not
cancelled, if the instruction has a destination, and if the destination is the same as
the e stage source. If not, the multiplexer outputs its lower input (i.e. the value read
from the register file).
Hence, the register sources are taken in priority order from the m stage value,
from the w stage value, or from the register file read value.
In any pipeline, the priority for the bypass mechanism orders the pipeline stages
from the pipeline stage next to the stage reading the register file (in this 4-stage
pipeline, the execute stage reads the register file) up to the writeback stage.
Unfortunately, the bypass mechanism degrades the critical path as it inserts a
3-to-1 multiplexer between the register file read and the ALU.
result result
from w from m
mux 3−>1
rs1 arithmetic result
register file and logic
rs2 unit (ALU)
mux 3−>1
Moreover, the ALU input depends on the values coming from the m and w stages.
The value transmitted by the w stage is not in the critical path because it is set at the
cycle start and not modified by the writeback (the register file access time is longer
than the value transmission from the w stage).
But for the m stage, the value to transmit is loaded from memory if the instruction
is a load. The memory access time is much longer than the register file access time.
So, for a load, the m stage value transmission is in the critical path.
To limit the increase of the critical path to a 3-to-1 multiplexer, what I transfer
from the m stage to the multiplexers upper input is not the value computed in the
m stage but only the one received by the m stage and propagated from the e stage
(hence, bypassing applies to the e stage ALU computed values, not to the m stage
memory loads; for LOAD instructions, there is a special processing explained in Sect.
8.2.7).
The execute function calls the get_source function, which calls the bypass func-
tion.
The bypass function in the execute.cpp file (see Listing 8.17) selects which
source is the most up-to-date between the mem_result and the wb_result. The
mem_result value sent by the m stage has priority over the wb_result sent by the
w stage (more precisely, the bypass function returns the mem_result value if the m
stage is bypassing, otherwise it returns the wb_result value).
Listing 8.17 The bypass function
s t a t i c int b y p a s s (
bit_t m_bp ,
int mem_result ,
int wb_result ){
if ( m_bp ) r e t u r n m e m _ r e s u l t ;
else return w b _ r e s u l t ;
}
The get_source function in the execute.cpp file (see Listing 8.18) computes
bypass conditions for both rs1 and rs2 sources (bypass_rs1 and bypass_rs2).
Each condition is set when the bypass applies either for the m or w stage.
For example, the bypass applies for the rs1 source of the instruction i processed
in stage m if the following four conditions are all true: stage m is not cancelled, the
instruction i has a destination, the rs1 register source of instruction i is not register
zero, and the instruction j processed in stage m has register rs1 as its destination. In
this case, the Boolean variable m_bp_1 is set.
The Boolean variable w_bp_1 is set if the same four conditions are true, replacing
the m stage with the w stage.
The Boolean variables m_bp_2 and w_bp_2 are set in the same conditions,
replacing the rs1 register with the the rs2 register.
When a bypass condition is set (bypass_rs1 or bypass_rs2), the bypass function
is called to choose between the mem_result and the wb_result values. Otherwise,
the source is set to the value read from the register file and received as an argument
(r1 for rs1 and r2 for rs2).
8.2 Second Step: Slice a Pipeline into Stages 253
f r o m _ e _ t o _ e _ t * e_to_e ,
from_e_to_m_t * e_to_m ){
int r1 , r2 , rv1 , rv2 , rs ;
int c _ o p _ r e s u l t , c_result , r e s u l t ;
r e g _ n u m _ t rs1 , rs2 ;
bit_t bcond , t a k e n _ b r a n c h , load_delay , is_rs1_reg , i s _ r s 2 _ r e g
;
opcode_t opcode ;
rs1 = e _ f r o m _ f . d_i . rs1 ;
rs2 = e _ f r o m _ f . d_i . rs2 ;
r1 = r e a d _ r e g ( reg_file , rs1 ) ;
r2 = r e a d _ r e g ( reg_file , rs2 ) ;
g e t _ s o u r c e ( r1 , r2 ,
e_from_f ,
m_cancel ,
m_has_no_dest ,
m_rd ,
m_result ,
w_cancel ,
w_has_no_dest ,
w_rd ,
w_result ,
& rv1 , & rv2 ) ;
...
Once the sources are known, the execute function computes (see Listing 8.20)
the branch condition (unchanged compute_branch_result function), the opera-
tion result (unchanged compute_op_result function), the instruction typed result
(compute_result function) and the next pc (unchanged compute_next_pc func-
tion).
Listing 8.20 The execute function to compute
...
bcond = c o m p u t e _ b r a n c h _ r e s u l t ( rv1 , rv2 , e _ f r o m _ f . d_i . func3 ) ;
t a k e n _ b r a n c h = e _ f r o m _ f . d_i . i s _ b r a n c h && bcond ;
rs = ( e _ f r o m _ f . d_i . i s _ r _ t y p e ) ?
rv2 :( int ) e _ f r o m _ f . d_i . imm ;
c _ o p _ r e s u l t = c o m p u t e _ o p _ r e s u l t ( e _ f r o m _ f . d_i , rv1 , rs ) ;
c_result = c o m p u t e _ r e s u l t ( e _ f r o m _ f . pc , e _ f r o m _ f . d_i , rv1 ) ;
result = ( e _ f r o m _ f . d_i . i s _ r _ t y p e ||
e _ f r o m _ f . d_i . i s _ o p _ i m m ) ?
c_op_result : c_result ;
e_to_f - > t a r g e t _ p c =
c o m p u t e _ n e x t _ p c ( e _ f r o m _ f . pc , e _ f r o m _ f . d_i , rv1 ,
bcond ) ;
...
If the load_delay bit is set, the current fetch must be cancelled as explained in
Sect. 8.2.7. The target_pc is set as the current fetch pc (i.e. the instruction next to
the load should be refetched).
Listing 8.23 The execute function to set the load_delay bit
...
opcode = f _ t o _ e . d_i . o p c o d e ;
i s _ r s 1 _ r e g = (( o p c o d e != JAL ) && ( o p c o d e != LUI ) &&
( o p c o d e != AUIPC ) && ( f _ t o _ e . d_i . rs1 != 0) ) ;
i s _ r s 2 _ r e g = (( o p c o d e != O P _ I M M ) && ( o p c o d e != LOAD ) &&
( o p c o d e != JAL ) && ( o p c o d e != JALR ) &&
( o p c o d e != LUI ) && ( o p c o d e != AUIPC ) &&
( f _ t o _ e . d_i . rs2 != 0) ) ;
load_delay = ! e_cancel && e _ f r o m _ f . d_i . i s _ l o a d &&
(( i s _ r s 1 _ r e g && ( e _ f r o m _ f . d_i . rd == f _ t o _ e . d_i . rs1 ) )
||
( i s _ r s 2 _ r e g && ( e _ f r o m _ f . d_i . rd == f _ t o _ e . d_i . rs2 ) ) ) ;
if ( l o a d _ d e l a y ) e_to_f - > t a r g e t _ p c = e _ f r o m _ f . pc + 1;
...
The set_pc bit is set (see Listing 8.24) if the instruction is a non cancelled jump,
a taken branch, or if the load_delay bit is set.
The execute stage sends the set_pc bit to the f stage.
The execute stage sends a cancel bit to itself (the cancel bit is a copy of the
set_pc one).
The cancel bit sent to itself is also sent to the m stage.
The execute stage sends its result to the m stage (the transmitted result is the
target_pc if the instruction is a RET, i.e. the return address).
The execute stage sends rv2 (the stored value if the instruction is a STORE; it is
the value after a potential bypass) and d_i (which contains rd to be propagated to the
w stage, func3 which is the memory access size, and other decoded bits used by the
m and w stages).
Other values (the instruction, its pc, and the computed target_pc) are sent for
debugging purpose (they are not included in synthesis mode).
Listing 8.24 The execute function to set the transmitted values
...
e_to_f - > s e t _ p c =( e _ f r o m _ f . d_i . i s _ j a l r ||
e _ f r o m _ f . d_i . i s _ j a l ||
taken_branch ||
load_delay ) &&
! e_cancel ;
e_to_e - > c a n c e l = e_to_f - > s e t _ p c ;
e_to_m - > c a n c e l = e_cancel ;
e_to_m - > r e s u l t =( e _ f r o m _ f . d_i . i s _ r e t ) ?
( int ) e_to_f - > t a r g e t _ p c : r e s u l t ;
e_to_m - > rv2 = rv2 ;
e_to_m - > d_i = e _ f r o m _ f . d_i ;
# ifndef __SYNTHESIS__
8.2 Second Step: Slice a Pipeline into Stages 257
# ifdef D E B U G _ D I S A S S E M B L E
e_to_m - > pc = e _ f r o m _ f . pc ;
e_to_m - > i n s t r u c t i o n = e _ f r o m _ f . i n s t r u c t i o n ;
# endif
# ifdef D E B U G _ E M U L A T E
e_to_m - > n e x t _ p c = e_to_f - > t a r g e t _ p c ;
# endif
# endif
}
iteration i+1
258 8 Building a Pipelined RISC-V Processor
Figure 8.16 shows an example where s = 3. There are four instructions: W, I1,
I2 and R. The W instruction has register r as its destination. The R instruction has
r as its source. No bypass is necessary if W and R are separated by at least two
instructions (I1 and I2) not accessing to register R.
However, this lazy solution is usually inapplicable because of the compiler. The
Gnu RISC-V cross compiler does not provide any option to handle delays between
instructions (there is an option only for branch delays).
It is still possible to manipulate the assembly code produced by the compiler and
insert NOP instructions where they are needed (a NOP is a neutral No-OPeration
instruction, leaving the processor state unchanged).
This manipulation has to be done on the assembly code, before the production of
the hexadecimal code, to keep correct code references (the computed displacements
between label definitions and usages).
However, such a post-processing cannot be applied to the linked libraries, as they
are already compiled.
For loads, another solution is to extend cancellation to load dependent computa-
tions.
On the left part of Fig. 8.15, the "addi a1,a1,1" instruction increments the value
loaded by the preceding "lw a1,0(a2)" instruction. However, when the addi instruction
is executed, the lw instruction is still loading. I could forward the value out of the
load to the execute stage but this would serialize the mem stage after the execute
stage. Such a serialization would ruin the benefits of pipelining.
On the right part of Fig. 8.15, the first fetch of the "addi a1,a1,1" instruction is
cancelled, introducing in the pipeline an equivalent of a NOP instruction. The same
addi instruction is refetched in iteration i (the set_pc bit is set if a use after load
dependency is detected; the target_pc is set to refetch the use after load instruction:
see the last line in Listing 8.23). In iteration i+1, the value in the wb stage (which is
the one out of the iteration i mem stage) bypasses the register file value in the execute
stage.
The mem stage (mem.cpp file) is composed of the mem_load, the mem_store,
and the mem_access functions. The mem_load and the mem_store functions are
unchanged.
The mem_access function in the mem.cpp file (see Listing 8.25) is slightly
modified to add the case of cancellations (m_from_e.cancel). If the instruction in
the mem stage is not a load, its result is not modified. Otherwise, it receives the
loaded value. The mem_access function fills the m_to_w fields to either transmit
the result of the load or propagate what was computed in the execute function.
8.2 Second Step: Slice a Pipeline into Stages 259
# endif
# ifdef D E B U G _ D I S A S S E M B L E
d i s a s s e m b l e ( w _ f r o m _ m . pc , w _ f r o m _ m . i n s t r u c t i o n ,
w _ f r o m _ m . d_i ) ;
# endif
# ifdef D E B U G _ E M U L A T E
e m u l a t e ( reg_file , w _ f r o m _ m . d_i , w _ f r o m _ m . n e x t _ p c ) ;
# endif
# endif
}
}
Experimentation
To simulate the rv32i_pp_ip, operate as explained in Sect. 5.3.6, replacing fetch-
ing_ip with rv32i_pp_ip.
You can play with the simulator, replacing the included test_mem_0_text.hex
file with any other .hex file you find in the same folder.
Experimentation
To run the rv32i_pp_ip on the development board, proceed as explained in Sect.
5.3.10, replacing fetching_ip with rv32i_pp_ip.
You can play with your IP, replacing the included test_mem_0_text.hex file with
any other .hex file you find in the same folder.
The code in the helloworld.c file is shown in Listing 8.28 (do not forget to adapt
the path to the hex file to your environment with the update_helloworld.sh shell
script). The code_ram array initialization concerns the test_mem program.
8.2 Second Step: Slice a Pipeline into Stages 263
If the RISC-V code in the test_mem.h file is run, it produces the output shown
in Listing 8.29.
Listing 8.29 The helloworld output
88 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 119 c y c l e s ( ipc = 0.74)
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
264 8 Building a Pipelined RISC-V Processor
Table 8.2 Execution time of the benchmarks on the 4-stage pipelined rv32i_pp_ip processor
suite benchmark Cycles cpi Time (s) 2-stage Improvement
time (s) (%)
mibench basicmath 37,643,157 1.22 1.129294710 1.880591250 40
mibench bitcount 37,909,999 1.16 1.137299970 1.895499850 40
mibench qsort 8,086,191 1.21 0.242585730 0.404035950 40
mibench stringsearch 635,830 1.16 0.019074900 0.031719550 40
mibench rawcaudio 758,171 1.20 0.022745130 0.037358450 39
mibench rawdaudio 562,801 1.20 0.016884030 0.028139950 40
mibench crc32 360,016 1.20 0.010800480 0.016500700 35
mibench fft 38,233,594 1.22 1.147007820 1.911071900 40
mibench fft_inv 38,908,668 1.22 1.167260040 1.944825550 40
riscv-tests median 35,471 1.27 0.001064130 0.001773450 40
riscv-tests mm 193,405,475 1.23 5.802164250 9.669862050 40
riscv-tests multiply 540,024 1.29 0.016200720 0.027001100 40
riscv-tests qsort 330,630 1.22 0.009918900 0.016103500 38
riscv-tests spmv 1,497,940 1.20 0.044938200 0.074866500 40
riscv-tests towers 426,385 1.06 0.012791550 0.020909450 39
riscv-tests vvadd 18,012 1.13 0.000540360 0.000900500 40
To pass the riscv-tests on the Vitis_HLS simulator, you just need to use the test-
bench_riscv_tests_rv32i_pp_ip.cpp program in the riscv-tests/my_isa/my_rv32ui
folder as the testbench.
To pass the riscv-tests on the FPGA, you must use the helloworld_rv32i_pp_ip.c
in the riscv-tests/my_isa/my_rv32ui folder. Normally, since you already ran the up-
date_helloworld.sh shell script for the other processors, the helloworld_rv32i_
pp_ip.c file should have paths adapted to your environment. However if you did
not, you must run the update_helloworld.sh shell script.
8.3 The Comparison of the 2-Stage Pipeline with the 4-Stage One
To run a benchmark from the mibench suite, say my_dir/bench, you set the testbench
as the testbench_bench_rv32i_pp_ip.cpp file found in the mibench/my_mibench/
my_dir/bench folder. For example, to run basicmath, you set the testbench as test-
bench_basicmath_rv32i_pp_ip.cpp in mibench/my_mibench/my_automotive/
basicmath.
To run one of the official riscv-tests benchmarks, say bench, you set the testbench
as the testbench_bench_rv32i_pp_ip.cpp file found in the riscv-tests/benchmarks/
bench folder. For example, to run median, you set the testbench as testbench_
median_rv32i_pp_ip.cpp in riscv-tests/benchmarks/median.
8.3 The Comparison of the 2-Stage Pipeline with the 4-Stage One 265
Abstract
This chapter will make you build your third RISC-V processor. The implemented
microarchitecture proposed in this third version takes care of dependencies by
blocking an instruction in the pipeline until the instructions it depends on are all
out of the pipeline. For this purpose, a new issue stage is added. Moreover, the
pipeline stages are organized to allow an instruction to stay multiple cycles in the
same stage. The instruction processing is divided into six steps in order to fur-
ther reduce the processor cycle to two FPGA cycles (i.e. 50Mhz): fetch, decode,
issue, execute, memory access, and writeback. This multicycle pipeline microar-
chitecture is useful when the operators have different latencies, like multicycle
arithmetic or memory accesses.
All the source files related to the multicycle_pipeline_ip can be found in the multi-
cycle_pipeline_ip folder.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 267
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_9
268 9 Building a RISC-V Processor with a Multicycle Pipeline
i_wait i_wait
Fig. 9.1 The i_wait signal sent by the issue stage when a source is not ready
Another consequence is that an instruction cannot wait for its sources before enter-
ing the execution stage: they must all be ready when the instruction starts execution.
I had to add a bypass mechanism to the pipeline to forward computed but not yet
written values. This bypass hardware impacts the critical path. With no bypass, the
execute stage can fit in the two FPGA cycle limit. With bypassing, the execute stage
requires three FPGA cycles.
In the multicycle pipeline organization, when an instruction must stay in the same
stage for multiple cycles, it sends a wait signal to the preceding stages as shown in
Fig. 9.1.
Each stage receives some input (e.g. the stage y receives the structure x_to_y from
the stage x), processes it and sends an output (e.g. the structure y_to_z is sent to the
stage z). However, if the stage has a wait input signal, it stays frozen, which means
that it does not change its output at all (it does not receive nor process its input and
it keeps its output unchanged).
A stage waits until the wait signal it inputs is cleared (e.g. the issue stage is able
to issue and sends a null wait signal to the fetch and decode stages).
When a waiting stage receives a cleared wait signal, it resumes and starts process-
ing its input.
The inputs and the outputs of a pipeline stage use the same structures as the ones
described in the preceding chapter (see Sect. 8.1.2).
However, to allow waits from multicycle stages, each stage input and output
structure contains a valid bit (see Fig. 9.2). While a multicycle stage is in progress,
its output is invalid. The valid bit is set once the final result is ready and the wait
condition is cleared.
wait wait
Fig. 9.2 A waiting issue operation invalidates its output until the wait is cleared
9.1 The Difference Between a Pipeline and a Multicycle Pipeline 269
To keep the processor cycle within the limit of two FPGA cycles, the fetch stage
cannot additionally decode the fetched instruction. The read operation from the code
memory, the filling of the decoded instruction structure and the computation of the
immediate value according to the instruction format are too much work for two FPGA
cycles, i.e. 20ns. This is one of the two reasons why the rv32i_pp_ip processor cycle
had to be 30ns (the other reason is the impact of the bypass on the execute stage
delay).
Although the decoding can be moved to a dedicated decode stage, the computation
of the next pc cannot. If there is no valid pc output from the fetch stage to itself, the
next cycle is not able to fetch.
A sequential next pc can be computed in the fetch stage, as it can be done in
parallel with the fetch. But if the fetched instruction is a taken jump or branch, the
sequential pc is not correct.
Instead of cancelling an incorrect path as I did in the preceding chapter, the
multicycle pipeline does some decoding in the fetch stage to block the fetch operation
each time the next pc keeps unknown (i.e. when the fetched instruction is a BRANCH,
a JAL or a JALR).
In case of a JAL, the next pc is known in the decode stage. In case of a BRANCH
or a JALR, the next pc remains unknown until the target has been computed in the
execute stage. The fetch stage keeps idle until it receives a valid input next pc.
If neither of the three cases occurs, BRANCH, JAL, or JALR, the fetch stage sends
a valid next pc to itself. This next pc is pc + 1.
Figure 9.3 shows how the fetch stage input is set according to the incoming valid
bits of the structures produced by itself, by the decode stage, or by the execute stage.
The fetch stage remains idle until a valid bit is present on either one of the three
inputs. Each time a BRANCH or a JALR instruction is fetched, the fetch stage stays
idle for three cycles (the time to move the control instruction to the execute stage
where the target pc is computed and sent back to the fetch stage with the valid bit
set). Each time a JAL is fetched, the fetch stage stays idle for one cycle.
Control instructions represent between 10% and 20% of the instructions fetched
during a run, and most of them are branches. Hence, the impact on the performance
is rather high and the microarchitecture built in the next chapter gives a way to fill
the idle cycles with useful work.
When a multicycle stage s takes more than one cycle to produce its output from its
stable input, it raises a wait signal and sends it to its predecessors. It also clears the
output valid bit sent to its successor.
In the same cycle, the prior stage computes, and at the end of the cycle sends a
valid output.
On the following cycle, stage s receives the valid input but it should not use it
while it is still computing on the preceding input.
Such multicycle stages have an internal safe structure (see Fig. 9.4). This struc-
ture serves to store the input. Instead of computing directly on the input, the stage
computes on the saved data.
When there is a valid input and the stage is no longer in a waiting situation, the
new input is saved in the stage safe.
When the stage finishes its processing, it empties its safe.
During a wait, a valid input is not saved. It keeps in stand-by until the wait is over.
As the emitting stage is frozen, the input is preserved and stays stable until the stage
resumes.
If the pipeline contains multiple multicycle stages (e.g. a multicycle execute stage
implementing the M or the F extensions of the RISC-V ISA and a multicycle memory
access stage to access a hierarchized memory), each multicycle stage has a safe and
emits a wait signal to its predecessors as shown in Fig. 9.5.
When a stage s raises a wait signal while a sooner stage s has already signalled
its own wait (e.g. e_wait signal is raised while i_wait signal is set), the new wait
freezes the pipeline including stage s. However, stage s can continue its processing
from its stable input in the safe.
If stage s finishes its processing before stage s , its output valid bit is set but the
stage keeps frozen until stage s clears its wait signal. Stages before s remain frozen
too.
The red stages are frozen while the i_wait signal is set
The red and green stages are frozen while the e_wait signal is set
The red, green, and blue stages are frozen while the m_wait signal is set
If the i_wait signal is cleared while either the e_wait or the m_wait signal is set, the red stages keep frozen
If the i_wait signal is cleared while both the e_wait and the m_wait signals are clear, the red stages are resumed
If stage s and s finish their processing in the same cycle, their outputs are simul-
taneously valid and the whole pipeline gets active again.
If stage s finishes its processing after s , the pipeline between s and s resumes
when the s wait bit is cleared, but the pipeline until s remains frozen while the s
wait bit keeps set. Hence, the stage after s receives an invalid input until s ends its
processing. As the stage next to s has an invalid input, it does not process anything
and it sends an invalid output.
The initialization phase (see Listing 9.2) starts with a call to the init_reg_file
function which initializes the register file (all registers are cleared) and the register
locking bits in the is_reg_computed array (all registers are marked as unlocked).
9.2 The IP Top Function 273
Then, the connection links are all marked as invalid (is_valid bit is cleared) except
for the f_to_f one. The f_to_f.next_pc receives the start_pc address. The issue stage
is not in wait (i_wait cleared) and its safe is empty.
Listing 9.2 The multicycle_pipeline_ip function initializations
...
i n i t _ r e g _ f i l e ( reg_file , i s _ r e g _ c o m p u t e d ) ;
f_to_f . is_valid = 1;
f_to_f . next_pc = start_pc ;
f_to_d . is_valid = 0;
d_to_f . is_valid = 0;
d_to_i . is_valid = 0;
i_to_e . is_valid = 0;
e_to_f . is_valid = 0;
e_to_m . is_valid = 0;
m_to_w . is_valid = 0;
i_wait = 0;
i_safe . is_full = 0;
nbi = 0;
nbc = 0;
...
The multicycle_pipeline_ip.cpp file contains the main do ... while loop shown in
Listing 9.3.
The function calls are placed in a reverse order to avoid RAW dependencies (i.e.
from writeback to fetch).
Listing 9.3 The do ... while loop in the multicycle_pipeline_ip function
...
do {
# p r a g m a HLS P I P E L I N E II =2
# p r a g m a HLS L A T E N C Y max =1
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
printf (" =============================================\ n");
p r i n t f ( " cycle % d \ n " , ( int ) nbc ) ;
# endif
# endif
s t a t i s t i c _ u p d a t e ( i_to_e , & nbi , & nbc ) ;
r u n n i n g _ c o n d _ u p d a t e ( m_to_w , & i s _ r u n n i n g ) ;
w r i t e _ b a c k ( m_to_w , reg_file , i s _ r e g _ c o m p u t e d ) ;
m e m _ a c c e s s ( e_to_m , data_ram , & m _ t o _ w ) ;
e x e c u t e ( i_to_e ,
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
reg_file ,
# endif
# endif
& e_to_f , & e _ t o _ m ) ;
issue ( d_to_i , reg_file , i s _ r e g _ c o m p u t e d , & i_safe , & i_to_e , &
i_wait );
d e c o d e ( f_to_d , i_wait , & d_to_f , & d _ t o _ i ) ;
fetch ( f_to_f , d_to_f , e_to_f , i_wait , code_ram , & f_to_f , &
f_to_d );
} while ( i s _ r u n n i n g ) ;
...
274 9 Building a RISC-V Processor with a Multicycle Pipeline
The "HLS LATENCY max=1" pragma instructs the synthesizer to try to use at most
one separation register to implement the succession of operations, i.e. constraints the
iteration timing to at most two FPGA cycles.
As a general rule, you should define "HLS LATENCY max=n" if you want your
implementation to have a latency of at most n+1 cycles.
If the "HLS LATENCY max=1" pragma is disactivated (e.g. the line is turned into
a comment), the synthesizer expands the iteration processing on three FPGA cycles
instead of two (however, it keeps the IP cycle as two FPGA cycles; hence there is a
one cycle overlapping for two successive iterations). With an active "HLS LATENCY
max=1" pragma, the iteration latency and the IP cycle coincide (II=2).
The fetch, decode, issue, execute, mem_access, and write_back functions im-
plement the six pipeline stages.
They are organized and placed to be independent from each other and to be run
in parallel.
There are some RAW dependencies though. This is the case for the retroaction
link between the decode stage and the fetch stage (d_to_f), which is written by the
decode function and read by the fetch function. It is the same situation for e_to_f
(the execute function writes to the e_to_f structure and the fetch function reads
from it) and also for i_wait (written by issue and read by decode and fetch).
These RAW dependencies do serialize some part of the computations. But it turns
out that the synthesizer can fit everything in the do ... while loop into the two FPGA
cycles limit, probably because these dependencies are not in the critical path.
The statistic_update and the running_cond_update functions play the same role
as in the rv32i_pp_ip design.
The DEBUG_PIPELINE definition can be used to switch between a cycle by cycle
dump of the processor work (at each cycle, the fetch stage prints which instruction
it fetches, the decode stage prints what it decodes, and so on for each of the six
pipeline stages) or the already implemented execution trace (i.e. the same output as
previously implemented).
The constant is to be defined in the debug_multicycle_pipeline_ip.h file to set
the cycle by cycle dump.
Listings 9.14 to 9.19 show what the print looks like when the DEBUG_PIPELINE
constant is defined.
The end part of the top function after the do...while loop is unchanged from the
rv32i_pp_ip one.
Each stage is built on the same pattern. A stage only works when a valid input is
present and no wait signal is received. The computation is done in a stage_job
function. The stage produces its output in a set_output_to_ function (one function
per output recipient).
9.3 The Pipeline Stages 275
While the stage is waiting, the output is unchanged. When there is no input, the
output is cleared.
The code of the fetch function (in the fetch.cpp file) is shown in Listing 9.4.
The fetch function returns (i.e. the fetch stage stays frozen) if the i_wait condition
is set (i.e. the issue stage is stopped by a locked register).
The fetch function sets the has_input bit if any pc emitting stage has sent a valid
output at the end of the preceding cycle (the pc emitting stages are the fetch stage,
the decode stage, or the execute stage).
When a control instruction is fetched, the fetch stage does not output any valid
next_pc value to itself and the fetch is suspended. However, the fetch stage outputs
the fetched instruction to the decode stage.
If the control instruction is a JAL, the decode stage in the next cycle sends the
target_pc value back to the fetch stage, which resumes.
If the control instruction is a BRANCH or a JALR, the decode stage in the next
cycle is not able to send a valid target_pc value. Hence, the fetch stage remains
suspended. When the BRANCH or JALR target is computed, the execute stage sends
the target_pc value and the fetch stage resumes.
A consequence is that at most one of the three valid bits from the stages emitting
a pc can be set.
If there is no input (has_input is clear), the stage stays idle. It clears its output
valid bits.
If there is an input, the stage does its job in the stage_job function (fetch and
partial decoding).
The stage sets the outputs to itself (next sequential pc) and to the decode stage
(current pc and fetched instruction).
The stage sets the output valid bits (the output to itself is valid only if the instruction
is not a control one).
Listing 9.4 The fetch function
void fetch (
from_f_to_f_t f_from_f ,
from_d_to_f_t f_from_d ,
from_e_to_f_t f_from_e ,
bit_t i_wait ,
i n s t r u c t i o n _ t * code_ram ,
f r o m _ f _ t o _ f _ t * f_to_f ,
from_f_to_d_t * f_to_d ){
bit_t has_input ;
instruction_t instruction ;
decoded_control_t d_ctrl ;
bit_t is_ctrl ;
code_address_t pc ;
if (! i _ w a i t ) {
h a s _ i n p u t = f _ f r o m _ f . i s _ v a l i d || f _ f r o m _ d . i s _ v a l i d || f _ f r o m _ e .
is_valid ;
if ( h a s _ i n p u t ) {
276 9 Building a RISC-V Processor with a Multicycle Pipeline
if ( f _ f r o m _ f . i s _ v a l i d )
pc = f _ f r o m _ f . n e x t _ p c ;
else if ( f _ f r o m _ d . i s _ v a l i d )
pc = f _ f r o m _ d . t a r g e t _ p c ;
else if ( f _ f r o m _ e . i s _ v a l i d )
pc = f _ f r o m _ e . t a r g e t _ p c ;
s t a g e _ j o b ( pc , code_ram , & i n s t r u c t i o n , & d _ c t r l ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
printf (" fetched ");
p r i n t f ( " %04 d : %08 x \n",
( int ) ( pc < <2) , i n s t r u c t i o n ) ;
# endif
# endif
s e t _ o u t p u t _ t o _ f ( pc , f _ t o _ f ) ;
s e t _ o u t p u t _ t o _ d ( pc , i n s t r u c t i o n , d_ctrl , f _ t o _ d ) ;
}
i s _ c t r l = d _ c t r l . i s _ b r a n c h || d _ c t r l . i s _ j a l r ||
d_ctrl . is_jal ;
f_to_f - > i s _ v a l i d = h a s _ i n p u t && ! i s _ c t r l ;
f_to_d - > i s _ v a l i d = h a s _ i n p u t ;
}
}
The stage_job function (see Listing 9.5) fetches an instruction from the code_ram
addressed by the input pc.
It does not decode the fetched instruction. However, it checks if the instruction is
a control one (the decode_control function).
The stage_job function and the decode_control function are in the fetch.cpp
file.
Listing 9.5 The decode_control and stage_job functions
static void d e c o d e _ c o n t r o l (
instruction_t instruction ,
decoded_control_t * d_ctrl ){
opcode_t opcode ;
o p c o d e = ( i n s t r u c t i o n >> 2) ;
d_ctrl - > i s _ b r a n c h = ( o p c o d e == B R A N C H ) ;
d_ctrl - > i s _ j a l r = ( o p c o d e == JALR ) ;
d_ctrl - > i s _ j a l = ( o p c o d e == JAL ) ;
}
static void s t a g e _ j o b (
code_address_t pc ,
u n s i g n e d int * code_ram ,
instruction_t * instruction ,
decoded_control_t * d_ctrl ){
* i n s t r u c t i o n = c o d e _ r a m [ pc ];
d e c o d e _ c o n t r o l (* i n s t r u c t i o n , d _ c t r l ) ;
}
The two set_output functions (see Listing 9.6; fetch.cpp file) fill the fields of the
structures for the next decode stage and for the fetch stage itself. The output to the
decode stage includes the bits computed by the decode_control function to avoid
recomputing them.
Listing 9.6 The set_output_to_f and set_output_to_d functions
static void s e t _ o u t p u t _ t o _ f (
c o d e _ a d d r e s s _ t pc ,
from_f_to_f_t * f_to_f ){
9.3 The Pipeline Stages 277
f_to_f - > n e x t _ p c = pc + 1;
}
static void s e t _ o u t p u t _ t o _ d (
code_address_t pc ,
instruction_t instruction ,
d e c o d e d _ c o n t r o l _ t d_ctrl ,
from_f_to_d_t * f_to_d ){
f_to_d - > pc = pc ;
f_to_d - > i n s t r u c t i o n = i n s t r u c t i o n ;
f_to_d - > i s _ b r a n c h = d_ctrl . is_branch ;
f_to_d - > i s _ j a l r = d_ctrl . is_jalr ;
f_to_d - > i s _ j a l = d_ctrl . is_jal ;
}
The code to implement the decode stage has the same organization as the one im-
plementing the fetch stage.
Two bits have been added to the decoded_instruction_t type defined in the
multicycle_pipeline_ip.cpp file (see Listing 9.7): is_reg_rs1 and is_reg_rs2. The
is_reg_rs1 bit (respectively is_reg_rs2) is set if the decoded rs1 field (respectively
rs2) represents a register source.
Listing 9.7 The decode_instruction_t type
typedef struct decoded_instruction_s {
opcode_t opcode ;
...
i m m e d i a t e _ t imm ;
bit_t is_rs1_reg ;
bit_t is_rs2_reg ;
...
bit_t is_r_type ;
} decoded_instruction_t ;
The decode_instruction function defined in the decode.cpp file (see Listings 9.8
and 9.9) is updated to take the already decoded bits concerning control instructions
into account (is_branch, is_jalr, and is_jal).
The computation has also been reorganized to minimize the redundancy in the
expressions, inserting many local bits (e.g. is_lui or is_not_auipc).
Listing 9.8 The decode_instruction function: compute local bits
static void d e c o d e _ i n s t r u c t i o n (
instruction_t instruction ,
bit_t is_branch ,
bit_t is_jalr ,
bit_t is_jal ,
d e c o d e d _ i n s t r u c t i o n _ t * d_i ) {
opcode_t opcode ;
bit_t is_lui ;
bit_t is_load ;
bit_t is_store ;
bit_t is_op_imm ;
bit_t is_not_auipc ;
bit_t is_not_jal ;
278 9 Building a RISC-V Processor with a Multicycle Pipeline
opcode = ( i n s t r u c t i o n >> 2) ;
is_lui = ( o p c o d e == LUI ) ;
is_load = ( o p c o d e == LOAD ) ;
is_store = ( o p c o d e == STORE ) ;
is_op_imm = ( o p c o d e == O P _ I M M ) ;
is_not_auipc = ( o p c o d e != AUIPC ) ;
is_not_jal = ! is_jal ;
...
The decode_instruction function is also updated to fill the two new is_rs1_reg
and is_rs2_reg fields.
Listing 9.9 The decode_instruction function: fill the d_i fields
...
d_i - > o p c o d e = opcode ;
...
d_i - > i s _ r s 1 _ r e g = ( i s _ n o t _ j a l && ! i s _ l u i &&
i s _ n o t _ a u i p c && ( d_i - > rs1 != 0) ) ;
d_i - > i s _ r s 2 _ r e g = (! i s _ o p _ i m m && ! i s _ l o a d &&
is_not_jal && ! i s _ j a l r &&
! is_lui && i s _ n o t _ a u i p c &&
( d_i - > rs2 != 0) ) ;
...
d_i - > i s _ r _ t y p e = ( d_i - > type == R _ T Y P E ) ;
}
# endif
d_to_i );
}
d_to_f - > i s _ v a l i d = d _ f r o m _ f . i s _ v a l i d && d_i . i s _ j a l ;
d_to_i - > i s _ v a l i d = d _ f r o m _ f . i s _ v a l i d ;
}
}
The stage_job function in the decode.cpp file (see Listing 9.11) decodes the
instruction, decodes the immediate, and computes the JAL target_pc.
Listing 9.11 The decode stage_job function
static void s t a g e _ j o b (
code_address_t pc ,
instruction_t instruction ,
bit_t is_branch ,
bit_t is_jalr ,
bit_t is_jal ,
d e c o d e d _ i n s t r u c t i o n _ t * d_i ,
code_address_t * target_pc ){
d e c o d e _ i n s t r u c t i o n ( i n s t r u c t i o n , is_branch , is_jalr , is_jal ,
d_i ) ;
decode_immediate ( i n s t r u c t i o n , d_i ) ;
if ( d_i - > i s _ j a l )
* t a r g e t _ p c = pc +( c o d e _ a d d r e s s _ t ) ( d_i - > imm >> 1) ;
}
The job of the issue stage is to read the register sources from the register file and
send the values to the execute stage. The stage issues the instruction to the next stage
only when no register source of the current instruction is being computed at the same
280 9 Building a RISC-V Processor with a Multicycle Pipeline
time in the following stages of the pipeline (a source register r is being computed if
is_reg_computed[r] is set; the set bit locks the register).
The schedule shown in Listings 9.14 to 9.16 (printed from the run of the RISC-V
code in the multicycle_pipeline_ip with the DEBUG_PIPELINE constant set) shows
how the three instructions move in the pipeline when the destination is not checked
before issue.
At cycle 2 (see Listing 9.14), register a0 is locked by instruction (0000) (the
destination register is locked when the instruction is issued).
Listing 9.14 The schedule when the destination is not checked before issue: cycles 0 to 2
=============================================
cycle 0
fetched 0000: 01200513
=============================================
cycle 1
decoded 0 0 0 0 : li a0 , 18
fetched 0004: 01300513
=============================================
cycle 2
issued 0000
decoded 0 0 0 4 : li a0 , 19
fetched 0008: 00150513
Listing 9.15 The schedule when the destination is not checked before issue: cycles 3 to 5
=============================================
cycle 3
execute 0000
issued 0004
decoded 0 0 0 8 : addi a0 , a0 , 1
fetched 0012: 00008067
=============================================
cycle 4
mem 0000
execute 0004
=============================================
cycle 5
wb 0000
a0 = 18 ( 12)
mem 0004
issued 0008
decoded 0 0 1 2 : ret
At cycle 8 (see Listing 9.16), instruction (0008) writes value 19 into register a0
(i.e. 18 + 1) and this is the final printed register value.
Listing 9.16 The schedule when the destination is not checked before issue: cycles 6 to 9
=============================================
cycle 6
wb 0004
a0 = 19 ( 13)
execute 0008
issued 0012
=============================================
cycle 7
mem 0008
execute 0012
pc = 0 ( 0)
=============================================
cycle 8
wb 0008
a0 = 19 ( 13)
mem 0012
=============================================
cycle 9
wb 0012
=============================================
Listing 9.17 The schedule when the destination is checked before issue: cycles 3 to 5
=============================================
cycle 3
execute 0000
=============================================
cycle 4
mem 0000
=============================================
cycle 5
wb 0000
a0 = 18 ( 12)
issued 0004
decoded 0 0 0 8 : addi a0 , a0 , 1
fetched 0012: 00008067
Instruction (0008) is issued at cycle 8 (see Listing 9.18) after instruction (0004)
writeback in the same cycle. Hence, instruction (0008) has read the value 19 in
register a0.
Instruction (0008) computes value 20 in the execute stage at cycle 9.
Listing 9.18 The schedule when the destination is checked before issue: cycles 6 to 9
=============================================
cycle 6
execute 0004
=============================================
cycle 7
mem 0004
=============================================
cycle 8
wb 0004
a0 = 19 ( 13)
issued 0008
decoded 0 0 1 2 : ret
=============================================
cycle 9
execute 0008
issued 0012
is_locked_1 =
i_safe - > d_i . i s _ r s 1 _ r e g &&
i s _ r e g _ c o m p u t e d [ i_safe - > d_i . rs1 ];
is_locked_2 =
i_safe - > d_i . i s _ r s 2 _ r e g &&
i s _ r e g _ c o m p u t e d [ i_safe - > d_i . rs2 ];
* i _ w a i t = i s _ l o c k e d _ 1 || i s _ l o c k e d _ 2 ;
if (!(* i _ w a i t ) ) {
s t a g e _ j o b ( i_safe - > d_i , reg_file , & rv1 , & rv2 ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
printf (" issued ");
p r i n t f ( " %04 d \ n " , ( int ) ( i_safe - > pc < <2) ) ;
# endif
# endif
s e t _ o u t p u t _ t o _ e ( i_safe - > pc , i_safe - > d_i , rv1 , rv2 ,
# ifndef __SYNTHESIS__
i_safe - > i n s t r u c t i o n ,
i_safe - > target_pc ,
# endif
i_to_e );
if (! i_safe - > d_i . h a s _ n o _ d e s t )
i s _ r e g _ c o m p u t e d [ i_safe - > d_i . rd ] = 1;
}
}
i_to_e - > i s _ v a l i d = i_safe - > i s _ f u l l && !(* i _ w a i t ) ;
i_safe - > i s _ f u l l = (* i _ w a i t ) ;
}
However, if you want to apply strict scheduling, the issue function should be
updated by adding a third check bit is_locked_d related to the instruction destination
i_safe->d_i.rd, as shown in Listing 9.21.
Listing 9.21 The issue function
void issue (
from_d_to_i_t i_from_d ,
int * reg_file ,
bit_t * is_reg_computed ,
i_safe_t * i_safe ,
f r o m _ i _ t o _ e _ t * i_to_e ,
bit_t * i_wait ){
bit_t i s _ l o c k e d _ 1 ;
bit_t i s _ l o c k e d _ 2 ;
bit_t i s _ l o c k e d _ d ;
...
if ( i_safe - > i s _ f u l l ) {
is_locked_1 =
i_safe - > d_i . i s _ r s 1 _ r e g &&
i s _ r e g _ c o m p u t e d [ i_safe - > d_i . rs1 ];
is_locked_2 =
i_safe - > d_i . i s _ r s 2 _ r e g &&
i s _ r e g _ c o m p u t e d [ i_safe - > d_i . rs2 ];
is_locked_d =
! i_safe - > d_i . h a s _ n o _ d e s t &&
i s _ r e g _ c o m p u t e d [ i_safe - > d_i . rd ];
* i _ w a i t = i s _ l o c k e d _ 1 || i s _ l o c k e d _ 2 || i s _ l o c k e d _ d ;
...
}
9.3 The Pipeline Stages 285
The execute stage is not concerned by the i_wait signal (like all the stages after the
issue one).
Then, it sets the target_pc to be sent to the fetch stage (stage_job function).
After that, it fills the output structure fields in the two set_output functions.
Eventually, the execute stage sets the valid bits for the outputs.
If no input is valid, the outputs to the fetch stage and to the memory access stage
are invalid too. Otherwise, the output to the memory access stage is valid. The output
to the fetch stage is valid if the executed instruction is a BRANCH or a JALR.
If the instruction is a RET (a RET is a JALR with register RA as the source and
register zero as the destination) and if register RA is null, it is the return from the
main function, hence the last instruction run. In this case, the output to the fetch
stage is invalid and the return address is not sent.
Listing 9.24 The execute function
void e x e c u t e (
from_i_to_e_t e_from_i ,
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
int * reg_file ,
# endif
# endif
f r o m _ e _ t o _ f _ t * e_to_f ,
from_e_to_m_t * e_to_m ){
bit_t bcond ;
int result1 ;
int result2 ;
code_address_t target_pc ;
code_address_t next_pc ;
if ( e _ f r o m _ i . i s _ v a l i d ) {
compute ( e _ f r o m _ i . pc , e _ f r o m _ i . d_i , e _ f r o m _ i . rv1 ,
e _ f r o m _ i . rv2 , & bcond , & result1 , & result2 ,
& next_pc );
s t a g e _ j o b ( e _ f r o m _ i . pc , e _ f r o m _ i . d_i , bcond , next_pc ,
& target_pc );
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
printf (" execute ");
p r i n t f ( " %04 d \ n " , ( int ) ( e _ f r o m _ i . pc < <2) ) ;
if ( e _ f r o m _ i . d_i . i s _ b r a n c h || e _ f r o m _ i . d_i . i s _ j a l r )
e m u l a t e ( reg_file , e _ f r o m _ i . d_i , n e x t _ p c ) ;
# endif
# endif
s e t _ o u t p u t _ t o _ f ( target_pc , e _ t o _ f ) ;
s e t _ o u t p u t _ t o _ m ( e _ f r o m _ i . d_i , result1 , result2 , next_pc ,
e _ f r o m _ i . rv2 , target_pc ,
# ifndef __SYNTHESIS__
e _ f r o m _ i . pc , e _ f r o m _ i . i n s t r u c t i o n ,
# endif
e_to_m );
}
// block fetch after last RET
// ( i . e . RET with 0 return a d d r e s s )
e_to_f - > i s _ v a l i d =
e_from_i . is_valid &&
( e _ f r o m _ i . d_i . i s _ b r a n c h ||
( e _ f r o m _ i . d_i . i s _ j a l r &&
(! e _ f r o m _ i . d_i . i s _ r e t || ( n e x t _ p c != 0) ) ) ) ;
e_to_m - > i s _ v a l i d = e _ f r o m _ i . i s _ v a l i d ;
}
9.3 The Pipeline Stages 287
next_pc ;
# ifndef __SYNTHESIS__
e_to_m - > pc = pc ;
e_to_m - > i n s t r u c t i o n = instruction ;
e_to_m - > d_i = d_i ;
# endif
}
# ifdef D E B U G _ P I P E L I N E
p r i n t f ( " mem ");
p r i n t f ( " %04 d \ n " , ( int ) ( m _ f r o m _ e . pc < <2) ) ;
# endif
# endif
s e t _ o u t p u t _ t o _ w ( m _ f r o m _ e . rd , m _ f r o m _ e . h a s _ n o _ d e s t ,
m _ f r o m _ e . is_ret , value ,
# ifndef __SYNTHESIS__
m _ f r o m _ e . pc , m_from_e . instruction ,
m _ f r o m _ e . d_i , m _ f r o m _ e . target_pc ,
# endif
m_to_w );
}
m_to_w - > i s _ v a l i d = m _ f r o m _ e . i s _ v a l i d ;
}
The stage_job function in the mem_access.cpp file (see Listing 9.29) either loads
from or stores to memory. If the instruction is neither a load nor a store, the function
just returns.
The mem_load and mem_store functions in the mem.cpp file are unchanged.
Listing 9.31 The stage_job function
static void s t a g e _ j o b (
bit_t is_load ,
bit_t is_store ,
b _ d a t a _ a d d r e s s _ t address ,
func3_t func3 ,
int * data_ram ,
int * value ) {
if ( i s _ l o a d )
* value = m e m _ l o a d ( data_ram , address , func3 ) ;
else if ( i s _ s t o r e )
m e m _ s t o r e ( data_ram , address , * value , ( ap_uint <2 >) func3 ) ;
}
# endif
from_m_to_w_t * m_to_w ){
m_to_w - > rd = rd ;
m_to_w - > h a s _ n o _ d e s t = has_no_dest ;
m_to_w - > i s _ r e t = is_ret ;
m_to_w - > value = value ;
# ifndef __SYNTHESIS__
m_to_w - > pc = pc ;
m_to_w - > i n s t r u c t i o n = instruction ;
m_to_w - > d_i = d_i ;
m_to_w - > t a r g e t _ p c = target_pc ;
# endif
}
The writeback stage_job function in the wb.cpp file (see Listing 9.34) writes the
value into the destination register if the instruction has a destination.
Listing 9.34 The writeback stage_job function
static void s t a g e _ j o b (
bit_t has_no_dest ,
r e g _ n u m _ t rd ,
int value ,
int * reg_file ){
if (! h a s _ n o _ d e s t ) r e g _ f i l e [ rd ] = value ;
}
Experimentation
To simulate the multicycle_pipeline_ip, operate as explained in Sect. 5.3.6, re-
placing fetching_ip with multicycle_pipeline_ip.
You can play with the simulator, replacing the included test_mem_0_text.hex
file with any other .hex file you find in the same folder.
Experimentation
To run the multicycle_pipeline_ip on the development board, proceed as explained
in Sect. 5.3.10, replacing fetching_ip with multicycle_pipeline_ip.
You can play with your IP, replacing the included test_mem_0_text.hex file with
any other .hex file you find in the same folder.
294 9 Building a RISC-V Processor with a Multicycle Pipeline
The code in the helloworld.c file is shown in Listing 9.35 (do not forget to adapt
the path to the hex file to your environment with the update_helloworld.sh shell
script).
Listing 9.35 The helloworld.c file to run the test_mem.s RISC-V program
# i n c l u d e < stdio .h >
# include " xmulticycle_pipeline_ip .h"
# include " xparameters .h"
# d e f i n e L O G _ C O D E _ R A M _ S I Z E 16
// size in words
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
# d e f i n e L O G _ D A T A _ R A M _ S I Z E 16
// size in words
# define DATA_RAM_SIZE (1 < < L O G _ D A T A _ R A M _ S I Z E )
XMulticycle_pipeline_ip_Config * cfg_ptr ;
XMulticycle_pipeline_ip ip ;
word_type code_ram [ CODE_RAM_SIZE ] = {
# i n c l u d e " t e s t _ m e m _ 0 _ t e x t . hex "
};
int main () {
u n s i g n e d int nbi , nbc ;
word_type w;
cfg_ptr = XMulticycle_pipeline_ip_LookupConfig (
XPAR_XMULTICYCLE_PIPELINE_IP_0_DEVICE_ID );
X M u l t i c y c l e _ p i p e l i n e _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ S e t _ s t a r t _ p c (& ip , 0) ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ W r i t e _ c o d e _ r a m _ W o r d s (& ip , 0 , code_ram ,
CODE_RAM_SIZE );
X M u l t i c y c l e _ p i p e l i n e _ i p _ S t a r t (& ip ) ;
while (! X M u l t i c y c l e _ p i p e l i n e _ i p _ I s D o n e (& ip ) ) ;
nbi = X M u l t i c y c l e _ p i p e l i n e _ i p _ G e t _ n b _ i n s t r u c t i o n (& ip ) ;
nbc = X M u l t i c y c l e _ p i p e l i n e _ i p _ G e t _ n b _ c y c l e (& ip ) ;
p r i n t f ( " % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \
in % d cycles ( ipc = %2.2 f ) \ n " , nbi , nbc , (( float ) nbi ) / nbc ) ;
p r i n t f ( " data memory dump ( non null words ) \ n " ) ;
for ( int i =0; i < D A T A _ R A M _ S I Z E ; i ++) {
XMulticycle_pipeline_ip_Read_data_ram_Words
(& ip , i , &w , 1) ;
9.5 Comparing the Multicycle Pipeline to the 4-Stage Pipeline 295
if ( w != 0)
p r i n t f ( " m [%4 x ] = %16 d (%8 x ) \ n " , 4* i , ( int )w ,
( u n s i g n e d int ) w ) ;
}
}
If the RISC-V code in the test_mem.h file is run, it produces the output shown
in Listing 9.36 (217 cycles versus 119 cycles on the rv32i_pp_ip; the locking mech-
anism is less efficient than the bypass version from Chap. 8; the IPC is low, which
means that the pipeline is far from filled):
Listing 9.36 The helloworld output
88 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 217 c y c l e s ( ipc = 0.41)
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
To pass the riscv-tests on the Vitis_HLS simulator, you just need to use the test-
bench_riscv_tests_multicycle_pipeline_ip.cpp program in the riscv-tests/my_isa/
my_rv32ui folder as the testbench.
To pass the riscv-tests on the FPGA, you must use the helloworld_multicycle_
pipeline_ip.c in the riscv-tests/my_isa/my_rv32ui folder. Normally, since you al-
ready ran the update_helloworld.sh shell script for the other processors, the hel-
loworld_multicycle_pipeline_ip.c file should have paths adapted to your environ-
ment. However if you did not, you must run ./update_helloworld.sh.
To run a benchmark from the mibench suite, say my_dir/bench, you set the
testbench as the testbench_bench_multicycle_pipeline_ip.cpp file found in the
mibench/my_mibench/my_dir/bench folder. For example, to run basicmath, you
set the testbench as testbench_basicmath_multicycle_pipeline_ip.cpp in the
mibench/my_mibench/my_automotive/basicmath folder.
To run one of the official riscv-tests benchmarks, say bench, you set the test-
bench as the testbench_bench_multicycle_pipeline_ip.cpp file found in the riscv-
tests/benchmarks/bench folder. For example, to run median, you set the testbench
296 9 Building a RISC-V Processor with a Multicycle Pipeline
Table 9.1 Execution time of the benchmarks on the 6-stage pipelined multicycle_pipeline_ip
processor
suite benchmark Cycles cpi Time (s) 4-stage Improvement
time (s) (%)
mibench basicmath 62,723,992 2.03 1.254479840 1.129294710 −11
mibench bitcount 57,962,065 1.78 1.159241300 1.137299970 −2
mibench qsort 12,845,805 1.92 0.256916100 0.242585730 −6
mibench stringsearch 1,240,390 2.26 0.024807800 0.019074900 −30
mibench rawcaudio 1,363,673 2.15 0.027273460 0.022745130 −20
mibench rawdaudio 942,834 2.01 0.018856680 0.016884030 −12
mibench crc32 660,028 2.20 0.013200560 0.010800480 −22
mibench fft 64,979,537 2.07 1.299590740 1.147007820 −13
mibench fft_inv 66,054,232 2.07 1.321084640 1.167260040 −13
riscv-tests median 53,141 1.91 0.001062820 0.001064130 0
riscv-tests mm 328,860,252 2.09 6.577205040 5.802164250 −13
riscv-tests multiply 745,904 1.78 0.014918080 0.016200720 8
riscv-tests qsort 491,648 1.81 0.009832960 0.009918900 1
riscv-tests spmv 2,426,687 1.95 0.048533740 0.044938200 −8
riscv-tests towers 510,511 1.26 0.010210220 0.012791550 20
riscv-tests vvadd 24,016 1.50 0.000480320 0.000540360 11
9.6 Proposed Exercise: Reduce II to 1 297
A way to improve the multicycle pipeline is to fill it more efficiently. The compiler
can help by rearranging the instructions to minimize the number of waiting cycles
(but you have to modify the compiler).
A major improvement would be to avoid the unused cycles due to the next pc
computation latency. This can be achieved with a branch predictor, as shown in
Fig. 9.10.
The branch predictor predicts the next pc from the current pc (the current pc is
itself most of the time the result of a previous prediction) and a set of caches of the
targets of the past control flow instructions (if the current pc is in one of the caches,
it means it is the address of a control flow instruction and the cache gives a target
address, which is the prediction; otherwise, the prediction is pc + 1).
The predicted next pc is forwarded all along the pipeline up to the writeback
stage, where it is compared to the computed next pc. If they match, the run continues
(the correction order bit is cleared, letting pc receive the lower input of the mux
multiplexer shown on the left side of Fig. 9.10). Otherwise (the correction order bit
is set and pc receives the upper input of the multiplexer), the computed next pc is
sent back to the fetch stage to correct the instruction path.
The instructions in the pipeline are all cancelled. Whatever the correctness of the
prediction, the caches of the targets in the branch predictor are updated.
Because the branch predictor is able to produce a prediction in a single processor
cycle, the fetch stage receives a new predicted pc every cycle, even when a con-
trol instruction is fetched. Only when the prediction is wrong, the already fetched
instructions on the wrong path are discarded and the corresponding cycles are lost.
The best predictor [1] reaches a rate of 8 branch Mispredictions Per Kilo Instructions
(MPKI), i.e. on the average, one miss every 125 instructions.
Another way to improve the performance of the design and decrease the CPI is
to provide other instruction sources through multithreading, which will be your next
design.
With the multicycle pipeline, the processor cycle has increased up to 50 Mhz. It is
possible to double the speed though, with an initiation interval II = 1.
The exercise is to implement such an II = 1 design. Notice that the next pc
computation in the fetch stage cannot benefit of any decoding at all because the fetch
latency is two cycles (BRAM block latency; however, the BRAM access throughput
is one access per cycle, i.e. you can start a new access every cycle).
As the fetch duration is two cycles, while the processor is fetching the instruction,
a new fetch should start from a predicted address.
The simplest prediction is to systematically set next pc as pc+1. The MPKI for
this static predictor is the rate of control instructions (in average, there are 15%
of JAL, JALR, or taken branches), i.e. MPKI = 150 (one miss every six or seven
instructions).
298
corrected pc
correction order
compute
mux pc code ram next pc
instruction
branch compare
predictor predicted
next pc
predicted next pc
References
1. A. Seznec, P. Michaud, A case for (partially) TAgged GEometric history length branch predic-
tion. J. Instruction Level Parallelism (2006)
2. T.-Y. Yeh, Y.N. Patt, Two-level adaptive training branch prediction, in MICRO 24: Proceedings
of the 24th Annual International Symposium on Microarchitecture, pp. 51–61 (1991)
3. S. McFarling, Combining Branch Predictors. Digital Western Research Laboratory Technical
Note 36 (1993)
Building a RISC-V Processor
with a Multiple Hart Pipeline 10
Abstract
This chapter will make you build your fourth RISC-V processor. The implemented
microarchitecture proposed in this fourth version improves the CPI by filling the
pipeline with multiple instruction flows. At the OS level, a flow of control is
a thread. The processor can be designed to host multiple threads and run them
simultaneously (Simultaneous MultiThreading or SMT, as named by Tullsen in
[1]). Such thread dedicated slots in the processor are called harts (for HARdware
Threads). The multihart design presented in this chapter can host up to eight harts.
The pipeline has six stages. The processor cycle is two FPGA cycles (i.e. 50 Mhz).
In order to fill the pipeline and decrease the CPI as close to 1 as possible, there are
multiple techniques. The lost cycles are due to waiting conditions, i.e. dependencies
between instructions. One way to recover these cycles is to eliminate the depen-
dencies with prediction. The control flow dependencies can be eliminated through
branch prediction, as was presented in 9.5. The data dependencies can be eliminated
through value prediction [2,3]. However, predictors are complex pieces of hardware
and I will not give any implementation in this book.
Another possibility is to use independent instructions to fill empty pipeline slots.
While an instruction is waiting for a prior result, the pipeline continues to fetch and
process following instructions and make them bypass the stopped one. This is called
out-of-order, or OoO computation [4]. OoO implementation is tricky and requires
costly additional logic like a register renaming unit [5]. The reward is really only
worth the complexity for speculative superscalar designs [6] in which the pipeline
stages may process multiple instructions simultaneously and the addresses for the
fetch unit are provided through a branch predictor. Again, I will avoid implementing
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 301
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_10
302 10 Building a RISC-V Processor with a Multiple Hart Pipeline
an OoO design. I will stick to a scalar design (i.e. which processes instructions one
by one), with an optimal CPI of 1.
A third option to fill the pipeline is to interleave instructions from multiple flows
(i.e. by running multiple threads), which is called multithreading.
It should be pointed out though that multithreading is a way to make the pipeline
compute more but not a way to accelerate the run of a thread of computation. The
pipeline is shared by the threads and each thread runs at the speed corresponding to
its share (e.g. half the processor speed if there are two threads, each filling half of
the pipeline slots).
To handle multiple threads, a processor must be provided with multiple pc and
register files (in a multithreaded OoO design implementing register renaming like
SMT, the register file is shared). The pipeline stages and the computing units are
shared.
The inter-stage connections must specify which thread is emitting its result to the
next stage (e.g. d_to_i.hart to send something from the decode stage to the issue
stage).
Running an instruction of a thread means reading sources from and writing results
to the dedicated register file. If the instruction is a control flow one, it updates the pc
of the respective thread.
The multihart_ip hardware is designed as shown in Fig. 10.1 (the figure depicts
a 4-hart design). The figure shows the six stages of the pipeline, from left to right in
the upper half and continued from right to left in the lower half.
In the figure, the green rectangles represent hart slots.
Each pipeline stage has four slots (e.g. green rectangles named i0 to i3 for the
issue stage). Each slot may host one instruction. In the fetch stage, the slots are name
pc0 to pc3. Each can host the code memory address to the instruction of one running
thread.
Each stage processes a single instruction per cycle. Hence, the hart to be processed
is selected among the four slots in each pipeline stage. This selection is done in the
same time in all the stages. In the same cycle, the stages may select different harts:
for example, hart 0 for the fetch stage, hart 3 for the decode stage, hart 2 for the issue
stage and so on.
The selection is done in two successive steps.
The first selection step is represented as a magenta vertical line. It selects one
of the hosted threads in the four hart slots. The second step is represented as a red
vertical line. It either selects the chosen thread from the magenta first step if there is
any, or the incoming instruction from the previous stage output.
Each selection vertical line represents a multiplexer to choose one of its inputs.
A thread can be choosen if the instruction held in the stage or at the stage input
is ready, i.e. fulfills some stage related condition. For example, an instruction in the
issue stage is ready if its register sources are not locked.
The selection process follows a fixed priority order. The first step has priority over
the second one. In the first step, harts are increasingly ordered (i.e. hart 0 has the
highest priority and hart 3 has the lowest one). In the fetch stage, the pc incoming
from the decode stage has priority over the one incoming from the execute stage.
10.2 A Multiple Hart Memory Model 303
pc0 d0 i0 rf0
pc1 code d1 i1 rf1
ram decode
pc2 d2 i2 rf2
pc3 d3 i3 rf3
rf0 w0 m0 e0
rf1 w1 data m1 e1
ram alu
rf2 w2 m2 e2
rf3 w3 m3 e3
Hence in the figure the upmost input on any magenta line has the highest priority
and the lowest input on any red line has the lowest priority.
The hart selection adds some delay in the critical path. I reorganized some of the
computations to fit every path into the two FPGA cycles limit.
In an OS based processor, the memory model is imposed by the OS. The memory is
physically an array of bytes but the OS manages the space by paging it and allocating
and freeing pages on demand of the processes or threads. As a result, a thread cannot
directly access the physical memory. Its memory is made of non contiguous pages,
assembled through tables of page references.
I will not detail further the OS organization of the memory.
In a no-OS or bare-metal processor, we are more free to organize the memory as
we wish. Up to this point in the book, I have considered a processor running a single
program. In this case, the program has full access to the processor memory, which
is an array of bytes.
However, I have separated (see Chap. 6) the code and the data memories (which
has some drawbacks like prohibiting JIT (just-in-time) compilation, e.g. building
bit-banging instructions directly: the building program is unable to transfer the built
304 10 Building a RISC-V Processor with a Multiple Hart Pipeline
instructions from the data memory to the code memory; however, I separated the two
memories for reasons related to the number of access ports on the memory banks).
As I introduce multithreading, I need to reconsider how the memory is managed.
For the code memory, I can keep the same organization. There is a single array of
instructions, fully accessible to all of the threads through the fetch stage (remember
though that in a cycle only one thread is selected to fetch).
In each cycle, the fetch stage reads a single instruction from a single thread. The
threads can be placed anywhere in the code memory. Multiple threads can even share
their code.
For the data memory, the single byte array model should be reconsidered.
First, contrarily to the code memory, the data memory is writable. It is necessary
to protect the memory space of one thread from the writings of the other threads.
However, we must keep flexible by, on one side, prohibiting concurrent accesses
from multiple threads, and on another side, by allowing to share memory spaces
between threads.
Secondly, each thread may use a stack space, which should be fully private.
So, I adapt the data memory as follows: it is now an array of sub-memories, one
per hart. This partitioning ensures protection. A hart accesses its own partition. A
partition is divided in two parts: the static data part at one end and the stack at the
other end. Both parts converge to one another. The programmer should take care
that the stack never overlaps the static data part (as there is no OS to ensure this
condition).
However, to allow memory sharing, a hart may access the other harts memory
partitions. This should be done with care (the programmer should be careful). Of
course, the stack parts of the other harts memory partitions should not be accessed,
only their static data parts.
By allowing memory sharing, we allow parallelism. A computation may be di-
vided into threads and the computed data may be partitioned and distributed into the
hart memories.
Of course, as the processor runs a single hart per cycle, the threads are not really run
in parallel. They are interleaved. However, as I will show in this chapter, the efficiency
of the pipeline is improved by a multihart organization and so, a multithreaded
computation is run slightly faster than its sequential version.
The processor is designed to take care of the partitioned memory model. A load
or store instruction computes an access address to the data memory which is viewed
by the running hart as relative to its own partition. This is illustrated in Fig. 10.2.
In the upper part of the figure, each hart accesses the first memory word of its own
partition. For example, hart 1 loads from address 0, which is turned into an access
to the red square, i.e. memory location 0 in hart 1 partition. Hart 2 also loads from
address 0, but this address relative to its own data memory partition is turned into an
access to the green square.
In the lower part of the figure, hart 1 successively accesses to words in different
partitions of the memory. The HART_DATA_RAM_SIZE constant is the size (in words)
of the hart memory partition.
10.2 A Multiple Hart Memory Model 305
Fig. 10.2 How a hart addresses the partitioned memory (upper part: local accesses, lower part:
accesses to the whole memory)
The first load instruction (in blue) accesses the first word of the first partition, at
the relative word address -HART_DATA_RAM_SIZE, i.e. the absolute word address 0.
The second load (in red) accesses the first word of the second partition, at the
relative word address 0, i.e. the absolute word address HART_DATA_RAM_SIZE.
The third load (in green) accesses the first word of the third partition, at
the relative word address HART_DATA_RAM_SIZE, i.e. the absolute word address
2*HART_DATA_RAM_SIZE.
The fourth load (in brown) accesses the first word of the last partition, at the
relative word address 2*HART_DATA_RAM_SIZE, i.e. the absolute word address
3*HART_DATA_RAM_SIZE.
In this four banks memory, the hart 0 relative word addresses range from 0 to
4*HART_DATA_RAM_SIZE-1.
For hart 1, the relative word addresses range from -HART_DATA_RAM_SIZE to
3*HART_DATA_RAM_SIZE-1.
For hart 2, the relative word addresses range from -2*HART_DATA_RAM_SIZE to
2*HART_DATA_RAM_SIZE-1.
For hart 3, the relative word addresses range from -3*HART_DATA_RAM_SIZE to
HART_DATA_RAM_SIZE-1.
For any hart, a negative relative address is an access to a partition of a preceding
hart (hart 0 relative addresses are never negative) and a positive relative address
is an access either to the local partition (if the relative word address is less than
HART_DATA_RAM_SIZE) or to a partition of a succeeding hart.
The main advantages of the hart-partitioned memory model on bare-metal is that
protection and sharing are directly handled by the hardware (with a careful program-
mer: he/she should ensure that a thread which writes to a data and another thread
306 10 Building a RISC-V Processor with a Multiple Hart Pipeline
which reads from this data are properly synchronized, with the reading instruction
execution following the writing one; there is no hardware in the multihart design
to ensure such a synchronization; in Sect. 10.4.2, I present a parallel program with
multiple threads properly self-synchronized).
All the source files related to the multihart_ip can be found in the multihart_ip
folder.
The pipeline has the same six stages as the multicycle one presented in the pre-
ceding chapter.
The fetch stage has been slightly modified (see Fig. 10.3). As the pipeline is
intended to run multiple threads, it is not mandatory to provide a new pc every cycle.
If a new pc is available every other cycle, a 2-hart pipeline can alternate threads, the
first one using the even cycles and the second one using the odd cycles. A drawback
is that if only one thread is running, half of the cycles are lost.
The fetch stage does not decode instructions at all. It cannot distinguish control
flow ones. It does not compute any next pc. The next pc computation is done in the
decode stage (if the instruction is neither a BRANCH nor a JALR). This design gives
better performance for runs with at least two threads than the multicycle pipeline
organization where the fetch stage produces a next pc for itself. Moreover, this
simplification saves LUTs.
In the figure, the fetch stage forwards the fetch pc to the decode stage, which
computes the next pc. This next pc is sent to the fetch stage where it is received two
cycles after the initial fetch. Hence, the pipeline can be filled only when at least two
threads are running (red thread in hart 0 and green thread in hart 1).
pc0 pc0+1
fetch from hart 0 decode from hart 0
pc1 pc1+1
fetch from hart 1 decode from hart 1
pc0 pc0+1
fetch from hart 0 decode from hart 0
pc1
fetch from hart 1
The number of harts is defined in the multihart_ip.h file (see Listing 10.1). To change
it, you need to change the definition of the LOG_NB_HART constant (1 for two harts, 2
for four harts and 3 for eight harts). The number of harts cannot be 1 (LOG_NB_HART
should not be null) and cannot be greater than eight.
The code memory is organized as a single bank, shared by all the harts. The codes
they run are all mixed in the same memory.
The data memory is partitioned. Each hart has a private partition of HART_DATA_
RAM_SIZE words. The total data memory size is DATA_RAM_SIZE, i.e. 216 words
(256KB), whatever the number of harts. Hence, when the number of harts increases,
the hart memory partition size decreases. For example, with two harts, each hart has
a 215 words partition (128 KB). With four harts, the size of the partitions is 214 words
(64 KB). With eight harts, the size is 213 words (32 KB).
Listing 10.1 The multihart_ip.h file (partial)
# ifndef __MULTIHART_IP
# define __MULTIHART_IP
# include " ap_int .h"
# include " debug_multihart_ip .h"
# define LOG_NB_HART 1
# define NB_HART (1 < < L O G _ N B _ H A R T )
# define LOG_CODE_RAM_SIZE 16
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
# define LOG_DATA_RAM_SIZE 16
# define DATA_RAM_SIZE (1 < < L O G _ D A T A _ R A M _ S I Z E )
# define LOG_HART_DATA_RAM_SIZE ( LOG_DATA_RAM_SIZE - LOG_NB_HART )
# define HART_DATA_RAM_SIZE (1 < < L O G _ H A R T _ D A T A _ R A M _ S I Z E )
# define LOG_REG_FILE_SIZE 5
# define NB_REGISTER (1 < < L O G _ R E G _ F I L E _ S I Z E )
...
An instruction stays in a pipeline stage until it is selected. Each pipeline stage has
an internal array to hold its waiting instructions.
The array is named from the pipeline stage initial letter and the state suffix (e.g.
e_state for the execute stage array).
The array has one entry per hart (hence, a pipeline stage may not hold more than
one waiting instruction per running thread).
A state array entry is a structure gathering all the input and output fields of the
pipeline stage.
For example, the f_state array entry has two fields: fetch_pc and instruction. The
fetch_pc field is used to hold the incoming pc sent by the decode stage on the d_to_f
link or by the execute stage on the e_to_f link. The instruction field is used to hold
the fetched instruction to be sent to the decode stage on the f_to_d link.
Figure 10.4 shows the state arrays and the inter-stage links (red arrows).
308 10 Building a RISC-V Processor with a Multiple Hart Pipeline
e_to_f
d_to_f
f_state d_state i_state e_state m_state w_state
f_to_d d_to_i i_to_e e_to_m m_to_w
Fig. 10.4 The six pipeline stages, their state arrays, and their inter-stage links
The types of the state arrays and the types of the inter-stage links for the six pipeline
stages are shown in Listings 10.2 to 10.7, which are part of the multihart_ip.h file.
bit_t is_store ;
func3_t func3 ;
bit_t is_ret ;
b_data_address_t address ;
int value ;
# ifndef __SYNTHESIS__
code_address_t fetch_pc ;
instruction_t instruction ;
decoded_instruction_t d_i ;
code_address_t target_pc ;
# endif
} from_e_to_m_t ;
The instructions in the multihart pipeline move unevenly along the stages. An in-
struction may stay in the issue stage because its source registers are locked. It may
stay in any stage because a higher priority hart is selected.
In each stage, the processing starts by a hart selection to choose which instruction
is to be processed. To be selected, a hart must have its state array entry filled with an
instruction.
Moreover, selecting a hart h is possible only if the next stage is able to host the
output of the processing, i.e. if the hart h state array entry in the next stage is empty.
This selection procedure is not optimal though, because a full entry in stage s+1
blocks a selection in stage s even if this entry is to be processed in the same cycle.
However, an optimal selection algorithm would rely on a serialization of the stage
selections, each stage selection depending on what hart is selected in the next stage.
Anyway, the non-optimal selection algorithm used in the design is efficient, as will
be shown by the performance measures at the end of the chapter.
So, the selection algorithm requires to be able to keep track of the occupation of
the different state array entries. Stage s should be aware of which state array entries
are empty in stage s+1.
For this purpose, for each stage an array of occupation bits is transmitted to
its predecessor (see Fig. 10.5, in which each green arrow represents an array of
NB_HART bits).
For example, the d_state_is_full array links the decode stage to the fetch stage.
When hart h array entry is occupied in the decode stage, d_state_is_full[h] is set. In
this case, hart h cannot be selected for fetch in the fetch stage.
312 10 Building a RISC-V Processor with a Multiple Hart Pipeline
i_state_is_full m_state_is_full
Fig. 10.5 Inter-stage retro links (in green) to show which state array entries are occupied
(i.e. vectors) to partition, so dim was systematically 1. In the multihart design, the
arrays are matrices. The partitioning should apply to all the dimensions (i.e. each
element should have its individual access port), hence the dim option is set as dim=0
complete (e.g. NB_HART*NB_REGISTER ports for reg_file).
Listing 10.9 The multihart_ip function local declarations (1)
...
int reg_file [ N B _ H A R T ][ N B _ R E G I S T E R ];
# pragma HLS A R R A Y _ P A R T I T I O N v a r i a b l e = r e g _ f i l e dim =0 c o m p l e t e
bit_t is_reg_computed [ N B _ H A R T ][ N B _ R E G I S T E R ];
# pragma HLS A R R A Y _ P A R T I T I O N v a r i a b l e = i s _ r e g _ c o m p u t e d dim =0 c o m p l e t e
...
from_i_to_e_t i_to_e ;
from_i_to_e_t e_from_i ;
bit_t e _ s t a t e _ i s _ f u l l [ N B _ H A R T ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = e _ s t a t e _ i s _ f u l l dim =1 c o m p l e t e
e_state_t e_state [ N B _ H A R T ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = e _ s t a t e dim =1 c o m p l e t e
from_e_to_f_t e_to_f ;
from_e_to_m_t e_to_m ;
from_e_to_m_t m_from_e ;
bit_t m _ s t a t e _ i s _ f u l l [ N B _ H A R T ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = m _ s t a t e _ i s _ f u l l dim =1 c o m p l e t e
m_state_t m_state [ N B _ H A R T ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = m _ s t a t e dim =1 c o m p l e t e
from_m_to_w_t m_to_w ;
from_m_to_w_t w_from_m ;
bit_t w _ s t a t e _ i s _ f u l l [ N B _ H A R T ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = w _ s t a t e _ i s _ f u l l dim =1 c o m p l e t e
w_state_t w_state [ N B _ H A R T ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = w _ s t a t e dim =1 c o m p l e t e
...
lock_unlock_update function
is_reg_computed[ih][id]=1 is_reg_computed0
is_reg_computed1
is_reg_computed[wh][wd]=0 is_reg_computed2
is_reg_computed3
The stack pointer register sp points on the next word after the hart data memory
partition. As it is moving backward, the first word written to the stack is placed at
the end of the hart partition. As a result, each hart has its own local stack.
Listing 10.16 The init_file function
// a1 / x11 is set to the hart number
static void i n i t _ f i l e (
int reg_file [][ N B _ R E G I S T E R ] ,
bit_t i s _ r e g _ c o m p u t e d [][ N B _ R E G I S T E R ]) {
h a r t _ n u m _ p 1 _ t h1 ;
hart_num_t h;
reg_num_p1_t r1 ;
reg_num_t r;
for ( h1 =0; h1 < N B _ H A R T ; h1 ++) {
# p r a g m a HLS U N R O L L
h = h1 ;
for ( r1 =0; r1 < N B _ R E G I S T E R ; r1 ++) {
# p r a g m a HLS U N R O L L
r = r1 ;
i s _ r e g _ c o m p u t e d [ h ][ r ] = 0;
if ( r ==11)
reg_file [ h ][ r ] = h ;
else if ( r == SP )
reg_file [ h ][ r ] = (1 < <( L O G _ H A R T _ D A T A _ R A M _ S I Z E +2) ) ;
else
reg_file [ h ][ r ] = 0;
}
}
}
When the number of harts is four or eight, the main loop iteration duration is three
FPGA cycles (to check this, you must synthesize and have a look at the Schedule
Viewer; the synthesis time is rather long for eight harts: half an hour on my laptop).
However, the II value keeps to 2 and the multihart IP cycle stays at two FPGA
cycles. This implies an overlapping of one cycle between two successive iterations,
as shown in Fig. 10.7.
The overlapping does not cause any problem if everything used in the next iteration
is set by the current iteration before the end of its second FPGA cycle. The synthesizer
ensures that this condition is met (otherwise, it would raise an II violation).
The new_cycle function in the new_cycle.cpp file copies the _to_ variables into the
_from_ ones (see Listing 10.18).
If no selectable hart is found in the state array, is_fetching may still be set if there
is a valid input. Then, the fetching_hart is set as the inputting hart (f_from_d.hart if
f_from_d.is_valid is set, f_from_e.hart otherwise; the input from the decode stage
has priority over the input of the execute stage).
An is_fetching bit set indicates that a fetch is done in the current cycle. The
fetching_hart value is the hart number of the fetching hart. The matching entry in
the f_state_is_full array is cleared.
The fetching hart fetches in the stage_job function. The fetched instruction is
transmitted to the decode stage (set_output_to_d function and f_to_d->is_valid).
Listing 10.19 The fetch function
void fetch (
from_d_to_f_t f_from_d ,
from_e_to_f_t f_from_e ,
bit_t * d_state_is_full ,
i n s t r u c t i o n _ t * code_ram ,
f_state_t * f_state ,
f r o m _ f _ t o _ d _ t * f_to_d ,
bit_t * f_state_is_full ){
bit_t is_selected ;
hart_num_t selected_hart ;
bit_t is_fetching ;
hart_num_t fetching_hart ;
select_hart ( f_state_is_full , d_state_is_full ,
& is_selected , & selected_hart );
if ( f _ f r o m _ d . i s _ v a l i d ) {
f _ s t a t e _ i s _ f u l l [ f _ f r o m _ d . hart ] = 1;
s a v e _ i n p u t _ f r o m _ d ( f_from_d , f _ s t a t e ) ;
}
if ( f _ f r o m _ e . i s _ v a l i d ) {
f _ s t a t e _ i s _ f u l l [ f _ f r o m _ e . hart ] = 1;
s a v e _ i n p u t _ f r o m _ e ( f_from_e , f _ s t a t e ) ;
}
is_fetching =
is_selected ||
( f _ f r o m _ d . i s _ v a l i d && ! d _ s t a t e _ i s _ f u l l [ f _ f r o m _ d . hart ]) ||
( f _ f r o m _ e . i s _ v a l i d && ! d _ s t a t e _ i s _ f u l l [ f _ f r o m _ e . hart ]) ;
fetching_hart =
( is_selected )? selected_hart :
( f _ f r o m _ d . i s _ v a l i d && ! d _ s t a t e _ i s _ f u l l [ f _ f r o m _ d . hart ]) ?
f _ f r o m _ d . hart : f _ f r o m _ e . hart ;
if ( i s _ f e t c h i n g ) {
f _ s t a t e _ i s _ f u l l [ f e t c h i n g _ h a r t ] = 0;
s t a g e _ j o b ( f e t c h i n g _ h a r t , f_state , c o d e _ r a m ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
p r i n t f ( " hart % d : f e t c h e d " , ( int ) f e t c h i n g _ h a r t ) ;
p r i n t f ( " %04 d : %08 x \n",
( int ) ( f _ s t a t e [ f e t c h i n g _ h a r t ]. fetch_pc < <2) ,
f _ s t a t e [ f e t c h i n g _ h a r t ]. i n s t r u c t i o n ) ;
# endif
# endif
s e t _ o u t p u t _ t o _ d ( f e t c h i n g _ h a r t , f_state , f _ t o _ d ) ;
}
f_to_d - > i s _ v a l i d = i s _ f e t c h i n g ;
}
322 10 Building a RISC-V Processor with a Multiple Hart Pipeline
In a second step (see Listing 10.21), the c[h] Boolean values are ORed in a binary
tree to form the is_selected value (i.e. if at least one hart h has its c[h] condition set,
is_selected is set).
The selected_hart is set as the first one having its c[h] condition set. Hence, harts
are priority ordered (hart 0 has the highest priority).
Listing 10.21 The fetch stage select_hart function: selected_hart and is_selected
...
# if ( NB_HART <2)
* selected_hart = 0;
* is_selected = c [0];
# elif ( NB_HART <3)
* s e l e c t e d _ h a r t = ( c [0]) ?0:1;
* is_selected = ( c [0] || c [1]) ;
# elif ( NB_HART <5)
h a r t _ n u m _ t h01 , h23 ;
bit_t c01 , c23 ;
h01 = ( c [0]) ?0:1;
c01 = ( c [0] || c [1]) ;
h23 = ( c [2]) ?2:3;
c23 = ( c [2] || c [3]) ;
* s e l e c t e d _ h a r t = ( c01 ) ? h01 : h23 ;
* is_selected = ( c01 || c23 ) ;
# elif ( NB_HART <9)
h a r t _ n u m _ t h01 , h23 , h45 , h67 , h03 , h47 ;
bit_t c01 , c23 , c45 , c67 , c03 , c47 ;
10.3 The Multihart Pipeline 323
The decode function in the decode.cpp file (see Listing 10.22) works the same way
as the fetch function. It starts by a hart selection (a call to the select_hart function;
the selection is similar to the fetch stage one).
In parallel, the input from the fetch stage is saved in the state array (a call to the
save_input_from_f function).
When a decoding hart has been selected, its instruction is decoded (in the
stage_job function).
The fields of the output structures to be sent to the fetch stage and to the issue
stage are filled (set_output_to_f and set_output_to_i functions).
The output to the fetch stage is valid if an instruction has been decoded and if it
is neither a BRANCH nor a JALR.
The output to the issue stage is valid if an instruction has been decoded.
Listing 10.22 The decode function
void d e c o d e (
from_f_to_d_t d_from_f ,
bit_t * i_state_is_full ,
d_state_t * d_state ,
f r o m _ d _ t o _ f _ t * d_to_f ,
f r o m _ d _ t o _ i _ t * d_to_i ,
bit_t * d_state_is_full ){
bit_t is_selected ;
hart_num_t selected_hart ;
bit_t is_decoding ;
hart_num_t decoding_hart ;
select_hart ( d_state_is_full , i_state_is_full ,
& is_selected , & selected_hart );
if ( d _ f r o m _ f . i s _ v a l i d ) {
d _ s t a t e _ i s _ f u l l [ d _ f r o m _ f . hart ] = 1;
s a v e _ i n p u t _ f r o m _ f ( d_from_f , d _ s t a t e ) ;
}
is_decoding =
i s _ s e l e c t e d ||
( d _ f r o m _ f . i s _ v a l i d && ! i _ s t a t e _ i s _ f u l l [ d _ f r o m _ f . hart ]) ;
decoding_hart =
324 10 Building a RISC-V Processor with a Multiple Hart Pipeline
( i s _ s e l e c t e d ) ? s e l e c t e d _ h a r t : d _ f r o m _ f . hart ;
if ( i s _ d e c o d i n g ) {
d _ s t a t e _ i s _ f u l l [ d e c o d i n g _ h a r t ] = 0;
stage_job ( decoding_hart , d_state );
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
p r i n t f ( " hart % d : d e c o d e d %04 d : " ,
( int ) d e c o d i n g _ h a r t ,
( int ) ( d _ s t a t e [ d e c o d i n g _ h a r t ]. fetch_pc < <2) ) ;
d i s a s s e m b l e ( d _ s t a t e [ d e c o d i n g _ h a r t ]. fetch_pc ,
d _ s t a t e [ d e c o d i n g _ h a r t ]. i n s t r u c t i o n ,
d _ s t a t e [ d e c o d i n g _ h a r t ]. d_i ) ;
if ( d _ s t a t e [ d e c o d i n g _ h a r t ]. d_i . i s _ j a l )
printf (" pc = %16 d (%8 x ) \ n " ,
( int ) ( d _ s t a t e [ d e c o d i n g _ h a r t ]. r e l a t i v e _ p c < <2) ,
( u n s i g n e d int ) ( d _ s t a t e [ d e c o d i n g _ h a r t ]. r e l a t i v e _ p c < <2) ) ;
# endif
# endif
s e t _ o u t p u t _ t o _ f ( d e c o d i n g _ h a r t , d_state , d _ t o _ f ) ;
s e t _ o u t p u t _ t o _ i ( d e c o d i n g _ h a r t , d_state , d _ t o _ i ) ;
}
d_to_f - > i s _ v a l i d =
is_decoding &&
! d _ s t a t e [ d e c o d i n g _ h a r t ]. d_i . i s _ b r a n c h &&
! d _ s t a t e [ d e c o d i n g _ h a r t ]. d_i . i s _ j a l r ;
d_to_i - > i s _ v a l i d = i s _ d e c o d i n g ;
}
if ( i _ f r o m _ d . i s _ v a l i d ) {
i _ s t a t e _ i s _ f u l l [ i _ f r o m _ d . hart ] = 1;
s a v e _ i n p u t _ f r o m _ d ( i_from_d , i s _ r e g _ c o m p u t e d , i _ s t a t e ) ;
}
is_issuing =
i s _ s e l e c t e d ||
( i _ f r o m _ d . i s _ v a l i d && ! e _ s t a t e _ i s _ f u l l [ i _ f r o m _ d . hart ] &&
! i _ s t a t e [ i _ f r o m _ d . hart ]. w a i t _ 1 2 ) ;
issuing_hart =
( i s _ s e l e c t e d ) ? s e l e c t e d _ h a r t : i _ f r o m _ d . hart ;
...
If there are two harts, a second c[1] condition is computed the same way for hart
1.
If there are four harts, three more c[1] to c[3] conditions are computed the same
way for harts 1 to 3.
If there are eight harts, seven more c[1] to c[7] conditions are computed the same
way for harts 1 to 7.
The last part of the select_hart function in the issue stage computes the final
is_selected and selected_hart values. The code is the same as the one in the se-
lect_hart function in the fetch stage (refer back to Listing 10.21).
10.3 The Multihart Pipeline 327
int result2 ;
code_address_t computed_pc ;
select_hart ( e_state_is_full , m_state_is_full ,
& is_selected , & selected_hart );
if ( e _ f r o m _ i . i s _ v a l i d ) {
e _ s t a t e _ i s _ f u l l [ e _ f r o m _ i . hart ] = 1;
s a v e _ i n p u t _ f r o m _ i ( e_from_i , e _ s t a t e ) ;
}
is_executing =
i s _ s e l e c t e d ||
( e _ f r o m _ i . i s _ v a l i d && ! m _ s t a t e _ i s _ f u l l [ e _ f r o m _ i . hart ]) ;
executing_hart =
( i s _ s e l e c t e d ) ? s e l e c t e d _ h a r t : e _ f r o m _ i . hart ;
if ( i s _ e x e c u t i n g ) {
e _ s t a t e _ i s _ f u l l [ e x e c u t i n g _ h a r t ] = 0;
compute ( e x e c u t i n g _ h a r t , e_state , & bcond , & result1 ,
& result2 , & c o m p u t e d _ p c ) ;
s t a g e _ j o b ( e x e c u t i n g _ h a r t , e_state , bcond , c o m p u t e d _ p c ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
p r i n t f ( " hart % d : e x e c u t e " , ( int ) e x e c u t i n g _ h a r t ) ;
p r i n t f ( " %04 d \ n " , ( int ) ( e _ s t a t e [ e x e c u t i n g _ h a r t ]. fetch_pc < <2) ) ;
if ( e _ s t a t e [ e x e c u t i n g _ h a r t ]. d_i . i s _ b r a n c h ||
e _ s t a t e [ e x e c u t i n g _ h a r t ]. d_i . i s _ j a l r )
e m u l a t e ( e x e c u t i n g _ h a r t , reg_file ,
e _ s t a t e [ e x e c u t i n g _ h a r t ]. d_i ,
e _ s t a t e [ e x e c u t i n g _ h a r t ]. t a r g e t _ p c ) ;
# endif
# endif
s e t _ o u t p u t _ t o _ f ( e x e c u t i n g _ h a r t , e_state , e _ t o _ f ) ;
s e t _ o u t p u t _ t o _ m ( e x e c u t i n g _ h a r t , result1 , result2 ,
c o m p u t e d _ p c , e_state , e _ t o _ m ) ;
}
// block fetch after last RET
// ( i . e . RET with 0 return a d d r e s s )
e_to_f - > i s _ v a l i d =
i s _ e x e c u t i n g && e _ s t a t e [ e x e c u t i n g _ h a r t ]. i s _ t a r g e t ;
e_to_m - > i s _ v a l i d = i s _ e x e c u t i n g ;
}
The mem_access function in the mem_access.cpp file which implements the mem-
ory access stage is shown in Listings 10.29 and 10.31.
Listing 10.29 The mem_access function: computing the is_accessing and accessing_hart values
void m e m _ a c c e s s (
from_e_to_m_t m_from_e ,
bit_t * w_state_is_full ,
int d a t a _ r a m [][ H A R T _ D A T A _ R A M _ S I Z E ] ,
m_state_t * m_state ,
f r o m _ m _ t o _ w _ t * m_to_w ,
bit_t * m_state_is_full ){
bit_t is_selected ;
hart_num_t selected_hart ;
bit_t is_accessing ;
hart_num_t accessing_hart ;
select_hart ( m_state_is_full , w_state_is_full ,
& is_selected , & selected_hart );
if ( m _ f r o m _ e . i s _ v a l i d ) {
m _ s t a t e _ i s _ f u l l [ m _ f r o m _ e . hart ] = 1;
s a v e _ i n p u t _ f r o m _ e ( m_from_e , m _ s t a t e ) ;
}
is_accessing =
i s _ s e l e c t e d ||
( m _ f r o m _ e . i s _ v a l i d && ! w _ s t a t e _ i s _ f u l l [ m _ f r o m _ e . hart ]) ;
accessing_hart =
( i s _ s e l e c t e d ) ? s e l e c t e d _ h a r t : m _ f r o m _ e . hart ;
...
The mem_load function selects the requested bytes out of the read word (un-
changed from rv32i_npp_ip; see Listing 6.11).
# endif
# endif
}
}
Experimentation
To simulate the multihart_ip, operate as explained in Sect. 5.3.6, replacing fetch-
ing_ip with multihart_ip. There are two testbench programs: testbench_seq_
multihart_ip.cpp to run independent codes (one per hart) and testbench_par_
multihart_ip.cpp to run a parallel sum of the elements of an array.
With testbench_seq_multihart_ip.cpp you can play with the simulator, replacing
the included test_mem_0_text.hex file with any other .hex file you find in the
same folder. You can also vary the number of harts.
Two different testbench files are provided. One is to run the set of test codes (from
test_branch.s to test_sum.s). The other is to run a distributed parallel sum of the
elements of an array.
The first testbench is in the testbench_seq_multihart_ip.cpp file. Each hart runs
a copy of the selected test code. All the harts are set as active at the start of the run
and they have the same start address (start_pc[h]=0).
The test codes can be built with the build_seq.sh shell script shown in Listing
10.41. The script needs the file name of the test code to build (argument $1; for
example "./build_seq.sh test_mem").
336 10 Building a RISC-V Processor with a Multiple Hart Pipeline
reads the partial sums (remote memory accesses) and computes their overall sum
(reduction operation).
The parallelized version of test_mem.s is named test_mem_par_2h.s for two
harts, test_mem_par_4h.s for four harts, and test_mem_par_8h.s for eight harts.
When you switch from x harts to y harts, you must not forget to do two updates: the
LOG_NB_HART value in the multihart_ip.h file and the name of the included hex
file in the testbench_par_multihart_ip.cpp file.
Then, the first hart sums the elements in a second loop (.L2; see Listing 10.47).
The local sum is saved at the local address 40.
340 10 Building a RISC-V Processor with a Multiple Hart Pipeline
Listing 10.47 The test_mem_par_2h.s file (RISC-V source code): summing the local partition
...
li a1 ,0 /* a1 =0*/
li a2 ,0 /* a2 =0*/
.L2 :
lw a4 ,0( a2 ) /* a4 = t [ a2 ]*/
addi a2 , a2 ,4 /* a2 +=4*/
add a0 , a0 , a4 /* a0 += a4 */
bne a2 , a3 , .L2 /* if ( a3 != a2 ) goto .L2 */
li a2 ,40 /* a2 =40*/
sw a0 ,0( a2 ) /* t [ a2 ]= a0 */
...
Eventually, the first hart adds the other harts partial sums to compute the total (see
Listing 10.48). It uses accesses to other partitions (positive addresses greater than
the partition size).
The synchronization between the writers (the other harts writing their partial sum
in their local memory) and the reader (the first hart reading these partial sums from
accesses to the external memory) is done in the inner .L3 loop. The first hart keeps
reading while the memory word is null (the inner loop contains two instructions: lw
and beq). As soon as the other hart has written its sum, the addressed word is non
null anymore and the first harts exits from the inner loop and accumulates the value
loaded into register a3.
Listing 10.48 The test_mem_par_2h.s file (RISC-V source code): accumulating the other parti-
tions local sums
...
li a1 ,1 /* a1 =1*/
li a2 , N B _ H A R T /* a2 = N B _ H A R T */
li a4 , H A R T _ D A T A _ R A M _ S I Z E
mv a5 , a4 /* a5 = a4 */
.L3 : lw a3 ,40( a4 ) /* a3 = t [ a4 + 4 0 ] * /
beq a3 , zero , .L3 /* if ( a3 ==0) goto .L3 */
add a0 , a0 , a3 /* a0 += a3 */
add a4 , a4 , a5 /* a4 += a5 */
addi a1 , a1 ,1 /* a1 ++*/
bne a1 , a2 , .L3 /* if ( a1 != a2 ) goto .L3 */
li a2 ,44 /* a2 =44*/
sw a0 ,0( a2 ) /* t [ a2 ]= a0 */
ret
The OTHER_HART_START constant definition sets the starting point of the code to
be run by the harts other than 0 (OTHER_HART_START=0x74/4, i.e. instruction 29).
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 28] = 55 ( 37)
m[ 2c] = 210 ( d2 )
hart 1: data memory dump ( non null words )
m [20000] = 11 ( b)
m [20004] = 12 ( c)
m [20008] = 13 ( d)
m [2000 c ] = 14 ( e)
m [20010] = 15 ( f)
m [20014] = 16 ( 10)
m [20018] = 17 ( 11)
m [2001 c ] = 18 ( 12)
m [20020] = 19 ( 13)
m [20024] = 20 ( 14)
m [20028] = 155 ( 9b)
Figure 10.8 shows the synthesis report for two harts. The IP cycle is two FPGA cycles
(20 ns, 50 Mhz). This is also the case for four and eight harts.
Figure 10.9 shows that the main loop iteration takes two cycles, hence the multihart
IP cycle is 20 ns (for four and eight harts, the iteration takes three cycles but the II
interval is two cycles).
Figure 10.10 shows the Vivado implementation cost of the multihart_ip design for
two harts (4697 LUTs, 8.83%).
Figure 10.11 shows the Vivado implementation cost of the multihart_ip design
for four harts (7537 LUTs, 14.17%).
Figure 10.12 shows the Vivado implementation cost of the multihart_ip design
for eight harts (12,866 LUTs, 24.18%).
Fig. 10.10 The multihart_ip Vivado implementation report for two harts
344 10 Building a RISC-V Processor with a Multiple Hart Pipeline
Fig. 10.11 The multihart_ip Vivado implementation report for four harts
Fig. 10.12 The multihart_ip Vivado implementation report for eight harts
10.7 Running the Multihart_ip on the Development Board 345
Experimentation
To run the multihart_ip on the development board, proceed as explained in Sect.
5.3.10, replacing fetching_ip with multihart_ip.
There are two drivers: helloworld_seq_hart.c on which you can run independent
programs on each hart and helloworld_par_hart.c on which you can run a parallel
sum of the elements of an array.
The code to drive the FPGA to run the test_mem.s program on two harts is shown in
Listing 10.51 (do not forget to adapt the path to the hex file to your environment with
the update_helloworld.sh shell script). To run another program, e.g. test_branch,
test_jal_jalr, test_load_store, test_lui_auipc, test_op, test_op_imm, or test_sum,
update the code_ram array initialization #include line. To increase the number of
harts, change the LOG_NB_HART value.
Listing 10.51 The helloworld_seq_hart.c file driving the 2-hart multihart_ip
# i n c l u d e < stdio .h >
# include " xmultihart_ip .h"
# include " xparameters .h"
# define LOG_NB_HART 1
# define NB_HART (1 < < L O G _ N B _ H A R T )
# define LOG_CODE_RAM_SIZE 16
// size in words
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
# define LOG_DATA_RAM_SIZE 16
// size in words
# define DATA_RAM_SIZE (1 < < L O G _ D A T A _ R A M _ S I Z E )
# define LOG_HART_DATA_RAM_SIZE ( LOG_DATA_RAM_SIZE - LOG_NB_HART )
# define HART_DATA_RAM_SIZE (1 < < L O G _ H A R T _ D A T A _ R A M _ S I Z E )
XMultihart_ip_Config * cfg_ptr ;
XMultihart_ip ip ;
word_type code_ram [ CODE_RAM_SIZE ] = {
# i n c l u d e " t e s t _ m e m _ t e x t . hex "
};
word_type start_pc [ NB_HART ]={0};
int main () {
u n s i g n e d int nbi , nbc ;
word_type w;
cfg_ptr = XMultihart_ip_LookupConfig (
XPAR_XMULTIHART_IP_0_DEVICE_ID );
X M u l t i h a r t _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X M u l t i h a r t _ i p _ S e t _ r u n n i n g _ h a r t _ s e t (& ip , (1 < < N B _ H A R T ) -1) ;
X M u l t i h a r t _ i p _ W r i t e _ s t a r t _ p c _ W o r d s (& ip , 0 , start_pc , N B _ H A R T ) ;
X M u l t i h a r t _ i p _ W r i t e _ c o d e _ r a m _ W o r d s (& ip , 0 , code_ram ,
CODE_RAM_SIZE );
X M u l t i h a r t _ i p _ S t a r t (& ip ) ;
while (! X M u l t i h a r t _ i p _ I s D o n e (& ip ) ) ;
346 10 Building a RISC-V Processor with a Multiple Hart Pipeline
nbi = X M u l t i h a r t _ i p _ G e t _ n b _ i n s t r u c t i o n (& ip ) ;
nbc = X M u l t i h a r t _ i p _ G e t _ n b _ c y c l e (& ip ) ;
p r i n t f ( " % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \
in % d cycles ( ipc = %2.2 f ) \ n " , nbi , nbc , (( float ) nbi ) / nbc ) ;
for ( int h =0; h < N B _ H A R T ; h ++) {
p r i n t f ( " hart % d data memory dump ( non null words ) \ n " , h ) ;
for ( int i =0; i < H A R T _ D A T A _ R A M _ S I Z E ; i ++) {
XMultihart_ip_Read_data_ram_Words
(& ip , i +((( int ) h ) << L O G _ H A R T _ D A T A _ R A M _ S I Z E ) , &w , 1) ;
if ( w != 0)
p r i n t f ( " m [%5 x ] = %16 d (%8 x ) \ n " ,
4*( i +((( int ) h ) << L O G _ H A R T _ D A T A _ R A M _ S I Z E ) ) ,
( int )w , ( u n s i g n e d int ) w ) ;
}
}
}
The run of the RISC-V code in the test_mem.s file on the FPGA produces the
print shown in Listing 10.52 to the putty window (run for two harts).
Listing 10.52 The helloworld.c print on the putty window
176 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 306 c y c l e s ( ipc = 0.58)
hart 0 data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
hart 1 data memory dump ( non null words )
m [20000] = 1 ( 1)
m [20004] = 2 ( 2)
m [20008] = 3 ( 3)
m [2000 c ] = 4 ( 4)
m [20010] = 5 ( 5)
m [20014] = 6 ( 6)
m [20018] = 7 ( 7)
m [2001 c ] = 8 ( 8)
m [20020] = 9 ( 9)
m [20024] = 10 ( a)
m [2002 c ] = 55 ( 37)
The code to drive the FPGA to run the test_mem_par_2h.s code on two harts is
shown in Listing 10.53 (do not forget to adapt the path to the hex file to your
environment with the update_helloworld.sh shell script). To increase the number
of harts, change the LOG_NB_HART value and change the included hex file.
10.7 Running the Multihart_ip on the Development Board 347
Listing 10.53 The helloworld_par_hart.c file driving the multihart_ip for the
test_mem_par_2h.s code
# i n c l u d e < stdio .h >
# include " xmultihart_ip .h"
# include " xparameters .h"
# define LOG_NB_HART 1
# define NB_HART (1 < < L O G _ N B _ H A R T )
# define LOG_CODE_RAM_SIZE 16
// size in words
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
# define LOG_DATA_RAM_SIZE 16
// size in words
# define DATA_RAM_SIZE (1 < < L O G _ D A T A _ R A M _ S I Z E )
# define LOG_HART_DATA_RAM_SIZE ( LOG_DATA_RAM_SIZE - LOG_NB_HART )
# define HART_DATA_RAM_SIZE (1 < < L O G _ H A R T _ D A T A _ R A M _ S I Z E )
# define OTHER_HART_START 0 x74 /4
XMultihart_ip_Config * cfg_ptr ;
XMultihart_ip ip ;
u n s i g n e d int c o d e _ r a m [ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ m e m _ p a r _ 2 h _ t e x t . hex "
};
w o r d _ t y p e s t a r t _ p c [ N B _ H A R T ];
int main () {
u n s i g n e d int nbi ;
u n s i g n e d int nbc ;
word_type w;
cfg_ptr = XMultihart_ip_LookupConfig (
XPAR_XMULTIHART_IP_0_DEVICE_ID );
X M u l t i h a r t _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X M u l t i h a r t _ i p _ S e t _ r u n n i n g _ h a r t _ s e t (& ip , (1 < < N B _ H A R T ) -1) ;
for ( int h =1; h < N B _ H A R T ; h ++)
s t a r t _ p c [ h ]= O T H E R _ H A R T _ S T A R T ;
s t a r t _ p c [0] = 0;
X M u l t i h a r t _ i p _ W r i t e _ s t a r t _ p c _ W o r d s (& ip , 0 , start_pc , N B _ H A R T ) ;
X M u l t i h a r t _ i p _ W r i t e _ c o d e _ r a m _ W o r d s (& ip , 0 , code_ram ,
CODE_RAM_SIZE );
X M u l t i h a r t _ i p _ S t a r t (& ip ) ;
while (! X M u l t i h a r t _ i p _ I s D o n e (& ip ) ) ;
nbi = X M u l t i h a r t _ i p _ G e t _ n b _ i n s t r u c t i o n (& ip ) ;
nbc = X M u l t i h a r t _ i p _ G e t _ n b _ c y c l e (& ip ) ;
p r i n t f ( " % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \
in % d cycles ( ipc = %2.2 f ) \ n " , nbi , nbc , (( float ) nbi ) / nbc ) ;
for ( int h =0; h < N B _ H A R T ; h ++) {
p r i n t f ( " hart % d data memory dump ( non null words ) \ n " , h ) ;
for ( int i =0; i < H A R T _ D A T A _ R A M _ S I Z E ; i ++) {
XMultihart_ip_Read_data_ram_Words
(& ip , i +((( int ) h ) << L O G _ H A R T _ D A T A _ R A M _ S I Z E ) , &w , 1) ;
if ( w != 0)
p r i n t f ( " m [%5 x ] = %16 d (%8 x ) \ n " ,
4*( i +((( int ) h ) << L O G _ H A R T _ D A T A _ R A M _ S I Z E ) ) ,
( int )w , ( u n s i g n e d int ) w ) ;
}
}
}
The run of the RISC-V code in the test_mem_par_2h.s file on the FPGA produces
the print shown in Listing 10.52 to the putty window (run for two harts).
348 10 Building a RISC-V Processor with a Multiple Hart Pipeline
To pass the riscv-tests on the Vitis_HLS simulator, you just need to use the test-
bench_riscv_tests_multihart_ip.cpp program in the riscv-tests/my_isa/my_rv32ui
folder as the testbench.
To pass the riscv-tests on the FPGA, you must use the helloworld_multihart_2h_
ip.c, helloworld_multihart_4h_ip.c, or helloworld_multihart_8h_ip.c in the riscv-
tests/my_isa/my_rv32ui folder. Normally, since you already ran the update_
helloworld.sh shell script for the other processors, the helloworld_multihart_xh_ip.c
file (xh stands for 2h, 4h, or 8h) should have paths adapted to your environment. How-
ever if you did not, you must run the update_helloworld.sh shell script.
To run a benchmark from the mibench suite, say my_dir/bench, you set the testbench
as the testbench_bench_multihart_ip.cpp file found in the mibench/my_mibench/
my_dir/bench folder. For example, to run basicmath, you set the testbench as
testbench_basicmath_multihart_ip.cpp in the mibench/my_mibench/my_auto-
motive/basicmath folder.
To run one of the official riscv-tests benchmarks, say bench, you set the testbench
as the testbench_bench_multihart_ip.cpp file found in the riscv-tests/benchmarks/
10.8 Comparing the Multihart_ip to the 4-stage Pipeline 349
bench folder. For example, to run median, you set the testbench as testbench_
median_multihart_ip.cpp in the riscv-tests/benchmarks/median folder.
To run the same benchmarks on the FPGA, select helloworld_multihart_2h_ip.c
to run on the z1_multihart_2h_ip Vivado project, helloworld_multihart_4h_ip.c to
run on the z1_multihart_4h_ip Vivado project, or helloworld_multihart_8h_ip.c to
run on the z1_multihart_8h_ip Vivado project.
Table 10.1 shows the execution time of the different benchmarks run on a 2-hart
design as computed with equation 5.1 (nmi ∗ cpi ∗ c, where c = 20ns). The baseline
time reference refers to two successive executions of the same test program on the
rv32i_pp_ip design (the fastest design up to now).
Two harts are not enough to fill the pipeline. The CPI remains higher than on the
4-stage single hart pipeline (1.41 on the average versus 1.20; see 8.3). But the cycle
reduction compensates the CPI degradation despite the six stages increased length
of the pipeline.
On the benchmark suite, the 2-hart multihart_ip is 20% faster than the
rv32i_pp_ip.
Table 10.1 Execution time of the benchmarks on the multihart_ip processor (two active harts
running the same program)
suite benchmark Cycles nmi cpi Time (s) 4-stage Improve (%)
time (s)
mibench basicmath 88,958,398 61,795,478 1.44 1.779167960 2.258589420 21
mibench bitcount 88,133,658 65,306,478 1.35 1.762673160 2.274599940 23
mibench qsort 18,756,398 13,367,142 1.40 0.375127960 0.485171460 23
mibench stringsearch 1,720,638 1,098,326 1.57 0.034412760 0.038149800 10
mibench rawcaudio 1,980,166 1,266,316 1.56 0.039603320 0.045490260 13
mibench rawdaudio 1,383,490 936,598 1.48 0.027669800 0.033768060 18
mibench crc32 960,042 600,028 1.60 0.019200840 0.021600960 11
mibench fft 91,511,156 62,730,816 1.46 1.830223120 2.294015640 20
mibench fft_inv 93,053,088 63,840,638 1.46 1.861061760 2.334520080 20
riscv-tests median 75,636 55,784 1.36 0.001512720 0.002128260 29
riscv-tests mm 461,940,400 315,122,748 1.47 9.238808000 11.604328500 20
riscv-tests multiply 1,099,802 835,794 1.32 0.021996040 0.032401440 32
riscv-tests qsort 736,402 543,346 1.36 0.014728040 0.019837800 26
riscv-tests spmv 3,502,988 2,492,304 1.41 0.070059760 0.089876400 22
riscv-tests towers 906,100 807,616 1.12 0.018122000 0.025583100 29
riscv-tests vvadd 40,026 32,020 1.25 0.000800520 0.001080720 26
350 10 Building a RISC-V Processor with a Multiple Hart Pipeline
Table 10.2 Execution time of the program examples on the multihart_ip processor (four active
harts running the same program)
suite benchmark Cycles nmi cpi Time (s) 4-stage time (s) Improve (%)
mibench basicmath 136,057,928 123,590,956 1.10 2.721158560 4.517178840 40
mibench bitcount 143,803,793 130,612,956 1.10 2.876075860 4.549199880 37
mibench qsort 29,151,626 26,734,284 1.09 0.583032520 0.970342920 40
mibench stringsearch 2,384,057 2,196,652 1.09 0.047681140 0.076299600 38
mibench rawcaudio 2,873,992 2,532,632 1.13 0.057479840 0.090980520 37
mibench rawdaudio 2,074,964 1,873,196 1.11 0.041499280 0.067536120 39
mibench crc32 1,230,088 1,200,056 1.03 0.024601760 0.043201920 43
mibench fft 138,090,940 125,461,632 1.10 2.761818800 4.588031280 40
mibench fft_inv 140,447,968 127,681,276 1.10 2.808959360 4.669040160 40
riscv-tests median 123,662 111,568 1.11 0.002231360 0.004256520 48
riscv-tests mm 694,489,981 630,245,496 1.10 13.889799620 23.208657000 40
riscv-tests multiply 1,894,335 1,671,588 1.13 0.037886700 0.064802880 42
riscv-tests qsort 1,221,567 1,086,692 1.12 0.024431340 0.039675600 38
riscv-tests spmv 5,430,195 4,984,608 1.09 0.108603900 0.179752800 40
riscv-tests towers 1,615,420 1,615,232 1.00 0.032308400 0.051166200 37
riscv-tests vvadd 66,076 64,040 1.03 0.001321520 0.002161440 39
Table 10.2 shows the execution time of the different program examples run on a
4-hart design as computed with equation 5.1 (nmi ∗ cpi ∗ c, where c = 20 ns). The
baseline time reference refers to four successive executions of the same test program
on the rv32i_pp_ip design.
With four harts, the CPI is lower than the rv32i_pp_ip CPI (1.09 vs. 1.20 on aver-
age). With the 20ns cycle, the multihart IP is 1.66 times faster than the rv32i_pp_ip
on the average, when the four harts are running. However, the design uses 1.7 times
more LUTs (7,537 LUTs versus 4,334 LUTs).
Table 10.3 shows the execution time of the different program examples run on an
8-hart design as computed with equation 5.1 (nmi ∗ cpi ∗ c, where c = 20ns). The
baseline time reference refers to eight successive executions of the same test program
on the rv32i_pp_ip design.
The average CPI is 1.09, no better than with four harts. Running eight interleaved
harts on the multihart IP is on the average 1.65 times faster than running the same
eight programs successively on the rv32i_pp_ip. However, the 8-hart design uses
12,866 LUTs, i.e. 3 times the LUTs used to build the rv32i_pp_ip.
References 351
Table 10.3 Execution time of the program examples on the multihart_ip processor (eight active
harts running the same program)
suite benchmark Cycles nmi cpi Time (s) 4-stage time (s) Improve (%)
mibench basicmath 273,657,613 247,181,912 1.11 5.473152260 9.034357680 39
mibench bitcount 285,828,811 261,225,912 1.09 5.716576220 9.098399760 37
mibench qsort 58,457,350 53,468,568 1.09 1.169147000 1.940685840 40
mibench stringsearch 4,782,512 4,393,304 1.09 0.095650240 0.152599200 37
mibench rawcaudio 5,677,516 5,065,264 1.12 0.113550320 0.181961040 38
mibench rawdaudio 4,138,785 3,746,392 1.10 0.082775700 0.135072240 39
mibench crc32 2,479,691 2,400,112 1.03 0.049593820 0.086403840 43
mibench fft 277,969,508 250,923,264 1.11 5.559390160 9.176062560 39
mibench fft_inv 282,804,386 255,362,552 1.11 5.656087720 9.338080320 39
riscv-tests median 247,080 223,136 1.11 0.004941600 0.008513040 42
riscv-tests mm 1,399,160,540 1,260,490,992 1.11 27.983210800 46.417314000 40
riscv-tests multiply 3,563,258 3,343,176 1.07 0.071265160 0.129605760 45
riscv-tests qsort 2,391,276 2,173,384 1.10 0.047825520 0.079351200 40
riscv-tests spmv 10,900,106 9,969,216 1.09 0.218002120 0.359505600 39
riscv-tests towers 3,261,569 3,230,464 1.01 0.065231380 0.102332400 36
riscv-tests vvadd 132,852 128,080 1.04 0.002657040 0.004322880 39
References
1. D.M. Tullsen, S.J. Eggers, H.M. Levy, Simultaneous multithreading: maximizing on-chip par-
allelism, in 22nd Annual International Symposium on Computer Architecture (IEEE, 1995),
pp. 392–403
2. F. Gabbay, Speculative Execution based on Value Prediction. EE Department TR 1080, Tech-
nion - Israel Institute of Technology (1996)
3. S. Mittal, A survey of value prediction techniques for leveraging value locality, in Concurrency
and Computation, Practice and Experience, vol. 29, no. 21 (2017)
4. Y. Patt, W. Hwu, M. Shebanow, HPS, A New Microarchitecture: Rationale and Introduction,
ACM SIGMICRO Newsletter, vol. 16, no. 4 (1985), pp. 103–108
5. R.M. Tomasulo, An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res.
Dev. 11(1), 25–33 (1967)
6. J. Silc, B. Robic, T. Ungerer, Processor Architecture, From Dataflow to Superscalar and Beyond
(Springer, Heidelberg, 1999)
Part II
Multiple Core Processors
In this second part, I present multicore systems (the replicated cores are based on
the six-stage multicycle pipeline presented in Chap. 9, either single or multi-
threaded). The cores are interconnected through the AXI standard interconnection
system. On a Pynq-Z1/Pynq-Z2, up to eight harts can be implemented (e.g., eight
single hart cores, two cores with four harts each, or four cores with two harts each).
Connecting IPs
11
Abstract
This chapter presents the AXI interconnection system. You will build two multi-IP
components. The different IPs are connected via the AXI interconnect IP provided
by the Vivado component library. The first design connects a rv32i_npp_ip pro-
cessor (presented in Chap. 6) to two block memories, one for code and the other
for data. This design is intended to show how the AXI interconnection system
works. The second design connects two IPs sharing two data memory banks. It is
intended to show how multiple memory blocks are shared by multiple IPs, using
the AXI interconnection to exchange data.
The Advanced eXtensible Interface (AXI) part of the ARM Advanced Microcontroller Bus
Architecture 3 (AXI3) and 4 (AXI4) specifications, is a parallel high-performance, syn-
chronous, high-frequency, multi-master, multi-slave communication interface, mainly de-
signed for on-chip communication. (wikipedia)
The Vivado design suite provides ready-to-use IPs to build an AXI-based System-
On-Chip (SOC). The central component is the AXI interconnect IP (see Fig. 11.1).
It is used to interconnect multiple masters to multiple slaves identifying themselves
with memory mapping. In a memory mapping system, each IP is identified by its
memory address in a virtual memory address space.
A master is an IP which can initiate a transaction on the AXI bus. A slave is an IP
which responds to a master’s request. A transaction is a round-trip data transmission
from a master to a slave (the request) and from the slave back to the master (the
response). The request can be a single or multiple word read, or a single or multiple
word write. The response is either an acknowledgement (response to a write request,
meaning that the write is done) or a single or multiple data words (response to a read
request).
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 355
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_11
356 11 Connecting IPs
AXI CPU IP
Zynq7 interconnect
BRAM IP
Fig. 11.2 Connecting the Zynq7 master to a CPU IP and a BRAM IP slaves
AXI Block
BRAM Memory
controller Generator
AXI
Zynq7 interconnect CPU IP
AXI Block
BRAM Memory
controller Generator
All the source files related to the rv32i_npp_bram_ip can be found in the rv32i_npp_
bram_ip folder.
Experimentation
To simulate the rv32i_npp_bram_ip, operate as explained in Sect. 5.3.6, replacing
fetching_ip with rv32i_npp_bram_ip.
You can play with the simulator, replacing the included test_mem_0_text.hex
file with any other .hex file you find in the same folder.
The s_axilite ports provide the slave AXI interface with which the Zynq7 can
send the start run order and arguments to the rv32i_npp_ip (i.e. the start pc and the
nb_instruction counter).
The code_ram and data_ram bram ports are the CPU IP private accesses to
the Block Memory Generator IPs. They are used by the LOAD/STORE RISC-V
instructions run by the rv32i_npp_ip processor.
The remaining of the rv32i_npp_ip top function is unchanged (refer back to
Sect. 6.1).
The other files are unchanged, including the testbench and the RISC-V test pro-
grams.
358 11 Connecting IPs
Figure 11.4 is the synthesis report showing that the IP cycle is unchanged (seven
FPGA cycles like in the original rv32i_npp_ip). The Timing Violation is not impor-
tant. The design will be finely routed by Vivado.
is shown in Fig. 11.6 (notice the three M0x_AXI pins on the right edge of the AXI
Interconnect IP).
The master and slave namings may be confusing because an IP may be seen as a
master and, in the same time, as a slave. In the design shown in Fig. 11.3, the Zynq7
IP is a master. The AXI interconnect IP is its slave. It is also the master of the three
other IPs (the CPU and the two AXI BRAM controllers). As a slave of the Zynq7,
it serves its transactions and as a master of the CPU and BRAMs, it propagates to
them the transactions initiated by the Zynq7.
The third step is to add the two AXI BRAM Controller IPs (still no click on the
proposed Run Connection Automation). They should be customized too (select the
IP, right-click and select Customize Block). In the Re-customize IP dialog box, set
the Number of BRAM interfaces to 1. Also set the AXI Protocol to AXI4LITE. The
Diagram frame now contains what is shown in Fig. 11.7.
The fourth step is to add two Block Memory Generator IPs and customize them
(the Memory Type should be set to True Dual Port RAM). You may also rename
the IPs as code_ram and data_ram (after selecting a memory block IP, edit the
Block Properties/Name entry). The Diagram frame now contains what is shown
in Fig. 11.8.
The fifth step is to add the rv32i_npp_bram_ip CPU built in Sect. 11.2.1. The
rv32i_npp_bram_ip folder in which the IP has been defined, synthesized, and ex-
ported should be added to the list of visible IPs (on the main window menu, tab
Tools, select Settings, then expand IP in the Project Settings frame and select
Repository; in the IP repositories frame, click on “+” and navigate to the folder
containing your IP). Back on the Vivado main window, Diagram frame, you can
add your rv32i_npp_bram_ip component. The Diagram frame now contains what
is shown in Fig. 11.9.
360 11 Connecting IPs
The sixth step is to connect all the IPs together. You can use the tool to do hand
made connections (the connections are a bit too complex to have the automatic
connection system find the connections that you need). To draw a connection, move
the mouse over a pin and you should see a pen appear, suggesting that you can pull
a line when you click. Pull the line until you reach the pin to be connected (if you
pulled a line from a wrong starting point, you can escape and cancel the line drawing
with the return key).
11.2 The Non Pipelined RISC-V Processor with External Memory IPs 361
Experimentation
To run the rv32i_npp_bram_ip on the development board, proceed as explained
in Sect. 5.3.10, replacing fetching_ip with rv32i_npp_bram_ip.
You can play with your IP, replacing the included test_mem_0_text.hex file with
any other .hex file you find in the same folder.
The code to be run on the Zynq7 master is shown in Listing 11.2 (do not forget
to adapt the path to the hex file to your environment with the update_helloworld.sh
shell script). The external code_ram and data_ram arrays are mapped as defined
by Vivado (i.e. 0x4000_0000 and 0x4004_0000, to be checked with the Address
Editor in Vivado).
As the arrays are external to the rv32i_npp_bram_ip, they are not accessed
through XRv32i_npp_ip_Write_code_ram_Words and XRv32i_npp_ip_Read_
data_ram_Words functions but directly using the AXI addresses assigned to the
code_ram and data_ram pointers. These addresses are used by the AXI intercon-
nect IP to route the Zynq7 read and write requests to the concerned AXI BRAM
controller IP.
Listing 11.2 The helloworld.c file
# i n c l u d e < stdio .h >
# include " xrv32i_npp_ip .h"
# include " xparameters .h"
# d e f i n e L O G _ C O D E _ R A M _ S I Z E 16
// size in words
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
# d e f i n e L O G _ D A T A _ R A M _ S I Z E 16
364 11 Connecting IPs
// size in words
# define DATA_RAM_SIZE (1 < < L O G _ D A T A _ R A M _ S I Z E )
XRv32i_npp_ip_Config * cfg_ptr ;
X R v 3 2 i _ n p p _ i p ip ;
int * c o d e _ r a m = ( int *) (0 x 4 0 0 0 0 0 0 0 ) ;
int * d a t a _ r a m = ( int *) (0 x 4 0 0 4 0 0 0 0 ) ;
w o r d _ t y p e i n p u t _ c o d e _ r a m [ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ m e m _ 0 _ t e x t . hex "
};
int main () {
word_type w;
cfg_ptr = XRv32i_npp_ip_LookupConfig (
XPAR_XRV32I_NPP_IP_0_DEVICE_ID );
X R v 3 2 i _ n p p _ i p _ C f g I n i t i a l i z e (& ip , c f g _ p t r ) ;
X R v 3 2 i _ n p p _ i p _ S e t _ s t a r t _ p c (& ip , 0) ;
for ( int i =0; i < C O D E _ R A M _ S I Z E ; i ++)
c o d e _ r a m [ i ] = i n p u t _ c o d e _ r a m [ i ];
X R v 3 2 i _ n p p _ i p _ S t a r t (& ip ) ;
while (! X R v 3 2 i _ n p p _ i p _ I s D o n e (& ip ) ) ;
p r i n t f ( " % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \ n " ,
( int ) X R v 3 2 i _ n p p _ i p _ G e t _ n b _ i n s t r u c t i o n (& ip ) ) ;
p r i n t f ( " data memory dump ( non null words ) \ n " ) ;
for ( int i =0; i < D A T A _ R A M _ S I Z E ; i ++) {
w = d a t a _ r a m [ i ];
if ( w != 0)
p r i n t f ( " m [%5 x ] = %16 d (%8 x ) \ n " , 4* i , ( int )w ,
( u n s i g n e d int ) w ) ;
}
r e t u r n 0;
}
If the RISC-V code in the test_mem.h file is run, the helloworld driver outputs
what is shown in Listing 11.3.
Listing 11.3 The helloworld run output
88 f e t c h e d and d e c o d e d i n s t r u c t i o n s
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
All the source files related to the multi_core_multi_ram_ip can be found in the
multi_core_multi_ram_ip folder.
11.3 Connecting Multiple CPUs and Multiple RAMs Through an AXI Interconnect 365
Fig. 11.13 Two CPUs and two RAMs interconnected with an AXI Interconnect IP
The second design connects a set of CPUs and a set of RAM blocks. They com-
municate through the AXI Interconnect IP. Each CPU accesses its local memory
bank (direct connection) and also any of the other memory banks (through the AXI
interconnect).
Figure 11.13 shows the Vivado design. The design shown includes two CPUs but
can be extended up to the AXI Interconnect IP extensibility (i.e. up to 16 slaves and
16 masters on a single AXI Interconnect IP).
Each CPU is an AXI slave to receive its input data from the Zynq IP. It is also an
AXI master to access non local memory banks. So, the AXI Interconnect IP has four
master ports and three slave ports (2n master ports and n + 1 slave ports for n CPUs
and n memory banks). The master ports are on the right side of the AXI interconnect
IP and the slave ports are on the upper part of the left side.
The code in the multi_core_multi_ram_ip.cpp file which defines the CPUs is shown
in Listing 11.4.
The top function receives its identity (ip_num) through the axilite interface. It
has an access to its local part of the shared memory (local_ram). It also has an AXI
master port to access the full shared memory (data_ram with an m_axi interface).
366 11 Connecting IPs
The Initiation Interval has been set to 10 to avoid overlapping ("#pragma HLS
PIPELINE II=10").
Every even CPU cycle, the top function writes to the local memory bank
("local_ram[local_address] = local_value") and to the memory of the CPU next
to it ("data_ram [global_address] = global_value").
Every odd cycle, the CPU regenerates local_value and global_value from what
was written in the preceding even cycle (which proves the write is effective).
The header file contains the definitions of the constants. It is shown in Listing 11.5.
Listing 11.5 The multi_core_multi_ram_ip.h file
# include " ap_int .h"
# define LOG_NB_RAM 1 // 2^ L O G _ N B _ R A M ram b l o c k s
# define LOG_NB_IP LOG_NB_RAM
11.4 Simulating, Synthesizing, and Running a Multiple IP Design 367
When the design is reduced to two CPUs (i.e. LOG_NB_IP is set to 1), the first
one writes to bank 0 (local access) and to bank 1 (AXI access) and the second one
accesses bank 1 (local access) and bank 0 (AXI access), as shown on the testbench
code in Listing 11.6.
Listing 11.6 The testbench_multi_core_multi_ram_ip.cpp file
# include " multi_core_multi_ram_ip .h"
int ram [ R A M _ S I Z E ];
int * ram0 = ram ;
int * ram1 = & ram [ L O C A L _ R A M _ S I Z E ];
void m u l t i _ c o r e _ m u l t i _ r a m _ i p (
int ip_num ,
int l o c a l _ r a m [ L O C A L _ R A M _ S I Z E ] ,
int d a t a _ r a m [ R A M _ S I Z E ]
);
int main () {
m u l t i _ c o r e _ m u l t i _ r a m _ i p (0 , ram , ram ) ;
m u l t i _ c o r e _ m u l t i _ r a m _ i p (1 , & ram [ L O C A L _ R A M _ S I Z E ] , ram ) ;
p r i n t f ( " ram0 dump \ n " ) ;
for ( int i =0; i < L O C A L _ R A M _ S I Z E ; i ++) {
if ( ram0 [ i ]!=0)
p r i n t f ( " ram0 [%4 d ] = %2 d \ n " , 4* i , ram0 [ i ]) ;
}
p r i n t f ( " ram1 dump \ n " ) ;
for ( int i =0; i < L O C A L _ R A M _ S I Z E ; i ++) {
if ( ram1 [ i ]!=0)
p r i n t f ( " ram1 [%4 d ] = %2 d \ n " , 4* i , ram1 [ i ]) ;
}
r e t u r n 0;
}
Experimentation
To simulate the multi_core_multi_ram_ip, operate as explained in Sect. 5.3.6,
replacing fetching_ip with multi_core_multi_ram_ip.
The testbench_multi_core_multi_ram_ip.cpp program runs a test to check the
possibility for the two cores to access the two memory blocks.
368 11 Connecting IPs
11.4.1 Simulation
The simulation does not behave like the run on the FPGA. On the FPGA, the two
CPUs are run in parallel. Global accesses from one CPU are done while the other
CPU is running. In the simulation, the first CPU is fully run before the second CPU
starts its own run. In the multi_core_multi_ram_ip example, there is no difference
in the output though.
The testbench prints what is shown in Listing 11.7.
Listing 11.7 The testbench print
ram0 dump
ram0 [ 0] = 18
ram0 [ 4] = 18
ram0 [ 8] = 18
ram0 [ 12]= 18
ram0 [ 16] = 18
ram0 [ 20] = 18
ram0 [ 24] = 18
ram0 [ 28] = 18
ram0 [ 32] = 19
ram0 [ 36] = 19
ram0 [ 40] = 19
ram0 [ 44] = 19
ram0 [ 48] = 19
ram0 [ 52] = 19
ram0 [ 56] = 19
ram0 [ 60] = 19
ram1 dump
ram1 [ 0] = 18
ram1 [ 4] = 18
ram1 [ 8] = 18
ram1 [ 12] = 18
ram1 [ 16] = 18
ram1 [ 20] = 18
ram1 [ 24] = 18
ram1 [ 28] = 18
ram1 [ 32] = 19
ram1 [ 36] = 19
ram1 [ 40] = 19
ram1 [ 44] = 19
ram1 [ 48] = 19
ram1 [ 52] = 19
ram1 [ 56] = 19
ram1 [ 60] = 19
11.4.2 Synthesis
The synthesis report in Fig. 11.14 shows that the II interval is 10 and the iteration
latency is 10 FPGA cycles, because of the external memory access duration through
the AXI interconnect.
The Schedule Viewer (see Fig. 11.15) shows that the local memory write takes one
FPGA cycle (local_ram_addr(getelementptr) and local_ram_addr_write_ln30(write)
at cycle 1).
The local memory read takes two cycles (local_value_1(read)).
11.4 Simulating, Synthesizing, and Running a Multiple IP Design 369
Fig. 11.14 Synthesis analysis for the multi core multi RAM design
To build the design in Vivado, place on the Diagram frame the Zynq7 Processing
System IP, Run Block Automation, add the AXI Interconnect IP and Run Connec-
tion Automation to automatically add and connect the Processor System Reset IP.
You obtain the Diagram shown in Fig. 11.16.
Then, you must add to the Diagram frame two multi_core_multi_ram IPs, two
AXI BRAM Controller IPs and two Block Memory Generator IPs, as shown in
Fig. 11.17.
The AXI Interconnect IP must be customized to offer four master ports and three
slave ports (select the IP and right click, then customize block). Set the number of
slave interfaces to 3 and the number of master interfaces to 4.
370 11 Connecting IPs
The two Block Memory Generator IPs should also be customized. Set the Mem-
ory Type to True Dual Port RAM.
The two AXI BRAM Controller IPs are also customized. Set the AXI Protocol to
AXI4LITE and the Number of BRAM Interfaces to 1.
The updated Diagram frame is shown in Fig. 11.18.
The next step is to connect the slave AXI Interconnect links as shown in Fig. 11.19.
The next step is to connect the master AXI Interconnect links as shown in
Fig. 11.20.
The next step is to connect the Block Memory Generator IPs links as shown in
Fig. 11.21.
Lastly, you can let the automatic system finish the wiring (Run Connection
Automation) to obtain the design shown in Fig. 11.13.
The components connected to the AXI interconnect must be placed in the memory
mapped address space. Each component is assigned to a base address. You do this
by opening the Address Editor frame as shown in Fig. 11.22.
Select all the unassigned lines and proceed to their default assignment (right-click
on the line and select Assign; you can select all the lines and assign them globally).
You obtain the default assignment as shown in Fig. 11.23.
You may exclude the unused address spaces.
For example, the multi_core_multi_ram_0 IP does not need an external (AXI)
access to its own multi_core_multi_ram_0 internal code memory, neither to the
multi_core_multi_ram_1 IP code memory.
Right-click on multi_core_multi_ram_0 and on multi_core_multi_ram_1 for
the multi_core_multi_ram_0 IP and do the same for the multi_core_multi_ram_1
IP. Select Exclude.
372 11 Connecting IPs
Experimentation
To run the multi_core_multi_ram_ip on the development board, proceed as ex-
plained in Sect. 5.3.10, replacing fetching_ip with multi_core_multi_ram_ip.
The execution on the FPGA prints the same lines as in the simulation.
A Multicore RISC-V Processor
12
Abstract
This chapter will make you build your first multicore RISC-V CPU. The proces-
sor is built from multiple IPs, each being a copy of the multicycle_pipeline_ip
presented in Chap. 9. Each core has its own code and data memories. The data
memory banks are interconnected with an AXI interconnect IP. An example of a
parallelized matrix multiplication is used to measure the speedup when increasing
the number of cores from one to eight.
All the source files related to the multicore_multicycle_ip can be found in the mul-
ticore_multicycle_ip folder.
Figure 12.1 presents the design of a 4-core IP. Each core has a local internal code
memory, filled with RISC-V code by the Zynq through the AXI interconnect.
The cores implement the multicycle pipeline design presented in Chap. 9.
Each core has a direct access to an external data memory bank (four cores, four
data memory banks). It also has an indirect access to the other data banks through
the AXI interconnect.
The Zynq also has access to the data memory banks, either to initialize arguments
before run or to dump results after.
The memory model is the one described in Sect. 10.2 with cores instead of harts.
The codes presented in the chapter describe the implementation of anyone of the
multiple cores composing the processor, not all the processor.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 377
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_12
378 12 A Multicore RISC-V Processor
Listing 12.1 The transformed multicycle_pipeline_ip top function for a multicore design
void m u l t i c y c l e _ p i p e l i n e _ i p (
u n s i g n e d int ip_num ,
u n s i g n e d int start_pc ,
u n s i g n e d int ip_code_ram [ IP_CODE_RAM_SIZE ],
int ip_data_ram [ IP_DATA_RAM_SIZE ],
int data_ram [ NB_IP ][ I P _ D A T A _ R A M _ S I Z E ] ,
u n s i g n e d int * n b _ i n s t r u c t i o n ,
u n s i g n e d int * n b _ c y c l e ) {
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = ip_num
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = s t a r t _ p c
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = i p _ c o d e _ r a m
# p r a g m a HLS I N T E R F A C E bram port = i p _ d a t a _ r a m
# p r a g m a HLS I N T E R F A C E m_axi port = d a t a _ r a m o f f s e t = slave
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = n b _ i n s t r u c t i o n
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = n b _ c y c l e
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = r e t u r n
# p r a g m a HLS I N L I N E r e c u r s i v e
...
For the meaning of the m_axi and bram interfaces, please refer back to Sect. 11.2.1.
The memory arrays have sizes depending on the number of IPs (see Listing 12.2),
to keep the total memory within the limit of 512 KB (to avoid going beyond the
540 KB available on the FPGA).
For a 2-core IP, each core has a 128 KB (32K instructions) code memory and a
128 KB (32K words) data memory. The size is 64 KB + 64 KB for a 4-core IP and
32 KB + 32 KB for an 8-core IP.
Listing 12.2 The size of the code and data memories defined in the multicycle_pipeline_ip.h file
...
# define LOG_NB_IP 1
# d e f i n e NB_IP (1 < < L O G _ N B _ I P )
# define LOG_CODE_RAM_SIZE 16
# define CODE_RAM_SIZE (1 < < L O G _ C O D E _ R A M _ S I Z E )
# define LOG_DATA_RAM_SIZE 16
# define DATA_RAM_SIZE (1 < < L O G _ D A T A _ R A M _ S I Z E )
# define LOG_IP_CODE_RAM_SIZE ( L O G _ C O D E _ R A M _ S I Z E - L O G _ N B _ I P ) // in
word
# define IP_CODE_RAM_SIZE (1 < < L O G _ I P _ C O D E _ R A M _ S I Z E )
# define LOG_IP_DATA_RAM_SIZE ( L O G _ D A T A _ R A M _ S I Z E - L O G _ N B _ I P ) // in
words
# define IP_DATA_RAM_SIZE (1 < < L O G _ I P _ D A T A _ R A M _ S I Z E )
...
Like in the multihart_ip design, the IP top function local declarations (see Listing
12.3) add _from_ variables to the multicycle_pipeline_ip.
For example, the f_to_d variable has a matching d_from_f variable.
380 12 A Multicore RISC-V Processor
Listing 12.3 The top function declarations of _from_ and _to_ variables
...
int reg_file [ N B _ R E G I S T E R ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = r e g _ f i l e dim =1
complete
bit_t i s _ r e g _ c o m p u t e d [ N B _ R E G I S T E R ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = i s _ r e g _ c o m p u t e d dim =1
complete
from_f_to_f_t f_from_f ;
from_d_to_f_t f_from_d ;
from_e_to_f_t f_from_e ;
from_f_to_f_t f_to_f ;
from_f_to_d_t f_to_d ;
from_f_to_d_t d_from_f ;
from_d_to_f_t d_to_f ;
from_d_to_i_t d_to_i ;
from_d_to_i_t i_from_d ;
bit_t i_wait ;
i_safe_t i_safe ;
from_i_to_e_t i_to_e ;
from_i_to_e_t e_from_i ;
from_e_to_f_t e_to_f ;
from_e_to_m_t e_to_m ;
from_e_to_m_t m_from_e ;
from_m_to_w_t m_to_w ;
from_m_to_w_t w_from_m ;
bit_t is_running ;
counter_t nbi ;
counter_t nbc ;
...
Listing 12.4 shows the top function initializations. The inter-stage links valid bits
are all cleared except the f_to_f one. The main loop starts as if the fetch stage would
send the start_pc to itself.
Listing 12.4 The top function initializations
...
i n i t _ r e g _ f i l e ( ip_num , reg_file , i s _ r e g _ c o m p u t e d ) ;
f_to_f . is_valid = 1;
f_to_f . next_pc = start_pc ;
f_to_d . is_valid = 0;
d_to_f . is_valid = 0;
d_to_i . is_valid = 0;
i_to_e . is_valid = 0;
e_to_f . is_valid = 0;
e_to_m . is_valid = 0;
m_to_w . is_valid = 0;
i_wait = 0;
i_safe . is_full = 0;
nbi = 0;
nbc = 0;
...
12.1 An Adaptation of the Multicycle_pipeline_ip to Multicore 381
The do ... while loop (see Listing 12.5) is slightly modified to incorporate the
new ip_num, ip_code_ram, ip_data_ram, and data_ram arguments into the fetch,
mem_access, and write_back function calls.
Eventually, like in the multihart_ip design the order of the calls goes from the
fetch stage to the writeback stage. This is to ensure that the synthesis is able to stand
the II=2 constraint.
Listing 12.5 The do ... while loop
...
do {
# p r a g m a HLS P I P E L I N E II =2
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
printf (" =============================================\ n");
p r i n t f ( " cycle % d \ n " , ( int ) nbc ) ;
# endif
# endif
n e w _ c y c l e ( f_to_f , d_to_f , e_to_f , f_to_d , d_to_i , i_to_e ,
e_to_m , m_to_w , & f_from_f , & f_from_d , & f_from_e ,
& d_from_f , & i_from_d , & e_from_i , & m_from_e ,
& w_from_m );
fetch ( f_from_f , f_from_d , f_from_e , i_wait , i p _ c o d e _ r a m ,
& f_to_f , & f _ t o _ d ) ;
d e c o d e ( d_from_f , i_wait , & d_to_f , & d _ t o _ i ) ;
issue ( i_from_d , reg_file , i s _ r e g _ c o m p u t e d , & i_safe ,
& i_to_e , & i _ w a i t ) ;
execute (
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
ip_num ,
# endif
# endif
e_from_i ,
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
reg_file ,
# endif
# endif
& e_to_f , & e _ t o _ m ) ;
m e m _ a c c e s s ( ip_num , m_from_e , i p _ d a t a _ r a m , data_ram , & m _ t o _ w ) ;
write_back (
# ifndef __SYNTHESIS__
ip_num ,
# endif
w_from_m , reg_file , i s _ r e g _ c o m p u t e d ) ;
s t a t i s t i c _ u p d a t e ( w_from_m , & nbi , & nbc ) ;
r u n n i n g _ c o n d _ u p d a t e ( w_from_m , & i s _ r u n n i n g ) ;
} while ( i s _ r u n n i n g ) ;
...
382 12 A Multicore RISC-V Processor
The register file initialization (the init_reg_file function defined in the multicy-
cle_pipeline_ip.cpp file; see Listing 12.6) sets register a0 (x10) with the core iden-
tification number received as the first argument.
Moreover, it initializes the sp register of each core with the address of the first
word of the next core memory bank. When places are allocated on the stack by
decreasing the sp register, the locations are within the core’s memory bank. As a
result, each core has its own local stack.
Listing 12.6 The init_reg_file function
// a0 / x10 is set with the IP number
static void i n i t _ r e g _ f i l e (
i p _ n u m _ t ip_num ,
int * reg_file ,
bit_t * is_reg_computed ){
reg_num_p1_t r;
for ( r =0; r < N B _ R E G I S T E R ; r ++) {
# p r a g m a HLS U N R O L L
i s _ r e g _ c o m p u t e d [ r ] = 0;
if ( r ==10)
reg_file [r] = ip_num ;
else if ( r == SP )
reg_file [ r ] = (1 < <( L O G _ I P _ D A T A _ R A M _ S I Z E +2) ) ;
else
reg_file [ r ] = 0;
}
}
The fetch (in the fetch.cpp file), decode (in the decode.cpp file), issue (in the
issue.cpp file), execute (in the execute.cpp file), and write_back (in the wb.cpp file)
functions are mostly unchanged from the multicycle_pipeline_ip implementation.
The mem_access function (in the mem_access.cpp file) is shown in Listing 12.7.
As the memory is partitioned, the mem_access function determines, from the
access address, if the access is local (is_local) and otherwise, which is the accessed
partition (accessed_ip).
The mem_access function calls the stage_job function and the set_output_to_w
function (defined in the same file).
Listing 12.7 The mem_access function
void m e m _ a c c e s s (
ip_num_t ip_num ,
from_e_to_m_t m_from_e ,
int * ip_data_ram ,
int d a t a _ r a m [][ I P _ D A T A _ R A M _ S I Z E ] ,
from_m_to_w_t * m_to_w ){
int value ;
ip_num_t accessed_ip ;
bit_t is_local ;
if ( m _ f r o m _ e . i s _ v a l i d ) {
12.1 An Adaptation of the Multicycle_pipeline_ip to Multicore 383
value = m _ f r o m _ e . value ;
accessed_ip =
( m _ f r o m _ e . address > >( L O G _ I P _ D A T A _ R A M _ S I Z E +2) ) + i p _ n u m ;
is_local = ( i p _ n u m == a c c e s s e d _ i p ) ;
s t a g e _ j o b ( a c c e s s e d _ i p , is_local , m _ f r o m _ e . is_load ,
m _ f r o m _ e . is_store , m _ f r o m _ e . address ,
m _ f r o m _ e . func3 , i p _ d a t a _ r a m , data_ram , & value ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
p r i n t f ( " mem ");
p r i n t f ( " %04 d \ n " , ( int ) ( m _ f r o m _ e . pc < <2) ) ;
# endif
# endif
s e t _ o u t p u t _ t o _ w ( m _ f r o m _ e . rd , m _ f r o m _ e . h a s _ n o _ d e s t ,
m _ f r o m _ e . is_load , m _ f r o m _ e . is_ret ,
m _ f r o m _ e . value , value ,
# ifndef __SYNTHESIS__
m _ f r o m _ e . pc , m_from_e . instruction ,
m _ f r o m _ e . d_i , m _ f r o m _ e . target_pc ,
# endif
m_to_w );
}
m_to_w - > i s _ v a l i d = m _ f r o m _ e . i s _ v a l i d ;
}
The stage_job function (see Listing 12.8) calls either mem_load or mem_store
according to the memory access type.
Listing 12.8 The stage_job function in the mem_access.cpp file
static void s t a g e _ j o b (
ip_num_t accessed_ip ,
bit_t is_local ,
bit_t is_load ,
bit_t is_store ,
b _ d a t a _ a d d r e s s _ t address ,
func3_t func3 ,
int * ip_data_ram ,
int d a t a _ r a m [][ I P _ D A T A _ R A M _ S I Z E ] ,
int * value ) {
if ( i s _ l o a d )
* value =
m e m _ l o a d ( a c c e s s e d _ i p , is_local ,
i p _ d a t a _ r a m , data_ram , address , func3 ) ;
else if ( i s _ s t o r e )
m e m _ s t o r e ( a c c e s s e d _ i p , is_local ,
i p _ d a t a _ r a m , data_ram , address , * value ,
( ap_uint <2 >) func3 ) ;
}
The memory load and store functions (defined in the mem.cpp file) are unchanged
except for an if statement with the is_local condition to either access the local memory
(ip_data_ram) or the global one (data_ram).
In the mem_load function (see Listing 12.9), the addressed word is fully read
from ip_data_ram (local access) or from data_ram (global access). After the access,
the mem_load function selects the accessed bytes. The code is similar to the one
shown in Listing 12.11.
The local access takes one processor cycle and the global access takes five. Hence,
the variable w is filled after a variable latency.
384 12 A Multicore RISC-V Processor
To serialize the load after the store the pipeline should be frozen with a wait
condition sent from the memory access stage, in a similar way to the wait condition
sent by the issue stage when a locked source is detected.
However, this would impact the CPI for every remote store just because of unop-
timized codes (the load after the store is useless).
Nevertheless, the succession of a store and a load at the same address does occur,
for example when a compilation at optimization level 0 is done ("-O0"). The compiler
does produce such unoptimized code, but the store and load accessed address in this
case is local (i.e. in the stack), not remote.
I decided to apply the same relaxed scheduling policy as I did for the issue stage
(refer back to Sect. 9.3.3). The programmer is warned that he/she should not use
back-to-back store and load to the same external address (there should be at least
four instructions in between to ensure correct scheduling).
If a strict scheduling is wished, the memory access stage should be equipped with
a safe and a wait signal, like for the issue stage. The wait should be raised when a
12.1 An Adaptation of the Multicycle_pipeline_ip to Multicore 385
remote store is processed. It should be maintained during four processor cycles (use
a counter to control this). The memory stage wait condition should be added to all
the preceding stages (fetch, decode, issue, and execute).
In the mem_store function (see Listing 12.11), the access is done according to
the size encoded in the memory access instruction. The data_ram and ip_data_ram
word pointers are casted into either char (i.e. byte) or short (i.e. half word) pointers.
The access address (i.e. an offset in the selected ram) is also turned into a char (a),
short (a1), or word (a2) displacement, with an added IP offset if the access is not
local.
Listing 12.11 The mem_store function
void m e m _ s t o r e (
ip_num_t ip ,
bit_t is_local ,
int * ip_data_ram ,
int d a t a _ r a m [][ I P _ D A T A _ R A M _ S I Z E ] ,
b _ d a t a _ a d d r e s s _ t address ,
int rv2 ,
ap_uint <2 > msize ) {
b_ip_data_address_t a = address ;
h _ i p _ d a t a _ a d d r e s s _ t a1 = address > >1;
w _ i p _ d a t a _ a d d r e s s _ t a2 = address > >2;
char rv2_0 = rv2 ;
short r v 2 _ 0 1 = rv2 ;
s w i t c h ( msize ) {
case SB :
if ( i s _ l o c a l )
*(( char *) ( i p _ d a t a _ r a m ) + a )
= rv2_0 ;
else
*(( char *) ( d a t a _ r a m ) +
((( b _ d a t a _ a d d r e s s _ t ) ip ) <<
( L O G _ I P _ D A T A _ R A M _ S I Z E +2) ) + a )
= rv2_0 ;
break ;
case SH :
if ( i s _ l o c a l )
*(( short *) ( i p _ d a t a _ r a m ) + a1 )
= rv2_01 ;
else
*(( short *) ( d a t a _ r a m ) +
((( h _ d a t a _ a d d r e s s _ t ) ip ) <<
( L O G _ I P _ D A T A _ R A M _ S I Z E +1) ) + a1 )
= rv2_01 ;
break ;
case SW :
if ( i s _ l o c a l )
i p _ d a t a _ r a m [ a2 ] = rv2 ;
else
d a t a _ r a m [ ip ][ a2 ] = rv2 ;
break ;
case 3:
break ;
}
}
386 12 A Multicore RISC-V Processor
Experimentation
To simulate the multicore_multicycle_ip, operate as explained in Sect. 5.3.6, re-
placing fetching_ip with multicore_multicycle_ip. There are two testbench pro-
grams: testbench_seq_multicore_multicycle_ip.cpp to run independent codes
(one per core) and testbench_par_multicore_multicycle_ip.cpp to run a parallel
sum of the elements of an array.
With testbench_seq_multicore_multicycle_ip.cpp you can play with the simu-
lator, replacing the included test_mem_0_text.hex file with any other .hex file
you find in the same folder. You can also vary the number of cores.
Like in the multihart_ip project, two testbench files are provided, one to run fully
independent codes (testbench_seq_multicore_multicycle_ip.cpp) and the other to
run codes sharing a distributed array (testbench_par_multicore_multicycle_ip.cpp).
Once the hex files have been built, they can be used to initialize the code_ram
arrays. In the testbench example shown in Listing 12.13, the code_ram array is
12.2 Simulating the IP 387
To run the test_mem.s code on the Vitis_HLS simulation of two cores, you must
first set LOG_NB_IP as 1 in the multicycle_pipeline_ip.h file. Then, you must build
the .hex files by running "./build_seq.sh test_mem". Eventually, you can start the
simulation.
For the run of two copies of test_mem.s, the output is composed of the list of
the instructions run in the first core followed by its register file final state and the list
of the instructions run in the second core with its register file final state (see Listing
12.14).
Listing 12.14 The output of the main function of the test-
bench_seq_multicore_multicycle_ip.cpp file: the code run and the final register file state
0000: 0 0 0 0 0 5 1 3 li a0 , 0
a0 = 0 ( 0)
0004: 0 0 0 0 0 5 9 3 li a1 , 0
a1 = 0 ( 0)
...
0 0 5 6 : 00 a 6 2 2 2 3 sw a0 , 4( a2 )
m[ 2c] = 55 ( 37)
0060: 0 0 0 0 8 0 6 7 ret
pc = 0 ( 0)
...
sp = 131072 ( 20000)
...
a0 = 55 ( 37)
388 12 A Multicore RISC-V Processor
...
a2 = 40 ( 28)
a3 = 40 ( 28)
a4 = 10 ( a)
...
0000: 0 0 0 0 0 5 1 3 li a0 , 0
a0 = 0 ( 0)
0004: 0 0 0 0 0 5 9 3 li a1 , 0
a1 = 0 ( 0)
...
0 0 5 6 : 00 a 6 2 2 2 3 sw a0 , 4( a2 )
m [2002 c ] = 55 ( 37)
0060: 0 0 0 0 8 0 6 7 ret
pc = 0 ( 0)
...
sp = 262144 ( 40000)
...
a0 = 55 ( 37)
...
a2 = 40 ( 28)
a3 = 40 ( 28)
a4 = 10 ( a)
...
After that, the run outputs for each core the number of instructions run, the number
of cycles of the run, and the dump of the data memory bank (non null values) (see
Listing 12.15).
Listing 12.15 The output of the main function of the test-
bench_seq_multicore_multicycle_ip.cpp file: the memory dump
core 0: 88 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 279 c y c l e s ( ipc =
0.32)
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
core 1: 88 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 279 c y c l e s ( ipc =
0.32)
data memory dump ( non null words )
m [20000] = 1 ( 1)
m [20004] = 2 ( 2)
m [20008] = 3 ( 3)
m [2000 c ] = 4 ( 4)
m [20010] = 5 ( 5)
m [20014] = 6 ( 6)
m [20018] = 7 ( 7)
m [2001 c ] = 8 ( 8)
m [20020] = 9 ( 9)
m [20024] = 10 ( a)
m [2002 c ] = 55 ( 37)
12.2 Simulating the IP 389
As I already mentioned in the last chapter, the Vitis_HLS simulation does not
work like the run on the FPGA. The simulation runs the core IPs sequentially, i.e.
core 0 is first fully run before core 1 starts. If the code run on core 0 reads some
memory word written by the code run on core 1, the read misses the written value
on the simulation (e.g. because the write is done after the read) but not necessarily
on the FPGA (e.g. if in fact the write is done before the read when the two cores are
run simultaneously).
This is a general limitation to the simulation of multiple IPs in the Vitis_HLS tool.
To keep the simulation identical to the run on the FPGA, I have organized the
testbench to run the writing IPs before the reading one. All the cores except core 0
are first simulated. At the end of each simulation, the result is written to the shared
data_ram array.
Then, core 0 is simulated. At the end of the RISC-V program run, remote accesses
read the data_ram values written by the other cores.
It should be noticed that this solution does not work when there are multiple RAW
dependencies, some from core 0 to core 1 and others from core 1 to core 0. In this
case, there is no general solution. You have to blindly try your SoC IP directly on
the FPGA without any preliminary simulation check.
The RISC-V code to be run on core 0 successively fills a 10 elements array (loop
.L1 in Listing 10.46), sums the elements (loop .L2 in Listing 10.47), saves the sum
12.2 Simulating the IP 391
to memory, gathers and accumulates the sums computed by the other cores (loop .L3
in Listing 12.18), and saves the final sum to memory.
The lw a3,40(a4) load at label .L3 is a remote access (see Listing 12.8). The beq
branch after the load is a security to ensure that core 0 waits until the other cores
have dumped their local sum to their local memory (then the loaded memory word
is not null).
Listing 12.18 The .L3 loop
...
. L3 : lw a3 ,40( a4 ) /* a3 = t [ a4 +40] */
beq a3 , zero ,. L3 /* if ( a3 ==0) goto . L3 */
add a0 , a0 , a3 /* a0 += a3 */
add a4 , a4 , a5 /* a4 += a5 */
addi a1 , a1 ,1 /* a1 ++ */
bne a1 , a2 ,. L3 /* if ( a1 != a2 ) goto . L3 */
...
...
sp = 131072 ( 20000)
...
a0 = 210 ( d2 )
a1 = 2 ( 2)
a2 = 44 ( 2c)
a3 = 155 ( 9b)
a4 = 262144 ( 40000)
a5 = 131072 ( 20000)
...
Then (see Listing 12.10) the run dumps the memory banks (the sum of the 20 first
integers is 210).
Listing 12.20 The output of the main function of the test-
bench_par_multicore_multicycle_ip.cpp file: the memory dump
...
core 0: 101 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 273 c y c l e s ( ipc =
0.37)
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 28] = 55 ( 37)
m[ 2c] = 210 ( d2 )
core 1: 90 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 243 c y c l e s ( ipc =
0.37)
data memory dump ( non null words )
m [20000] = 11 ( b)
m [20004] = 12 ( c)
m [20008] = 13 ( d)
m [2000 c ] = 14 ( e)
m [20010] = 15 ( f)
m [20014] = 16 ( 10)
m [20018] = 17 ( 11)
m [2001 c ] = 18 ( 12)
m [20020] = 19 ( 13)
m [20024] = 20 ( 14)
m [20028] = 155 ( 9b)
Figure 12.2 shows that the II=2 constraint is satisfied. The iteration latency, set by
the global memory access one, is 13 FPGA cycles. The number of IPs declared in
the multicycle_pipeline_ip.h file is two (notice that the number of used resources is
for one IP, not for two).
12.4 The Vivado Project 393
For two cores, the s_axi_control range is 256 KB and the addresses are 0x4004_0000
and 0x4008_0000. For four cores, the range is 128 KB and the addresses are
0x4004_0000, 0x4006_0000, 0x4008_0000, and 0x400A_0000. For eight cores,
the range is 64 KB and the addresses are 0x4004_0000, 0x4005_0000, ..., and
0x400B_0000.
The Vivado bitstream generation for a 2-core processor produces the implemen-
tation report in Fig. 12.5, showing that it uses 11,962 LUTs (22.48%; the 4-core
processor uses 22,155 LUTs; the 8-core processor uses 43,731 LUTs).
Experimentation
To run the multicore_multicycle_ip on the development board, proceed as ex-
plained in Sect. 5.3.10, replacing fetching_ip with multicore_multicycle_ip.
There are two drivers: helloworld_seq.c on which you can run independent pro-
grams on each core and helloworld_par.c on which you can run a parallel sum of
the elements of an array.
12.5 Running the IPs on the Development Board 395
The code to drive the FPGA to run the eight RISC-V test programs is shown in List-
ing 12.21 (test_branch, test_jal_jalr, test_load_store, test_lui_auipc, test_mem,
test_op, test_op_imm, and test_sum). You must update the paths of the included
code files. Use the update_helloworld.sh shell script provided in the same folder.
Listing 12.21 The helloworld_seq.c file driving the multicore_multicycle_ip
# i n c l u d e < stdio .h >
# include " xmulticycle_pipeline_ip .h"
# include " xparameters .h"
# define LOG_NB_IP 1
# d e f i n e NB_IP (1 < < L O G _ N B _ I P )
# define LOG_IP_CODE_RAM_SIZE (16 - L O G _ N B _ I P ) // in word
# define IP_CODE_RAM_SIZE (1 < < L O G _ I P _ C O D E _ R A M _ S I Z E )
# define LOG_IP_DATA_RAM_SIZE (16 - L O G _ N B _ I P ) // in words
# define IP_DATA_RAM_SIZE (1 < < L O G _ I P _ D A T A _ R A M _ S I Z E )
# define DATA_RAM 0 x40000000
int * d a t a _ r a m = ( int *) D A T A _ R A M ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ C o n f i g * c f g _ p t r [ NB_IP ];
X M u l t i c y c l e _ p i p e l i n e _ i p ip [ NB_IP ];
w o r d _ t y p e c o d e _ r a m [ I P _ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ m e m _ t e x t . hex "
};
int main () {
u n s i g n e d int nbi [ NB_IP ];
u n s i g n e d int nbc [ NB_IP ];
int w;
for ( int i =0; i < NB_IP ; i ++) {
cfg_ptr [i] = XMulticycle_pipeline_ip_LookupConfig (i);
X M u l t i c y c l e _ p i p e l i n e _ i p _ C f g I n i t i a l i z e (& ip [ i ] , c f g _ p t r [ i ]) ;
XMulticycle_pipeline_ip_Set_ip_num (& ip [ i ] , i ) ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ S e t _ s t a r t _ p c (& ip [ i ] , 0) ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ W r i t e _ i p _ c o d e _ r a m _ W o r d s (& ip [ i ] , 0 ,
code_ram , I P _ C O D E _ R A M _ S I Z E ) ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ S e t _ d a t a _ r a m (& ip [ i ] , D A T A _ R A M ) ;
}
396 12 A Multicore RISC-V Processor
For LOG_NB_IP = 1 (i.e. two cores), the run of the RISC-V code in the test_mem.h
file should print on the putty window what is shown in Listing 12.22.
Listing 12.22 The helloworld_seq.c prints
core 0: 88 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 279 c y c l e s ( ipc =
0.32)
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
core 1: 88 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 279 c y c l e s ( ipc =
0.32)
data memory dump ( non null words )
m [20000] = 1 ( 1)
m [20004] = 2 ( 2)
m [20008] = 3 ( 3)
m [2000 c ] = 4 ( 4)
m [20010] = 5 ( 5)
m [20014] = 6 ( 6)
m [20018] = 7 ( 7)
m [2001 c ] = 8 ( 8)
m [20020] = 9 ( 9)
m [20024] = 10 ( a)
m [2002 c ] = 55 ( 37)
12.5 Running the IPs on the Development Board 397
The code to drive the FPGA to run the distributed sum is shown in Listing 12.23
(you must update the paths of the included code files; use the update_helloworld.sh
shell script provided in the same folder).
Listing 12.23 The helloworld_par.c file driving the multicore_multicycle_ip
# i n c l u d e < stdio .h >
# include " xmulticycle_pipeline_ip .h"
# include " xparameters .h"
# define LOG_NB_IP 1
# d e f i n e NB_IP (1 < < L O G _ N B _ I P )
# define LOG_IP_CODE_RAM_SIZE (16 - L O G _ N B _ I P ) // in word
# define IP_CODE_RAM_SIZE (1 < < L O G _ I P _ C O D E _ R A M _ S I Z E )
# define LOG_IP_DATA_RAM_SIZE (16 - L O G _ N B _ I P ) // in words
# define IP_DATA_RAM_SIZE (1 < < L O G _ I P _ D A T A _ R A M _ S I Z E )
# define LOG_DATA_RAM_SIZE 16
# define DATA_RAM_SIZE (1 < < L O G _ D A T A _ R A M _ S I Z E )
# define DATA_RAM 0 x40000000
int * d a t a _ r a m = ( int *) D A T A _ R A M ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ C o n f i g * c f g _ p t r [ NB_IP ];
X M u l t i c y c l e _ p i p e l i n e _ i p ip [ NB_IP ];
w o r d _ t y p e c o d e _ r a m _ 0 [ I P _ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ m e m _ p a r _ i p 0 _ t e x t . hex "
};
w o r d _ t y p e c o d e _ r a m [ I P _ C O D E _ R A M _ S I Z E ]={
# i n c l u d e " t e s t _ m e m _ p a r _ o t h e r i p _ t e x t . hex "
};
int main () {
u n s i g n e d int nbi [ NB_IP ];
u n s i g n e d int nbc [ NB_IP ];
int w;
for ( int i =0; i < NB_IP ; i ++) {
cfg_ptr [i] = XMulticycle_pipeline_ip_LookupConfig (i);
X M u l t i c y c l e _ p i p e l i n e _ i p _ C f g I n i t i a l i z e (& ip [ i ] , c f g _ p t r [ i ]) ;
XMulticycle_pipeline_ip_Set_ip_num (& ip [ i ] , i ) ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ S e t _ s t a r t _ p c (& ip [ i ] , 0) ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ S e t _ d a t a _ r a m (& ip [ i ] , D A T A _ R A M ) ;
}
for ( int i =1; i < NB_IP ; i ++)
X M u l t i c y c l e _ p i p e l i n e _ i p _ W r i t e _ i p _ c o d e _ r a m _ W o r d s (& ip [ i ] , 0 ,
code_ram , I P _ C O D E _ R A M _ S I Z E ) ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ W r i t e _ i p _ c o d e _ r a m _ W o r d s (& ip [0] , 0 ,
code_ram_0 , I P _ C O D E _ R A M _ S I Z E ) ;
for ( int i =1; i < NB_IP ; i ++)
X M u l t i c y c l e _ p i p e l i n e _ i p _ S t a r t (& ip [ i ]) ;
X M u l t i c y c l e _ p i p e l i n e _ i p _ S t a r t (& ip [0]) ;
for ( int i = NB_IP -1; i >=0; i - -)
while (! X M u l t i c y c l e _ p i p e l i n e _ i p _ I s D o n e (& ip [ i ]) ) ;
for ( int i =0; i < NB_IP ; i ++) {
nbc [ i ] = ( int ) X M u l t i c y c l e _ p i p e l i n e _ i p _ G e t _ n b _ c y c l e (& ip [ i ]) ;
nbi [ i ] = ( int ) X M u l t i c y c l e _ p i p e l i n e _ i p _ G e t _ n b _ i n s t r u c t i o n (& ip [ i
]) ;
p r i n t f ( " core % d : % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \
in % d cycles ( ipc = %2.2 f ) \ n " , i , nbi [ i ] , nbc [ i ] ,
(( float ) nbi [ i ]) / nbc [ i ]) ;
p r i n t f ( " data memory dump ( non null words ) \ n " ) ;
for ( int j =0; j < I P _ D A T A _ R A M _ S I Z E ; j ++) {
w = d a t a _ r a m [ i * I P _ D A T A _ R A M _ S I Z E + j ];
398 12 A Multicore RISC-V Processor
if ( w != 0)
p r i n t f ( " m [%5 x ] = %16 d (%8 x ) \ n " ,
( i * I P _ D A T A _ R A M _ S I Z E + j ) *4 , w , ( u n s i g n e d int ) w ) ;
}
}
r e t u r n 0;
}
The run should print on the putty window (for a 2-core IP) what is shown in
Listing 12.24.
Listing 12.24 The helloworld prints
core 0: 101 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 273 c y c l e s ( ipc =
0.37)
data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 28] = 55 ( 37)
m[ 2c] = 210 ( d2 )
core 1: 90 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 243 c y c l e s ( ipc =
0.37)
data memory dump ( non null words )
m [20000] = 11 ( b)
m [20004] = 12 ( c)
m [20008] = 13 ( d)
m [2000 c ] = 14 ( e)
m [20010] = 15 ( f)
m [20014] = 16 ( 10)
m [20018] = 17 ( 11)
m [2001 c ] = 18 ( 12)
m [20020] = 19 ( 13)
m [20024] = 20 ( 14)
m [20028] = 155 ( 9b)
Instead of comparing the multicore design to the 4-stage pipeline baseline, I prefer
to compare the different versions of the multicore (i.e. 2-core, 4-core, and 8-core).
To measure the efficiency of the parallelization, I compare the times to run a
distributed matrix multiplication on an increasing number of cores.
The matrix multiplication is a computation example which tests the capability of
the shared memory interconnection to route an intense traffic. Each core uses a large
amount of external memory data (the more the cores, the higher the proportion of
external data accesses).
The code of the matrix multiplication can be found in the mulmat.c file in the
multicore_multicycle_ip folder. The code contains a definition of the LOG_NB_IP
constant. It should be adapted to the LOG_NB_IP constant defined in the multi-
core_multicycle_ip.h file.
Reference 399
Table 12.1 Execution time of the parallelized matrix multiplication on the multi-
core_multicycle_ip processor and speedup from the sequential run
Number of Cycles nmi CPI Time (s) Speedup
cores
1 6,236,761 3,858,900 1.62 0.124735220 –
2 5,162,662 4,545,244 2.27 0.103253240 1.21
4 2,618,162 4,601,284 2.28 0.052363240 2.38
8 1,351,309 4,728,936 2.29 0.027026180 4.62
Reference
1. https://developer.arm.com/documentation/102202/0300/Transfer-behavior-and-transaction-
ordering
A Multicore RISC-V Processor
with Multihart Cores 13
Abstract
This chapter will make you build your second multicore RISC-V CPU. The pro-
cessor is built from multiple IPs, each being a copy of the multihart_ip presented
in Chap. 10. Each core runs multiple harts. Each core has its own code and data
memories. The code memory is common to all the harts of the core. The data
memory of the core is partitioned between the implemented harts. Hence, a c core
with h hart processor has h*c data memory partitions embedded in c memory IPs.
The data memory banks are interconnected with an AXI interconnect IP. Any hart
has a private access to its data memory partition and any other partition of the
same core, and a remote access to any partition of any other core. An example of
a parallelized matrix multiplication is used to measure the speedup when moving
the number of cores from one to four and the number of harts from one to eight
with a maximum of 16 harts in the whole IPs for simulation and a maximum of
eight implementable harts on the FPGA.
All the source files related to the multicore_multihart_ip can be found in the mul-
ticore_multihart_ip folder.
Like for the multicore_multicycle_ip design presented in Chap. 12, the codes
presented in this chapter describe the implementation of anyone of the multiple cores
composing the processor, not all the processor. The concerned core is identified by
the ip_num argument sent to the IP when it is started.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 401
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_13
402 13 A Multicore RISC-V Processor with Multihart Cores
The multihart_ip top function defining the CPUs is located in the multihart_ip.cpp
file.
Its prototype (see Listing 13.1) is a mix of the multihart_ip one (refer back to
Sect. 10.3.4) and the multicore_multicycle_ip one (refer back to Sect. 12.1.1). The
ip_num argument is the IP number. The running_hart_set argument is the set of
running harts in the core. The start_pc argument is the array used to set the starting
pc of the running harts in the core code memory.
Listing 13.1 The multihart_ip top function for a multicore and multihart design
void m u l t i h a r t _ i p (
u n s i g n e d int ip_num ,
u n s i g n e d int running_hart_set ,
u n s i g n e d int start_pc [ NB_HART ],
u n s i g n e d int ip_code_ram [ IP_CODE_RAM_SIZE ],
int ip_data_ram [ N B _ H A R T ][ H A R T _ D A T A _ R A M _ S I Z E ] ,
int d a t a _ r a m [ NB_IP ][ N B _ H A R T ][ H A R T _ D A T A _ R A M _ S I Z E ] ,
u n s i g n e d int * n b _ i n s t r u c t i o n ,
u n s i g n e d int * n b _ c y c l e ) {
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = ip_num
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = r u n n i n g _ h a r t _ s e t
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = s t a r t _ p c
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = i p _ c o d e _ r a m
# p r a g m a HLS I N T E R F A C E bram port = i p _ d a t a _ r a m s t o r a g e _ t y p e =
ram_1p
# p r a g m a HLS I N T E R F A C E m_axi port = d a t a _ r a m o f f s e t = slave
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = n b _ i n s t r u c t i o n
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = n b _ c y c l e
# p r a g m a HLS I N T E R F A C E s _ a x i l i t e port = r e t u r n
# p r a g m a HLS I N L I N E r e c u r s i v e
...
The local declarations (see Listing 13.2) are vectorized like in the Chap. 10 mul-
tihart_ip top function.
Listing 13.2 The multihart_ip top function declarations
...
int reg_file [ N B _ H A R T ][ N B _ R E G I S T E R ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = r e g _ f i l e dim =0 c o m p l e t e
bit_t i s _ r e g _ c o m p u t e d [ N B _ H A R T ][ N B _ R E G I S T E R ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = i s _ r e g _ c o m p u t e d dim =0 c o m p l e t e
from_d_to_f_t f_from_d ;
from_e_to_f_t f_from_e ;
bit_t f _ s t a t e _ i s _ f u l l [ N B _ H A R T ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = f _ s t a t e _ i s _ f u l l dim =1 c o m p l e t e
f_state_t f_state [ N B _ H A R T ];
# p r a g m a HLS A R R A Y _ P A R T I T I O N v a r i a b l e = f _ s t a t e dim =1 c o m p l e t e
from_f_to_d_t f_to_d ;
...
In Listing 13.3, the pragma orders the synthesizer to eliminate any RAW depen-
dency on the data_ram variable between successive iterations of the loop (type=inter
option).
In other words, the data_ram variable read in the mem_load function to perform
a remote load is not serialized after the data_ram variable write in the mem_store
function to perform a remote store (the RAW dependency between the two accesses
to the data_ram variable in successive iterations of the do ... while loop is eliminated
by the synthesizer). Hence, a back-to-back remote load starts while the just preceding
remote store is still in progress.
A consequence of this choice is that a remote store (write to the data_ram variable)
followed by a load at the same address (read from the data_ram variable) would not
behave correctly (refer back to Listing 12.10).
However, the elimination of inter iteration RAW dependencies on the data_ram
variable is necessary to keep the processor cycle set as two FPGA cycles.
Listing 13.3 The do ... while loop
label
...
do {
# p r a g m a HLS D E P E N D E N C E d e p e n d e n t = f a l s e d i r e c t i o n = RAW type = inter
variable = data_ram
# p r a g m a HLS P I P E L I N E II =2
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
printf (" =============================================\ n");
p r i n t f ( " cycle % d \ n " , ( u n s i g n e d int ) nbc ) ;
# endif
# endif
n e w _ c y c l e ( f_to_d , d_to_f , d_to_i , i_to_e , e_to_f , e_to_m ,
m_to_w , & f_from_d , & f_from_e , & d_from_f ,
& i_from_d , & e_from_i , & m_from_e , & w _ f r o m _ m ) ;
s t a t i s t i c _ u p d a t e ( e_from_i , & nbi , & nbc ) ;
r u n n i n g _ c o n d _ u p d a t e ( has_exited , & i s _ r u n n i n g ) ;
fetch ( f_from_d , f_from_e , d _ s t a t e _ i s _ f u l l ,
i p _ c o d e _ r a m , f_state , & f_to_d , f _ s t a t e _ i s _ f u l l ) ;
d e c o d e ( d_from_f , i _ s t a t e _ i s _ f u l l , d_state , & d_to_f ,
& d_to_i , d _ s t a t e _ i s _ f u l l ) ;
issue ( i_from_d , e _ s t a t e _ i s _ f u l l , reg_file ,
i s _ r e g _ c o m p u t e d , i_state , & i_to_e , i _ s t a t e _ i s _ f u l l ,
& is_lock , & i_hart , & i _ d e s t i n a t i o n ) ;
execute (
# ifndef __SYNTHESIS__
ip_num ,
# endif
e_from_i , m _ s t a t e _ i s _ f u l l ,
# ifndef __SYNTHESIS__
reg_file ,
# endif
e_state , & e_to_f , & e_to_m , e _ s t a t e _ i s _ f u l l ) ;
m e m _ a c c e s s ( ip_num , m_from_e , w _ s t a t e _ i s _ f u l l ,
i p _ d a t a _ r a m , data_ram , m_state , & m_to_w ,
m_state_is_full );
write_back (
# ifndef __SYNTHESIS__
ip_num ,
# endif
w_from_m , reg_file , w_state , w _ s t a t e _ i s _ f u l l ,
404 13 A Multicore RISC-V Processor with Multihart Cores
The fetch (in the fetch.cpp file), decode (in the decode.cpp file), issue (in the
issue.cpp file), execute (in the execute.cpp file), and write_back (in the wb.cpp
file) functions have the same code as in the Chap. 10. The lock_unlock_update
function (in the multihart_ip.cpp file) is unchanged too.
The init_file function in the multihart_ip.cpp file (see Listing 13.4) sets registers
a0, a1, and sp (core identification number, hart identification number, and hart stack
pointer respectively).
Listing 13.4 The init_file function
// a0 / x10 is set with the IP number
// a1 / x11 is set with the hart number
static void i n i t _ f i l e (
i p _ n u m _ t ip_num ,
int reg_file [][ N B _ R E G I S T E R ] ,
bit_t i s _ r e g _ c o m p u t e d [][ N B _ R E G I S T E R ]) {
h a r t _ n u m _ p 1 _ t h1 ;
hart_num_t h;
reg_num_p1_t r1 ;
reg_num_t r;
for ( h1 =0; h1 < N B _ H A R T ; h1 ++) {
# p r a g m a HLS U N R O L L
h = h1 ;
for ( r1 =0; r1 < N B _ R E G I S T E R ; r1 ++) {
# p r a g m a HLS U N R O L L
r = r1 ;
i s _ r e g _ c o m p u t e d [ h ][ r ] = 0;
if ( r ==10)
reg_file [ h ][ r ] = i p _ n u m ;
else if ( r ==11)
reg_file [ h ][ r ] = h ;
else if ( r == SP )
reg_file [ h ][ r ] = (( int ) ( i p _ n u m +1) ) < <(
L O G _ I P _ D A T A _ R A M _ S I Z E +2) ;
else
reg_file [ h ][ r ] = 0;
}
}
}
The mem_access function code implementing the memory access stage is presented
in Listing 13.5.
A hart is selected in two steps. The select_hart function returns the highest priority
ready hart number. In parallel, the input from the execute stage is saved in the
m_state array (save_input_from_e function). The selection process then keeps the
13.1 An Adaptation of the Multihart_ip to Multicore 405
select_hart function selection, or if no ready hart was found, selects the just input
hart.
The access is done in the stage_job function. The accessed_ip and the ac-
cessed_h hart numbers are choosen according to the is_local_ip bit computed in the
save_input_from_e function, when an instruction is input from the execute stage.
These three values are pre-computed at instruction input to shorten the critical path
of the memory access.
The mem_access function in the mem_access.cpp file fills the output structure
to the writeback stage (set_output_to_w function).
Listing 13.5 The mem_access function
void m e m _ a c c e s s (
ip_num_t ip_num ,
from_e_to_m_t m_from_e ,
bit_t * w_state_is_full ,
int i p _ d a t a _ r a m [][ H A R T _ D A T A _ R A M _ S I Z E ] ,
int data_ram [][ N B _ H A R T ][ H A R T _ D A T A _ R A M _ S I Z E ] ,
m_state_t * m_state ,
f r o m _ m _ t o _ w _ t * m_to_w ,
bit_t * m_state_is_full ){
bit_t is_selected ;
hart_num_t selected_hart ;
bit_t is_accessing ;
hart_num_t accessing_hart ;
bit_t input_is_selectable ;
input_is_selectable =
m _ f r o m _ e . i s _ v a l i d && ! w _ s t a t e _ i s _ f u l l [ m _ f r o m _ e . hart ];
select_hart ( m_state_is_full , w_state_is_full ,
& is_selected , & selected_hart );
if ( m _ f r o m _ e . i s _ v a l i d ) {
m _ s t a t e _ i s _ f u l l [ m _ f r o m _ e . hart ] = 1;
s a v e _ i n p u t _ f r o m _ e ( ip_num , m_from_e , m _ s t a t e ) ;
}
is_accessing =
i s _ s e l e c t e d || i n p u t _ i s _ s e l e c t a b l e ;
accessing_hart =
( i s _ s e l e c t e d ) ? s e l e c t e d _ h a r t : m _ f r o m _ e . hart ;
if ( i s _ a c c e s s i n g ) {
m _ s t a t e _ i s _ f u l l [ a c c e s s i n g _ h a r t ] = 0;
s t a g e _ j o b ( m _ s t a t e [ a c c e s s i n g _ h a r t ]. a c c e s s e d _ i p ,
m _ s t a t e [ a c c e s s i n g _ h a r t ]. accessed_h ,
m _ s t a t e [ a c c e s s i n g _ h a r t ]. i s _ l o c a l _ i p ,
m _ s t a t e [ a c c e s s i n g _ h a r t ]. is_load ,
m _ s t a t e [ a c c e s s i n g _ h a r t ]. is_store ,
m _ s t a t e [ a c c e s s i n g _ h a r t ]. address ,
m _ s t a t e [ a c c e s s i n g _ h a r t ]. func3 ,
i p _ d a t a _ r a m , data_ram ,
& m _ s t a t e [ a c c e s s i n g _ h a r t ]. value ) ;
# ifndef __SYNTHESIS__
# ifdef D E B U G _ P I P E L I N E
p r i n t f ( " hart % d : mem " , ( int ) a c c e s s i n g _ h a r t ) ;
p r i n t f ( " %04 d \ n " ,
( int ) ( m _ s t a t e [ a c c e s s i n g _ h a r t ]. fetch_pc < <2) ) ;
# endif
# endif
s e t _ o u t p u t _ t o _ w ( a c c e s s i n g _ h a r t , m_state , m _ t o _ w ) ;
}
m_to_w - > i s _ v a l i d = i s _ a c c e s s i n g ;
}
406 13 A Multicore RISC-V Processor with Multihart Cores
Listing 13.6 shows the code of the save_input_from_e function, located in the
mem_access.cpp file.
The accessed absolute_hart number is computed from the address (which gives
the accessed hart number relative to the accessing IP), from the ip_num accessing
IP and from the accessing hart.
The accessed IP (m_state[hart].accessed_ip) is the IP number of the memory
accessed IP partition. It is the upper part of the absolute_hart number. The accessed
hart (m_state[hart].accessed_h) is the hart number, within the accessed IP, of the
memory accessed hart partition. It is the lower part of the absolute_hart number.
The m_state[hart].is_local_ip bit is set if the accessed IP is the accessing IP.
Listing 13.6 The save_input_from_e function
static void s a v e _ i n p u t _ f r o m _ e (
ip_num_t ip_num ,
f r o m _ e _ t o _ m _ t m_from_e ,
m_state_t * m_state ){
hart_num_t hart ;
ap_uint < L O G _ N B _ I P + L O G _ N B _ H A R T > a b s o l u t e _ h a r t ;
hart = m _ f r o m _ e . hart ;
m _ s t a t e [ hart ]. rd = m _ f r o m _ e . rd ;
m _ s t a t e [ hart ]. h a s _ n o _ d e s t = m _ f r o m _ e . h a s _ n o _ d e s t ;
m _ s t a t e [ hart ]. i s _ l o a d = m_from_e . is_load ;
m _ s t a t e [ hart ]. i s _ s t o r e = m_from_e . is_store ;
m _ s t a t e [ hart ]. func3 = m _ f r o m _ e . func3 ;
m _ s t a t e [ hart ]. i s _ r e t = m_from_e . is_ret ;
m _ s t a t e [ hart ]. a d d r e s s = m_from_e . address ;
m _ s t a t e [ hart ]. value = m _ f r o m _ e . value ;
m _ s t a t e [ hart ]. r e s u l t = m _ f r o m _ e . value ;
absolute_hart =
( m _ f r o m _ e . address > >
( L O G _ H A R T _ D A T A _ R A M _ S I Z E +2) ) + ((( ap_uint < L O G _ N B _ I P + L O G _ N B _ H A R T >)
i p _ n u m ) << L O G _ N B _ H A R T ) + hart ;
m _ s t a t e [ hart ]. a c c e s s e d _ i p = a b s o l u t e _ h a r t > > L O G _ N B _ H A R T ;
m _ s t a t e [ hart ]. a c c e s s e d _ h = absolute_hart ;
m _ s t a t e [ hart ]. i s _ l o c a l _ i p =( m _ s t a t e [ hart ]. a c c e s s e d _ i p == i p _ n u m
);
# ifndef __SYNTHESIS__
m _ s t a t e [ hart ]. f e t c h _ p c = m_from_e . fetch_pc ;
m _ s t a t e [ hart ]. i n s t r u c t i o n = m _ f r o m _ e . i n s t r u c t i o n ;
m _ s t a t e [ hart ]. d_i = m _ f r o m _ e . d_i ;
m _ s t a t e [ hart ]. t a r g e t _ p c = m_from_e . target_pc ;
# endif
}
Listing 13.7 shows the code of the stage_job function, located in the
mem_access.cpp file.
The ip_num argument refers to the accessed IP, not to the accessing one. The hart
argument refers to the accessed hart partition.
If the instruction is neither a load nor a store, the stage_job function does nothing
(the instruction just transits through the memory access stage).
A load accesses the accessed IP and accessed hart memory partition in the
mem_load function. A store accesses the memory in the mem_store function.
13.1 An Adaptation of the Multihart_ip to Multicore 407
Listing 13.8 shows the start of the code of the mem_load function, located in the
mem.cpp file.
A local load reads from the local IP data ram (ip_data_ram). A remote load reads
from the hart in the ip within the data_ram. After a full word has been loaded, the
addressed bytes are selected and returned as the result of the load (this part of the
mem_load is not shown as it is unchanged from preceding designs; refer back to
Listing 6.11).
Listing 13.8 The mem_load function
int m e m _ l o a d (
ip_num_t ip ,
bit_t is_local ,
hart_num_t hart ,
int i p _ d a t a _ r a m [][ H A R T _ D A T A _ R A M _ S I Z E ] ,
int data_ram [][ N B _ H A R T ][ H A R T _ D A T A _ R A M _ S I Z E ] ,
b _ d a t a _ a d d r e s s _ t address ,
func3_t msize ){
ap_uint <2 > a01 = address ;
bit_t a1 = ( a d d r e s s >> 1) ;
w _ h a r t _ d a t a _ a d d r e s s _ t a2 = ( a d d r e s s >> 2) ;
int result ;
char b , b0 , b1 , b2 , b3 ;
u n s i g n e d char ub , ub0 , ub1 , ub2 , ub3 ;
short h , h0 , h1 ;
u n s i g n e d short uh , uh0 , uh1 ;
int w , ib , ih ;
u n s i g n e d int iub , iuh ;
if ( i s _ l o c a l )
w = i p _ d a t a _ r a m [ hart ][ a2 ];
else
w = d a t a _ r a m [ ip ][ hart ][ a2 ];
...
Listing 13.9 shows the code of the mem_store function, located in the mem.cpp
file.
A local store writes to the local IP ram bank (i.e. in the hart partition of the
ip_data_ram array).
408 13 A Multicore RISC-V Processor with Multihart Cores
A remote store writes to the accessed IP memory bank (i.e. writes in the hart
partition of the ip core in the data_ram array) through the AXI interconnect.
Listing 13.9 The mem_store function
void m e m _ s t o r e (
ip_num_t ip ,
bit_t is_local ,
hart_num_t hart ,
int i p _ d a t a _ r a m [][ H A R T _ D A T A _ R A M _ S I Z E ] ,
int data_ram [][ N B _ H A R T ][ H A R T _ D A T A _ R A M _ S I Z E ] ,
b _ d a t a _ a d d r e s s _ t address ,
int rv2 ,
ap_uint <2 > msize ) {
b_hart_data_address_t a = address ;
h _ h a r t _ d a t a _ a d d r e s s _ t a1 = address > >1;
w _ h a r t _ d a t a _ a d d r e s s _ t a2 = address > >2;
char rv2_0 = rv2 ;
short r v 2 _ 0 1 = rv2 ;
s w i t c h ( msize ) {
case SB :
if ( i s _ l o c a l )
*(( char *) ( i p _ d a t a _ r a m ) +
(((( b _ i p _ d a t a _ a d d r e s s _ t ) hart ) <<
( L O G _ H A R T _ D A T A _ R A M _ S I Z E +2) ) | a ) )
= rv2_0 ;
else
*(( char *) ( d a t a _ r a m ) +
(((( b _ d a t a _ a d d r e s s _ t ) ip ) <<
( L O G _ I P _ D A T A _ R A M _ S I Z E +2) ) |
((( b _ i p _ d a t a _ a d d r e s s _ t ) hart ) <<
( L O G _ H A R T _ D A T A _ R A M _ S I Z E +2) ) | a ) ) = rv2_0 ;
break ;
case SH :
if ( i s _ l o c a l )
*(( short *) ( i p _ d a t a _ r a m ) +
(((( h _ i p _ d a t a _ a d d r e s s _ t ) hart ) <<
( L O G _ H A R T _ D A T A _ R A M _ S I Z E +1) ) | a1 ) )
= rv2_01 ;
else
*(( short *) ( d a t a _ r a m ) +
(((( h _ d a t a _ a d d r e s s _ t ) ip ) <<
( L O G _ I P _ D A T A _ R A M _ S I Z E +1) ) |
((( h _ i p _ d a t a _ a d d r e s s _ t ) hart ) <<
( L O G _ H A R T _ D A T A _ R A M _ S I Z E +1) ) | a1 ) ) = r v 2 _ 0 1 ;
break ;
case SW :
if ( i s _ l o c a l )
i p _ d a t a _ r a m [ hart ][ a2 ] = rv2 ;
else
d a t a _ r a m [ ip ][ hart ][ a2 ] = rv2 ;
break ;
case 3:
break ;
}
}
13.2 Simulating the IP 409
Experimentation
To simulate the multicore_multihart_ip, operate as explained in Sect. 5.3.6, re-
placing fetching_ip with multicore_multihart_ip. There are two testbench pro-
grams: testbench_seq_multihart_ip.cpp to run independent codes (one per hart
in each core) and testbench_par_multihart_ip.cpp to run a parallel sum of the
elements of an array.
With testbench_seq_multihart_ip.cpp you can play with the simulator, replacing
the included test_mem_0_text.hex file with any other .hex file you find in the
same folder. You can also vary the number of cores and the number of harts (two
cores and two, four, or eight harts per core; four cores and two or four harts per
core; eight cores and two harts per core; in any case, no more than 16 harts per
processor).
}
for ( int i =0; i < NB_IP ; i ++) {
p r i n t f ( " core % d : % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \
in % d cycles ( ipc = %2.2 f ) \ n " , i , nbi [ i ] , nbc [ i ] ,
(( float ) nbi [ i ]) / nbc [ i ]) ;
for ( int h =0; h < N B _ H A R T ; h ++) {
p r i n t f ( " hart : % d data memory dump ( non null words ) \ n " , h ) ;
for ( int j =0; j < H A R T _ D A T A _ R A M _ S I Z E ; j ++) {
w = d a t a _ r a m [ i ][ h ][ j ];
if ( w != 0)
p r i n t f ( " m [%5 x ] = %16 d (%8 x ) \ n " ,
4*(( i < < L O G _ I P _ D A T A _ R A M _ S I Z E ) +
(h < < L O G _ H A R T _ D A T A _ R A M _ S I Z E ) + j ) ,
w , ( u n s i g n e d int ) w ) ;
}
}
}
r e t u r n 0;
}
For the run of four copies (two cores of two harts) of test_mem.h, the output is
shown in Listings 13.11 to 13.13.
Listing 13.11 The output of the main function of the testbench_seq_multicore_multihart_ip.cpp
file: core 0
hart 0: 0000: 00000513 li a0 , 0
hart 0: a0 = 0 ( 0)
hart 1: 0000: 00000513 li a0 , 0
hart 1: a0 = 0 ( 0)
...
hart 0: 0056: 00 a62223 sw a0 , 4( a2 )
hart 0: m[ 2c] = 55 ( 37)
hart 1: 0056: 00 a62223 sw a0 , 4( a2 )
hart 1: m [1002 c ] = 55 ( 37)
hart 0: 0060: 00008067 ret
hart 0: pc = 0 ( 0)
hart 1: 0060: 00008067 ret
hart 1: pc = 0 ( 0)
r e g i s t e r file for hart 0
...
sp = 131072 ( 20000)
...
a0 = 55 ( 37)
...
a2 = 40 ( 28)
a3 = 40 ( 28)
a4 = 10 ( a)
...
r e g i s t e r file for hart 1
...
sp = 131072 ( 20000)
...
a0 = 55 ( 37)
...
a2 = 40 ( 28)
a3 = 40 ( 28)
a4 = 10 ( a)
...
13.2 Simulating the IP 411
m [10020] = 9 ( 9)
m [10024] = 10 ( a)
m [1002 c ] = 55 ( 37)
core 1: 176 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 306 c y c l e s ( ipc =
0.58)
hart : 0 data memory dump ( non null words )
m [20000] = 1 ( 1)
m [20004] = 2 ( 2)
m [20008] = 3 ( 3)
m [2000 c ] = 4 ( 4)
m [20010] = 5 ( 5)
m [20014] = 6 ( 6)
m [20018] = 7 ( 7)
m [2001 c ] = 8 ( 8)
m [20020] = 9 ( 9)
m [20024] = 10 ( a)
m [2002 c ] = 55 ( 37)
hart : 1 data memory dump ( non null words )
m [30000] = 1 ( 1)
m [30004] = 2 ( 2)
m [30008] = 3 ( 3)
m [3000 c ] = 4 ( 4)
m [30010] = 5 ( 5)
m [30014] = 6 ( 6)
m [30018] = 7 ( 7)
m [3001 c ] = 8 ( 8)
m [30020] = 9 ( 9)
m [30024] = 10 ( a)
m [3002 c ] = 55 ( 37)
};
int d a t a _ r a m [ NB_IP ][ N B _ H A R T ][ H A R T _ D A T A _ R A M _ S I Z E ];
u n s i g n e d int s t a r t _ p c [ NB_HART ]={0};
u n s i g n e d int s t a r t _ p c _ 0 [ N B _ H A R T ];
int main () {
u n s i g n e d int nbi [ NB_IP ];
u n s i g n e d int nbc [ NB_IP ];
int w;
s t a r t _ p c _ 0 [0] = 0;
for ( int i =1; i < N B _ H A R T ; i ++)
start_pc_0 [i] = OTHER_HART_START ;
for ( int i =1; i < NB_IP ; i ++)
m u l t i h a r t _ i p (i , (1 < < N B _ H A R T ) -1 , start_pc , code_ram ,
& d a t a _ r a m [ i ][0] , data_ram , & nbi [ i ] , & nbc [ i ]) ;
m u l t i h a r t _ i p (0 , (1 < < N B _ H A R T ) -1 , start_pc_0 , code_ram_0 ,
& d a t a _ r a m [0][0] , data_ram , & nbi [0] , & nbc [0]) ;
for ( int i =0; i < NB_IP ; i ++) {
p r i n t f ( " core % d : % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \
in % d cycles ( ipc = %2.2 f ) \ n " , i , nbi [ i ] , nbc [ i ] ,
(( float ) nbi [ i ]) / nbc [ i ]) ;
for ( int h =0; h < N B _ H A R T ; h ++) {
p r i n t f ( " hart % d : data memory dump ( non null words ) \ n " , h ) ;
for ( int j =0; j < H A R T _ D A T A _ R A M _ S I Z E ; j ++) {
w = d a t a _ r a m [ i ][ h ][ j ];
if ( w != 0)
p r i n t f ( " m [%5 x ] = %16 d (%8 x ) \ n " ,
4*(( i < < L O G _ I P _ D A T A _ R A M _ S I Z E ) +
(h < < L O G _ H A R T _ D A T A _ R A M _ S I Z E ) + j ) ,
w , ( u n s i g n e d int ) w ) ;
}
}
}
r e t u r n 0;
}
The code run is the one already presented in Sect. 12.2.2. All the cores ex-
cept the first one run the test_mem_par_otherip.s code. The first core runs the
test_mem_par_ip0.s code.
In the testbench code, the call related to the first core is placed in the last position
to ensure correct simulation.
For the run on two cores of two harts, the output is shown in Listings 13.15 to
13.17 (the final sum of the 40 first integers is 820).
Listing 13.15 The output of the main function of the testbench_par_multicore_multihart_ip.cpp
file: core 1
hart 0: 0000: 00359293 slli t0 , a1 , 3
hart 0: t0 = 0 ( 0)
hart 1: 0000: 00359293 slli t0 , a1 , 3
hart 1: t0 = 8 ( 8)
...
hart 0: 0088: 00 a62023 sw a0 , 0( a2 )
hart 0: m [20028] = 255 ( ff )
hart 1: 0088: 00 a62023 sw a0 , 0( a2 )
hart 1: m [30028] = 355 ( 163)
hart 0: 0092: 00008067 ret
hart 0: pc = 0 ( 0)
hart 1: 0092: 00008067 ret
hart 1: pc = 0 ( 0)
r e g i s t e r file for hart 0
...
414 13 A Multicore RISC-V Processor with Multihart Cores
sp = 262144 ( 40000)
...
t0 = 8 ( 8)
t1 = 2 ( 2)
...
a0 = 255 ( ff )
...
a2 = 40 ( 28)
a3 = 40 ( 28)
a4 = 30 ( 1e)
...
r e g i s t e r file for hart 1
...
sp = 262144 ( 40000)
...
t0 = 8 ( 8)
t1 = 2 ( 2)
...
a0 = 355 ( 163)
...
a2 = 40 ( 28)
a3 = 40 ( 28)
a4 = 40 ( 28)
...
Figure 13.1 shows that the II = 2 constraint is satisfied for a 2-core and two harts per
core processor (the II = 2 constraint is also satisfied for a 2-core and four harts per
core design and for a 4-core and two harts per core design). The iteration latency,
imposed by the global memory access one, is 13 FPGA cycles.
Fig. 13.3 Vivado implementation report for a 2-core of two harts processor
For two cores, the s_axi_control range is 256 KB and the addresses are 0x4004_0000
and 0x4008_0000. For four cores, the range is 128KB and the addresses are
0x4004_0000, 0x4006_0000, 0x4008_0000, and 0x400A_0000.
The Vivado bitstream generation produces the implementation report in Fig. 13.3,
showing that the 2-core IP with two harts uses 20,756 LUTs (39.02%). The 2-core
4-hart version uses 36,204 LUTs (68.05%). The 4-core 2-hart IP uses 37,520 LUTs
(70.53%).
Experimentation
To run the multicore_multihart_ip on the development board, proceed as ex-
plained in Sect. 5.3.10, replacing fetching_ip with multicore_multihart_ip.
There are two drivers: helloworld_seq.c on which you can run independent pro-
grams on each hart of each core and helloworld_par.c on which you can run a
parallel sum of the elements of an array.
The code to drive the FPGA to run the eight copies of the test_mem.h program is
shown in Listing 13.8 (the same code can be used to run the other test programs:
418 13 A Multicore RISC-V Processor with Multihart Cores
(h < < L O G _ H A R T _ D A T A _ R A M _ S I Z E ) + j ];
if ( w != 0)
p r i n t f ( " m [%5 x ] = %16 d (%8 x ) \ n " ,
4*(( i < < L O G _ I P _ D A T A _ R A M _ S I Z E ) +
(h < < L O G _ H A R T _ D A T A _ R A M _ S I Z E ) + j ) , ( int )w , ( u n s i g n e d
int ) w ) ;
}
}
}
}
The run should print on the putty window what is shown in Listing 13.9.
Listing 13.19 The helloworld_seq.c prints
core 0: 176 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 306 c y c l e s ( ipc =
0.58)
hart 0 data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 2c] = 55 ( 37)
hart 1 data memory dump ( non null words )
m [10000] = 1 ( 1)
m [10004] = 2 ( 2)
m [10008] = 3 ( 3)
m [1000 c ] = 4 ( 4)
m [10010] = 5 ( 5)
m [10014] = 6 ( 6)
m [10018] = 7 ( 7)
m [1001 c ] = 8 ( 8)
m [10020] = 9 ( 9)
m [10024] = 10 ( a)
m [1002 c ] = 55 ( 37)
core 1: 176 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 306 c y c l e s ( ipc =
0.58)
hart 0 data memory dump ( non null words )
m [20000] = 1 ( 1)
m [20004] = 2 ( 2)
m [20008] = 3 ( 3)
m [2000 c ] = 4 ( 4)
m [20010] = 5 ( 5)
m [20014] = 6 ( 6)
m [20018] = 7 ( 7)
m [2001 c ] = 8 ( 8)
m [20020] = 9 ( 9)
m [20024] = 10 ( a)
m [2002 c ] = 55 ( 37)
hart 1 data memory dump ( non null words )
m [30000] = 1 ( 1)
m [30004] = 2 ( 2)
m [30008] = 3 ( 3)
m [3000 c ] = 4 ( 4)
m [30010] = 5 ( 5)
m [30014] = 6 ( 6)
m [30018] = 7 ( 7)
420 13 A Multicore RISC-V Processor with Multihart Cores
m [3001 c ] = 8 ( 8)
m [30020] = 9 ( 9)
m [30024] = 10 ( a)
m [3002 c ] = 55 ( 37)
X M u l t i h a r t _ i p _ W r i t e _ i p _ c o d e _ r a m _ W o r d s (& ip [ i ] , 0 , code_ram ,
IP_CODE_RAM_SIZE );
}
for ( int h =1; h < N B _ H A R T ; h ++)
s t a r t _ p c [ h ]= O T H E R _ H A R T _ S T A R T ;
X M u l t i h a r t _ i p _ W r i t e _ s t a r t _ p c _ W o r d s (& ip [0] , 0 , start_pc , N B _ H A R T ) ;
X M u l t i h a r t _ i p _ W r i t e _ i p _ c o d e _ r a m _ W o r d s (& ip [0] , 0 , code_ram_0 ,
IP_CODE_RAM_SIZE );
for ( int i =0; i < NB_IP ; i ++) X M u l t i h a r t _ i p _ S t a r t (& ip [ i ]) ;
for ( int i = NB_IP -1; i >=0; i - -)
while (! X M u l t i h a r t _ i p _ I s D o n e (& ip [ i ]) ) ;
for ( int i =0; i < NB_IP ; i ++) {
nbc [ i ] = ( int ) X M u l t i h a r t _ i p _ G e t _ n b _ c y c l e (& ip [ i ]) ;
nbi [ i ] = ( int ) X M u l t i h a r t _ i p _ G e t _ n b _ i n s t r u c t i o n (& ip [ i ]) ;
}
for ( int i =0; i < NB_IP ; i ++) {
p r i n t f ( " core % d : % d f e t c h e d and d e c o d e d i n s t r u c t i o n s \
in % d cycles ( ipc = %2.2 f ) \ n " , i , nbi [ i ] , nbc [ i ] , (( float ) nbi [ i ]) /
nbc [ i ]) ;
for ( int h =0; h < N B _ H A R T ; h ++) {
p r i n t f ( " hart % d data memory dump ( non null words ) \ n " , h ) ;
for ( int j =0; j < H A R T _ D A T A _ R A M _ S I Z E ; j ++) {
w = d a t a _ r a m [( i < < L O G _ I P _ D A T A _ R A M _ S I Z E ) +
(h < < L O G _ H A R T _ D A T A _ R A M _ S I Z E ) + j ];
if ( w != 0)
p r i n t f ( " m [%5 x ] = %16 d (%8 x ) \ n " ,
4*(( i < < L O G _ I P _ D A T A _ R A M _ S I Z E ) +
(h < < L O G _ H A R T _ D A T A _ R A M _ S I Z E ) + j ) , ( int )w , ( u n s i g n e d
int ) w ) ;
}
}
}
}
The run should print on the putty window what is shown in Listing 13.21.
Listing 13.21 The helloworld_par.c prints
core 0: 212 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 357 c y c l e s ( ipc =
0.59)
hart 0 data memory dump ( non null words )
m[ 0] = 1 ( 1)
m[ 4] = 2 ( 2)
m[ 8] = 3 ( 3)
m[ c] = 4 ( 4)
m[ 10] = 5 ( 5)
m[ 14] = 6 ( 6)
m[ 18] = 7 ( 7)
m[ 1c] = 8 ( 8)
m[ 20] = 9 ( 9)
m[ 24] = 10 ( a)
m[ 28] = 55 ( 37)
m[ 2c] = 820 ( 334)
hart 1 data memory dump ( non null words )
m [10000] = 11 ( b)
m [10004] = 12 ( c)
m [10008] = 13 ( d)
m [1000 c ] = 14 ( e)
m [10010] = 15 ( f)
m [10014] = 16 ( 10)
m [10018] = 17 ( 11)
m [1001 c ] = 18 ( 12)
m [10020] = 19 ( 13)
422 13 A Multicore RISC-V Processor with Multihart Cores
m [10024] = 20 ( 14)
m [10028] = 155 ( 9b)
core 1: 192 f e t c h e d and d e c o d e d i n s t r u c t i o n s in 290 c y c l e s ( ipc =
0.66)
hart 0 data memory dump ( non null words )
m [20000] = 21 ( 15)
m [20004] = 22 ( 16)
m [20008] = 23 ( 17)
m [2000 c ] = 24 ( 18)
m [20010] = 25 ( 19)
m [20014] = 26 ( 1a)
m [20018] = 27 ( 1b)
m [2001 c ] = 28 ( 1c)
m [20020] = 29 ( 1d)
m [20024] = 30 ( 1e)
m [20028] = 255 ( ff )
hart 1 data memory dump ( non null words )
m [30000] = 31 ( 1f)
m [30004] = 32 ( 20)
m [30008] = 33 ( 21)
m [3000 c ] = 34 ( 22)
m [30010] = 35 ( 23)
m [30014] = 36 ( 24)
m [30018] = 37 ( 25)
m [3001 c ] = 38 ( 26)
m [30020] = 39 ( 27)
m [30024] = 40 ( 28)
m [30028] = 355 ( 163)
Table 13.1 shows the execution time of the matrix multiplication on a multicore and
multihart design. They are compared to the baseline time of the sequential version
run on the multicycle pipeline design and the speedup is given. The conditions of
the run are the same as the ones presented in Sect. 12.6.
The code of the matrix multiplication can be found in the mulmat_xc_yh.c file in
the multicore_multihart_ip folder (xc ranges from 2c to 8c, i.e. from two cores to
eight cores and xh ranges from 2h to 8h, i.e. from two harts to eight harts; the total
number of harts should not be more than 16).
Then, the mulmat_xc_yh_text.hex files are to be built with the build_mulmat_
xc_yh.sh script. The run can be simulated in Vitis_HLS with the testbench_mulmat_
par_multihart_ip.cpp testbench. The matrix multiplication can be run on the FPGA
with the helloworld_mulmat_xc_yh.c driver (xc is 2c or 4c and xh is 2h or 4h with
a total number of harts no more than eight).
The lines starting with a star (*) correspond to simulations only as the matching
designs cannot be implemented on the XC7Z020 FPGA.
The fastest design to run the matrix multiplication is the 8-core 2-hart (speedup of
7.34). The fastest implementable design is the 8-core 1-hart evaluated in the preceding
chapter (4.62 times faster than a single core single hart processor; however, the design
13.4 Evaluating the Parallelism Efficiency of the Multicore Multihart IP 423
Table 13.1 Execution time of the parallelized matrix multiplication on the multicore_multihart_ip
processor and speedup from the sequential run
Number of Number of Cycles nmi cpi Time (s) Speedup
cores harts
1 1 6,236,761 3,858,900 1.62 0.124735220 –
2 2 3,072,198 4,601,284 1.34 0.061443960 2.03
2 4 2,466,429 4,728,936 1.04 0.049328580 2.53
*2 8 2,694,867 5,057,744 1.07 0.053897340 2.31
4 2 1,581,360 4,728,936 1.34 0.031627200 3.94
*4 4 1,326,897 5,057,744 1.05 0.026537940 4.70
*8 2 849,410 5,057,744 1.34 0.016988200 7.34
uses 43,731 LUTs to be compared to the 4,111 LUTs used by the single core single
hart IP, i.e. nearly 11 times more).
The experience shows that with two harts on a core, the remote memory access
latency is hidden by the multithreading mechanism as the speedup is super-optimal
(2.03 for two cores) or close to optimal (3.94 for four cores and 7.34 for eight cores)
(the speedup can be more than optimal as we run more threads than cores; the optimal
speedup related to the number of threads run on a 2-core 2-hart processor is 4).
With four harts per core, the speedup related to the number of cores is super-
optimal (2.53 for two cores and 4.70 for four cores).
Conclusion: Playing with the
Pynq-Z1/Z2 Development Board Leds 14
and Push Buttons
Abstract
This chapter makes you play with the leds and push buttons of the development
board. In a first step, an experience is built from a driver run on the Zynq Pro-
cessing System and directly interacting with the board buttons and leds. Then, the
driver is modified to interact with a multicore_multicycle_ip processor presented
in Chap. 12. The processor runs a RISC-V program which accesses the board
buttons and leds. From the general organization of the multicore_multicycle_ip
processor design shown in this chapter, you can develop any RISC-V applica-
tion to access the resources on the development board (switches, buttons and
leds, DDR3 DRAM, SD card), including the expansion connectors (USB, HDMI,
Ethernet RJ45, Pmods and Arduino shield).
All the source files related to the button/led IP can be found in the pynq_io folder.
All the development boards include a set of buttons and leds. These resources can
be accessed from the FPGA, either directly from its PS part (the Zynq Processing
System) or from the PL part (the Programmable Logic implementing your RISC-V
processors).
In a first step, you will build a design containing a Zynq Processing System and
two GPIO IPs (GPIO stands for General Purpose I/O) interconnected with the AXI
interconnect IP. One of the two GPIO IPs will be connected to the four button pads
out of the FPGA and connected to the push buttons on the board. The other GPIO
IP will be connected to the four led pads.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 425
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1_14
426 14 Conclusion: Playing with the Pynq-Z1/Z2 Development …
Fig. 14.1 Buttons and leds access design based on the Zynq Processing System
Figure 14.1 shows the GPIO design to be built in Vivado. You must add three IPs:
the Zynq7 Processing System and two AXI_GPIO IPs. The AXI smart connect IP
is automatically added when you run connect automation. The two GPIO IPs have
been renamed buttons and leds. On the connection dialog box, right-click on the
buttons GPIO. In the Options/Select Board Part Interface box, select btns_4bits (4
Buttons). For the leds GPIO, select leds_4bits (4 leds).
With the address editor, you can check the addresses allocated by the AXI inter-
connection system to the two GPIO IPs (0x41200000 for buttons and 0x41210000
for leds).
Once the bitstream has been generated and the hardware has been exported, on
Vitis IDE, you should run the helloworld_button_led.c driver shown in Listing 14.1
(all the software resources are located in the pynq_io folder).
Listing 14.1 The GPIO buttons and leds driver
# i n c l u d e < stdio .h >
# include " xparameters .h"
# i n c l u d e " xgpio . h "
# d e f i n e B T N _ C H A N N E L 1 // A X I _ G P I O can be c o n f i g u r e d with 1 or 2
channels
# d e f i n e l e d _ C H A N N E L 1 // The V i v a d o P r o j e c t c o n f i g u r a t i o n is for one
channel
int main () {
XGpio_Config * cfg_ptr ;
XGpio leds_device , buttons_device ;
u32 data ;
cfg_ptr = XGpio_LookupConfig ( XPAR_LEDS_DEVICE_ID );
X G p i o _ C f g I n i t i a l i z e (& l e d s _ d e v i c e , cfg_ptr , cfg_ptr - > B a s e A d d r e s s ) ;
cfg_ptr = XGpio_LookupConfig ( XPAR_BUTTONS_DEVICE_ID );
X G p i o _ C f g I n i t i a l i z e (& b u t t o n s _ d e v i c e , cfg_ptr , cfg_ptr - >
BaseAddress );
// u n p r e s s e d b u t t o n = 1 ; p r e s s e d b u t t o n = 0 ; init as u n p r e s s e d
X G p i o _ S e t D a t a D i r e c t i o n (& b u t t o n s _ d e v i c e , B T N _ C H A N N E L , 0 xf ) ;
// off led = 0 ; on led = 1 ; init as off
X G p i o _ S e t D a t a D i r e c t i o n (& l e d s _ d e v i c e , L E D _ C H A N N E L , 0) ;
14.2 A Design to Access Buttons and Leds from a RISC-V Processor 427
while (1) {
// data is the bitmap of the four b u t t o n s with 0/ pressed , 1/
unpressed
data = X G p i o _ D i s c r e t e R e a d (& b u t t o n s _ d e v i c e , B T N _ C H A N N E L ) ;
X G p i o _ D i s c r e t e W r i t e (& l e d s _ d e v i c e , L E D _ C H A N N E L , data ) ;
}
}
When the driver is running, you light led LDx by pressing on button BTNx (with
x ranging from 0 to 3).
The RISC-V processor can access the board resources through the AXI interconnec-
tion.
The RISC-V processor accesses the external resources through memory ad-
dresses out of its internal space (i.e. with loads and stores to addresses beyond the
IP_DATA_RAM_SIZE limit).
Hence, the GPIO address spaces must be mapped after the RISC-V processor one
and the code run on the RISC-V processor must address these external spaces to
access the buttons and leds.
Figure 14.2 shows the design mixing the RISC-V processor and the GPIO IPs.
Figure 14.3 shows the memory mapping of the AXI interconnection. The RISC-
V processor data RAM is 128 KB large and ranges from address 0x40000000 to
0x4001ffff. The buttons GPIO space starts at address 0x40020000 and the leds
GPIO space starts at address 0x40030000.
The helloworld_button_led_multicore_multicycle.cpp Vitis IDE driver shown
in Listing 14.2 runs the multicycle_pipeline_ip which runs the RISC-V code to
access the leds and buttons.
It is the same code as the one to drive the multicore_multicycle_ip design pre-
sented in Sect. 12.5.1 (do not forget to update the .hex file paths to your environment
with the update_helloworld shell script). As the code run by the RISC-V processor
is a forever loop, the call to XMulticycle_pipeline_ip_IsDone never returns.
Listing 14.2 The RISC-V GPIO buttons and leds driver
# i n c l u d e < stdio .h >
# include " xmulticycle_pipeline_ip .h"
# include " xparameters .h"
# define LOG_NB_IP 1
# d e f i n e NB_IP (1 < < L O G _ N B _ I P )
# define LOG_IP_CODE_RAM_SIZE (16 - L O G _ N B _ I P ) // in word
# define IP_CODE_RAM_SIZE (1 < < L O G _ I P _ C O D E _ R A M _ S I Z E )
# define LOG_IP_DATA_RAM_SIZE (16 - L O G _ N B _ I P ) // in words
# define IP_DATA_RAM_SIZE (1 < < L O G _ I P _ D A T A _ R A M _ S I Z E )
# define DATA_RAM 0 x40000000
int * d a t a _ r a m = ( int *) D A T A _ R A M ;
XMulticycle_pipeline_ip_Config * cfg_ptr ;
X M u l t i c y c l e _ p i p e l i n e _ i p ip ;
428 14 Conclusion: Playing with the Pynq-Z1/Z2 Development …
Fig. 14.2 Buttons and leds access design based on the RISC-V multicore_multicycle_ip
14.2 A Design to Access Buttons and Leds from a RISC-V Processor 429
However, the Vivado/Vitis IDE XGpio_ functions must be adapted to the RISC-V
processor.
This adaptation only consists in gathering the necessary header files to compile
the driver. A few source files are also needed. The original files have been slighly
modified to comment some unnecessary inclusions and limit the number of imported
files.
The gpio_utils folder contains the C code and header files which are these adapta-
tions of the XGpio_ functions to the RISC-V processor. These files are the modified
versions of the original files which can be found in the /opt/Xilinx/Vitis/2022.1/data/
embeddedsw/XilinxProcessorIPLib/drivers/gpio_v4_9/src folder.
To build the RISC-V code, you can use the build.sh shell script (it builds the .hex
text and data files according to the GPIO sources and the driver shown in Listing
14.3).
14.3 Conclusion
The technique employed to interface the development board leds and buttons with the
RISC-V processor can be applied to all the other devices. In the Xilinx resources, you
can find driver examples for the connectors of the various available boards (at https://
github.com/Xilinx/embeddedsw/tree/master/XilinxProcessorIPLib/drivers;
e.g. the GPIO files can be found at https://github.com/Xilinx/embeddedsw/tree/
master/XilinxProcessorIPLib/drivers/gpio/src; a documentation is available at
https://xilinx.github.io/embeddedsw.github.io/gpio/doc/html/api/files.html).
The processor designs presented in this book are basic unoptimized hardwares.
They can be improved either by expanding the capabilities (i.e. adding the RISC-V
ISA expansions), or by optimizing the low-level VHDL or Verilog code which can
be derived from HLS when exporting RTL.
Their target is bare-metal but OS-based targets like Linux can also be reached
when implementing the RISC-V privileged ISA.
A full machine, with DRAM, keyboard, mouse, and screen could be developed
around a development board, using the USB, HDMI, and Ethernet connectors. Only
a SATA interface for hard drives is missing but there is an SD-card which can serve
as a permanent memory.
All these developments would deserve a new volume including experimentations
to build a full computer. It could be combined with a renewed version of the Douglas
Comer Xinu implementation of Unix, oriented to Linux kernels.
Index
A C
Adder, 3, 24 Carry propagation chain, 9
Aligned memory access, 133, 186 CLB, 8
ALU, 3 Combinational component, 23
ARM, 14 Comer, Douglas, v
Array initializer, 95 CPI, vii, 126
Artix-7, 14, 57
AXI, 129, 355 D
AXI based SoC, 356 Data_ram
AXI interconnected CPU and RAMs, 356 external access, 185
size, 184
Debugging hint
B
from do ... while to for loop, 229
Benchmark, 216
frozen IP on FPGA
Binary addition, 4
ap_uint or ap_int variable, 229
Bitstream, 11, 12 computation within #ifndef __SYNTHE-
Board SIS__ and #endif, 230
Basys3, 13 reduce the RISC-V code run, 230
board_file replacing while (..._ip_IsDone(...)); by for
other than pynq, 16 (int i=0; i<1000000; i++);, 230
pynq-z1, 16 non deterministic behaviour on FPGA
pynq-z2, 16 check initializations, 231
Jumper Configuration, 68 save progression to IP data memory, 231
Nexys4, 13 synthesis is not simulation, 229
Pynq-Z1, 13
Pynq-Z2, 13 E
Zedboard, 13 ELF based 0, 94
Zybo, 13 Endianness, 187
Boolean function, 5 Execution time, 126
generate, 7 Exercise
propagate, 7 F ISA extension, 228
BRAM two access ports, 184 M ISA extension, 226
Branch prediction, 297 Experimentation
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 431
Nature Switzerland AG 2023
B. Goossens, Guide to Computer Processor Architecture, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-031-18023-1
432 Index
helloworld_button_led_multicore_multicycle.c, multicore_multicycle_ip
427 init_reg_file, 382
riscv-tests mem_access, 382
_start.S, 208 mem_load, 384
helloworld_rv32i_npp_ip.c, 215 mem_store, 385
my_riscv_test.h, 206 multicycle_pipeline_ip top function, 378,
my_test_macros.h, 207 379, 381
testbench_riscv_tests_rv32i_npp_ip.cpp, stage_job for the memory access stage, 383
209
text and data hex files, 210 multicore_multihart_ip
rv32i_npp_bram_ip init_file, 404
helloworld.c, 363 mem_access, 405
rv32i_npp_ip mem_load, 407
helloworld.c, 199 mem_store, 408
testbench_rv32i_npp_ip.cpp, 193 multihart_ip top function, 402
rv32i_pp_ip save_input_from_e in the memory access
helloworld.c, 262 stage, 406
testbench_test_mem_rv32i_pp_ip.cpp, 260 stage_job in the memory access stage, 406
multicycle_pipeline_ip
simple_pipeline_ip compute, 287
helloworld.c, 244 decode, 278
simple_pipeline_ip.h, 236 decode stage set_output_to_f, 279
Flip-flop, 9 decode stage set_output_to_i, 279
FPGA, v, 9 decode stage stage_job, 279
Full adder, 7 decode_instruction, 277, 278
Function execute, 286
fde_ip execute stage set_output_to_f, 289
compute_branch_result, 170 execute stage set_output_to_m, 289
compute_next_pc, 171 execute stage stage_job, 287
compute_op_result, 170 fetch, 275
compute_result, 168 fetch stage set_output_to_d, 276
decode, 165 fetch stage set_output_to_f, 276
execute, 166 fetch stage stage_job, 276
fde_ip, 160, 161 issue, 283
print_reg, 162 issue stage set_output_to_e, 285
read_reg, 167 issue stage stage_job, 285
running_cond_update, 163 mem_access stage, 289
write_reg, 167 memory access stage set_output_to_w, 290
fetching_decoding_ip
compute_next_pc, 154 memory access stage stage_job, 290
decode_immediate, 152 multicycle_pipeline_ip top function, 273
decode_instruction, 151 write_back, 291
execute, 154 writeback stage stage_job, 292
fetching_decoding_ip, 146 multihart_ip
i_immediate, 153 decode, 323
type, 152 execute, 327
type_00, 152 fetch, 320
fetching_ip init_exit, 316
compute_next_pc, 135 init_f_state, 317
fetch, 132 init_file, 317
fetching_ip, 128 issue, 324, 325
434 Index