Risc V Assembly Language Programming Using Esp32 c3 and Qemu
Risc V Assembly Language Programming Using Esp32 c3 and Qemu
RISC-V
Assembly Language
Programming
Using the ESP32-C3 and QEMU
Warren Gay
RISC-V Assembly
Language Programming
Using ESP32-C3 and QEMU
Warren Gay
● All rights reserved. No part of this book may be reproduced in any material form, including photocopying, or
storing in any medium by electronic means and whether or not transiently or incidentally to some other use of this
publication, without the written permission of the copyright holder except in accordance with the provisions of the
Copyright Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licencing Agency
Ltd., 90 Tottenham Court Road, London, England W1P 9HE. Applications for the copyright holder's permission to
reproduce any part of the publication should be addressed to the publishers.
● Declaration
The Author and Publisher have used their best efforts in ensuring the correctness of the information contained in
this book. They do not assume, and hereby disclaim, any liability to any party for any loss or damage caused by
errors or omissions in this book, whether such errors or omissions result from negligence, accident, or any other
cause.
All the programs given in the book are Copyright of the Author and Elektor International Media. These programs
may only be used for educational purposes. Written permission from the Author or Elektor must be obtained before
any of these programs can be used for commercial purposes.
Elektor is part of EIM, the world's leading source of essential technical information and electronics products for pro
engineers, electronics designers, and the companies seeking to engage them. Each day, our international team develops
and delivers high-quality content - via a variety of media channels (including magazines, video, digital media, and social
media) in several languages - relating to electronics design and DIY electronics. www.elektormagazine.com
●4
Contents
Chapter 1 • Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.8. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2. Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Chapter 4 • Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
●5
4.2. Endianness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.3. Register x1 / ra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.4. Register x2 / sp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.6. Register x4 / tp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.8. Register x8 / s0 / fp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.9. Register x9 / s1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.11. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
●6
5.4.3. Objdump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.7. Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.11. Pseudo-Op mv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.15. Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.17. Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
●7
●8
●9
● 10
● 11
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
● 12
Introducing.... RISC-V!
With the availability of free and open-source C/C++ compilers today, you might wonder
why someone would be interested in assembler language. What is so compelling about the
RISC-V Instruction Set Architecture (ISA)? How does RISC-V differ from existing architec-
tures? And most importantly, how do we gain experience with the RISC-V without a major
investment? Is there affordable hobbyist hardware available?
The availability of the Espressif ESP32-C3 chip provides a student with one affordable way
to get hands-on experience with RISC-V. The open-sourced QEMU emulator adds a 64-bit
experience in RISC-V under Linux. These are just two ways for the student and enthusiast
alike to explore RISC-V in this book.
I fear that today, we've lost some of that passion for exploiting the machine. Our CPU
tower might possess only a power-on button and LED. But if we're lucky, it might provide
a disk activity LED as well. Embedded systems are better, sporting multiple LEDs that can
be utilized. Whether desktop or embedded device, we still program in high-level languages
like C/C++. While there is still joy in that, the thrill of the hunt may be lacking for those
thirsting for more. Getting that algorithm to execute even faster or more efficiently is not
just a badge of honor, but a matter of pride.
● 13
One deterrent to assembly language has been the extremely cluttered instruction sets of
today's architectures. The Intel and AMD instructions sets, for example, have been horri-
bly extended in the name of compatibility. RISC-V allows the enthusiast to sweep all that
clutter aside and start over with a clean slate. This makes things so much easier because
there is less to be learned.
Getting back to assembler language will bring back that thrill of the hunt that you may be
pining for. At the assembly level, the programmer directs every step of the machine. You
decide how the registers are allocated and apply every binary trick in your toolbox to make
that process slick. While we don't have hardware debug consoles anymore, we do have
powerful debugging tools like the GNU gdb debugger. There's never been a better time to
do assembly language than today.
One attractive area is writing custom floating-point algorithms in assembly language. For
each calculation, you can carefully evaluate which rounding method to use and check for
special exception cases at strategic places in the code. In C/C++, the tendency is to simply
code the formula.
Even if you decide that you have no need to write assembly language code, being able to
read it can be extremely helpful when debugging. Compilers, often at higher optimization
levels, can generate incorrect code (this happens more often than you think). Being able to
verify that the generated code is correct can save you from a great deal of guessing when
debugging. With a working knowledge, you can step through the code one instruction at a
time in a debugger and pinpoint the problem. Once the problem is revealed, you can then
decide on a work-around for the compiler or replace the errant code directly with some
assembler language code.
The RISC-V instruction set is also designed to be clean, unlike many existing architectures.
Today's Intel ISA is a bewildering mess of adapted and extended instructions. To be fair,
this was done to maintain code compatibility, but what a bewildering mess it has become.
● 14
Now that security is more important than ever, there is a pressing need for a simpler de-
sign.
Reduced Instruction Set Computer (RISC) architecture had its beginnings in research pro-
jects conducted by John L. Hennessy at Stanford University between 1981 and 1984. MIPS
(Microprocessor without Interlocked Pipeline Stages) was developed with the IBM 801 and
the Berkeley RISC projects. The Berkeley RISC project was led by David Patterson who
coined the term RISC. The thrust of these efforts was to develop a design that was simpler
and potentially faster than the CISC (Complex Instruction Set Computers) of that time.
Today, simplicity benefits the chip manufacturer in lowering the cost of chip development
and verification. Simplicity means fewer transistors, which leads to lower power require-
ments. A simple instruction set also reduces the complexity that the software developers
must face. Finally, the RISC-V ISA has provision to include vendor extensions, without
requiring any special approval. This can lead to surprising innovations.
This book has a different focus, but it is still fun! The projects in this book are boiled down
to the barest essentials to keep the assembly language concepts clear and simple. In this
manner, you will have "aha!" moments rather than puzzling about something difficult. The
focus of this book is on learning how to write RISC-V assembly language code without get-
ting bogged down. As you work your way through this tutorial, you'll build up small demon-
stration programs to be run and tested. Often the result is some simple printed messages
to prove a concept. Once you've mastered these basic concepts, you will be well equipped
to apply assembly language in larger projects.
● 15
A suitably modern desktop computer is required to program the ESP32-C3 and to run the
QEMU emulator, running Fedora Linux. If you lack 20 GB of free disk space, then that can
be remedied by plugging in a USB hard drive to add some disk space.
This book was developed specifically with the Espressif ESP32-C3 device (emphasis on the
"C3").
When purchasing, don't make the mistake of just buying just a chip. Be sure to purchase a
"devkit" that consists of a PCB with the GPIO breakouts, USB interface and the ESP32-C3
chip soldered onto it. You might also find it listed as "ESP-C3", but be careful. Make sure
this refers to the ESP32-C3, and not some other ESP32 variants (using the Xtensa CPU).
You'll also need an appropriate USB cable to flash and communicate with the devkit. The
projects in this book can run off of the power from the USB cable.
Figure 1.1 illustrates an early ESP32-C3 dev board purchased from AliExpress. This ver-
sion 1 dev board cannot support JTAG. Notice that these use a USB to serial interface chip
(CH340C), which can be seen in the photo. If you aren't concerned about JTAG support,
then these are otherwise suitable to use.
● 16
As time went on, Espressif came out with ESP32-C3 devices with JTAG support (version 3 or
later). Since these connect directly from the ESP32-C3 chip to the USB port, they don't use
a serial to USB converter chip. This makes it easy to identify the boards that support JTAG.
With a direct connection to the USB bus, the ESP32-C3 can perform JTAG functions over
the USB bus, as well as the usual serial communications. These are the preferred ESP32-C3
devkits to obtain. Figure 1.2 illustrates a JTAG-capable ESP32-C3 dev board sitting on a
breadboard.
● 17
The reader is expected to be familiar with basic file system navigation: changing directo-
ries, creating directories and copying files. In some cases, the editing of a script file may
be necessary. The Windows user should also be familiar with the Linux file system in this
regard when using QEMU (emulating Fedora Linux).
Knowledge of the C language is assumed. Code examples will consist of a C language main
program calling an assembly language function.
The QEMU examples use a RISC-V version of downloadable Fedora Linux. Consequent-
ly, some familiarity with the Linux command line is an asset even for Windows users.
The ESP32-C3 examples will use your native desktop environment for ESP development,
whether Linux, MacOS or Windows. In all cases, simple commands are issued to build and
run the test programs.
Finally, the reader is expected to do some software downloading and installation. Unfor-
tunately, there is a fair amount of this initially, but the good news is that it is all free and
open software. Once installed, the reader can then focus on the RISC-V concepts presented
in this book.
1.8. Summary
The next pair of chapters deal with installing your ESP-IDF (ESP Integrated Development
Framework) and the QEMU emulation software. This bit of work is necessary to let the good
times roll. So gather your disk space and enable your internet access and begin!
Bibliography
[1] The_RISC-V_Reader_An_Open_Architecture_Atlas_by_David_Patterson_Andrew_
Waterman.pdf
● 18
This chapter illustrates the steps necessary to get you up and running using the ESP32-C3
device. The first part of this chapter focuses on Linux and MacOS software installations.
This is what Espressif calls "Manual Installation". Espressif also supports IDE installations
for Eclipse or VSCode if you prefer. See their online documentation for that.
Windows users will want to skip to the later part of this chapter starting with the section
heading "Windows Install". This will guide you through the use of the downloaded Espressif
windows installer.
For either installation, you will need to plan for ample disk space and be connected to your
internet. The ESP-IDF on the Mac (for device ESP32-C3 only) requires about 1.5 GB of
disk. But the compilers and other tools will also require additional disk space as they are
installed. I recommend that you allocate a minimum of 10 GB of free disk space before you
proceed. The downloads are rather large and will take some time to complete. Choose a
time to install where you are not rushed.
Dev kits may include a USB to serial chip (USB-UART bridge) like the CP2102. My devices
used the CHG340 chip. Either bridge chip is ok, since we are only interested in the RISC-V
CPU in this book. Now you can purchase dev kits with JTAG support. These devices with
revision 3 or later use the USB facilities directly. Figure 2.1 illustrates two early examples
of revision 1 dev boards. The CH340C USB to serial converter chip is very conspicuous.
● 19
Figure 2.1: Two ESP32-C3 devices with revision 1 PCBs (using USB to serial converters).
https://docs.espressif.com/projects/esp-idf/en/latest/esp32c3/get-started/index.html
Once you arrive at that page, don't forget to choose the product type ESP32-C3 in the
upper left. Espressif supports multiple device types, and we are primarily interested in the
ESP32-C3 with RISC-V support. It should look something like Figure 2.2.
Scroll down the page until you see the link "Linux and MacOS". Click on the link to open a
new page of instructions.
● 20
1. Install Prerequisites.
2. Get ESP-IDF.
3. Setup the tools.
4. Set up the environment variables.
5. First Steps for ESP-IDF.
Note: If the instructions differ at all from this text, do follow the instructions found on
the website instead. Espressif may have made changes to the installation procedure by
the time you read this.
Install Prerequisites
The prerequisites will vary somewhat with the Linux distribution you're using. Check the
Espressif website for the latest updates by distro.
$ sudo apt install git wget flex bison gperf python3 python3-pip \
python3-setuptools cmake ninja-build ccache libffi-dev libssl-dev \
dfu-util libusb-1.0-0
You might wish to split these up into smaller steps, like the following:
CentOS 7 & 8
CentOS uses the yum installer, and the dependencies are listed as follows:
$ sudo yum -y update && sudo yum install git wget flex bison gperf \
python3 python3-pip python3-setuptools cmake ninja-build ccache \
dfu-util libusbx
● 21
Arch
Espressif has documented the following dependencies for the Arch distro:
$ sudo pacman -S --needed gcc git make flex bison gperf python-pip \
cmake ninja ccache dfu-util libusb
Espressif notes that CMake version 3.5 or newer is required. If you're running an older
distribution, you may need to update your system packages. For other Linux distributions
not listed here, use the above as a hint to the package names that you may need to add
or update.
MacOS
MacOS users usually install HomeBrew (recommended) or the MacPorts open-source pro-
jects to add functionality to their Mac. If you've not done that yet, you need to do that now.
Refer to the following sites for more information about this:
https://brew.sh/ (HomeBrew)
https://www.macports.org/install.php (MacPorts)
Espressif recommends that you also install ccache for faster build times. For HomeBrew,
use:
Note: Espressif indicates that if during the installation you encounter an error like the
following:
xcrun: error: invalid active developer path (/Library/Developer/Command-
LineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
Then you will need to install the Apple XCode command line tools in order to continue.
Install these with:
● 22
$ xcode-select --install
MacOS Python 3
Check the version of python you have installed with:
$ python --version
If it is a version older than 3, or is not present, then check the following (notice that the
command name is python3 this time):
$ python3 --version
For MacPorts:
Get ESP-IDF
At this point, you need to decide where to place your ESP-IDF software (not as root). I'm
going to assume in this book, that the directory will be named ~/espc3, where the tilda
(~) represents your home directory. You can choose a different directory name, by simply
substituting ~/espc3 for the name you prefer.
First, create the subdirectory to house your files in (starting from your home directory):
$ mkdir -p ~/espc3
$ cd ~/espc3
This git operation downloads several files and will take some time to complete. It is also a
good opportunity to take a break for your favourite beverage.
When the git operation completes, you will have the Espressif software downloaded into the
subdirectory ~/espc3/esp-idf.
● 23
$ cd ~/espc3/esp-idf
$ ./install.sh esp32c3
Performing this step will result in more files being downloaded and installed. This step
progresses much faster than the GitHub clone operation but can still require some time.
Perhaps your beverage needs a refill?
WARNING: You are using pip version 21.2.1; however, version 22.0.3 is available.
At the end of the installation, your session should have completed with the message:
$ . ./export.sh
At this point, you should run this script and watch for any error messages. Some errors you
might see may include:
• xtensa-esp32-elf
• xtensa-esp32s2-elf
• xtensa-esp32s3-elf
If you're only concerned about our ESP32-C3 device, which was not listed in error, you can
ignore these messages. If you also want to support these other devices (regular ESP32,
● 24
ESP32-S2 and ESP32-S3), you can follow the Espressif advice and run the installation
scripts suggested.
$ . ~/espc3/esp-idf/export.sh
Assuming that no critical error messages are reported, this sets up your terminal session to
build your ESP projects. Some users may wish to create a shorter way to do this. Espressif
recommends creating an alias like get_idf as follows:
$ get_idf
to establish your build environment. Depending upon the shell you use for your terminal
session, you might want to set the get_idf alias up in your ~/.profile or ~/.bashrc file, so
that it is automatically defined each time you log in.
Once that export.sh script has run, your environment will have the variable IDF_PATH set.
In this chapter, it would have the value "~/espc3/esp-idf". This allows you to use the shell
value $IDF_PATH in commands and scripts if you like.
$ cd $IDF_PATH/examples/get-started/hello_world
(or)
$ cd ~/espc3/esp-idf/examples/get-started/hello_world
● 25
$ idf.py build
The first time this runs for a given project, it will compile a lot, but don't be concerned.
Subsequent builds will not take so long. When it succeeds, the process should end with a
message:
$ ls /dev/cu*
/dev/cu.Bluetooth-Incoming-Port /dev/cu.usbserial-0001
This will list some that are already present. Don't be concerned if no devices show up. Now
plug in your RISC-V device's USB cable and list the files again:
$ ls /dev/cu*
/dev/cu.Bluetooth-Incoming-Port /dev/cu.usbserial-0001 /dev/cu.usbserial-1430
In this example, the device /dev/cu.usbserial-1430 was added. This is the device name we
need for flashing our device. Define it in the shell variable named PORT, so that you won't
have to type it each time:
$ PORT=/dev/cu.usbserial-1430
$ export PORT
To flash the example program to your device, try it now (the example output has been
abbreviated somewhat):
$ idf.py flash
Executing action: flash
Serial port /dev/cu.usbserial-1430
Connecting....
Detecting chip type... ESP32-C3
Running ninja in directory /Users/joe/espc3/esp-idf/examples/get-started/
hello_world/build
Executing "ninja flash"...
...
Chip is ESP32-C3 (revision 3)
Features: Wi-Fi
Crystal is 40MHz
● 26
MAC: 7c:df:a1:b4:44:94
Uploading stub...
Running stub...
Stub running...
Changing baud rate to 460800
Changed.
Configuring flash size...
Flash will be erased from 0x00000000 to 0x00004fff...
Flash will be erased from 0x00010000 to 0x00034fff...
Flash will be erased from 0x00008000 to 0x00008fff...
Compressed 19984 bytes to 12116...
Writing at 0x00000000... (100 %)
Wrote 19984 bytes (12116 compressed) at 0x00000000 in 0.7 seconds (effective
221.1 kbit/s)...
Hash of data verified.
Compressed 151072 bytes to 81515...
Writing at 0x00010000... (20 %)
Writing at 0x00019a2e... (40 %)
Writing at 0x00020360... (60 %)
Writing at 0x00027602... (80 %)
Writing at 0x0002dada... (100 %)
Wrote 151072 bytes (81515 compressed) at 0x00010000 in 2.7 seconds (effective
443.9 kbit/s)...
Hash of data verified.
Compressed 3072 bytes to 103...
Writing at 0x00008000... (100 %)
Wrote 3072 bytes (103 compressed) at 0x00008000 in 0.1 seconds (effective 262.5
kbit/s)...
Hash of data verified.
Leaving...
Hard resetting via RTS pin...
Done
If your session appeared similar to this, then congratulations are in order. You have flashed
your first RISC-V program to the device.
Running hello_world
To see the hello_world program run, we monitor it as follows:
$ idf.py monitor
$ idf.py monitor
Executing action: monitor
Serial port /dev/cu.usbserial-1430
Connecting....
Detecting chip type... ESP32-C3
● 27
● 28
After the device boots up, you will see the message "Starting scheduler". The next message
shown is the "Hello world!" that we were waiting for. Then the program counts down and
will reboot again until you terminate it. To stop monitoring, type Control-] (control plus the
right square bracket) and you should be returned to your shell. Congratulations, you ran
your first RISC-V program!
https://docs.espressif.com/projects/esp-idf/en/latest/esp32c3/get-started/
windows-setup.html
Limitations
Espressif lists the following limitations:
• The installation path of ESP-IDF and ESP-IDF Tools must not be longer than 90
characters.
[Installation paths that are too long] might result in a failed build. The installation
path of Python or ESP-IDF must not contain white spaces or parentheses. The in-
stallation path of Python or ESP-IDF should not contain special characters (non-AS-
CII) unless the operating system is configured with "Unicode UTF-8" support.
System Administrator can enable the support via Control Panel. Change date, time,
or number formats - Administrative tab - Change system locale - check the option
"Beta: Use Unicode UTF-8 for worldwide language support" - Ok and reboot the
computer.
● 29
https://dl.espressif.com/dl/esp-idf/?idf=4.4
They provide several different installers, which at the time of writing include:
In this guide, I've chosen the Universal Online Installer. Once the installer has been down-
loaded and launched, you'll be prompted to "Select Setup Language", which will be "En-
glish" by default.
Next is the "License Agreement", where you want to click "I accept the agreement" and
then click the "Next" button.
The installer does a "Pre-installation system check". If there were any warnings, click the
"Apply Fixes" button. Likely it will complain that you need Administrator access, which the
"Apply Fixes" button will correct. Once all is good, click on the "Next" button.
The next dialog "Download or use ESP-IDF" asks you to choose one of:
• Download ESP-IDF.
• Use an existing ESP-IDF directory.
In the "Version of ESP-IDF" dialog, choose the version of the software to download. At the
time of writing, the most current version was v4.4 (release version). It is probably best to
choose the latest version available. Click the "Next" button.
In the "Select Destination Location" dialog, you are asked where you want the software to
be installed. By default, the text "C:\Espressif" is used. If you need it on another drive or a
different directory, this is where you can choose it. Click the "Next" button.
In the "Select Components" dialog, you are presented with a tree of checkboxes. The one
area that you might want to deliberate is the "Chip Targets", where it lists the following
choices:
● 30
• ESP32 (optional)
• ESP32-C3 (make sure this remains checked for RISC-V)
• ESP32-S Series:
- ESP32-S2 (optional)
- ESP32-S3 (optional)
If you need to save disk space, then unselect the optional items listed above. Make certain
however that ESP32-C3 remains checked. Click the "Next" button.
At this point, the "Ready to Install" dialog box appears with a summary of your choices so
far. If the summary appears okay, then click the "Install" button.
If any prompt presents a "Do you want to allow this app to make changes to your device?"
question, then answer with a click on the "Yes" button.
You will also get prompts from "Windows Security" to the effect "Would you like to install
this device software?". Click on the "Install" button.
The software will then download and install for a considerable amount of time. Later on,
the "Installing ESP-IDF tools" dialog will appear towards the end. If you're feeling a craving
for coffee or tea, this would be a great time to make it. The download and installation may
take about 45 minutes depending upon your system and internet.
After the above is completed, you are presented with a "Completing the ESP-IDF Tools Set-
up Wizard" dialog. I suggest you leave all the options checked and click the "Finish" button.
Click on "Yes" if you are prompted with "Do you want to allow this app to make changes to
your device?".
Once all that has been completed, you should have two more start menu options:
Figure 2.3: The created ESP-IDF startup links in the start menu.
● 31
PS C:\Espressif\frameworks\esp-idf-v4.4>
In the text that follows, I will just show "PS C:>" as the prompt. Now change to the hel-
lo_world subdirectory:
PS C:> cd examples
PS C:> cd get-started
PS C:> cd hello_world
This configures the build for this project to compile for your RISC-V device. Once that com-
pletes, you can build your hello_world software:
The first time you build your project a lot of compiling of dependencies will occur. On suc-
cessive builds however, the process only rebuilds what is needed and is much faster. If all
went well, the command should complete with the message:
Now we must discover the name of your USB device when the ESP32-C3 is plugged in.
Open the Device Manager and expand the "Ports (COM & LPT)" after plugging in your
ESP32-C3 device. See Figure 2.4 for an example:
● 32
Figure 2.4: Locating the COM port for the ESP32-C3 plugged into the serial port.
In this example, we see that Windows has registered the ESP32-C3 as the device name
COM3. Your device may use a different name depending on the device hardware. My device
used a CH340 USB serial interface chip, and so the "CH340" shows up in the name.
Once you know the device name, you can flash it:
If all went well, you will see that it recognizes the device and uploads the compiled software
to it. In order to see hello_world run, we monitor it with:
Several message lines will appear, but eventually you will see some output similar to the
following:
● 33
load:0x403ce000,len:0x930
load:0x403d0000,len:0x2d40
entry 0x403ce000
I (30) boot: ESP-IDF v4.4 2nd stage bootloader
I (30) boot: compile time 15:55:20
I (30) boot: chip revision: 3
I (32) boot.esp32c3: SPI Speed : 80MHz
I (36) boot.esp32c3: SPI Mode : DIO
I (41) boot.esp32c3: SPI Flash Size : 2MB
I (46) boot: Enabling RNG early entropy source...
...
I (228) spi_flash: detected chip: generic
I (232) spi_flash: flash io: dio
I (236) sleep: Configure to isolate all GPIO pins in sleep state
I (243) sleep: Enable automatic switching of GPIO sleep configuration
I (250) cpu_start: Starting scheduler.
Hello world!
This is esp32c3 chip with 1 CPU core(s), WiFi/BLE, silicon revision 3, 2MB
external flash
Minimum free heap size: 329676 bytes
Restarting in 10 seconds...
Restarting in 9 seconds...
Restarting in 8 seconds...
Restarting in 7 seconds...
Restarting in 6 seconds...
After booting up, it will eventually reboot and display "Hello world!" as often as you allow it
to continue. To exit monitor mode, type Control-] (control key plus the right square brack-
et). Congratulations, you have successfully compiled, flashed and ran your first RISC-V
program!
2.4. Summary
This may have been a tedious chapter for getting your Espressif software ready. The good
news is that the difficult part is over and that your ESP32-C3 is now at your beckon call.
● 34
In this chapter, instructions are provided for installing QEMU for the RISC-V emulator. Run-
ning the QEMU emulator allows us to explore the 64-bit RISC-V machine without having
actual hardware for it. Additionally, it provides us debugger access to the machine in a
friendly Linux environment.
When there is no QEMU binary install package available for your Linux desktop, it can be
compiled from source code. The procedure for that is provided in this chapter. Otherwise,
Windows and MacOS users can download and use prebuilt binary installs instead. Windows
users should skip down to the section "Installing QEMU on Windows" near the end of this
chapter.
$ cp ~/riscv/repo/boot.sh ~/riscv/boot.sh
This provides a copy of the checked-out file at the top of your ~/riscv tree.
● 35
3.2. Windows
Windows makes the process a little more difficult, so follow these recommended steps:
1. Double-click your desktop ESP-IDF 4.4 CMD icon that was installed when you in-
stalled the ESP-IDF framework. Using this environment will give you instant access
to the ESP-IDF installed git command.
2. Change to the C:\riscv after you create the directory.
3. Type the following git command using forward slashes to clone the repository.
After this is done, there should be files populated in your C:\riscv\repo subdirectory.
Basic Steps
There are two basic steps required to get our RISC-V system environment setup:
In order to complete both of these steps, you need to determine if you have sufficient disk
space. You will need at least 10 GB of space for the Fedora Linux system image file. But
since it downloads as a compressed file, you need additional space to uncompress it. Plan
on about 20 GB (some of this space is only temporarily needed). You will also need about
500 MB for the installed QEMU software. If you plan to build QEMU from source code on Li-
nux, you will need about another 1 GB of disk space. These are rough estimates that should
provide you with minimum guidelines.
https://www.qemu.org/download/#macos
These installations use the Homebrew or the MacPorts collections. If you don't use either of
these yet, then you will need to install Homebrew or MacPorts first. I prefer the HomeBrew
(https://brew.sh) collection myself.
● 36
This installs all QEMU emulators, but check that the riscv64 version is the one installed
with:
$ qemu-system-riscv64 --version
QEMU emulator version 5.2.0 (Debian 1:5.2+dfsg-11+deb11u1)
Copyright (c) 2003-2020 Fabrice Bellard and the QEMU Project developers
At the time of writing, for example, an old Devuan 32-bit Linux system supported only the
following QEMU packages:
In this example platform, there is no qemu-system-riscv64 listed. If your Linux also lacks a
package for riscv64, then don't despair. It can be compiled and installed from source code.
If you now have QEMU support for riscv64 installed, then skip the build instructions and
resume at the section "Setup of Fedora Linux".
On the command lines shown, any text following the "#" character indicates a comment,
which is ignored by the shell (from the point of the "#" character to the end of the line).
These are comments about why the command is necessary etc. Don't type those in.
● 37
The entire process is relatively painless but may require considerable time on slower plat-
forms. Let your computer perform most of the work!
$ mkdir ~/work
$ cd ~/work
Depending upon your internet download speed, this might take a while. Once it has been
completed, change to the checked-out source code directory:
$ cd ~/work/qemu
Open-source projects often experience significant changes. So that you can follow this text
with fewer speed bumps, I strongly recommend that you use the same version as used in
the book. To do this, change the version of the checked-out source code as shown:
This will modify the source files to build the exact same version of QEMU that was used by
the author here.
Configuration
The configuration and build are performed in a subdirectory named "build" that you must
create. Create the subdirectory and then change to it:
$ mkdir -p ~/work/qemu/build
$ cd ~/work/qemu/build
● 38
If you ever need to start over because of problems, remove the build directory and recreate
it.
Next, configure the build using the supplied configure script. Be sure to supply the "--target-
list" option as shown because we only want the riscv64 version of the emulator. Otherwise,
you will end up building many emulators, which can take a very long time.
$ cd ~/work/qemu/build
$ ../configure --target-list=riscv64-softmmu
...
libdaxctl support: NO
libudev: NO
FUSE lseek: NO
Subprojects
libvhost-user: YES
If you don't have ninja installed, you may need to do so now (as root). For Devuan Linux
for example, the following is installed:
The package names often vary by Linux distro. Always choose a "dev" (or devel) version
of the package when it is available since it includes the header files needed for compiling.
If these package names don't work for you, then copy the error message to your browser
and find out what others used to solve the dependency.
After installing the missing dependencies, simply repeat the configure script as shown
above. If all goes well, you should get to the end of the configure script successfully.
Build QEMU
After the configure step ends successfully, QEMU can be compiled from the build directory
that you created.
$ cd ~/work/qemu/build
$ make
● 39
This build process can take a long time depending upon your Linux system. The compile can
proceed without further user involvement so take a moment to make a coffee or tea or take
your faithful dog for a walk in the park. At the end of the build, you should see messages
similar to these:
Install QEMU
Once QEMU has been successfully compiled, you can install its components as follows:
$ cd ~/work/qemu/build
$ sudo make install
Congratulations! You have installed QEMU! Now verify that the emulator is available from
the command line:
$ qemu-system-riscv64 --version
QEMU emulator version 5.2.95 (v6.0.0-rc5)
Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
$
If that fails, check your PATH variable. Because you built QEMU from a specific version of
the software, you should see the version number "v6.0.0-rc5".
● 40
$ mkdir ~/riscv
$ cd ~/riscv
https://dl.fedoraproject.org/pub/alt/risc-v/repo/virt-builder-images/images/
Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-smode.elf
2020-01-13 16:48 592K
Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-smode.elf.
CHECKSUM 2020-01-13 16:48 151
Fedora-Developer-Rawhide-20200108.n.0-sda.raw.xz
2020-01-13 16:55 1.4G
Fedora-Developer-Rawhide-20200108.n.0-sda.raw.xz.CHECKSUM 2020-01-13 16:58 125
If these particular files are no longer available when you read this, then choose the newer
versions of these files. Keep in mind that disk space requirements may differ slightly.
Download Files
Download the .elf file required as shown below. On MacOS or Linux use the convenient wget
command while in the ~/riscv directory:
● 41
$ wget 'https://dl.fedoraproject.org/pub/alt/risc-v/repo/virt-builder-images/
images/Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-smode.
elf'
You should also download the CHECKSUM file for verification, to make sure it has not been
tampered with:
$ wget 'https://dl.fedoraproject.org/pub/alt/risc-v/repo/virt-builder-images/
images/Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-smode.
elf.CHECKSUM'
Now verify the downloads with the sha256sum command. Note the checksums reported,
shown underlined here:
$ sha256sum 'Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-
smode.elf'
5ebc762df148511e2485b99c0bcf8728768c951680bc97bc959cae4c1ad8b053 Fedora-
Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-smode.elf
$ cat 'Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-smode.
elf.CHECKSUM'
SHA256 (Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-smode.
elf) = 5ebc762df148511e2485b99c0bcf8728768c951680bc97bc959cae4c1ad8b053
Because the checksums match, we have confidence that the file has not been tampered
with. You can now discard the CHECKSUM file if you wish.
We want the "developer" images of Fedora Linux to save us the trouble of installing compil-
ers and other build tools. It also gives us an image with enough disk space to work in. So,
download the following Fedora disk image file:
$ wget 'https://dl.fedoraproject.org/pub/alt/risc-v/repo/virt-builder-images/
images/Fedora-Developer-Rawhide-20200108.n.0-sda.raw.xz'
and its matching CHECKSUM file. Once again, verify the checksum on the compressed file.
When the checksums agree, decompress the file system image:
$ unxz Fedora-Developer-Rawhide-20200108.n.0-sda.raw.xz
$ ls -l Fedora-Developer-Rawhide-20200108.n.0-sda.raw
With the .elf and .raw files available, it is now time to boot into Fedora Linux using QEMU.
● 42
$ cp ~/riscv/repo/boot.sh ~/riscv/boot.sh
The local copy of this script is available for your own customization. Review the section
Book Source Code earlier, if necessary, if you've not yet checked out the source code.
#!/bin/bash
exec qemu-system-riscv64 \
-nographic \
-machine virt \
-smp 2 \
-m 2G \
-kernel Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-
smode.elf \
-bios none \
-object rng-random,filename=/dev/urandom,id=rng0 \
-device virtio-rng-device,rng=rng0 \
-device virtio-blk-device,drive=hd0 \
-drive file=Fedora-Developer-Rawhide-20200108.n.0-sda.raw,format=raw,id=hd0 \
-device virtio-net-device,netdev=usernet \
-netdev user,id=usernet,hostfwd=tcp::10000-:22
If your system fails to find the qemu-system-riscv64 command, then its directory needs to
be added to your PATH environment variable. MacOS for example, may have the emulator
installed in /usr/local/bin when using HomeBrew. If this directory is not in your PATH, then
add it now:
$ PATH="/usr/local/bin:$PATH"
$ cd ~/riscv
$ ./boot.sh
● 43
But be patient and with time, it will progress beyond that point. The entire process can
be quite lengthy on a little 32-bit Linux with one CPU. Modern systems should manage it
better. Once Fedora comes up, you should be greeted with the following information:
Fedora/RISC-V
-------------
Koji: http://fedora.riscv.rocks/koji/
SCM: http://fedora.riscv.rocks:3000/
Distribution rep.: http://fedora.riscv.rocks/repos-dist/
Koji internal rep.: http://fedora.riscv.rocks/repos/
fedora-riscv login:
After the boot process is completed, you can log in using riscv as login and fedora_
rocks! as the password.
You can log in at the console, but you may prefer to login using ssh instead (note that the
account name is riscv):
SSH is enabled so you can ssh into the VM through port 10000 using the following
command.
● 44
You may find that a simple ssh command as follows is good enough:
Root Access
You will need root access to administer your Fedora Linux system from time to time. To
change to root from the riscv account, use sudo using the pasword fedora_rocks!:
We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:
[riscv@fedora-riscv ~]$ df -k .
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/vda4 9539456 5345064 4080428 57% /
[riscv@fedora-riscv ~]$ bc -l <<<'4080428/1024/1024'
3.89139938354492187500
From the example shown, there are nearly 4 GB of free space. When you boot up a new
instance of Linux there is a strong urge to perform system updates. Resist that temptation
because by doing so, you could end up running out of space once the updates are applied.
SSH Authentication
Unfortunately, Fedora Linux requires a long password, which is very annoying in a hosted
environment like this. If you want easier SSH access, follow the optional instructions found
at https://kb.iu.edu/d/aews to set up public key authentication. This will allow you to log
in without a password. Otherwise, simply choose a password of the required length that is
easy enough for you to use.
● 45
This creates ~/riscv/riscv (or /home/riscv/riscv). Then check out the book source code
using (don't forget the -b master option to check out the main branch):
This places the book's source code in the directory ~/riscv/repo (which is /home/riscv/
riscv/repo in Fedora Linux). If you check it out exactly like this, the text references in the
book will match.
Fedora Shutdown
Now that your RISC-V instance of Fedora Linux is running, let's review how to shut it down.
If things go terribly wrong, like hanging at boot time, it is usually possible to ^C (Control-C)
out of it in the session where you launched ~/riscv/boot.sh. If that fails, you can kill the
process using the kill command.
A normal shutdown is performed in Fedora Linux from the root in the usual way:
https://www.qemu.org/download/#windows
First click on "Windows" and then you will see the text below:
Stefan Weil provides binaries and installers for both 32-bit and 64-bit.
Click on the appropriate link for your version of Windows. Navigate through the directory of
offered versions and then download and run the installer chosen. The download is approx-
imately 192 MB in size.
● 46
QEMU Install
Once you have downloaded the installer, launched it, and answer Yes to the prompt "Do you
want to allow this app from an unknown publisher to make changes to your device?" Select
your language (English by default) and click OK.
In the "Welcome to QEMU Setup" dialog, click "Next". Agree to the license by clicking "I
Agree". In the "Choose Components" dialog, leave everything checked that is checked
and click "Next". Next, you will choose a folder into which you place the installed files. By
default, it will be "C:\Program Files\qemu" (a recommended choice for this book). Change
it or leave it as you wish and click "Install". Then click "Finish" when it completes the in-
stallation. Figure 3.1 illustrates the contrast in physical sizes of the two systems: ESP32-C3
sitting on top to the Dell Windows 10 tower below.
Figure 3.1: Size contrast of the ESP32-C3 dev board with a typical Windows-10 tower.
● 47
Now you can download the .elf file needed and the .raw image files:
After the .elf file is downloaded, download the raw image file.
Note: If you don't have the wget command, use your browser to download the image
files and move them into the C:\riscv directory.
If you want to verify the checksums, see the Linux section about using sha256sum. You'll
likely also need to install the sha256sum command under Windows. This step is optional.
The image file needs to be decompressed. There are freely downloadable apps like the
ones at https://tukaani.org/xz/, or if you have PKZIP already installed, that will be able to
uncompress it. After uncompressing the file, you should have the following two files ready
for use:
C:\riscv\Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-smode.elf
C:\riscv\Fedora-Developer-Rawhide-20200108.n.0-sda.raw
With the .elf and .raw files ready, it is time to boot into Fedora Linux using QEMU.
@REM boot.bat
PATH="C:\Program Files\qemu:%PATH%"
cd C:\Program Files\qemu
qemu-system-riscv64 -nographic -machine virt -smp 2 -m 2G ^
-kernel \riscv\Fedora-Developer-Rawhide-20200108.n.0-fw_payload-uboot-qemu-virt-
smode.elf ^
-bios none ^
-device virtio-blk-device,drive=hd0 ^
-drive file=\riscv\Fedora-Developer-Rawhide-20200108.n.0-sda.raw,format=raw,id=hd0
^
-device virtio-net-device,netdev=usernet ^
● 48
-netdev user,id=usernet,hostfwd=tcp::10000-:22
The very long command line is broken into segments using the caret (^) character. Edit this
file if you have different drive letters, directory, or file names. Once your .bat file is ready,
launch QEMU to boot up Fedora Linux:
PC C:> cd \riscv
PC C:> .\boot.bat
If this fails to launch, check for spelling errors in the file names in the boot.bat file and cor-
rect them. The boot process may seem to hang when coming up but don't despair:
With time, it will progress beyond that point. Once it comes up, you should be greeted with
the following information:
Fedora/RISC-V
-------------
Koji: http://fedora.riscv.rocks/koji/
SCM: http://fedora.riscv.rocks:3000/
Distribution rep.: http://fedora.riscv.rocks/repos-dist/
Koji internal rep.: http://fedora.riscv.rocks/repos/
fedora-riscv login:
● 49
See the prior section "Linux/MacOS Boot Test". For more information about logging into
Fedora Linux:
• Root Access
• Check Disk Space
• SSH Authentication
• Source Code Checkout for Fedora
• Fedora Shutdown
Related to this, I found that the Windows version of QEMU may sometimes stall at shut-
down. Pressing RETURN at the console got it going again. Note also that the console may
not be the best place to login since the terminal support may not support ANSI escape se-
quences, affecting its display. Use SSH for login, perhaps from PuTTY (https://www.putty.
org) or from a Linux or MacOS machine. Use the ipconfig command in the PowerShell to list
your network address, if necessary. Once you know that, you can log in with SSH from any
network connected machine using ssh:
Make sure that you specify the port as ten thousand and not as one thousand, as the author
is prone to do. If you still run into trouble, try the verbose version of the command below:
3.7. Summary
This chapter has been an intense software setup chapter. But having this out of the way
and the QEMU emulator running RISC-V in 64-bit mode under Linux, MacOS or Windows
opens many doors for our exploration in the remainder of the book. Fasten your seat belt.
● 50
Chapter 4 • Architecture
Before a programmer can program in assembly language, he/she needs to know something
about the machine's architecture. What registers are available? What flag bits are evaluat-
ed? How is memory accessed? What are the available opcodes? In short, we need to know
how the machine is organized.
In the following sections, the RISC-V value of XLEN is the number of bits for the corre-
sponding architecture. For the ESP32-C3 for example, XLEN is 32 (for 32 bits). For the
QEMU emulated RISC-V 64-bit CPU, XLEN is 64 bits. So let's start with the program counter
register.
4.2. Endianness
The RISC-V memory architecture is little endian based and is byte addressable. Little endi-
an means that a word (of 4 bytes) orders the bytes with least significant bytes first (lowest
addressable unit). The last (highest addressed) byte of the word is the most significant
byte. Figure 4.1 illustrates two 32-bit words with byte addresses and their word values at
right.
● 51
Byte 78563412
Adress
12 34 56 78 ➔
0 1 2 3
Byte 9A BC DE F0 ➔ F0DEBC9A
Adress
4 5 6 7
Figure 4.1: RISC-V Byte Order (Little Endian). All numbers expressed in hexadecimal.
Given that the word length is 4 bytes, a normal instruction word fetch will increment the
program counter by 4. For compressed instructions consisting of 2-byte parcels an instruc-
tion fetch will increment the program counter by some multiple of 2, depending upon the
instruction's length.
The width of the register is defined by constant XLEN in the given architecture. So, reg-
isters are 32-bits wide for the ESP32-C3 where XLEN=32. Registers are 64-bits wide for
the QEMU emulated CPU where XLEN=64. For most architectural subsets (discussed in the
next section) have 32 general purpose registers available. These are known as registers
X0 through X31. We shall see that they can also be referred to by different names and that
register X0 has a special talent.
The most basic architectural subset is RV32I. This defines the most basic opcodes, register
set and memory operations as a "base architecture". The "32" indicates that the architec-
tural width, is 32 bits (XLEN=32). So, in RV32I we know that the register size is 32 bits
wide.
● 52
A minimal RISC-V 64-bit implementation is defined as the RV64I subset. This expands upon
RV32I by widening the registers to 64 bits (XLEN=64). A few additional operations are add-
ed to allow for loading and storing of 64-bit register values.
There also exists the RV32E subset for use in embedded processors where minimum cost
and low power are the focus. In this unusual subset, with the number of registers reduced
to 16 but keeping with the register width of 32 bits (XLEN=32). The reasoning for this sub-
set is explained in [1]:
We have found that in the small RV32I core designs, the upper 16 registers consume
around one quarter of the total area of the core excluding memories, thus their removal
saves around 25% core area with a corresponding core power reduction.
There are other basic subsets such as the RV128I, which widens XLEN to 128. But we'll
focus mainly on RV32I and RV64I in this book. It is easiest to learn when the abstract is
replaced with the concrete.
There are other capabilities described by RISC-V such as RV32M for example, which defines
the operations necessary for integer multiply and divide. RV32C describes the provision for
compressed instruction opcodes, allowing for greater code density. The ESP32-C3 device
implements RV32IMC, which indicates that the basic subset RV32I, multiply and divide
RV32M and compressed opcodes RV32C subsets are supported (XLEN=32). We'll review
the features of the QEMU 64-bit CPU later in this book.
● 53
x3 gp Global Pointer -
x4 tp Thread Pointer -
The register x0 (zero) is special to RISC-V in that it always returns the value zero when
used as a source register or discards a value when it is used as a destination. This permits
some creativity in the opcode's effect. It is specified as x0 or simply as zero.
The remainder of registers, from x1 to x32 are each XLEN bits wide and can be used as
source or destination. Each register has an ABI name assigned to make it easier to write
code that uses the registers in a consistent way. For example, there are temporary registers
that are named t0 through t6.
The rightmost column of Table 4.1 indicates who saves the register when saving is required,
according to the GNU calling convention. Either the calling code (caller) or the called code
(callee) must save the register, if it is modified. Registers marked with "-" like registers gp
or tp, have no requirement for being preserved. The description field indicates how these
registers are typically used.
You might wonder how to handle multi-precision unsigned integers without a Carry flag bit,
for example. This and similar problems will be solved later in this book.
● 54
4.5.3. Register x1 / ra
The x1 or ra register is traditionally used as the return address register. This is by GNU
calling convention and is not hard-wired. When a function is called, traditionally the return
address is put into register ra. There are opcodes used by the calling code that arrange this.
4.5.4. Register x2 / sp
According to the GNU calling convention, the x2 or sp register is used as the stack pointer.
This is by convention only. The code that is called (callee) is responsible for establishing
this value. The called code normally adds a value to the current stack pointer to adjust the
stack.
4.5.5. Register x3 / gp
According to the GNU calling convention for RISC-V, this register serves as a global pointer.
This is by convention only. The use of this value may vary according to the platform.
4.5.6. Register x4 / tp
By GNU convention, this register is used for thread local storage access. The establishment
of this value may vary by the platform supported.
4.5.8. Register x8 / s0 / fp
The x8 register, has two basic purposes: s0 is the first saved register value, while fp rep-
resents a stack frame pointer. This value must be saved and restored by the called (callee)
code if the register value is modified.
4.5.9. Register x9 / s1
Register x9 or s1, is similar to register s0. It is a second saved register value. The called
(callee) code must save and restore this value when it is modified.
● 55
The Espressif ESP32-C3 device, for example supports the RV32IMC ISA. This indicates
that the base integer subset "I", and the "C" compressed extensions apply. The "M" in the
RC32IMC indicates that the "M" extension is also supported, providing hardware multiply
and divide instructions. The "32" by way of review means that the registers are 32-bits
wide (XLEN=32).
There is another subset designated as the RV32E, intended for small embedded systems.
This is a reduced version of RV32I so that instead of 32 registers, only the first 16 registers
are available. This reduces the number of transistors required and also reduces the power
requirements. In chapter 3 of "The RISC-V Instruction Set Manual", it is stated that:
● 56
RV32E was designed to provide an even smaller base core for embedded microcontrol-
lers. ... However, given the demand for the smallest possible 32-bit microcontroller, and
in the interests of preempting fragmentation in this space, we have now defined RV32E
as a fourth standard base ISA in addition to RV32I, RV64I, and RV128I. The E variant is
only standardized for the 32-bit address space width.
This change requires a different calling convention and ABI. In particular, RV32E is only
used with a soft-float calling convention. Systems with hardware floating-point must use
an I base.
Table 4.2 provides a list of most of the RISC-V base subsets and extensions. This list is
incomplete because some standards are still evolving.
Base
RV32I Base Integer Instruction Set, 32-bit
RV32E Base Integer Instruction Set (embedded), 32-bit, 16 registers
RV64I Base Integer Instruction Set, 64-bit
RV128I Base Integer Instruction Set, 128-bit
Extension
M Standard Extension for Integer Multiplication and Division
A Standard Extension for Atomic Instructions
F Standard Extension for Single-Precision Floating-Point
D Standard Extension for Double-Precision Floating-Point
Shorthand for the IMAFDZicsr Zifencei base and extensions, intended
G
to represent a standard general-purpose ISA
Q Standard Extension for Quad-Precision Floating-Point
L Standard Extension for Decimal Floating-Point
C Standard Extension for Compressed Instructions
B Standard Extension for Bit Manipulation
J Standard Extension for Dynamically Translated Languages
T Standard Extension for Transactional Memory
P Standard Extension for Packed-SIMD Instructions
V Standard Extension for Vector Operations
K Standard Extension for Scalar Cryptography
N Standard Extension for User-Level Interrupts
H Standard Extension for Hypervisor
S Standard Extension for Supervisor-level Instructions
Table 4.2: Partial list of base subsets and extensions of the RISC-V ISA.
● 57
processor : 1
hart : 1
isa : rv64imafdcsu
mmu : sv48
processor : 2
hart : 2
isa : rv64imafdcsu
mmu : sv48
processor : 3
hart : 3
isa : rv64imafdcsu
mmu : sv48
From this list, it is apparent that the following RISC-V base and extensions are supported:
● 58
Finally, the "u" that is listed just means that "user mode" (vs "supervisor mode") is sup-
ported. The "S" extension is necessary for Unix/Linux/*BSD systems that protect inde-
pendent processes from corrupting each other's memory.
The "m", "s" and "u" modes together will be used by hardware (and emulators) that sup-
port Unix/Linux/*BSD type systems. The supervisor mode separates the execution of spe-
cialized kernel code (mode "s") from the unprivileged user code ("u" mode). This protects
one process from another and provides strict access to shared resources like memory and
peripherals.
The column labeled MISA in Table 4.3 refers to a special register with bits that define the
ISA and extensions supported.
ID Mode MISA
m machine
d debug
What does all of this mean for you? On the ESP32-C3, you will strictly operate in machine
mode. Whatever is supported by that hardware will be open to you to use and abuse. Thus
for the ESP32-C3, for example, you will be able to inquire of the MISA special register to
determine the level of support available, among other things.
Under Fedora Linux using the QEMU emulator, however, you will be running code in the
"u" user mode. This means that your code will be restricted in what it can do, including
the reading of the MISA register. If you try to read the MISA register, your code will fail.
To determine the level of support available, you must rely on the Linux kernel's special file
/proc/cpuinfo.
● 59
4.11. Summary
By this point in this chapter, I hope you are chomping at the bit to do some assembly-level
code. With the basics of RISC-V out of the way, you are ready to write some code. So, let's
get started!
Bibliography
[1] The RISC-V Instruction Set Manual.
https://riscv.org/wp-content/uploads/2019/06/riscv-spec.pdf.
● 60
You've been introduced to the RISC-V ISA. You've installed the software. Now it is time to
play! So let's get started with assembly language programming. To do that, we'll embark
on a very small but digestible project. This will introduce several concepts without too much
detail. When you complete this chapter, you will be ready to take on more advanced topics
in assembly language programming.
For the remainder of this book, I will simply use QEMU to refer to the qemu-system-riscv64
emulator. Now let's examine how the memory models differ between these two machine
architectures in C language terms.
sizeof(int) = 4
sizeof(long) = 4
sizeof(long long) = 8
sizeof(void*) = 4
These sizes are in bytes. We could summarize that the int, long integers and pointers
share the same bit width of 32-bits (4 bytes). This fits the RV32I architecture where the
registers are 32-bits in width. C programmers might be more familiar with the equivalent
ILP32 model, which originates from the Solaris C language data model (Integers, Long and
Pointers are 32-bits). While we didn't report it above, the sizeof(short) is 16-bits and the
● 61
character (byte) is 8-bits in size. The C/C++ compiler will also support the long long data
type but multi-precision arithmetic is used on 32-bit platforms to achieve it.
sizeof(int) = 4
sizeof(long) = 8
sizeof(long long) = 8
sizeof(void*) = 8
In other words, long integers and pointers are 64 bits wide (8 bytes), while sizeof(int)
remains at 32 bits. This is equivalent to the LP64 Solaris model (Long and Pointers are 64-
bit, while Integers are assumed to be 32 bits). Again, the sizeof(short) is 16 bits and the
character 8 bits as before. The long long integer data type is also 64 bits, as it was for RV32
(or ILP32). However, the RV64 platform can naturally compute in 64 bits. Clint Eastwood
in the movie Dirty Harry says that "a man's gotta know his limitations". Now that we are
familiar with the two memory models, we know our data type limitations.
Table 5.1 summarizes the data type characteristics for the two different RISC-V profiles
that we'll use.
char 8 8
short 16 16
int 32 32
long 32 64
long long 64 64
pointer 32 64
While the register widths differ in these two models, the instruction set remains largely
the same. When a 32-bit value is loaded into a 64-bit register, the value is automatical-
ly sign-extended into the 64-bit register. When the same register is stored into a 32-bit
memory location, only the low order 32-bits are saved. In other words, the main impact of
different register sizes will occur in the loading from and storing to memory. To accomplish
● 62
64-bit loads and stores, additional opcodes were added to RV64I. For now, just be aware of
this and keep it in the back of your mind.
The main program is written as if it were calling another C function. The C compiler doesn't
even know that the function add3() in this example is written in assembler language. The
main program for this example is illustrated in Listing 5.1 and will be run from Fedora Linux
under QEMU. Throughout this book, I'll use line numbers at the left of each listing for ease
of reference. They are not part of the source file.
1 // main.c
2
3 #include <stdio.h>
4
5 extern int add3(int arg1,int arg2,int arg3);
6
7 int
8 main(int argc,char **argv) {
9 int a=23, b=24, c=25;
10 int r = 0;
11
12 r = add3(a,b,c);
13 printf("r = %d\n",r);
14 return 0;
15 }
● 63
• Line 3 is used to bring in support for the printf() function used in line 13.
- The extern keyword informs the C compiler that the function add3 is
external to the current source file.
• Line 9 declares three integers a, b and c and initializes them with values.
• The function add3 is invoked in line 12, with the result assigned to the variable
r (declared in line 10).
• Line 13 reports the sum returned in r. And finally, line 14 just returns an exit
code that Linux expects from the main program.
Keep in mind that the data type int is 32 bits in size. This applies to variables a, b, c and r.
1 .global add3
2 .text
3 add3: add a0,a0,a1 # a0 = a0 + a1
4 add a0,a0,a2 # a0 = a0 + a2
5 ret # return value in a0
Now that you've seen the assembly language source file, let's discuss the general format
used.
● 64
Labels are optional but when used, start in the first column, followed by a colon. The label
"add3:" is an example of a label found in line 3 of Listing 5.2.
The second field is the operation code (opcode), which can be an actual instruction opcode
like "add" (see lines 3 and 4) or a pseudo operation like .global or .text. Pseudo opcodes
influence the working of the assembler as it progresses from the start of the source file to
its end.
Many opcodes require operands. These are listed in the third field and have subfields sep-
arated by commas.
Optionally, a fourth field starting with '#' marks the start of a comment on the line. A '#'
in the first column indicates that the entire line is a comment. These lines are ignored by
the assembler.
While the above represents the general convention for an assembler source line, GNU's
assembler is somewhat forgiving. For example, you could have spaces preceding the label
or spaces between the label and the colon character. I have used tab characters to separate
the fields, but you can use spaces instead.
If the assembler source file was to call a function like printf() for example, the symbol printf
should also be listed as a .global (there will be examples of this later in the book). Refer-
ences and defined symbols can be listed in the .global operands field. Multiple symbols can
be given on one line, for example:
By default, all undefined symbols are assumed by the assembler to be external. However, it
is best practice to list all global symbols defined or referenced in the program. As a result,
it is easier to distinguish between an error of omission and an actual external symbol. This
becomes critical in larger projects.
● 65
.section .text
The .text section is assumed by default at the start of the assembly, but it is best to be
specific. If nothing else, it helps the reader of your code know where the instructions are
going to be assembled.
The label "add3" is defined as a global symbol because it was listed as a .global reference.
The RISC-V instruction set architecture consists mainly of register-to-register and register
load/store operations. The "add" instruction is a register-to-register operation that has
three operands. Two registers are source registers (shown underlined) and one destination
register. With the shown add instruction, register a0 is added to a1 with the result replacing
the contents of register a0.
In this RV64I code, the int variables are still 32-bits in size (review Table 5.1). However,
as these values are loaded into 64-bit registers, the values are sign extended to 64 bits.
The C language compiler will expect the 32-bit integer result to be returned in register a0
(x10).
Note: Throughout this book, the friendly (ABI) register names are used. For example,
we will refer to the register a0 instead of x10. Either is legal in the source code but the
friendly names are easier to work with. It is also kinder to the reader.
In line 3 (Listing 5.2), the value of the first argument arrives in register a0, with the second
argument in a1. These two registers are added together, and the result replaces the value
in a0. In C language terms, it amounts to:
a0 += a1;
Line 4 repeats the "add" instruction but this time referencing the third argument in register
a2:
● 66
add a0,a0,a2
This time the result of a0 + a2 replaces a0. In C language terms, the pair of instructions
can be summarized as:
a0 += a1;
a0 += a2;
The calculated result is left in register a0 prior to the return to main. The opcode "ret" in line
5 causes the assembler routine to return to the caller. This works because the caller's re-
turn address has been placed in register ra (or x1 – think "return address") when our add3
routine was called. The ret instruction will be examined in more detail later in the book.
$ cd ~/riscv/repo/05/add3/qemu64
$ gcc -O0 -g main.c add3.S
$ ls -l a.out
-rwxrwxr-x. 1 riscv riscv 13560 Apr 19 21:13 a.out
The compiled result is in file a.out, which can now be executed. Option -O0 (dash capital oh
zero) should be used to prevent any compiler optimization. Compilers today are very good
at optimizing and might precompute the result without even calling the assembler routine
at all. So we disable optimization to prevent that. The -g option is optional here, but I en-
courage you to use it in case you want to step through the code using a debugger like gdb.
We'll make use of the debugger later.
The C compiler will compile main.c as a C program and add3.S as an assembler source
module. After those successfully produce object modules, they are linked into a final exe-
cutable named a.out. To execute our program under Linux use:
$ ./a.out
r = 72
● 67
Note: It is also possible to place assembly language into file add3.s (lowercase 's') but
this causes the assembly to proceed without the benefit of the C preprocessor. Later in
this book, we will want to use the macro capabilities of the preprocessor to make the
assembly code portable. It is, therefore, recommended that you get used to using the
capitalized suffix .S instead.
At this point, you can shut down your Fedora Linux instance (QEMU). We're now going to
exercise the same code on the ESP32-C3 device.
Note: Linux and MacOS users will want to initialize their ESP-IDF in a new terminal win-
dow with the alias established in chapter 2, or to do so manually as follows:
$ . ~/espc3/esp-idf/export.sh
Windows users can start a session just by double-clicking their ESP-IDF 4.4 CMD icon.
The ESP32 main program is somewhat different than what is used on Linux. Listing 5.3
illustrates:
1 #include <stdio.h>
2
3 extern int add3(int one,int two,int three);
4
5 void
6 app_main(void) {
7 int a=23, b=24, c=25;
8 int r = 0;
9
10 r = add3(a,b,c);
11 printf("r = %d\n",r);
12 }
The main difference for ESP32 is that the main program is named app_main, taking no
arguments and returning no value. The assembler file ~/riscv/repo/05/add3/main/add3.S
is otherwise identical to the file used for QEMU before.
● 68
It's always a good idea to start with a clean slate, so let's clean the project directory using
the ESP-IDF:
$ cd ~/riscv/repo/05/add3
$ idf.py fullclean
Executing action: fullclean
Done
This guarantees that any partially built objects from experiments and failed compiles are
removed so that your build will be done from scratch.
Now build your ESP32-C3 version of the add3 program. The first time you build after a
fullclean, it can take a long time. But later builds will be swifter due to cached compiles:
$ idf.py build
Executing action: all (aliases: build)
Running cmake in directory /Users/ve3wwg/riscv/repo/05/add3/build
Executing "cmake -G Ninja -DPYTHON_DEPS_CHECKED=1 -DESP_PLATFORM=1 -DIDF_
TARGET=esp32c3 -DCCACHE_ENABLE=0 /Users/ve3wwg/riscv/repo/05/add3"...
...
Executing "ninja all"...
[10/955] Generating ../../partition_table/partition-table.bin
Partition table binary generated. Contents:
*******************************************************************************
# ESP-IDF Partition Table
# Name, Type, SubType, Offset, Size, Flags
nvs,data,nvs,0x9000,24K,
phy_init,data,phy,0xf000,4K,
factory,app,factory,0x10000,1M,
*******************************************************************************
[502/955] Performing configure step for 'bootloader'
...
Bootloader binary size 0x4500 bytes. 0x3b00 bytes (86%) free.
[954/955] Generating binary image from built executable
esptool.py v3.2-dev
Merged 1 ELF section
...
● 69
Now flash the device and monitor its output. In the example below, your path for the device
will likely differ from the path /dev/cu.usbserial-1430 shown (underlined). Windows users
will specify a COM port instead.
Leaving...
Hard resetting via RTS pin...
Executing action: monitor
Running idf_monitor in directory /Users/ve3wwg/riscv/repo/05/add3
...
ESP-ROM:esp32c3-api1-20210207
Build:Feb 7 2021
rst:0x1 (POWERON),boot:0xc (SPI_FAST_FLASH_BOOT)
SPIWP:0xee
mode:DIO, clock div:1
load:0x3fcd6100,len:0x15cc
load:0x403ce000,len:0x8ec
load:0x403d0000,len:0x25e8
entry 0x403ce000
I (30) boot: ESP-IDF v4.4-dev-2359-g58022f8599 2nd stage bootloader
I (30) boot: compile time 16:00:41
I (30) boot: chip revision: 3
...
I (256) cpu_start: Starting scheduler.
R = 72
When the program exits app_main, it just stalls within the ESP32 framework. Press Con-
trol-] (Control right square bracket) to exit the monitor process. Notice that the sum is
reported correctly as 72. Congratulations, you have executed your first RISC-V instructions
on the ESP32-C3!
Figures 5.1 and 5.2 illustrate a typical dev board PCB offering that includes a tiny OLED
display purchased from AliExpress.
● 70
● 71
$ cd ~/riscv/repo/05/add3
$ ~/riscv/repo/listesp main/add3.S
GAS LISTING /var/folders/jp/_ktfnf412kvbdznr4m9769tr0000gn/T//cc8FupVj.s
page 1
1 # 1 "main/add3.S"
1 .global add3
0
0
2 .text
3 0000 2E95 add3: add a0,a0,a1 # a0 = a0 + a1
4 0002 3295 add a0,a0,a2 # a0 = a0 + a2
5 0004 8280 ret # return value in a0
DEFINED SYMBOLS
main/add3.S:3 .text:0000000000000000 add3
NO UNDEFINED SYMBOLS
The invoked script merely invokes gcc to produce an assembler listing using:
gcc -c -Wa,-a,-ad $*
The option -Wa indicates the further options are to be passed to the assembler, providing
assembler options -a (high-level listing) and -ad (drop debug information from the listing).
The $* is replaced with the name of the assembler source file.
● 72
The source line number is reported at the extreme left of the listing. These are useful ref-
erences for error and informational messages. Lines starting with zero have no source file
line.
The first listing line shows "# 1 "main/add3.S", indicating the source file that was assem-
bled. Opcodes and pseudo-ops are shown in the center of each line. Line 3 shows the first
assembled line of code:
The four hex digits after the line number indicate the relative address of the assembled
instruction. Notice how line 4 shows the relative address 0002, indicating that the add in-
struction in line 3 was only two bytes in length. Recall that the ESP32-C3 supports RV32IMC,
with the "C" indicating support for compressed instructions of 16 bits instead of 32. We'll
revisit this idea shortly.
The next four hex digits show the assembled instruction. For example, line 4 shows the in-
struction as 3295 in hexadecimal. In some cases, as we will see, the instruction length may
be longer. There will also be times that the assembler will truncate what is shown there,
because of the limits of the line length. In some cases, the assembler will show a temporary
instruction code because of relocation performed by the linker.
The assembler listing will also show defined symbols. The following shows that line 3 of
main/add3.S defines a symbol named add3, which is located in the .text section of mem-
ory. The value of the symbol will be a relative value in the listing since this is adjusted at
link time by the linker.
DEFINED SYMBOLS
main/add3.S:3 .text:0000000000000000 add3
The next section will list any undefined symbols, if there are any. It's a good practice to
quickly scan this looking for symbols that should not be undefined.
NO UNDEFINED SYMBOLS
● 73
Listing 5.5 shows the result of adding -march=rv32im to the listesp script. The option
-march=rv32im is passed to gcc, which informs the assembler to support only the RV32IM
base and extension, omitting the "C" extension.
Note: For Fedora Linux, change to ~/riscv/repo/05/add3/qemu64, and then use the
script at ~/riscv/repo/list.
$ cd ~/riscv/repo/05/add3
$ ~/riscv/repo/listesp -march=rv32im main/add3.S
GAS LISTING /var/folders/jp/_ktfnf412kvbdznr4m9769tr0000gn/T//cct6UCZb.s
page 1
1 # 1 "main/add3.S"
1 .global add3
0
0
2 .text
3 0000 3305B500 add3: add a0,a0,a1 # a0 = a0 + a1
4 0004 3305C500 add a0,a0,a2 # a0 = a0 + a2
5 0008 67800000 ret # return value in a0
DEFINED SYMBOLS
main/add3.S:3 .text:0000000000000000 add3
NO UNDEFINED SYMBOLS
Here we see that both add instructions are now 4 bytes in length. The address of the second
add instruction is now 0004, rather than 0002, since the first instruction is now 3305B500
in hex. So, no compressed instructions were generated. Even the ret instruction (return) is
now 4 bytes in length.
● 74
Be sure to place the -march=rv32im option in quotes for the batch file processor.
5.4.3. Objdump
Sometimes as a developer, you want to disassemble the contents of a compiled or assem-
bled object file. Other times, you may not have the source file for the object. The objdump
takes many options, but in the ESP32-C3 environment, we can dump out the generated
add3.o (this file was generated as part of producing the listing file) as follows:
$ riscv32-esp-elf-objdump -d add3.o
00000000 <add3>:
0: 00b50533 add a0,a0,a1
4: 00c50533 add a0,a0,a2
8: 00008067 ret
Here we see that the object file is reverse engineered by the objdump command, in a
format similar to the listing file (this one used -march=rv32im). We are reminded by the
output, that the format is little endian by the line:
This is an important thing to remember since RISC-V machines are byte-addressable, and
the word format is little endian. The objdump command simply lists the bytes 00b50533,
as the bytes increase in address. The assembler listing displays the word value 3305B500
instead. But with little endian addressing, byte 00 would be the lowest addressed byte, and
byte 33 would be placed at the highest of the 4-byte sequence.
Note: For Fedora Linux, use objdump as the command name. In Windows you would use
the riscv32-esp-elf-objdump command.
5.6. Summary
You've succeeded in assembling RISC-V code for the RV64 and RV32 platforms. You've
linked your assembly code with a C main program and run it successfully. Knowing the
organization of the assembly language source file and the optional listing report puts power
into your hands. Finally, you've experienced the first instance of how the C program calls
your assembly language code. In the next chapter, we'll examine the instructions needed
to load from and store to memory.
● 75
In the last chapter, we applied the "add" instruction to sum two registers and place the
result into a third register. This is a register-to-register operation. But how do we get val-
ues into a register and how do we save them back to memory? This chapter examines the
RISC-V operations used for loading and storing.
Because RV32 is a 32-bit (XLEN-32) platform with 32-bit registers, the ESP32-C3 does
not have instructions for loading or storing double words. The RV64 ISA (XLEN=64) does,
however.
From this we can list the following basic assembler instructions for loading values:
● 76
$ cd ~/riscv/repo/06/loads
C:> cd \riscv\repo\06\loads
The example is illustrated in Listing 6.1, defining four assembler routines: loadb(), loadh(),
loadw() and loadd(). This source file demonstrates how to provide multiple functions in
one assembly file. Each of the symbols loadb, loadh, loadw and loadd can be thought of as
multiple "entry points".
1 .global loadb,loadh,loadw,loadd
2
3 .text
4 loadb: lb a0,byte # Load a byte
5 ret
6 loadh: lh a0,hword # Load half word
7 ret
8 loadw: lw a0,word # Load a word
9 ret
10 loadd: lw a0,dword # Load lower word of dword
11 lw a1,dword+4 # Load upper word of dword
12 ret # return value in a0
13
14 .data
15 byte: .byte 1
16 hword: .half 0xF509
17 word: .word 0x0708090A
18 dword: .dword 0xCCBBAA99887766
The functions are defined in C language terms in Listing 6.2. There we see that function
loadb(), loadh() and loadw() all return a 32-bit integer value. The last function, loadd() will
return a 64-bit long long integer type, even on the RV32 platform.
#include <stdio.h>
void
app_main(void) {
● 77
printf("loadb() = %08X\n",loadb());
printf("loadh() = %08X\n",loadh());
printf("loadw() = %08X\n",loadw());
printf("loadd() = %016llX\n",loadd());
}
The RISC-V 32-bit integer is always returned in register a0. So line 4 of Listing 6.1 loads a
byte into the 32-bit register a0 and returns to the caller in line 5. The byte value loaded is
1 (defined in line 15). The half-word value is defined as the value 0xF509 in line 16. Think
about what you expect that return value will be in the C program (and hold that thought).
Finally, the entry point loadw loads the 32-bit integer value defined in line 17.
Notice that in the entry point loadd we had to perform two-word loads. This is because the
registers are only 32 bits in size for the ESP32-C3, so that the upper half of the double word
has to be returned in register a1 instead for XLEN=32 platforms.
Build, flash and execute this program on the ESP32-C3. Specify your device port (under-
lined) after the -p option according to the device port that it appears on (or Windows COM
port).
$ idf.py build
...
$ idf.py -p <<<your-port>>> flash monitor
...
--- idf_monitor on /dev/cu.usbserial-146410 115200 ---
--- Quit: Ctrl+] | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H ---
...
I (256) cpu_start: Starting scheduler.
loadb() = 00000001
loadh() = FFFFF509
loadw() = 0708090A
loadd() = 00CCBBAA99887766
(Type Control-] to exit)
What did you observe by running the program? The loadb() function reported as 00000001
as expected. So we know that the byte loaded into the 32-bit register as expected. The
loadh() function however reported FFFFF509 rather than 0000F509 as you might have
expected. If this surprises you, recall that the values are sign extended to the width of the
register. Since the high order bit of the half-word was a 1-bit (line 16), that sign bit was
extended to the full width of the register. Figure 6.1 illustrates the sign extension process.
Finally function loadw() loaded and returned a 32-bit value just fine.
● 78
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 1
sign extend
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
The loadd() function was designed to return a long long integer which is 64-bits in length.
Since the registers are only 32-bits in size, the value was returned in the a0 and a1 register
pair. The low order word in a0, and the high order word in a1.
Let's now examine the assembler listing for the program you just ran:
$ ~/riscv/repo/listesp main/loads.S
1 # 1 "main/loads.S"
1 .global loadb,loadh,loadw,loadd
0
0
2
3 .text
4 0000 17050000 loadb: lb a0,byte # Load a byte
4 03050500
5 0008 8280 ret
6 000a 17050000 loadh: lh a0,hword # Load half word
6 03150500
7 0012 8280 ret
8 0014 17050000 loadw: lw a0,word # Load a word
8 03250500
9 001c 8280 ret
10 001e 17050000 loadd: lw a0,dword # Load lower word of dword
10 03250500
11 0026 97050000 lw a1,dword+4 # Load upper word of dword
11 83A50500
12 002e 8280 ret # return value in a0
13
14 .data
15 0000 01 byte: .byte 1
● 79
DEFINED SYMBOLS
main/loads.S:4 .text:0000000000000000 loadb
main/loads.S:6 .text:000000000000000a loadh
main/loads.S:8 .text:0000000000000014 loadw
main/loads.S:10 .text:000000000000001e loadd
main/loads.S:15 .data:0000000000000000 byte
main/loads.S:16 .data:0000000000000001 hword
main/loads.S:17 .data:0000000000000003 word
main/loads.S:18 .data:0000000000000007 dword
main/loads.S:4 .text:0000000000000000 .L0
main/loads.S:6 .text:000000000000000a .L0
main/loads.S:8 .text:0000000000000014 .L0
main/loads.S:10 .text:000000000000001e .L0
main/loads.S:11 .text:0000000000000026 .L0
NO UNDEFINED SYMBOLS
Notice the addresses shown left of the .byte, .half etc. data definitions. They increase start-
ing from zero because they assemble in their own .data section (not .text, which is normally
reserved for code). The defined symbols reported at the bottom also reflect this:
Notice how the assembler has indicated that these are defined in the .data section.
● 80
All of these values would wind up in SRAM on the ESP32-C3. Under Fedora Linux (QEMU),
they are collected into .data virtual memory pages, that are read/write capable. The GNU
C compiler prefers to gather these into the .sdata section, rather than .data. But the effect
is the same.
We could have defined these particular values in .text since these values are never modi-
fied. These values would be protected as read-only. In C/C++ terms, defining those values
in .text is almost equivalent to:
The ESP32-C3 device keeps the .text values in flash. Fedora Linux (QEMU) places .text into
virtual memory pages that permit only read and execute permissions. The C/C++ state-
ments shown above actually get placed into the section named ".srodata". The name of
this section suggests Static Read-Only Data. This is preferred over the section .text since it
indicates that no execute permission should be provided.
Note: The GNU compiler places C/C++ static const values into the section named
.srodata for RV32I and RV64I. Non-const static C/C++ values likewise go into the .sda-
ta section instead.
If you comment out the line containing the .data pseudo-op, the values will be defined in
the .text section instead. Let's try it for fun. After you comment the .data line out, repeat
the build, flash and run of the program. Does it still work? Now examine the listing for it,
shown in Listing 6.4.
1 # 1 "main/loads.S"
1 .global loadb,loadh,loadw,loadd
0
0
2
3 .text
4 0000 17050000 loadb: lb a0,byte # Load a byte
4 03050500
5 0008 8280 ret
6 000a 17050000 loadh: lh a0,hword # Load half word
6 03150500
7 0012 8280 ret
8 0014 17050000 loadw: lw a0,word # Load a word
8 03250500
9 001c 8280 ret
10 001e 17050000 loadd: lw a0,dword # Load lower word of dword
10 03250500
● 81
DEFINED SYMBOLS
main/loads.S:4 .text:0000000000000000 loadb
main/loads.S:6 .text:000000000000000a loadh
main/loads.S:8 .text:0000000000000014 loadw
main/loads.S:10 .text:000000000000001e loadd
main/loads.S:15 .text:0000000000000030 byte
main/loads.S:16 .text:0000000000000031 hword
main/loads.S:17 .text:0000000000000033 word
main/loads.S:18 .text:0000000000000037 dword
main/loads.S:4 .text:0000000000000000 .L0
main/loads.S:6 .text:000000000000000a .L0
main/loads.S:8 .text:0000000000000014 .L0
main/loads.S:10 .text:000000000000001e .L0
main/loads.S:11 .text:0000000000000026 .L0
NO UNDEFINED SYMBOLS
Listing 6.4: the loads.S listing with the .data pseudo-op commented out.
Now notice the relative addresses of the data values. They all have addresses after your
code and in the .text section. The last ret instruction in line 12 has a reported address of
002E, so the byte that follows in .text has an address of 0030.
● 82
Figure 6.2 illustrates the operation of an unsigned load of a half-word into a 32-bit register.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
zeros 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Figure 6.2: the process of loading an unsigned half-word into a 32-bit register.
Notice that RV32I does not have a "lwu" instruction because there is no need for it. The
register is only 32-bits in size so there is no sign extension involved. But for RV64I, there
is indeed a need because its registers are 64 bits wide. Loading a 32-bit value can be sign
extended (lw) or not (lwu) for RV64I.
There is also an RV128I specification where registers are 128-bits wide, which needs the
"ldu" instruction but we won't be concerned with RV128 in this book. If you learn RV32I and
RV64I, then you will be well prepared for larger architectures.
For best performance, the effective address for all loads and stores should be nat-
urally aligned for each data type (i.e., on a four-byte boundary for 32-bit accesses,
and a two-byte boundary for 16-bit accesses). The base ISA supports misaligned
accesses, but these might run extremely slowly depending on the implementation.
Furthermore, naturally aligned loads and stores are guaranteed to execute atomi-
cally, whereas misaligned loads and stores might not, and hence require additional
synchronization to ensure atomicity
From this, we can conclude that if performance isn't an issue, misaligned data loads and
stores are supported. But if you are concerned about the best possible performance, then
your data should be aligned. Further, if you are performing advanced programming where
atomic values are needed, then you must align your data to guarantee that the operation
is indeed atomic.
● 83
where n is 1, 2, 4 or 8 (or more). The optional fill parameter specifies what to fill each byte
with and defaults to zero. The third optional parameter specifies the maximum number of
bytes to pad with when specified. The location is padded with fill bytes until it is aligned
to the value of n, or the maximum (when given) is reached. For example, to align a 32-bit
value, use n=4 (4-byte entity). To align a half-word, use 2, and for a 64-bit double-word
use 8. When using .balign in the .text section, the fill parameter defaults to the noop in-
struction.
14 .data
15 byte: .byte 1
16 .balign 2
17 hword: .half 0xF509
18 .balign 4
19 word: .word 0x0708090A
20 .balign 8
21 dword: .dword 0xCCBBAA99887766
Let's check the assembler listing now. The changes to the listing shown below:
14 .data
15 0000 01 byte: .byte 1
16 0001 00 .balign 2
17 0002 09F5 hword: .half 0xF509
18 .balign 4
19 0004 0A090807 word: .word 0x0708090A
20 .balign 8
21 0008 66778899 dword: .dword 0xCCBBAA99887766
21 AABBCC00
From this, it is evident that the alignment directives had their desired effect. For example,
the location was 0001 after the declaration of "byte". But after the ".balign 2" directive,
a pad byte was added to bring the location of "hword" to 0002. Similar alignments were
unnecessary for the others, since they already had locations suitably aligned.
We can also see the effect in objdump, as shown below (after using the listesp script to
generate a listing and object file):
● 84
Notice that, before the F509 (in little endian format) is defined, the byte value 01 is fol-
lowed by a 00 byte, to align it on a half word.
6.7. Experiment
Return and edit the program in ~/riscv/repo/06/main/loads.S, so that instead of the "lh"
instruction, it uses the "lhu" instruction instead. The affected loadh code should look like
this after the edit:
Then rebuild, flash, and run the program again. You should get the following results:
loadb() = 00000001
loadh() = 0000F509
loadw() = 0708090A
loadd() = 00CCBBAA99887766
This time, the loadh() value was reported as 0000F509, without the sign extension. This is
the special talent of the "lhu" opcode.
Imagine that at some point in your algorithm, all you need to do is to increment the value
of register a1 by 1. To do so without immediate data requires that you code something like
this:
lb t1,one # t1 = 1
add a1,t1,zero # a1 += t1
...
.section .srodata
one: .byte 1
● 85
li t1,1 # t1 = 1
add a1,t1,zero # a1 += t1 + zero
The assembler listing might look a little confused for pseudo-ops like "li". For example,
assembling just the above code, produces a listing:
1 # 1 "t.S"
1 li t1,1 # t1 = 1
0
0
2 0002 B3050300 add a1,t1,zero # a1 += t1
But the assembler listing did not report any code for line 1. Code was however, reserved by
the evidence of the address of line 2 (the address is 0002, indicating that the prior instruc-
tion was two bytes in length). If we ask objdump about what was produced in the object
file we get:
00000000 <.text>:
0: 4305 li t1,1
2: 000305b3 add a1,t1,zero
There is positive proof that the first instruction was 2 bytes, followed by a 4-byte instruc-
tion. You might be wondering why all the smoke and mirrors for the "li" opcode? It turns
out that the assembler has to jump some hurdles for certain constants. For small constants
like the number 1, the constant can be embedded into a compressed instruction (the keen
student is encouraged to examine the RISC-V instruction formats in [1]). For larger con-
stants, the pseudo-op is expanded by the assembler into multiple instructions as neces-
sary. Consider the following code:
li t1,0xFEEDBEEF # t1 = 0xFEEDBEEF
add a1,t1,zero # a1 += t1
The assembler in this case expands the "li" pseudo-op into two RISC-V opcodes:
00000000 <.text>:
0: feedc337 lui t1,0xfeedc
4: eef30313 addi t1,t1,-273 # 0xfeedbeef
8: 000305b3 add a1,t1,zero
The first is generated as a "lui" opcode, with the upper portion of the constant (0xFEEDC).
This new instruction is the "Load Upper Immediate" opcode. This immediate data is placed
into the upper 20 bits of the destination register t1 so that the value becomes 0xFEEDC000
● 86
(the lower 12 bits are zeroed). The second generated instruction uses an "addi" (add im-
mediate) instruction to add the signed value -273 to t1 (effectively subtracting 273). When
that instruction completes, t1 will hold the value 0xFEEDBEEF as intended.
Let's check this in the gdb debugger. Tool gdb is a very useful for developers. It not only
debugs but allows you to perform hexadecimal calculations. Try typing the underlined text
into gdb as follows and type q to quit:
$ gdb
GNU gdb (GDB) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
...
(gdb) p /x 0xFEEDC000 - 273
$1 = 0xfeedbeef
(gdb) q
Note: Windows users must use the command name “riscv32-esp-elf-gdb” instead of
“gdb”.
Here we use the gdb "p" (print) command, "/x" to print the result in hexadecimal, subtract-
ing 273 (decimal) from FEEDBEEF (in hexadecimal). The result is displayed as 0xfeedbeef.
The printed result confirms the computation is indeed correct. This also proves that gdb is
indeed your friend. We'll see more of gdb later in this book.
Notice that the "addi" instruction is an add instruction with some "immediate data" capa-
bility. Many of the basic RV32I and RV64I instructions have an immediate data version to
allow a signed constant to be applied instead of a second source register.
addi a1,a1,1 # a1 += 1
This is far more convenient and efficient than the earlier attempts. What is assembled,
however, is going to depend upon the size of the constant used. In some cases, the assem-
bler may report an error for cases that cannot be directly encoded.
6.11. Pseudo-Op mv
Register values sometimes need to be copied or moved from one register to another. The
general form of this opcode is this:
mv rd, rs
● 87
The source register rs is copied to the destination register rd, where the destination reg-
ister is not x0 (zero). When, however, the destination register is x0 (zero), that operation
performs as a "noop" (no operation).
The "mv" is a pseudo-op because, with a small trick, it is possible to do this without creating
a new instruction. The pseudo-op also permits the assembler to optimize the move request.
The following two instructions are equivalent, except for their size:
The effect of moving register t1 to t0 can be achieved by adding t1 to zero (x0) and placing
the result in t0, in the first example. However, if you use the "mv" pseudo-op, the assem-
bler can substitute a compressed opcode (16 bits) in place of a 32-bit opcode, when it is
permitted to do so.
$ cd ~/riscv/repo/06/loads/qemu64
$ gcc -O0 -g loads.S main.c
$ ./a.out
loadb() = 00000001
loadh() = FFFFF509
loadw() = 0708090A
loadd() = FFFFFFFF99887766
The loads.S program for RV64I is illustrated in Listing 6.4. This program is identical to the
ESP32-C3 version except that line 10 uses the 64-bit opcode "ld" to load the double word
into the 64-bit register a0, where the value is returned.
1 .global loadb,loadh,loadw,loadd
2
3 .text
4 loadb: lb a0,byte # Load a byte
5 ret
6 loadh: lh a0,hword # Load half word
7 ret
8 loadw: lw a0,word # Load a word
9 ret
10 loadd: ld a0,dword # Load lower word of dword
11 ret # return value in a0
12
13 .data
● 88
14 byte: .byte 1
15 hword: .half 0xF509
16 word: .word 0x0708090A
17 dword: .dword 0xCCBBAA99887766
$ cd ~/riscv/repo/06/aligned/qemu64
$ gcc -O0 -g aligned.S main.c
$ ./a.out
In Fedora Linux, produce a listing with the ~/riscv/repo/list script. Use command name
"objdump" to dump an object file.
The name, as we've seen before is a section name like .data, .sdata or .srodata etc. Pre-
defined sections with these names have default attributes associated with them. The flags
argument, when present, must be enclosed in double-quotes and consist of one or more of
the following flag characters:
The allocatable attribute (a) means that space can be reserved but is not otherwise initial-
ized. This means that the region is not necessarily zeroed or otherwise initialized (some
platforms may zero this before calling the main entry point however). The writable attribute
(w) is used for data areas that may be updated, like static data variables in a C program.
Finally, the execute attribute (x) is used to indicate executable code. Under Linux, this
permits the execution of code in the region and the lack of this permission prevents the
execution of data.
Once a section has its flags defined (with some very special exceptions), they cannot be
changed. It is thus very important that sections be defined consistently.
● 89
The @type argument, when present, may be one of the following values:
We can create our own section like .srodata, and name it .precious as follows:
This declares a memory section that can be allocated, but lacks write and execute permis-
sion. Because of the "@progbits" type, there can be initialized data within it, resulting in a
section of read-only data.
Table 6.1 contains the most commonly encountered predefined section names. There are
other specialized names used by C++ that are not shown for use in constructors and de-
structors, for example. You as the application developer are free to create section names
of your own.
Notice that the ".bss" section just allocates space for data, but does not define or initialize
any data values for it. Some platforms may zero these sections prior to invoking the main
function.
● 90
Since global addresses can be large, a store is usually performed in reference to a base reg-
ister and a fixed offset. To help us in this regard, the assembler (and indirectly the linker)
provides the use two special functions:
Consequently, for global memory store operations, the following pattern is often used:
The "lui" operation loads the high order 20 bits of the address symbol into t0 in this ex-
ample, setting the lower 12 bits to zero. Then the "sw" opcode stores register a0 into
the address computed by t0 plus the 12-bit offset returned from the assembler function
%lo(symbol).
$ cd ~/riscv/repo/06/celcuius
Listing 6.5 illustrates the main program that is going to call upon our Fahrenheit to Cel-
sius assembler conversion using integer arithmetic. In this example, we are using global
integers to pass and return the values to illustrate the load and store operations in the
assembler. The calculations are performed in an integer, so the input value temp_f10 (line
5) is the temperature in Fahrenheit multiplied by 10. The resulting value temp_c10 (line 6)
is likewise the integer result times ten. The printf() statement (lines 13 to 17) reports the
values with the necessary format adjustments.
1 #include <stdio.h>
2
3 extern void convtemp();
4
5 int temp_f10 = 400; // Fahrenheit degrees * 10
6 int temp_c10 = 0; // Computed result: Celsius * 10
7
8 void
9 app_main(void) {
10
11 convtemp(); // Convert temp_f10 to temp_c10
12
● 91
1 .global convtemp,temp_f10,temp_c10
2
3 .text
4 convtemp:
5 lw t0,temp_f10 # t0 = F * 10
6 addi t0,t0,-320 # t0 = (F * 10) - (32 * 10)
7 li t1,10 # t1 = 10
8 mul t0,t0,t1 # t0 *= 10
9 li t1,18 # t1 = 1.8 * 10
10 div t0,t0,t1 # t0 = F * 100 / 1.8 * 10
11 lui t1,%hi(temp_c10)
12 sw t0,%lo(temp_c10)(t1) # t0 = Celsius * 10
13 ret
This function does not return a value, so our calculation is performed in temporary register
t0. Line 5 loads the global int value temp_f10. The signed immediate value -320 is added
to t0 before it is multiplied by 10 in lines 7 and 8. We've not looked at "mul" and "div" yet,
but they take two source operands and produce a result in the destination register. These
are both signed integer computations.
Once the temp result in t0 is multiplied by 10, the value is divided by 18 (1.8 times 10), to
produce a result in t0 (line 10). This will be in degrees Celsius times 10. Then temporary t1
is loaded with the high order 20 bits of the global integer temp_c10 (line 11). Finally, in line
12, we can store the register t0 into the global variable using the offset of %lo(temp_c10)
added to t1. After the routine returns in line 13, the C program reports the answer:
Now let's re-examine the "lw" opcode in line 5 of Listing 6.6. The objdump utility will report
something like the following:
● 92
The value is not known until the link step is performed, so the immediate constant shown
is simple 0x0, which is fixed up later by the linker. By dumping out build/celsius.elf for the
ESP32-C3 device, we see that the instruction gets converted to the following instruction:
In other words, the ESP32-C3 linker script has established a global pointer in register gp,
and the word is loaded using an offset of -1996 from it. If your brain is feeling a little bit of
hurt right now, don't dispair. We can use a more friendly bit of code to do the same thing.
Listing 6.7 contains the source code for celsius2.S.
1 .global convtemp,temp_f10,temp_c10
2
3 .text
4 convtemp:
5 lui t0,%hi(temp_f10)
6 lw t0,%lo(temp_f10)(t0) # t0 = F * 10
7 addi t0,t0,-320 # t0 = (F * 10) - (32 * 10)
8 li t1,10 # t1 = 10
9 mul t0,t0,t1 # t0 *= 10
10 li t1,18 # t1 = 1.8 * 10
11 div t0,t0,t1 # t0 = F * 100 / 1.8 * 10
12 lui t1,%hi(temp_c10)
13 sw t0,%lo(temp_c10)(t1) # t0 = Celsius * 10
14 ret
The cryptic load is now replaced with the more familiar "lui" and "lw" instructions in lines
5 and 6. Line 5 loads the high order 20 bits of the absolute address for global temp_f10
into t0. Then the low order 12 bits are added to that to form an address of the word to be
loaded. That word replaces t0 with the contents of the memory word that we were after.
If you build and flash that project, then running it on your ESP32-C3 will confirm that this
still works.
RV64I Run
Start up your Fedora Linux instance in QEMU, and perform the following:
$ cd ~/riscv/repo/06/celsius2/qemu64
$ gcc -O0 -g celsius2.S main.c
$ ./a.out
40.0 F -> 4.4 C
● 93
Your result should match what you got for the ESP32-C3 run.
6.15. Review
We've covered a lot of ground in this chapter, wading into the various options for loading
from and storing to global memory, immediate values, moving register values, memory
alignment and section attributes. Let's summarize this and review the instructions that
you've learned in this chapter.
To load values from memory, you can use the following opcodes using a memory offset and
base register rx, to load into the destination register rd:
Likewise, the following opcodes may be used to store register rs into memory using a mem-
ory offset and base register rx:
The assembler also supports the load immediate form, for loading constants:
li rd, immediate
mv rd, rs
We also saw the use of the special "lui" and "auipc" instructions:
● 94
Note that some of these instructions like "li" and "mv" for example, are adjusted according
to the assembler based upon the size of the address/immediate data involved. Finally, we
briefly saw and used the multiply and divide instructions, which will be explored later in
the book.
It might at first appear that the load and store opcodes are rather clumsy because of the
need to prepare a base register and then use an offset from that. This is true perhaps for
global memory locations. Most memory accesses today, however, are relative to a stack
frame, for which an offset and a stack frame pointer are ideal. We'll explore stacks and
stack frames later in this book.
There are two more useful functions for the programmer. They are:
These functions are used to provide a relative address constant. When we examine
branches, we will see more relative addresses from the current instruction counter (PC).
6.17. Summary
This chapter has introduced you to the critical operations of loading from and storing to
memory. It was shown that immediate data can provide for more optimal code. Mastering
these approaches is an essential start to successful RISC-V assembly language program-
ming. The next chapter will take a detailed look at the essentials of the GNU calling con-
vention.
Bibliography
[1] The RISC-V instruction set manual volume I: User-level Isa (n.d.). Retrieved April
23, 2022, from https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf, 2.2
Base Instruction Formats
● 95
In order for compilers, assemblers and linkers to produce a final working executable, an
agreement must exist between them on the use of the stack and the registers used. Since
we're using the GNU compiler collection in this book, let's discuss the GNU calling conven-
tion.
x3 gp Global Pointer -
x4 tp Thread Pointer -
● 96
In chapter 4, Architecture, the function of these registers has already been described. In
this chapter, we'll focus on the call and return.
1. Copies the current PC + n, which is the address following the current instruction,
to the destination register rd.
2. Adds the relative offset to the address PC + n to begin execution at the called
location.
Normally the destination register is ra (x1), but any other register can be used, including
x0. When x0 is specified, the return address is discarded, and the operation becomes a
jump instead. When rd is not x0, the return address is saved there, and execution resumes
at the call address.
jal rd,symbol
The destination register rd can be omitted, if you're using the standard ra (x1) register for
this call:
jal symbol
In both cases, the symbol is converted by the assembler into a half-word offset from the
current instruction address. This offset is used to jump to the new routine address as part
of the call.
jr rs
● 97
The operation of this instruction sets the PC to the value in register rs. In effect, it is a jump
through register operation. This causes execution to resume after the point of the call. Why
is "jr" a pseudo-instruction? The assembler expands this to the instruction:
jalr x0,rs,0
Effectively this instruction sets the PC to the value of rs + 0 and, because no return value
is saved, it simply becomes a jump.
ret
But how does that differ from the operations we've just reviewed? The "ret" is a pseudo-op-
code that is expanded by the assembler into:
jalr x0,ra,0
By the GNU calling convention, the return address is saved in register ra (x1). This allows
us to return to the caller by jumping to the address in ra (x1).
1. The caller performs a "jal" or "jalr", causing the return address to be placed in rd,
which is normally ra (x1).
2. The PC jumps to the offset + rs to start executing the called code.
3. The called code, returns by "ret", or "jr ra", or "jalr x0,ra,0".
Unless you're using a different register than ra (x1), the "ret" pseudo-opcode is recom-
mended since this is easily understood and clear. We'll see later that there are sometimes
reasons to use a different register.
This pseudo-op is expanded into a pair of instructions "auipc" and "jalr" to accomplish the
far call. The expansion of the normal case of "call foo", would be:
● 98
1: auipc ra,%pcrel_hi(foo)
jalr ra,ra,%pcrel_lo(1b)
Recall that the assembler function %pcrel_hi(symbol) returns the high order 20 bits of the
symbol and %pcrel_low(symbol) returns the lower 12 bits. The "auipc" step loads the high
order 20 bits of the foo address into ra (x1) temporarily, which is the word offset from the
current address in PC. The "jalr" instruction then branches to the temporary value in x1
plus the low order 12 bits of foo's address provided by %pcrel_low(1b). This arrives at the
PC for the far away subroutine. Upon completion of the "jalr" opcode, the return address
(of PC) is saved in ra (x1).
Numeric Labels
Before we can fully explain the "%pcrel_lo(1b)" part of this code, we need to explain the
numeric label of "1" and the reference to it as "1b". GNU assembler permits numeric labels
like "1" and references to these are either "1b" or "1f", indicating the first "1" back, or the
first "1" forward. This clever system permits labels to be reused without having to invent
unique names. This is especially useful in assembler macros. These numeric labels never
register as external symbols.
Returning to the "auipc" instruction, we can see that it loads the high order 20 bits of foo's
address into register ra (x1), with the PC added to it. The x1 register is used temporarily
for this. The "jalr" instruction which follows then adds the low order 12 bits of foo's address
to ra (x1) to compute the target subroutine address. The assembler function %pcrel_lo(1b)
must reference the address of the "auipc" instruction where %pcrel_hi(foo) was used. Oth-
erwise, there is a chance that the calculation might be incorrect.
This might seem like a lot to digest right now. The important concept here is to understand
that a "call" is expanded into a pair of instructions to make a relative far call possible, when
necessary.
1. The caller performs a "call", causing the return address to be placed in rd, which
is normally ra (x1).
2. The PC jumps to the subroutine address to start executing the called code.
3. The called code, eventually returns by "ret" (or "jr rs", when a different register
is used).
● 99
The current address of the "auipc" instruction 0x42005e16 is added to the constant
0x7EA000 (which is derived from 0x7EA placed in the upper 20 bits) and the result is tem-
porarily stored into ra (x1). Then the "jalr" instruction adds the constant 490 (decimal) to
ra (x1), to arrive at 0x427F0000 for foo(). As part of the "jalr" instruction, the return ad-
dress of 42005E1A + 4 is now stored into ra (x1), while the PC register is set to foo's entry
point address of 0x427F0000.
The good news is that the programmer doesn't need to worry about this much. The assem-
bler and the linker do all the dirty work to make a long or short call as necessary.
Part of the calling convention is that the return address is saved into register ra (x1). If,
however, you need to call some mini-routine(s), the convention is that t0 (x5) is used. This
avoids having to save and restore ra (x1) from memory.
1 .global callme
2
3 .text
4 callme: li a0,1 # a0 = 1
5 call t0,intern # Call internal
6 ret
7
8 intern: addi a0,a0,2 # a0 += 2
9 jr t0 # Return to retn
● 100
Listing 7.2 illustrates the ESP32-C3 main program used for this experiment. When the
function callme() is called the register a0 is loaded with the value of 1 (line 4). After the
internal function intern() is called (at line 8) the value of 2 is added resulting in a0 holding
the value of 3.
1 #include <stdio.h>
2
3 extern int callme();
4
5 void
6 app_main(void) {
7
8 printf("callme() returned %d\n",callme());
9 }
When you build and flash this code, the run output should report the following (substitute
the appropriate path for your USB device):
$ idf.py build
...
$ idf.py idf.py -p <<<your-port>>> flash monitor
...
callme() returned 3
The same program exists for QEMU. You can build and run it as follows:
$ cd ~/riscv/repo/07/jal/qemu64
$ gcc -O0 -g jal.S main.c
$ ./a.out
callme() returned 3
$ gdb ./a.out
GNU gdb (GDB) Fedora 9.0.50.20191119-2.0.riscv64.fc32
Copyright (C) 2019 Free Software Foundation, Inc.
...
Reading symbols from ./a.out...
(gdb)
● 101
This prepares to debug the program executable a.out and pauses for your next command.
As long as you compiled the code with the -g option, you should be able to list your code
with the "list" command:
(gdb) list
1 #include <stdio.h>
2
3 extern int callme();
4
5 int
6 main(int argc,char **argv) {
7
8 printf("callme() returned %d\n",callme());
9 return 0;
10 }
(gdb)
(gdb) b callme
Breakpoint 1 at 0x1048e: file jal.S, line 4.
(gdb)
Now when the program is running, it will stop when callme() is invoked. Since the program
is not running yet, let's start it:
(gdb) r
Starting program: /home/riscv/riscv/repo/07/jal/qemu64/a.out
glibc-2.30.9000-29.fc32.riscv64
Breakpoint 1, callme () at jal.S:4
4 callme: li a0,1 # a0 = 1
(gdb)
Because we have the breakpoint set, the execution paused as soon as it entered the as-
sembler routine callme(). So far, so good. Now we can trace one instruction at a time. Let's
"step" one instruction:
(gdb) s
5 call t0,intern # Call internal
(gdb)
We see the execution has performed the assembler statement of line 4. But let's examine
the register contents of a0:
● 102
If we leave the register name out, all registers would be reported. We see that the register
a0 has been loaded with the value 1, as coded. Let's step one more instruction:
(gdb) s
intern () at jal.S:8
8 intern: addi a0,a0,2 # a0 += 2
(gdb)
Now we've called the intern() function by use of t0. Step once again and display registers:
(gdb) s
9 jr t0 # Return to retn
(gdb) info reg t0 a0
t0 0x10494 66708
a0 0x3 3
(gdb)
The register a0 now has the value 3, and the register t0 has the return address from the
call in line 5. Let's step from line 9:
(gdb) s
callme () at jal.S:6
6 ret
(gdb)
From this, we see the control has returned to after the point of the call in line 6. Stepping
once again should get us back to the main() program, which called us:
(gdb) s
callme() returned 3
main (argc=1, argv=0x3ffffff2a8) at main.c:9
9 return 0;
(gdb)
From this session, we see that the printf() call was also completed as part of the return, so
that gdb could return a complete statement executed, returning to the statement following
in line 9. Since we don't care about tracing the rest of the program you can just "continue"
it as follows:
● 103
(gdb) c
Continuing.
[Inferior 1 (process 843) exited normally]
(gdb) q
The program then exits in the normal way. You could just "quit" at that point also.
What did we learn? Using the gdb debugger (under QEMU), we were able to trace the code
from main(), to callme(), and step through each assembler instruction until we returned
to the main program. Along the way we, were able to report register contents. This is de-
bugging in luxury!
When a value like a 64-bit integer must be passed or returned on an XLEN=32-bit platform,
the low-order XLEN bits are loaded into a0 (or even numbered argument register). The
high-order bits are then passed into the next odd-numbered register depending. Figure 7.1
illustrates how a long long int is passed or returned for the ESP32-C3.
9 A B C D E F 0 1 2 3 4 5 6 7 8
XLEN=32
Figure 7.1: How a long long int is loaded into two 32-bit registers on ESP32-C3.
When an int followed by a long long is passed, the int goes into register a0 (the int fits the
register). However, the long long argument has its low order word in a2 (the even-num-
bered register), and the high order word is passed in a3. Very large arguments (more than
twice the size of a pointer) are passed by reference!
This convention is used for all arguments until all eight of the argument registers are used
up. When registers a0 through a7 are allocated, what happens to the remaining argu-
ments? These are placed on the stack. After the prologue executes, the stack pointer sp
(x2) points to the first argument that was not passed in a register.
● 104
The stack pointer value is found in register sp (x2) and is set to the high address of the
stack and grows downward. The stack pointer is maintained at an alignment boundary,
which for ESP32-C3 (where XLEN=32) is a double word address (modulo 8 bytes). For
XLEN=64 platforms (QEMU), this is modulo 16 bytes.
7.4.1. Prologue
At the start of the function, the prologue serves to perform any functions necessary to
preserve the integrity of the call. This involves adjusting the stack pointer and saving to the
stack as needed. The general procedure is:
1. Decrement sp by the stack frame size for register saves, including any local vari-
able space (the size is round up to mod 8 for XLEN=32, or mod 16 for XLEN=64).
2. Store registers to be saved in the allocated stack frame (this optionally includes
the s0/fp (x8) register).
3. Save register ra (x1) when the called routine calls or reuses register ra.
The fp/s0 (x8) register is often setup by the C compiler so that negative offsets refer to
values saved on the stack. By doing that, the stack can grow or shrink without losing track
of local variable addresses or saved register values. For example:
Then, to save one byte to a local stack byte variable, it might use:
Note: It is important to adjust the sp (x2) as the first step. This keeps your stack frame
from being corrupted by intervening interrupts should they occur.
When the function is first called, the sp (x2) points to the first overflow argument (the
calling program arranges this). After the function prologue completes, register s0/fp (x8)
points to that first overflow argument instead. Positive offsets from s0/fp point to excess
calling arguments. Since the caller arranges these arguments, the called function never
needs to worry about releasing them.
● 105
1. Allocate space on the stack by subtracting from the current sp (x2) and maintain
stack alignment (this keeps the stack frame interrupt safe).
2. Optionally save the current s0/fp (x8) and other registers as necessary.
3. Optionally save register ra (x1), when necessary.
4. Optionally setup stack frame register s0/fp (x8) when required. This is normally
required when all of the arguments did not fit into the available registers.
While the C compiler might use negative offsets from register fp/s0 to access local varia-
bles, it does not have to be done that way. The same access to local variables can be had
with positive offsets from the stack pointer register sp.
7.4.2. Epilogue
When it is time to return to the caller, the stack changes must be undone, and the neces-
sary registers restored. This procedure consists of:
Note that the called function does not concern itself with freeing excess arguments that
were passed on the stack. The calling program takes care of that for us.
When the floating-point data type is handled by software, those values are passed in the
usual integer registers instead.
The pragma was added to optimize the C code to make examining its main program listing
less confusing. Sometimes in unoptimized code, values are copied from register to register
in a confusing and unnecessary manner.
● 106
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize("-O3")
5
6 extern int bigcall(
7 int32_t one,
8 int64_t two,
9 int32_t three,
10 int64_t four,
11 int32_t five,
12 int64_t six,
13 int32_t seven,
14 int64_t eight,
15 int32_t nine
16 );
17
18 void
19 app_main(void) {
20 int rc = bigcall(101,102,103,104,105,106,107,108,109);
21
22 printf("bigcall() returned %d\n",rc);
23 }
The code of interest is found in Listing 7.4, for the assembler routine bigcall().
1 .global bigcall
2
3 .struct 0
4 svfp: .space 4 # Save register fp/s0
5 svra: .space 4 # Save register ra
6 var1: .space 4 # Sample stack variable
7 .balign 8 # Make sz mod 8
8 sz = . - svfp
9
10 .struct 0
11 sixpt2: .space 4 # High order arg 6 (int64)
12 seven: .space 4 # Arg 7 (int32)
13 eight: .space 8 # Arg 8 (int64)
14 nine: .space 4 # Arg 9 (int32)
15
16 .text
17 bigcall:
18 addi sp,sp,-sz # Set sp for stack frame
● 107
This source file introduces a few new concepts. We could hard code the stack offsets for the
overflow arguments and the saved register storage. But this is tedious and error-prone. So
we make use of the ".struct" pseudo-op in lines 3 and 10. Focusing on the group starting on
line 3, this starts an absolute definition, starting at address 0. This is different than defining
a memory section because it is not actually allocated to any memory. These definitions ap-
pear in the listing as a section named "*ABS*", which is not an actual section. For example,
a listing of bigcall.S would report:
DEFINED SYMBOLS
main/bigcall.S:17 .text:0000000000000000 bigcall
*ABS*:0000000000000000 svfp
*ABS*:0000000000000004 svra
*ABS*:0000000000000008 var1
main/bigcall.S:3 *ABS*:0000000000000010 sz
*ABS*:0000000000000000 sixpt2
*ABS*:0000000000000004 seven
*ABS*:0000000000000008 eight
*ABS*:0000000000000010 nine
● 108
Unlike the symbol bigcall, which is the name of our global entry point in the .text section,
the symbol svfp is registered to an absolute area starting at absolute offset 0. References to
memory section symbols get relocated by the linker, while these absolute symbols do not.
This is important since we want the offsets to be computed but not relocated.
3 .struct 0
4 svfp: .space 4 # Save register fp/s0
5 svra: .space 4 # Save register ra
6 var1: .space 4 # Sample stack variable
7 .balign 8 # Make sz mod 8
8 sz = . - svfp
we see that the symbol svfp is assigned an address (offset) of 0. The ".space" pseudo-op in
line 4 tells the assembler to reserve 4 more bytes before looking at line 5. This effectively
moves the current location without defining any content.
Line 5 defines the symbol svra as offset 4, while line 6 defines var1 as offset 12. Before we
compute the size of the stack frame in line 8, we apply the ".balign" pseudo-op to make
the stack frame size aligned to an 8-byte boundary (for XLEN=32). For RV64, we would
use 16 instead. Finally, sz is computed in line 8, which is the current location minus the
start (svfp).
Figure 7.2 illustrates the layout of the saved registers and the excess arguments that are
passed in the call after the function prologue has completed.
stack
14 nine (hi)
12 nine (lo)
8 eight
4 seven
fp
➔
0 six (hi)
12 filler
8 var1
4 svra
0 svfp sp
➔
grows
Figure 7.2: Diagram of stack layout for bigcall.S.
● 109
The saved word svfp is at offset 0 from the stack pointer (address 0(sp)). The next word
for saving ra at offset svra is at address 4(sp). Our example stack variable var1 is at 8(sp),
which we will later just zero to demonstrate accessing local variables. The saved calling
arguments are available as offset from fp/s0. For example, argument seven is available as
4(fp) (or 4(s0)).
With these save offsets symbolically defined by the assembler, we can use them in the
following prologue code:
Line 18 adjusts the stack pointer by the size of the stack frame (16 bytes in this case). This
maintains the alignment of the stack pointer. Lines 19 and 20 save registers fp/s0 and ra
to the stack. In this example, I am assuming that ra is going to be modified (perhaps by
another call) so it must be saved. We modify the value of fp/s0 in line 21, so it too must be
saved. We don't have to initialize var1, so that isn't done in this prologue.
The first group of arguments that fit into the registers a0 to a7, can be accessed directly:
This code sums those arguments into a0 (our return register) for arguments one through
six. It's a little confusing with a7 representing argument six but keep in mind that some of
the arguments were 64 bits in size and required the use of a register pair.
The remaining arguments were placed on the stack by the calling C program. Review Figure
7.2 again. We use the frame pointer (fp/s0) register to access these arguments according
to the offset symbols we defined in the structure starting in line 10. Since temporary reg-
isters like t0 don't need to be preserved, we use it temporarily to load and sum the extra
argument values:
● 110
The argument seven word is loaded into temporary register t0 (x5) and then added to reg-
ister a0. The same is done for arguments eight and nine.
Just for fun, we zero our stack variable var1 in line 38. This example demonstrates how you
might allocate and use stack variables when required.
After the sum is computed and var1 is zeroed, we are ready to return to the calling pro-
gram. This is where the epilogue is applied:
40 lw ra,svra(sp) # Restore ra
41 lw fp,svfp(sp) # Restore fp/s0
42 addi sp,sp,+sz # Restore sp
43 ret
The offsets svra and svfp are relative to the current stack pointer (sp). These offsets are
used with the stack pointer to restore values for register ra and fp. After that, we must
also restore the sp itself by adding to it the same offset that was subtracted from it upon
entry (line 18). The last step in this epilogue is to "ret", which returns control to the caller
(register ra (x1) is assumed by the pseudo-op by default).
When the program is flashed and executed, you should see the sum printed:
$ idf.py build
…
$ idf.py -p <<<your-port>>> flash monitor
...
bigcall() returned 945
Success!
This may seem like a lot of effort to orchestrate this function call and indeed it was. But do
keep in mind that most functions do not have so many arguments, which force the overflow
onto the stack. When the C code is calling, you only have to worry about where to find the
arguments. They are placed on the stack for you by the C compiler before the call is made
and automatically released after the return.
● 111
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize("-O3")
5
6 extern void bigcall2(
7 int32_t one,
8 int64_t two,
9 int32_t three,
10 int64_t four,
11 int32_t five,
12 int64_t six,
13 int32_t seven,
14 int64_t eight,
15 int32_t nine
16 );
17
18 void
19 app_main(void) {
20
21 bigcall2(101,102,103,104,105,106,107,108,109);
22 }
The program illustrated in Listing 7.6 is much the same as before with the following chang-
es:
1. Line 36 moves the sum to register a1, to act as a second int argument to the
printf() call.
2. Line 40 uses the pseudo-op "la" to load an address into a0. This is the pointer to
the format string to be passed to printf().
3. Line 42 calls printf(), with the arguments in registers a0 and a1. Recall that the call
(by default) clobbers register ra (x1) with the printf() return address, which is why
the register was saved in the function prologue.
4. Line 43 handles the value returned by printf(), and saves it in our stack variable
var1. We don't actually use it here but it demonstrates the use of a stack-based
variable.
5. Lines 52 and 53 define the printf() format string in a read-only section used for
literals.
● 112
● 113
52 .section .rodata
53 fmt: .string "bigcall2() computed a sum of %d\n"
The "la" opcode in line 40 is a pseudo-op that results in the destination register receiving
the desired address (not its value). Like the "call" pseudo-op, "la" can result in one or two
actual opcodes to accomplish this. The printf() call needs a pointer to a format string in a0,
so the "la" opcode accomplishes this for us. If you were to objdump the build/bigcall2.elf
executable that was linked, you would find that it does indeed turn into two opcodes.
When you flash and run the program, you should get the message:
indicating success. The format of the message was deliberately changed so that you can be
assured that it was the assembler program printing the message this time.
7.6. Summary
I hope you are feeling confident now about one of the more difficult aspects of RISC-V
assembly language. It is vital that the register convention be understood for saving and
restoring registers. You witnessed how the register passing works including those that were
passed on the stack itself. The presented example programs illustrate the function prologue
and epilogue.
Using the assembler ".struct" pseudo-op, you learned how to define symbolic offsets to
stack frame components. This is an important tool because it saves you from having to use
the brittle hard coding of offsets. Allow the assembler to do the mental arithmetic for you.
It is also more amenable to changes later on.
The new opcode "la" for load address was introduced without much fanfare. But its useful-
ness for loading a pointer value into a register will become more relevant in the chapters
ahead.
There will sometimes exist a temptation to violate the call convention in the name of effi-
ciency. Some advocate "just don't".[1] Certainly recognize that there is a risk when you do.
Accept that it is probably ill-advised for medical or safety-critical systems. Be aware that
interrupts will also use the stack while your code executes so your abnormal convention
must be able to tolerate that. If you are stuck debugging something weird, this is probably
one of the first things to be checked.
● 114
Bibliography
[1] RISC-V reference - Simon Fraser University. (n.d.). Retrieved May 11, 2022, from
https://www.cs.sfu.ca/~ashriram/Courses/CS295/assets/notebooks/RISCV/RISCV_
CARD.pdf
● 115
To this point we haven't explored branching other than calling and returning from a sub-
routine call. This chapter examines some fun programs to exercise branching instructions.
Branching is a fundamental part of a loop, which allows repeated operations.
• Unconditional transfers
• Conditional branches
The following are variations of the unconditional transfer. For easy-to-read code, the "j" or
"jr" pseudo-opcodes are recommended:
When coding these transfers, there is no need to compute word or byte offsets. The as-
sembler knows which offset to provide when you provide a branch label. However, if you
provide numeric offsets, you will need to use the correct form.
● 116
Depending upon the result of the comparison between registers rs1 and rs2, the branch is
made by modifying register pc, or allowing the execution to resume at the next instruction.
There are no status flags saved from the comparison in RISC-V.
Since the comparison to zero comes up frequently, the following pseudo-ops are also avail-
able:
These are just encodings of the previous opcodes. For example, "bnez" is the same as:
If you need unsigned comparisons, then the following additional opcodes are available:
Branch instructions make a CPU smarter than a simple calculator, by giving it the ability to
make decisions.
● 117
In all of the above, the result goes into the destination register rd. The value that is shifted
originates in register rs1. The shift count originates from either source register rs2 or from
an immediate constant. Finally, the opcode determines the type of shift – its direction and
whether it is logical or an arithmetic shift. The following is an example:
In the C language, you often don't consider whether it involves a logical or arithmetic shift
operation. That is because it is determined by the data type. When unsigned, it requires
a logical shift. When signed, it requires an arithmetic shift, when shifting right. The arith-
metic right shift preserves the sign while the remaining bits are shifted right. All shift-left
operations are logical shifts, which is why you don't see a "sla" opcode. Logical shifts always
populate the shifted-out bit position with a zero.
These shift operations take the lower 32-bit value in rs1, perform the shift and then sign
extend the 32-bit result into the destination register for 64 bits. For example, if register a0
holds the value:
a0 0x7ffffffffffffffc
a0 0xfffffffffffffffe
In other words, the sign was taken from bit 31 after the shift was performed and then ex-
tended out to the full 64 bits in the destination register. This permits an easy interchange
of signed 32-bit values into the 64-bit environment.
● 118
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize("-O3")
5
6 extern int ones(int bits); // Assembler routine
7
8 static int c_ones(int bits) {
9 int count = 0;
10
11 for ( ; bits != 0; bits <<= 1 ) {
12 if ( bits < 0 )
13 ++count;
14 }
15 return count;
16 }
17
18 void
19 app_main(void) {
20 static int tests[] = { 0, 1, 2, 3, 5, 7, 9, 15,
21 31, 63, 64, 127, 1023, 1024, 2047, 2048,
22 9999 };
23 int bcount, ccount, bits;
24
25 for ( unsigned ux=0; tests[ux] < 9999; ++ux ) {
26 bits = tests[ux];
27 bcount = ones(bits);
28 printf("ones(%4d) (0x%04X) returned %d\n",
29 bits, bits,
● 119
30 bcount);
31 ccount = c_ones(bits);
32 if ( ccount != bcount )
33 printf("c_ones(%4d) did not agree with %d!\n",
34 ccount,bcount);
35 }
36 printf("Done.\n");
37 }
When the value a0 is non-zero, we continue in line 6, testing if the sign bit is positive using
"bge". If it is positive, we skip over the instruction in line 7. Otherwise, we continue execu-
tion in line 7 and increment the count in t0 (line 7). Execution continues in line 8 where the
value of a0 is locally shifted left 1 bit. At the end of the loop in line 9, we unconditionally
branch back to the top of the loop at line 5.
When the branch is finally taken to the label done (line 11), the count in register t0 is
moved to a0 in order to return the bit count. Finally, we return to the caller in line 12.
1 .global ones
2 .text
3 ones: mv t0,zero # t0 = 0
4
5 loop: beq a0,zero,done # Branch if a0 == 0
6 bge a0,zero,shift # Skip next if sign is positive
7 addi t0,t0,1 # t0 += 1
8 shift: slli a0,a0,1 # a0 <<= 1
9 j loop # Repeat loop
10
11 done: mv a0,t0 # a0 = count in t0
12 ret
● 120
Run it
Build, flash and monitor the result from the ESP32-C3 device:
$ cd ~/riscv/repo/08/ones
$ idf.py build
...
idf.py -p <<<your-port>>> flash monitor
...
ones( 0) (0x0000) returned 0
ones( 1) (0x0001) returned 1
ones( 2) (0x0002) returned 1
ones( 3) (0x0003) returned 2
ones( 5) (0x0005) returned 2
ones( 7) (0x0007) returned 3
ones( 9) (0x0009) returned 2
ones( 15) (0x000F) returned 4
ones( 31) (0x001F) returned 5
ones( 63) (0x003F) returned 6
ones( 64) (0x0040) returned 1
ones( 127) (0x007F) returned 7
ones(1023) (0x03FF) returned 10
ones(1024) (0x0400) returned 1
ones(2047) (0x07FF) returned
ones(2048) (0x0800) returned 1
Done.
How did we do? No complaints were issued, so this confirms that both the assembly lan-
guage function and the C language functions agreed. Examination of the return values
indicates that the results are correct.
Routines Compared
Was our assembler routine better than the C language one? Let's first examine the C lan-
guage listing in assembly language:
$ ~/riscv/repo/listesp main/main.c
The extract of the portion of the listing for the c_ones() function is shown in Listing 8.3.
6 c_ones:
7 0000 AA87 mv a5,a0
8 0002 0145 li a0,0
9 0004 99C7 beqz a5,.L5
10 .L4:
11 0006 13A70700 slti a4,a5,0
12 000a 8607 slli a5,a5,1
13 000c 3A95 add a0,a0,a4
● 121
It's clear from this listing that the C compiler chose to generate this function a little differ-
ently than our own assembler routine. Let's break it down and compare.
Line 7 moves the argument value "bits" to a5 as its first step. Then it sets a0 to zero in line
8. If the value of a5 is already zero, control passes to .L5 (line 16) where the routine returns
to the caller (with zero held in a0).
Otherwise, the top of the loop in lines 10 and 11 is entered. Line 11 uses an opcode "slti"
that we've not yet covered. When the value in a5 is less than 0 (the immediate value), then
the value placed into a4 is the value 1. Otherwise, a4 receives the value zero (line 11).
After that, it shifts the bits in a5 left 1 bit (line 12). Line 13 adds the value of a4 to a0. We
know that a4 will be the value 0 or 1 from line 11. Finally, if a5 is not equal to zero in line
14, execution resumes at the top of the loop at .L4. Otherwise, the execution falls through
to line 15, returning the bit count in a0.
It is readily apparent that both functions are the same number of bytes of object code (20
bytes). So which routine is faster? The c_ones() routine must execute the "addi" instruction
even when the register a4 is zero:
In this routine's favour however, there is one less branch involved in each loop. The ones.S
program has a "bge" branch in line 6, whereas the c_ones() function uses the "slti" in line
11 instead. The practical difference is going to boil down to the silicon used.
● 122
seqz rd,rs # rd = rs == 0 ? 1 : 0
snez rd,rs # rd = rs != 0 ? 1 : 0
sltz rd,rs # rd = rs < 0 ? 1 : 0
sgtz rd,rs # rd = rs > 0 ? 1 : 0
It should be clear that these operations have direct application for C/C++. Consider the
C++ code:
bool zflag;
int x;
…
zflag = !x; // Set zflag true if x == 0 else false
In this example, if register t0 holds the value of zflag and a0 holds the value of x, then the
C compiler could simply emit:
● 123
1 .global odd_parity
2 .text
3 odd_parity:
4 li t0,0 # t0 = 0
5
6 loop: beq a0,zero,done # Branch if a0 == 0
7 sltz t1,a0 # t1 = a0 < 0 ? 1 : 0
8 xor t0,t0,t1 # t0 = t0 ^ t1
9 slli a0,a0,1 # a0 <<= 1
10 j loop # Repeat loop
11
12 done: mv a0,t0 # a0 = count in t0
13 ret
In this program, we chose to use "li" in line 4 this time to initialize t0 to zero just for fun.
The execution continues to the top of the loop in line 6, where a0 is tested for the value of
zero. If it is zero, control moves to line 12, where the parity value is placed in a0, and then
returned to the caller (line 13).
Otherwise, when line 6 fails to branch, it means we still have 1 bits to test for parity. Line 7
sets the temporary register t1 to a 1 if the value is less than zero (indicating that the sign
bit is true), else t1 is cleared to zero. The "xor" opcode in line 8 applies an exclusive-or be-
tween the value in t0 and t1. Since either register only has the low order bit set (or not), the
end result is a 1 or a 0, indicating the current parity value. Recall that an exclusive-or oper-
ation flips bits when one or the other bit is a 1-bit, but not if both are the same. So, every
time through the loop, as we encounter more 1-bits, the value of t0 (parity) is inverted.
At the end of the loop in lines 9 and 10, we shift the value in a0 logically left by the 1-bit
position, populating a new bit in the sign bit (bit 31 for the ESP32-C3) and repeat the loop.
Listing 8.6 illustrates the main program for driving the test, which is much the same as
before. The function odd_parity() is called with various values from the array tests[] and
reported in the printf() call in lines 18 and 19.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize("-O3")
5
6 extern int odd_parity(int bits); // Assembler routine
7
8 void
9 app_main(void) {
10 static int tests[] = { 0, 1, 2, 3, 5, 7, 9, 15,
● 124
$ cd ~/riscv/repo/08/parity
$ idf.py build
...
$ idf.py -p <yourport> flash monitor
...
odd_parity( 0) (0x0000) returned 0
odd_parity( 1) (0x0001) returned 1
odd_parity( 2) (0x0002) returned 1
odd_parity( 3) (0x0003) returned 0
odd_parity( 5) (0x0005) returned 0
odd_parity( 7) (0x0007) returned 1
odd_parity( 9) (0x0009) returned 0
odd_parity( 15) (0x000F) returned 0
odd_parity( 31) (0x001F) returned 1
odd_parity( 63) (0x003F) returned 0
odd_parity( 64) (0x0040) returned 1
odd_parity( 127) (0x007F) returned 1
odd_parity(1023) (0x03FF) returned 0
odd_parity(1024) (0x0400) returned 1
odd_parity(2047) (0x07FF) returned 1
odd_parity(2048) (0x0800) returned 1
Checking the results, you can verify that the odd parity was computed for each test value.
● 125
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize("-O3")
5
6 extern int odd_parity(int64_t bits); // Assembler routine
7
8 int
9 main(int argc,char **argv) {
10 static int64_t tests[] = { 0, 1, 2, 3, 5, 7, 9, 15,
11 31, 63, 64, 127, 1023, 1024, 2047, 2048,
12 65535, 65536, 0xF0F0900000000001ll, 999999 };
13 int64_t bits;
14 int oddpar;
15
16 for ( unsigned ux=0; tests[ux] != 999999ll; ++ux ) {
17 bits = tests[ux];
18 oddpar = odd_parity(bits);
19 printf("odd_parity(%20lld) (0x%016llX) returned %d\n",
20 bits, bits, oddpar);
21 }
22 printf("Done.\n");
23 }
The parity.S assembly language program shown in Listing 8.8 is exactly the same program
that we used on the ESP32-C3. The difference in this instance, however, is that the regis-
ters are 64 bits in width (XLEN=64). So, the 64-bit argument is received in register a0 (line
6). The 64 bits are shifted left (logically) in line 9 (vs 32 bits as it was on the ESP32-C3).
1 .global odd_parity
2 .text
3 odd_parity:
4 li t0,0 # t0 = 0
5
6 loop: beq a0,zero,done # Branch if a0 == 0
7 sltz t1,a0 # t1 = a0 < 0 ? 1 : 0
8 xor t0,t0,t1 # t0 = t0 ^ t1
9 slli a0,a0,1 # a0 <<= 1
10 j loop # Repeat loop
11
12 done: mv a0,t0 # a0 = count in t0
13 ret
● 126
$ cd ~/riscv/repo/08/parity/qemu64
$ gcc -g parity.S main.c
$ ./a.out
odd_parity( 0) (0x000000) returned 0
odd_parity( 1) (0x000001) returned 1
odd_parity( 2) (0x000002) returned 1
odd_parity( 3) (0x000003) returned 0
odd_parity( 5) (0x000005) returned 0
odd_parity( 7) (0x000007) returned 1
odd_parity( 9) (0x000009) returned 0
odd_parity( 15) (0x00000F) returned 0
odd_parity( 31) (0x00001F) returned 1
odd_parity( 63) (0x00003F) returned 0
odd_parity( 64) (0x000040) returned 1
odd_parity( 127) (0x00007F) returned 1
odd_parity( 1023) (0x0003FF) returned 0
odd_parity( 1024) (0x000400) returned 1
odd_parity( 2047) (0x0007FF) returned 1
odd_parity( 2048) (0x000800) returned 1
odd_parity(65535) (0x00FFFF) returned 0
odd_parity(65536) (0x010000) returned 1
Done.
This parity function worked on 64 bits, just as it did with 32 for the ESP device. But there
is a new efficiency concern. The loop iterations may run up to 64 times before returning to
this platform. For XLEN=32-bit platforms, the maximum loop count was half of that. If you
were computing parity on byte values that were loaded into an int64_t argument, the loop
would be idle for 87.5% of the bits processed. Perhaps this can be improved upon.
Listing 8.9 shows a somewhat improved version of odd_parity() in file parity2.S. What is
new is that we preload register t2 with a mask value of 32 1-bits in the upper half of the
register. Line 5 places all 1 bits into t2, while line 6 shifts it left logically 32 bits. Recall that
the immediate constant is sign-extended when loaded. This mask value permits us to check
if the upper 32-bits of a0 are all zeros or not (lines 11 and 12). If they are zero, we branch
to line 9 to simply shift the a0 register left by 32 bits, to avoid iterating 32 more times to
get to that point. Note that no parity adjustment is required for this.
1 .global odd_parity
2 .text
3 odd_parity:
4 li t0,0 # t0 = 0
5 li t2,-1 # t2 = 0xFFFFFFFFFFFFFFFF
● 127
There is more that could be done to improve the efficiency, but each addition has its trade-
offs. I'll leave that to you to experiment further.
Tip: To load a register with all 1 bits, take advantage of the sign extension of the imme-
diate data. For example "li t2,-1" sign extends the 12-bit value of -1 (0xFFF) to the full
XLEN width of the register.
This can also apply to other special mask values. For example, to create a mask for all
but the low order byte, the instruction "li t2,-256" can be used. This sets all bits to 1 in
t2, except for the low order byte because -256 (0xF00) is sign extended.
Normally we write code to execute at one fixed address. But what happens if the code
needs to be copied to a different address and executed there? Traditionally, this requires
fixing up addresses of referenced routines that didn't move. Under Fedora Linux, for ex-
ample, shared libraries need to be able to run in the shared memory segments, which are
dynamically allocated. This creates two problems:
1. Labels of instructions that move with the code, need to be relative to the pc. In
this way, moving the code does not upset branch references because they remain
relative to the current address.
2. Fixed labels of routines or global data that do not move, require some special
treatment in order to remain valid after the move.
● 128
Recall from the last chapter when we examined the far call to a function foo() at address
0x427F0000:
That code relied on the fact that the "auipc" opcode does a calculation relative to the pc.
If we were to move this code to a new location and keep foo where it is, the value computed
for the call to foo() would now be incorrect. This is due to the relative "auipc" calculation.
To fix this as an absolute reference to foo, we can substitute the "auipc" opcode with "lui"
(load upper immediate) instead. Instead of computing a relative result, the absolute value
of %hi(foo) is placed in ra instead. Then when we call lo%(foo) plus ra, we arrive at the
fixed address of foo (0x427F0000).
For beginners, this kind of stuff might bother the brain. The good news is that, for the av-
erage routine, you can glibly use the "call" and otherwise ignore the issue. But those who
write code that must be position independent will have to assume more responsibility.
8.8. Summary
This chapter covered unconditional jumps and conditional branches. You witnessed how
RISC-V is able to function without condition flags. You've worked with the shift instructions
and the compare and set operations. We compared an optimized C function against our
own assembly language routine and found that it is not always trivial to beat the compiler.
We even snuck in the use of the "xor" opcode. This and other opcodes are the subjects of
the next chapter.
● 129
Up to this point in this tutorial, things have been kept simple to avoid overwhelming the
reader with details. But with the framework out of the way, you're at the point now where
more functionality would be welcomed. This chapter introduces a number of additional
opcodes, which are the essential building blocks for most RISC-V assembly language pro-
grams.
9.1.2. lui
The "lui" opcode provides a way to load 20 bits of unsigned data, into the high order 32
bits of a 32-bit register. We've discussed this one before. When registers are larger than 32
bits, the same applies except that bit 31 is sign-extended to the full XLEN bits for RV64 and
larger. For example, the instruction:
lui a0,0xEEEEE
● 130
9.1.3. auipc
We've also seen this opcode before when it was used to do a relative address calculation. It
adds the current location (register pc before it is incremented) to the unsigned immediate
data (shifted up 20 bits). Like the "lui" opcode, bit 31 is sign-extended to the full register
width for RV64 and larger platforms.
These all behave as if the register result were only 32 bits in size. But once the result is
computed, bit 31 is then sign-extended to the full width of the register. You might think of
these as working on a word (32-bits) and then sign-extended.
The operations of "and", "or" and "xor" should be familiar to the C/C++ language program-
mer. These boolean operations are summarized in Table 9.1. Remember that the immedi-
ate data constants are sign-extended to the full register width, even though the immediate
constant itself is only 12-bits wide within the opcode.
Tip: To load a register with all 1-bits, load an immediate data value of -1, as in "li t0,-1".
In this example, t0 is set to all 1-bits due to the sign extension of the constant -1.
● 131
1 .global rotate_left
2 .text
3
4 rotate_left:
5 sltz t0,a0 # t0=1 if bit 32=1
6 slli a0,a0,1 # a0 <<= 1
7 or a0,a0,t0 # Or in shifted out bit
8 ret
The 32-bit unsigned value is passed in register a0 (line 5). The "set less than zero" opcode
then sets register t0 to the value of 1 or 0 based upon the sign bit (bit 31) of a0. Line 6 then
shifts that value in a0 left one bit (this fills bit 0 with a zero). Finally, line 7 does a logical
"or" of the bit in t0, which adds a 1-bit or leaves the register a0 as it is, if t0 happens to be
zero. Finally, the value is returned in register a0 (line 8).
The main driver program is shown in Listing 9.2. It simply iterates 32 times, rotating the
value of 0xBEEF, initialized in line 10. After iterating the full 32 times, the value should
return to 0xBEEF.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize("-O3")
5
6 extern uint32_t rotate_left(uint32_t bits); // Assembler routine
7
8 void
9 app_main(void) {
10 uint32_t bits = 0xBEEF, rol;
11
12 for ( unsigned ux=0; ux<32; ++ux, bits = rol ) {
13 rol = rotate_left(bits);
14 printf("rotate_left(0x%08X) returned 0x%08X\n",
15 bits,rol);
16 }
17 printf("Done.\n");
18 }
● 132
$ cd ~/riscv/repo/09/rol
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
rotate_left(0x0000BEEF) returned 0x00017DDE
rotate_left(0x00017DDE) returned 0x0002FBBC
rotate_left(0x0002FBBC) returned 0x0005F778
rotate_left(0x0005F778) returned 0x000BEEF0
rotate_left(0x000BEEF0) returned 0x0017DDE0
rotate_left(0x0017DDE0) returned 0x002FBBC0
rotate_left(0x002FBBC0) returned 0x005F7780
rotate_left(0x005F7780) returned 0x00BEEF00
rotate_left(0x00BEEF00) returned 0x017DDE00
rotate_left(0x017DDE00) returned 0x02FBBC00
rotate_left(0x02FBBC00) returned 0x05F77800
rotate_left(0x05F77800) returned 0x0BEEF000
rotate_left(0x0BEEF000) returned 0x17DDE000
rotate_left(0x17DDE000) returned 0x2FBBC000
rotate_left(0x2FBBC000) returned 0x5F778000
rotate_left(0x5F778000) returned 0xBEEF0000
rotate_left(0xBEEF0000) returned 0x7DDE0001
rotate_left(0x7DDE0001) returned 0xFBBC0002
rotate_left(0xFBBC0002) returned 0xF7780005
rotate_left(0xF7780005) returned 0xEEF0000B
rotate_left(0xEEF0000B) returned 0xDDE00017
rotate_left(0xDDE00017) returned 0xBBC0002F
rotate_left(0xBBC0002F) returned 0x7780005F
rotate_left(0x7780005F) returned 0xEF0000BE
rotate_left(0xEF0000BE) returned 0xDE00017D
rotate_left(0xDE00017D) returned 0xBC0002FB
rotate_left(0xBC0002FB) returned 0x780005F7
rotate_left(0x780005F7) returned 0xF0000BEE
rotate_left(0xF0000BEE) returned 0xE00017DD
rotate_left(0xE00017DD) returned 0xC0002FBB
rotate_left(0xC0002FBB) returned 0x80005F77
rotate_left(0x80005F77) returned 0x0000BEEF
Done.
At the end of the test run, the returned value is indeed 0xBEEF, which is the value that it
started with. Success!
● 133
1 .global rotate_left
2 .text
3
4 rotate_left:
5 sltz t0,a0 # t0=1 if bit 63=1
6 slli a0,a0,1 # a0 <<= 1
7 or a0,a0,t0 # Or in shifted out bit
8 ret
The main program differs slightly in Listing 9.4. This time the test iterates 64 times, to
arrive at the original value, due to the increase in register width.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize("-O3")
5
6 extern uint64_t rotate_left(uint64_t bits); // Assembler routine
7
8 int
9 main(int argc,char **argv) {
10 uint64_t bits = 0xBEEF, rol;
11
12 for ( unsigned ux=0; ux<64; ++ux, bits = rol ) {
13 rol = rotate_left(bits);
14 printf("rotate_left(0x%016llX) returned 0x%016llX\n",
15 bits,rol);
16 }
17 return 0;
18 }
● 134
● 135
The result shows proof positive that the rotate left was correct because the original value
returned after 64 iterations.
1 .global rotate_right
2 .text
3
4 rotate_right:
5 andi a1,a0,1 # a1=1 if bit 0 of a0 is true
6 slli a1,a1,31 # Set bit 31 of a1 if bit=1
7 srli a0,a0,1 # a0 >>= 1 (logical)
8 or a0,a0,a1 # or in sign bit
9 ret
● 136
The main program is illustrated in Listing 9.6 and is otherwise much the same as before.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize("-O3")
5
6 extern uint32_t rotate_right(uint32_t bits); // Assembler routine
7
8 void
9 app_main(void) {
10 uint32_t bits = 0xBEEF, ror;
11
12 for ( unsigned ux=0; ux<32; ++ux, bits = ror ) {
13 ror = rotate_right(bits);
14 printf("rotate_right(0x%08X) returned 0x%08X\n",
15 bits,ror);
16 }
17 printf("Done.\n");
18 }
$ cd /riscv/repo/09/ror
$ idf.py build
...
$ idf.py -p $PORT flash monitor
...
I (257) cpu_start: Starting scheduler.
rotate_right(0x0000BEEF) returned 0x80005F77
rotate_right(0x80005F77) returned 0xC0002FBB
rotate_right(0xC0002FBB) returned 0xE00017DD
rotate_right(0xE00017DD) returned 0xF0000BEE
rotate_right(0xF0000BEE) returned 0x780005F7
rotate_right(0x780005F7) returned 0xBC0002FB
rotate_right(0xBC0002FB) returned 0xDE00017D
rotate_right(0xDE00017D) returned 0xEF0000BE
rotate_right(0xEF0000BE) returned 0x7780005F
rotate_right(0x7780005F) returned 0xBBC0002F
rotate_right(0xBBC0002F) returned 0xDDE00017
rotate_right(0xDDE00017) returned 0xEEF0000B
rotate_right(0xEEF0000B) returned 0xF7780005
rotate_right(0xF7780005) returned 0xFBBC0002
rotate_right(0xFBBC0002) returned 0x7DDE0001
● 137
As an exercise, create a RV64 version of the same to be run under QEMU. There is one
minor change needed for ror.S. Can you spot it?
The "not" opcode is the same as the familiar one's complement (~) operator in C. The "neg"
opcode produces a two's complement value by subtracting rs from zero. The "negw" is for
the RV64 and larger platforms to treat a 64-bit value initially as a 32-bit value and then
sign extend the result.
For unsigned data types, this proves to be straightforward. Listing 9.7 demonstrates how
it is done. The low order words are added together in line 11, with the result replacing a0,
where the low order 32 bit result is returned. Line 12 then tests if the unsigned result in a0
is less than a2 (operand 2), and if so, places a 1 in a7, else zero. This value is essentially the
● 138
carry bit. In other words, if the unsigned sum of the low order words is less than either of
the low order operands, then there is a carry of 1. The program continues in line 13 to add
the high order 32-bit words, replacing a1 with the result. Finally, the carry (if any) is added
to the high order word in line 14. The 64-bit result is then returned in a0 and a1 at line 15.
1 .global add64
2 .text
3
4 # ARGUMENTS:
5 # a0, a1 uint64_t operand 1
6 # a2, a3 uint64_t operand 2
7 #
8 # RETURNS:
9 # a0, a1 uint64_t sum
10 #
11 add64: add a0,a0,a2 # Add low order 32 bits
12 sltu a7,a0,a2 # a7 = a0 < a2 ? 1 : 0
13 add a1,a1,a3 # Add high order 32 bits
14 add a1,a1,a7 # Add carry
15 ret
The main driver program for this test is shown in Listing 9.8. The variables x, y and z are
marked volatile in this program to prevent the optimizing C compiler from pre-calculating
the result.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize("-O3")
5
6 extern uint64_t add64(uint64_t op1,uint64_t op2);
7
8 void
9 app_main(void) {
10 volatile uint64_t x=0x7FFFFFFFF, y=0x3000011115, z;
11
12 z = add64(x,y);
13 printf("0x%016llX + 0x%016llX = 0x%016llX\n",x,y,z);
14 }
● 139
$ cd ~/riscv/repo/09/multipu
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
0x00000007FFFFFFFF + 0x0000003000011115 = 0x0000003800011114
If you plug those values into gdb, you can verify the result:
$ gdb
GNU gdb (GDB) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
...
(gdb) p /x 0x7FFFFFFFFll + 0x3000011115ll
$1 = 0x3800011114
(gdb)
Unsigned Overflow
Testing for unsigned overflow is equally simple, since all that is required is to see if a carry
occurred out of the high order word.
1 .global s32ovf
2 .text
3
4 # ARGUMENTS:
5 # a0 int32_t operand 1
6 # a1 int32_t operand 2
7 #
8 # RETURNS:
9 # a0 flag:
10 # 0 = no overflow
11 # 1 = overflow
12 #
● 140
1. The input values to be summed are passed into the function in registers a0 and a1
according to the GNU calling convention.
2. Line 13 sets t1 to 1 if the first operand is negative, else zero.
3. Line 14 sets t2 to 1 if the second operand is negative, else zero.
4. A branch to noovfl is taken in line 15 if the signs differ (no overflow is possible
when the signs differ).
5. A trial sum is made of the two operands into t0 to determine the sign of the result
(line 19).
6. Register t0 is set to 1 if the trial sum is negative, else zero (line 20).
7. If t0 from step 6 does not match t1 from step 2, then the result's sign has changed,
indicating an overflow. In the overflow case, a branch is made to label ovfl (line
21).
8. When there is no overflow from step 7, fall through and return zero (line 23).
In this example, we only return the overflow status of the sum. An improved approach
would be to return both the sum and the status together, but that would be getting ahead
of ourselves.
The main driver program is provided in Listing 9.10. Four tests are performed by calling
the static function report() (lines 8 to 15). Each test is invoked with different test values in
lines 20 to 23. The assembly language function is invoked from line 12, and the results are
reported in lines 13 and 14.
1 #include <stdio.h>
2 #include <stdint.h>
3
● 141
$ cd ~/riscv/repo/09/ovflow
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
...
I (258) cpu_start: Starting scheduler.
0x7FFFFFFE (2147483646) + 0x00000001 (1) = 0x7FFFFFFF (2147483647): no overflow
0x7FFFFFFE (2147483646) + 0x0000002E (46) = 0x8000002C (-2147483604): overflow
0xFFFFFFFD (-3) + 0x0000002E (46) = 0x0000002B (43): no overflow
0xFFFFFFFD (-3) + 0x80000000 (-2147483648) = 0x7FFFFFFD (2147483645): overflow
In the output, the first example retains the original sign after the sum (result 0x7FFFFFFF).
However, in the second sum, we see that the sign changed (result 0x8000002C) and the
function s32ovf() correctly identifies it as an overflow. The third test had inputs of different
signs and the result remains ok. The final test results in an overflow, which is correctly
identified.
● 142
neg64() since negating a value can also lead to overflow. For example, a 16-bit signed in-
teger has a maximum range of -32768 to +32767. When the value -32768 is negated, the
+32768 value cannot be represented. This counts as an overflow.
In these versions of the functions, we return an overflow indication by use of the pointer
argument pbool. When the operation for addi64() or neg64() results in an overflow, the
bool value pointed to by the pbool pointer argument is set to true. When no overflow oc-
curs, the value is set to false.
1. Register t5 is initialized to zero in line 15. This register is assuming the role of the
overflow indicator.
2. Register t2 is set to 1 if the first argument (in a0) is negative (line 16).
3. Register t3 is set to 1 if the second argument (in a1) is negative (line 17).
4. Register a0 is set to the result of adding the arguments (a0 + a1, in line 18).
5. If the signs of the arguments differ in line 19, the branch to "doret" is taken, since
no overflow is possible when the signs differ.
6. Otherwise, control continues to line 23 where t3 is set to the sign of the result (in
a0). Register t3 assumes the value 1, if the result is negative else is zero.
7. If the sign of the result (t3) matches the sign of the first argument (originally in
a0), then no overflow has occurred, and the branch to "doret" is taken.
8. Otherwise, control passes to line 28 where t5 is set to 1 (true).
9. Control then passes to the label of "doret", where the current boolean value in
register t5 is then stored to the caller's bool, by means of the pointer in register
a2 (argument 3).
10. Finally, control returns to the caller in line 30.
An important feature of this calling convention is that it is thread-safe. The caller supplies
its own bool variable to be updated, which is local to the calling thread. This allows several
threads to be simultaneously calling addi64() and yet each caller receives its own private
flag value for overflow.
1 .global addi64,neg64
2 .text
3
4 # ARGUMENTS:
5 # a0 int64_t operand 1
6 # a1 int64_t operand 2
7 # a2 ptr to bool_t
8 # RETURNS:
● 143
9 # a0 int64_t sum
10 # bool_t return values:
11 # 0 no overflow
12 # 1 overflow
13
14 addi64:
15 li t5,0 # Flag = false
16 slti t2,a0,0 # t2 = a0 < 0 ? 1 : 0 (sign 1)
17 slti t3,a1,0 # t3 = a1 < 0 ? 1 : 0 (sign 2)
18 add a0,a0,a1 # Sum opr 1 + opr 2
19 bne t2,t3,doret # No overflow possible: signs differ
20
21 # Signs equal: test result sign
22
23 slti t3,a0,0 # t3 = sum < 0 ? 1 : 0
24 beq t3,t2,doret # Signs equal: no overflow
25
26 # Overflowed
27
28 li t5,1 # flag = true
29 doret: sb t5,0(a2) # Update bool value by ptr
30 ret
31
32 # ARGUMENTS:
33 # a0 int64_t operand 1
34 # a1 ptr to bool_t
35 # RETURNS:
36 # a0 int64_t sum
37 # bool_t return values:
38 # 0 no overflow
39 # 1 overflow
40
41 neg64: li t5,0 # Flag = false
42 li t4,1 # t4 = 1
43 slli t3,t4,63 # t3 = max negative value
44 bne a0,t3,noprob # Branch if arg is not maximally negative
45 sb t4,0(a1) # *ptr = true (overflow)
46 ret # a0 remains the same value
47
48 noprob: neg a0,a0 # Negate argument (a0)
49 sb t5,0(a1) # *ptr = false (no overflow)
50 ret
● 144
1. Register t5 assumes the role of the overflow flag and is initialized false (line 41).
2. Register t4 is loaded with the immediate constant of 1 (line 42).
3. Register t3 is then set to the most negative number possible by shifting the value
of 1 (in t4) up by 63 bits.
4. The argument (a0) is then compared to t3 and if not equal control transfers to "no-
prob". Argument values other than the most negative value can be safely negated
without an overflow (at step 6).
5. When the branch is not taken in step 4, the value of t4 (still holding the value 1/
True) is then saved by pointer (in a1) to the caller's bool variable, and control re-
turns to the caller (line 46).
6. Otherwise, at line 48 (label "noprob"), the argument is safely negated (in a0), and
the value 0/False (in t5) is stored by pointer argument (a1) in the caller's bool
variable, prior to returning in line 50.
The main driver program is shown in Listing 9.12. Static functions report() and negtest()
test out the assembly language routines addi64() and neg64() respectively. Lines 34 and
35 test out the addi64() routine, which uses the thread-safe calling format. The addi64()
call is made in line 13, passing in the two operands and a pointer to the bool variable
declared in line 10. It is best practice to initialize values like "overflow", but the function
addi64() was designed to set the variable regardless of the overflow result. The overflow
value is reported by printf() in lines 14 and 15.
Likewise, the neg64() function is invoked at line 23, returning the overflow status declared
at line 20, by a pointer in the call. Lines 24 and 25 report the result of the call.
1 #include <stdio.h>
2 #include <stdint.h>
3 #include <stdbool.h>
4
5 extern int64_t addi64(int64_t op1,int64_t op2,bool *pbool);
6 extern int64_t neg64(int64_t op1,bool *pbool);
7
8 static void
9 report(int64_t op1,int64_t op2) {
10 bool overflow;
11 int64_t r;
12
13 r = addi64(op1,op2,&overflow);
14 printf("0x%016llX + 0x%016llX = 0x%016llX, overflow=%d\n",
15 op1,op2,r,overflow);
16 }
17
18 void
19 negtest(int64_t op) {
● 145
20 bool overflow;
21 int64_t r;
22
23 r = neg64(op,&overflow);
24 printf("- 0x%016llX = 0x%016llX, overflow=%d\n",
25 op,r,overflow);
26
27 }
28
29 int
30 main(int argc,char **argv) {
31 int64_t x, y;
32 int64_t r;
33
34 report(0x4EEEEEEE99999999ll,0x2AAAAAAAABBBBBBBll);
35 report(0x4EEEEEEE99999999ll,0x7AAAAAAAABBBBBBBll);
36
37 negtest(-45609);
38 negtest(45609);
39 negtest((1ll << 63) + 1);
40 negtest(1ll << 63);
41
42 return 0;
43 }
$ cd ~/riscv/repo/09/ovf64/qemu64
$ gcc -g addi64.S main.c
$ ./a.out
0x4EEEEEEE99999999 + 0x2AAAAAAAABBBBBBB = 0x7999999945555554, overflow=0
0x4EEEEEEE99999999 + 0x7AAAAAAAABBBBBBB = 0xC999999945555554, overflow=1
- 0xFFFFFFFFFFFF4DD7 = 0x000000000000B229, overflow=0
- 0x000000000000B229 = 0xFFFFFFFFFFFF4DD7, overflow=0
- 0x8000000000000001 = 0x7FFFFFFFFFFFFFFF, overflow=0
- 0x8000000000000000 = 0x8000000000000000, overflow=1
$
We see that the addi64() function performs the 64-bit addition and returns the correct
overflow status for the two tests (first two lines of output). The last four lines of output are
the result of testing neg64(). From the output shown, it is evident that the results of these
tests are correct and that the one overflow case was identified.
● 146
Questions emerge when designing functions like addi64() and neg64(). When the overflow
occurs for neg64() in this program, we chose to return the input argument unchanged.
Some applications might prefer to return the value 0x7FFFFFFFFFFFFFFF instead, which is
only off by one. On the other hand, this is still not a correct result.
For addi64(), the incomplete sum was returned instead when an overflow occurred. If the
application can tolerate the overflow (as wraparound), then perhaps this is acceptable.
Here, it is assumed that the caller will take the appropriate action when the overflow indi-
cator is set to true.
9.9. Summary
At this point in the book, you should feel good about your progress in the RISC-V assembly
language. You have covered most of the RISC-V instruction set opcodes that you'll normally
use. There are some others that are privileged or used in special situations, which will be
left for the advanced students to study on their own.
The next chapter covers multiply and divide opcodes. These are available when extension
M for RISC-V is provided. This M extension is included for the ESP32C3 device (RV32) and
for QEMU (RV64I) used by our Fedora Linux. With the exception of floating point and mul-
tiply/divide our treatment of opcodes is complete. The beauty of RISC-V is not having to
memorize volumes of opcodes!
● 147
Integer multiplication and division are essential mathematic operations. Yet RISC-V defines
these as an extension M, which permits vendors to create CPUs without this capability to
save on cost. These operations can, of course, be performed in software but at the expense
of CPU time. Given the clear advantages of these hardware operations, devices like the
ESP32-C3 do, however, provide for extension M. These added opcodes boost the overall
performance of the CPU.
Signed Multiplication
Multiplication can be performed as unsigned or signed integer operations. Unsigned mul-
tiplication is often used for C language subscripting operations within an array or matrix,
where the first element is at subscript 0. Other languages such as Ada permit subscripts to
use negative numbers, and thus must use signed multiplication.
When multiplying signed numbers, the product's sign obeys the following relationships:
● 148
In addition to the above characteristics, one has to be careful about the semantics of a
divide by zero and the possibility of overflow.
Signed Division
When dividing signed numbers, the quotient's and remainder's sign obeys the following
rules:
It should be noted that the C language compiler "/" and "%" operators follow these con-
ventions.
The destination register receives the lower XLEN bits of the product. To obtain the upper
XLEN bits of the product, the "mulhs" (signed) or "mulhu" (unsigned) opcodes are available.
● 149
The destination register rdh receives the high order product word while rdl receives the
low order product word. Register rdh cannot be the same as rs1 or rs2 for the optimized
operation. For unsigned products use the following sequence instead:
For the multiply to function optimally, the following rules must be observed:
Multiplication is a relatively expensive operation. When the silicon is designed for it, the first
of the pair of opcodes can evaluate the XLEN-bit+XLEN-bit product internally but stores
only the high order half of the product in the destination register rdh. When the second
opcode of the pair is processed and the rules above are obeyed, then the CPU can store
the lower half of the precomputed product in the current destination register. In this opti-
mization, the only added overhead is the fetch and decode of the second instruction in the
sequence. A failure to observe the above rules results in two completely separate multipli-
cation operations being computed, along with added execution time. Figure 10.1 shows a
creature that simply loves to multiply.
● 150
1 .global ufact32
2 .text
3
4 # ARGUMENTS:
5 # a0 uint32_t x
6 #
7 # RETURNS:
8 # a0 uint32_t factorial x!
9 #
10
11 ufact32:li t1,1 # t1 = 1
12 mv t0,a0
13
14 1: addi t0,t0,-1 # t0 = a0 - 1
15 ble t0,t1,2f # Branch if t0 <= 1
16 mul a0,a0,t0 # a0 *= t0
17 j 1b # Loop until t1 <= 1
18
19 2: ret
1. Upon entry to the function, the constant 1 is loaded into temporary register 1, for
use in comparisons (line 11).
2. The argument x is copied to temporary register t0 (line 12).
3. At the top of the loop in line 14, the value in t0 is decremented.
4. A branch is taken from line 15 if the value of t0 is less than or equal to 1. The
branch when taken, passes control to the return instruction in line 19.
5. The value that's currently in a0 (initially x) is multiplied by register t0 (initially x-1)
at line 16. The result replaces a0.
6. The control then passes from line 17 to the top of the loop at line 14.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 extern uint32_t ufact32(uint32_t x);
5
● 151
6 void
7 app_main(void) {
8
9 for ( unsigned ux=0; ux<=12; ++ux ) {
10 uint32_t f = ufact32(ux);
11
12 printf("%2u ! => %u 0x%08X\n",ux,f,f);
13 }
14 printf("Done.\n");
15 }
$ cd ~/riscv/repo/10/factorial
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
0 ! => 0 0x00000000
1 ! => 1 0x00000001
2 ! => 2 0x00000002
3 ! => 6 0x00000006
4 ! => 24 0x00000018
5 ! => 120 0x00000078
6 ! => 720 0x000002D0
7 ! => 5040 0x000013B0
8 ! => 40320 0x00009D80
9 ! => 362880 0x00058980
10 ! => 3628800 0x00375F00
11 ! => 39916800 0x02611500
12 ! => 479001600 0x1C8CFC00
Done.
The hexadecimal value was printed on the right so that you can visualize the limit of this
calculation (eight hexadecimal digits). If you were to continue further, the result would
overflow and be incorrect. Under Linux/Mac, you can check this using the Linux bc com-
mand:
$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
● 152
39916800 * 12
479001600
479001600 * 13
6227020800
The 12! result is indeed 479001600, but 13! overflows 32-bits to the value 6227020800
(0x17328CC00).
Notice that even though a product can be XLEN+XLEN bits in size, there is no provision
to divide an integer of that size. Only XLEN-bits can be hardware divided by extension M.
Opcode rem/remu
In addition to division, a remainder is often required. The "rem" and "remu" opcodes pro-
vide the remainder after dividing an XLEN-bit dividend value by an XLEN-bit divisor, for
signed and unsigned numbers respectively:
● 153
For the division operation to function optimally, the following rules must be observed:
• x÷0→-1
• x% 0→0
• x÷0→2XLEN-1
• x% 0→x
If there is a possibility of dividing by zero, it is best to test the divisor prior to performing
the division or remainder.
• 2XLEN/-1→2XLEN-1
• 2XLEN-1mod-1→0
In other words, dividing the most negative number by a negative one produces a quotient
that cannot be represented as a positive number, with a remainder of zero. If this is a
possibility, an extra test should be included when dividing it by a negative one. Then if the
dividend is the most negative number, you know that you have an overflow on your hands.
● 154
1 .global safediv
2 .text
3
4 # ARGUMENTS:
5 # a0 signed dividend
6 # a1 signed divisor
7 # a2 pointer to remainder
8 # a3 pointer to bool (div by zero)
9 # a4 pointer to bool (overflow)
10 #
11 # RETURNS:
12 # a0 quotient
13 # remainder by pointer
14 # div by zero flag by pointer
15 # overflow flag by pointer
16 #
17
18 safediv:
19 mv t3,a0 # t3 = dividend
20 mv t4,a1 # t4 = divisor
21
22 li t5,0 # t5 = true when overflow
23 li t6,0 # t6 = true if divisor zero
24 bnez t4,nzero
25 li t6,1 # t6 = true (div by zero)
26
27 nzero: div a0,t3,t4 # a0 = dividend / divisor
28 rem a1,t3,t4 # a1 = dividend % divisor
29 sw a1,0(a2) # Return remainder
30
31 li t2,-1
32 bne t4,t2,noovf # Divisor != -1 => no overflow
33
34 slli t2,t2,31 # t2 now maximally -ve
35 bne t2,t3,noovf # Branch if dividend not max -ve
36 li t5,1 # Else set overflow t5 = true
37
38 noovf: sb t5,0(a4) # Return overflow flag
39 sb t6,0(a3) # Return div by zero flag
40 ret
● 155
1. The dividend and divisor arguments are copied to temporary registers t3 and t4
respectively (lines 19 and 20).
2. The temporary register t5 is initialized false as the flag for overflow (line 22).
3. Temporary register t6 is initialized false as the flag for divide-by-zero (line 23).
4. The divisor is tested for zero in line 24. If it is not zero, a branch is taken to nzero
(step 6).
5. Otherwise, the flag in t6 is set to true (line 25) to indicate divide-by-zero.
6. The division is performed in line 27, placing the result into register a0, which is the
return value (line 27). The values divided are in temporaries t3 and t4.
7. The remainder, which is computed at step 6, is placed into register a1 at line 28.
This is an optimized operation since it obeys the optimized operation rules.
8. The remainder is stored by pointer back to the caller's variable in line 29 (the
pointer is in register a2).
9. Temporary register t2 is loaded with a value of -1 (line 31).
10. In line 32, if the divisor (in t4) is not equal to -1 (in t2), then a branch is made to
label "noovf", since no overflow is possible.
11. Otherwise, in line 34, the value in t2 is shifted left logical for 31 bits, creating the
most negative 32-bit value in t2.
12. If the dividend (in t3) is not equal to the most negative number (in t2), a branch
is made to label "noovf" (line 35).
13. Otherwise, we set temporary register t5, holding the overflow flag to true (line 36).
14. At the label "noovf", we store both flags – register t5 to the caller's overflow flag
(pointer in a4), and register t6 to the caller's divide-by-zero flag (pointer in a3),
in lines 38 and 39.
15. Return to the caller in line 40.
The driver program main.c, is provided in Listing 10.4. The loop in lines 30 to 45 call the
function safediv() with six test cases, while reporting the results.
1 #include <stdio.h>
2 #include <stdint.h>
3 #include <stdbool.h>
4
5 extern int32_t safediv(
6 int32_t divident,
7 int32_t divisor,
8 int32_t *remainder,
9 bool *divbyzero,
10 bool *overflow
11 );
12
13 struct s_div {
14 int32_t dividend;
15 int32_t divisor;
● 156
16 };
17
18 void
19 app_main(void) {
20 static struct s_div const tests[] = {
21 { 23, 3 },
22 { -23, 3 },
23 { 46, 0 },
24 { 0x80000000, -2 },
25 { 0x80000000, -1 },
26 { 0x80000000, 15 },
27 { 0, 0 }
28 };
29
30 for ( unsigned ux=0;
31 tests[ux].dividend || tests[ux].divisor;
32 ++ux ) {
33 int32_t dividend = tests[ux].dividend;
34 int32_t divisor = tests[ux].divisor;
35 int32_t quotient, remainder;
36 bool divbyzero, overflow;
37
38 quotient = safediv(dividend,divisor,
39 &remainder,&divbyzero,&overflow);
40
41 printf("%d / %d => %d remainder %d; "
42 "divbyzero=%d, overflow=%d\n",
43 dividend, divisor, quotient, remainder,
44 divbyzero, overflow);
45 }
46 printf("Done.\n");
47 }
$ cd ~/riscv/repo/10/safediv
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
23 / 3 => 7 remainder 2; divbyzero=0, overflow=0
-23 / 3 => -7 remainder -2; divbyzero=0, overflow=0
46 / 0 => -1 remainder 46; divbyzero=1, overflow=0
● 157
In the resulting output, the first two lines report valid quotient and remainders with no
divide-by-zero or overflow flags set. The third line does, however, report a divide-by-zero
failure (and also confirms the quotient result of -1 and a remainder of 46, which was the
dividend).
The second to last line reported an overflow. This is correct because a maximally negative
32-bit number (-2147483648) divided by -1, should be the positive value +2147483648,
which cannot be represented by a signed 32-bit number.
1 .global gcd64
2 .text
3
4 # ARGUMENTS:
5 # a0 Number a
6 # a1 Number b
7 #
8 # RETURNS:
9 # a0 Returned GCD(a,b)
10
11 gcd64: bge a0,a1,1f # Branch if a >= b
12
13 # swap a0 and a1
14
15 mv t0,a0
16 mv a0,a1
17 mv a1,t0
18
● 158
Now let's see how that mapped out to RISC-V in the program gcd64.S:
The main program for this test harness is provided in Listing 10.6.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 extern int64_t gcd64(int64_t a,int64_t b);
5
6 struct s_test {
7 int64_t a;
8 int64_t b;
9 };
10
● 159
11 int
12 main(int argc,char **argv) {
13 static struct s_test const tests[] = {
14 { 12, 10 },
15 { 51, 21 },
16 { 31 * 3, 31 * 7 },
17 { 211*5, 211 },
18 { 0, 0 }
19 };
20
21 for ( unsigned ux=0; tests[ux].a && tests[ux].b; ++ux ) {
22 int64_t g = gcd64(tests[ux].a,tests[ux].b);
23 printf("gcd(%d,%d) => %d\n",
24 tests[ux].a,tests[ux].b,g);
25 }
26 return 0;
27 }
The loop on lines 21 to 25 uses the function with different test values. The result is printed
on lines 23 and 24. Compile and run this program under Fedora Linux as follows:
$ cd ~/riscv/repo/10/gcd/qemu64
$ gcc -g gcd64.S main.c
$ ./a.out
gcd(12,10) => 2
gcd(51,21) => 3
gcd(93,217) => 31
gcd(1055,211) => 211
10.13. Combinations
To capitalize on the many things we've learned so far in this book, let's try our hand on
an RV64 project that is a little more involved. Let's compute a combination C(n,r) function
such that out of n objects, determine how many unique samples of size r, can be obtained.
This is calculated with the formula:
The listing for the assembly language nCr() function is provided in Listing 10.7.
● 160
4 # ARGUMENTS:
5 # a0 uint64_t n of C(n,r)
6 # a1 uint64_t r of C(n,r)
7 #
8 # RETURNS:
9 # a0 uint64_t C(n,r) = n! / (r! * (n - r)!)
10
11 nCr: mv t5,a0 # Save t5 = n
12 mv t4,a1 # Save t4 = r
13 sub t6,t5,t4 # Save t6 = (n - r)
14
15 jal t0,fact # n! (in a0)
16 mv t5,a0 # t5 = n!
17
18 mv a0,t4 # a0 = r
19 jal t0,fact # r!
20 mv t4,a0 # t4 = r!
21
22 mv a0,t6 # a0 = (n - r)
23 jal t0,fact # (n - r)!
24 mv t6,a0 # t6 = (n - r)!
25
26 mul t3,t4,t6 # t3 = r! * (n-r)!
27 div a0,t5,t3 # a0 = n! / t3
28 ret
29
30 #
31 # Internal factorial routine
32 #
33 fact: li t1,1 # t1 = 1
34 mv t2,a0 # t2 = n
35
36 1: addi t2,t2,-1 # t2 = a0 - 1
37 ble t2,t1,2f # Branch if t2 <= 1
38 mul a0,a0,t2 # a0 *= t2
39 j 1b # Loop until t1 <= 1
40
41 2: jr t0 # Internal return via t0
From the formula, it is seen that the factorial is needed in three places. Rather than in-
voking the overhead of saving and restoring to/from the stack, the internal function fact(),
starting in line 33, uses temporary register t0 as the linkage register (the return occurs
in line 41, through t0). Otherwise, this internal routine is much the same as the factorial
function that we've seen before.
● 161
It is possible to optimize this further, to reduce the number of registers needed. But, unless
you can reduce the number of steps involved, it may not be worth it. I'll leave that exercise
for the reader.
The main driver for the test is provided in Listing 10.8. It is like many test harness pro-
grams that we've seen before, testing the nCr() function with different values.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 extern uint64_t nCr(int64_t n,int64_t r);
5
6 struct s_test {
7 uint64_t n;
8 uint64_t r;
9 };
10
11 int
12 main(int argc,char **argv) {
13 static struct s_test const tests[] = {
14 { 3, 2 }, // 3
15 { 4, 3 }, // 4
16 { 9, 3 }, // 84
17 { 13, 7 }, // 1716
18 { 0, 0 }
19 };
20
21 for ( unsigned ux=0; tests[ux].n != 0; ++ux ) {
22 uint64_t ncr = nCr(tests[ux].n,tests[ux].r);
23 printf("C(n=%lu,r=%lu) => %lu\n",
24 tests[ux].n,
25 tests[ux].r,
26 ncr);
● 162
27 }
28
29 return 0;
30 }
There are online combination calculators available where you can verify these results.[1]
What was interesting about this assignment was that an internal function call was made
through the temporary register t0 to calculate the factorials. Since we didn't need to save
any registers or use any stack-based variables, the entire calculation was register-based
for the ultimate efficiency. With the use of the RISC-V M extension, the entire computation
was performed in hardware. The number of registers available in a RISC CPU often permits
optimal execution.
10.14. Summary
This chapter has shown the utility and use of the multiply and division operations provided
by the RISC-V extension M. When this extension is not provided, execution time suffers
considerably because these operations must be performed in software.
The next chapter will change gears somewhat and use what you've learned and apply it to
addressing, indexing and array subscripting.
Bibliography
[1] CalculatorSoup, L. L. C. (n.d.). Combinations calculator (NCR). CalculatorSoup.
Retrieved May 31, 2022, from https://www.calculatorsoup.com/calculators/
discretemathematics/combinations.php
● 163
Many times, you'll need to access an array or matrix from within an assembler routine to
achieve some performance goal. This chapter introduces some of the finer points of work-
ing with pointers and subscripts within the assembler function. Within C/C++, you can
increment a pointer without giving it much thought. But in assembler language, there are
some traps to watch out for. Additionally in this chapter, we'll examine some string-related
examples.
Testing the address for zero will branch when the pointer is NULL/nullptr.
● 164
Row-major order has the column number incrementing first, followed by the row number.
A 3x3 matrix is implemented in memory storage as a linear array of 9 elements. The very
first matrix element [0][0] is the first member of the array at index 0. The second row
[1][0] starts at index 1*3 + 0, or index 3. The same matrix is shown in Figure 11.2 with
linear array index numbers.
There is one more important aspect to this matrix business to remember. Each element
may be composed of multiple bytes, such as a matrix of "int" types. This must be taken into
account when computing the byte address of a particular element.
x=i*c+j
For example, if you have a 5×7 matrix, then the index to the m[3][2] element, is computed as:
23 = 3 * 7 + 2
Since a 5×7 matrix occupies the same storage as a 35-element linear array a, the index of
m[3][2] is equivalent to accessing a[23].
This time let's examine the main program first so that we're clear about the program ele-
ments being operated upon. Listing 11.1 illustrates the C main program that defines two
● 165
matrices and initializes each of them according to their size using our assembly language
routine identm().
1 #include <stdio.h>
2 #include <stdint.h>
3
4 extern void identm(void *matrix,unsigned n);
5
6 void
7 app_main(void) {
8 {
9 unsigned const n=3;
10 int matrix[n][n];
11
12 identm(&matrix,n);
13
14 printf("%u x %u identity matrix:\n",n,n);
15 for ( unsigned ux=0; ux<n; ++ux ) {
16 putchar('[');
17 for ( unsigned uy=0; uy<n; ++uy ) {
18 printf("%3d ",matrix[ux][uy]);
19 }
20 puts("]");
21 }
22 }
23 {
24 unsigned const n=6;
25 int matrix[n][n];
26
27 identm(&matrix,n);
28
29 printf("%u x %u identity matrix:\n",n,n);
30 for ( unsigned ux=0; ux<n; ++ux ) {
31 putchar('[');
32 for ( unsigned uy=0; uy<n; ++uy ) {
33 printf("%3d ",matrix[ux][uy]);
34 }
35 puts("]");
36 }
37 }
38
39 puts("Done");
40 fflush(stdout);
41 }
● 166
This program defines a 3×3 matrix in line 10 and a 6×6 matrix in line 25. The matrix ele-
ments are of type "int", which are 32 bits for the ESP32-C3. As created these matrices are
uninitialized. The calls to identm() in lines 12 and 27 will, however, initialize them according
to their dimensions.
Line 4 declares the function prototype for our assembly language routine:
The first argument is declared as a void pointer to avoid limitations of the C language. We
might be tempted to write:
but the compiler will only accept "[]" for the last dimension of the matrix. To avoid this
problem, we define that argument to be a byte address of the matrix, without any type
checking by the compiler. The second parameter declares the dimension of the square
matrix n.
The assembler language routine is illustrated in Listing 11.2. Let's break down its operation:
1. The pointer argument arrives in register a0, as a pointer to the first byte of the
matrix.
2. The dimension n arrives in register a1.
3. Line 10 multiplies n times n to arrive at the total number of elements of the matrix.
This is placed in temporary register t1.
4. Temporary register t0 is loaded with the constant 4 (line 11). This corresponds to
the sizeof(int).
5. Finally, t1 is multiplied by t0 (with the sizeof(int)) so that t1 becomes the total
number of bytes that the matrix occupies (line 12).
6. The pointer to the matrix (in a0) is added to t1 (total # of bytes) with the result
placed into temporary register t2 (line 13). This is the pointer to the last byte after
the end of the matrix storage. This simplifies the end of the loop handling later.
7. Temporary register t5 is loaded with the constant 1, to be stored on the diagonal
elements (line 14).
8. At the top of the outer loop at label "put1", we store the value of 1 (in t5) at the
current pointer in a0 (line 16). Note that we are storing a "word" or 4 bytes.
9. Now the pointer in a0 is incremented by 4 (sizeof(int)) in line 17.
10. The temporary register t3 is loaded with n (in a1) at line 19. This will control the
number of zeros that will be stored.
11. The loop in lines 20 to 24 will then store n zeros. There are n zeros between each
1 value on the diagonal.
12. A test is made at the top of the inner loop at line 20 to see if we have reached the
end of the matrix. If so, the branch is taken to label "end" to return to the caller.
13. Otherwise, a zero is stored from line 21.
14. The pointer is incremented by sizeof(int) at line 22.
● 167
15. The counter in t3 is decremented (line 23) and branches back to "loop" (line 20) if
the counter is non-zero (from line 24).
16. Otherwise, control falls through to line 25. As long as the pointer is still within the
matrix we loop back to the outer loop label "put1" at line 16.
17. Eventually, line 20 will branch to "end" and return to the caller.
1 .global identm
2 .text
3
4 # extern void identm(void *matrix,unsigned n)
5 #
6 # ARGUMENTS:
7 # a0 Pointer to int matrix[n][n]
8 # a1 unsigned n
9
10 identm: mul t1,a1,a1 # t1 = total elements
11 li t0,4 # sizeof(int) = 4
12 mul t1,t1,t0 # t1 *= sizeof(int)
13 add t2,a0,t1 # Ptr of end of matrix
14 li t5,1 # t5 = 1
15
16 put1: sw t5,0(a0) # *ptr = 1
17 addi a0,a0,4 # ptr += 4
18
19 mv t3,a1 # t3 = n
20 loop: bge a0,t2,end # At end of matrix?
21 sw x0,0(a0) # *matrix + ptr = 0
22 addi a0,a0,4 # ptr += 4
23 addi t3,t3,-1 # --t3
24 bnez t3,loop
25 blt a0,t2,put1 # Loop again if not at end
26
27 end: ret
The main lesson in this example is the need to be aware of the size of the matrix element.
When I initially wrote this routine it failed because I was incrementing the pointer (in a0)
by 1, rather than by the sizeof(int), which is 4. This is a very easy mistake to make, so
please take note.
$ cd ~/riscv/repo/11/identm
$ idf.py build
...
● 168
From the displayed output, it is verified that the initialization was correctly done.
Tip: The observant reader will notice that the RISC-V opcodes for compare and set only
includes "slt", "sltu" and the immediate constant forms. But what if you wanted to test
for greater-than-or-equal-to for example? This can be done by reversing the operands of
the comparison. If you want to test a>=b, then use:
1. The pointer to the first byte of the string is passed as an argument in register a0.
This pointer is copied to temporary register t0 at line 13, since we need a0 to re-
turn the result.
2. The register a0 is zeroed at line 14. This will be the byte count.
3. The loop starts at line 16, loading into register t1 the byte at the pointer (in t0).
Be aware that this is a signed character load (with sign extended) but that doesn't
affect this algorithm.
4. A branch is taken from line 17 to label "end" (step 7), if the byte we just loaded
from memory was zero (i.e. a null byte).
5. If control continues to line 18, we then increment a0, which holds the current
string length.
● 169
6. The pointer value in t0 is incremented in line 19, before looping back to line 16.
7. The register a0 already has the accumulated string length to return, so the control
returns to the caller in line 22.
1 .global strlen32
2 .text
3
4 # extern int strlen(char const *s)
5 #
6 # ARGUMENTS:
7 # a0 Pointer to string
8 #
9 # RETURNS:
10 # a0 (int) string length
11
12 strlen32:
13 mv t0,a0 # t0 = char const *ptr
14 li a0,0 # Zero strlen
15
16 loop: lb t1,0(t0) # t1 = *ptr
17 beqz t1,end # Branch if Null byte
18 addi a0,a0,1 # Increment strlen
19 addi t0,t0,1 # ++ptr
20 j loop
21
22 end: ret
The main program is shown in Listing 11.4, which calls strlen32() with a few different
strings and prints the results.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 extern int strlen32(char const *s);
5
6 void
7 app_main(void) {
8 static char const *tests[] = {
9 "One",
10 "Three",
11 "Ten four",
12 "",
13 "This is the end!",
14 NULL
● 170
15 };
16
17 for ( unsigned ux=0; tests[ux] != NULL; ++ux ) {
18 printf("strlen32('%s') => %d\n",
19 tests[ux],strlen32(tests[ux]));
20 }
21
22 puts("Done");
23 fflush(stdout);
24 } .
$ cd ~/riscv/repo/11/strlen
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
I (257) cpu_start: Starting scheduler.
strlen32('One') => 3
strlen32('Three') => 5
strlen32('Ten four') => 8
strlen32('') => 0
strlen32('This is the end!') => 16
Done
From these results, we can see that the function() produced the correct results.
This function copies characters from pointer src to pointer dest, until a null byte is encoun-
tered in the source string or n characters have been copied, whichever occurs first. When
the src string is less than n characters, null bytes are appended until n characters have
been placed into the destination buffer. Note that if the src string is greater or equal to n
in length, there is no null byte placed into the destination buffer. This is often the source of
many a C language program bug!
My complaint about this function is about efficiency. If you have a large receiving buffer, say
1024 bytes in length, and you happen to have a short string to copy into it, then the call is
wasteful. For example, assume the following:
● 171
char buf[1024];
...
strncpy(buf,"abc",sizeof buf);
The strncpy() function will copy the three characters "abc" and then fill the remaining 1021
bytes of buf with null bytes. This results in a nice and clean buf array but is wasteful of CPU
cycles. Cost-wise, it would be best if strncpy() copied the three characters "abc" and only
one null byte. So let's implement our own strncpy32() function to do exactly that.
Listing 11.5 illustrates our more efficient strncpy() function, named strncpy32() to avoid
affecting other functions that may depend upon the original behaviour. Let's explain its
operation:
1. Registers a0, a1 and a2 receive the calling arguments. Note that we return the
argument dest, so it remains left alone in register a0 during this call.
2. Register t6 has the dest pointer copied into it in line 15. We will be incrementing
this pointer as the execution proceeds.
3. Register t5 receives the calculated end of the buffer pointer in line 16. This is the
original dest argument plus the value n.
4. The top of the loop begins in line 18. The branch to label "end" is taken if our work-
ing dest pointer in t6 has gone past the end of the buffer.
5. Otherwise, we do an unsigned byte load in line 19, from the src pointer.
6. The src pointer is incremented by 1 (line 20).
7. If the byte loaded into t4 at line 19 is a null byte, then branch to label "nul" (line
21).
8. Otherwise, store the byte in t4 at the destination buffer using the working pointer
in t6 (line 22).
9. The dest pointer is incremented in line 23, and the loop repeats starting again at
line 18.
10. If a null byte is encountered at line 21, we arrive at line 26. Here one null byte is
stuffed into the dest buffer using working pointer t6.
11. Arriving at label "end" from line 26 or line 18, causes execution to return to the
caller. The unmodified dest pointer in a0 is returned.
1 .global strncpy32
2 .text
3
4 # extern char *strncpy32(char *dest,char const *src,size_t n);
5 #
6 # ARGUMENTS:
7 # a0 char *dest (also returned)
8 # a1 char const *src
9 # a2 size_t n
10 #
11 # RETURNS:
12 # a0 char *dest
● 172
13
14 strncpy32:
15 mv t6,a0 # t6 = dest ptr
16 add t5,a0,a2 # Points past end of dest buf
17
18 loop: bge t6,t5,end # Branch if we passed dest buf end
19 lbu t4,0(a1) # Load byte from source
20 addi a1,a1,1 # ++src
21 beqz t4,nul # Branch if null byte
22 sb t4,0(t6) # Copy byte to dest
23 addi t6,t6,1 # ++dest
24 j loop
25
26 nul: sb x0,0(t6) # Store null byte
27 end: ret
The main program to test our strncpy32() function is provided in Listing 11.6. The dest
buffer is declared in line 16 to hold 8 characters. Various tests from the for loop in line 19.
To validate our test, we fill the array buf with 8 character 'X' bytes in line 20. Then we call
our assembler routine strncpy32() at line 21, taking note of the returned pointer in variable
rp. This is checked in the assertion at line 23. Lines 24 to 32 then report the results of what
the array buf contains.
1 #include <stdio.h>
2 #include <string.h>
3 #include <assert.h>
4
5 extern char *strncpy32(char *dest,char const *src,size_t n);
6
7 void
8 app_main(void) {
9 static char const *tests[] = {
10 "abc",
11 "Main",
12 "",
13 "1234567890",
14 NULL
15 };
16 char buf[8], *rp;
17 char const *src;
18
19 for ( unsigned ux=0; (src = tests[ux]) != NULL; ++ux ) {
20 memset(buf,'X',sizeof buf);
● 173
21 rp = strncpy32(buf,src,sizeof buf);
22
23 assert(rp == buf);
24 printf("src='%s', n=%u, buf[] = ",src,sizeof buf);
25 for ( unsigned u=0; u<sizeof buf; ++u ) {
26 if ( buf[u] )
27 printf("'%c'%c",
28 buf[u],
29 u+1<sizeof buf?',':'\n');
30 else printf("NUL%c",
31 u+1<sizeof buf?',':'\n');
32 }
33 }
34
35 puts("Done");
36 fflush(stdout);
37 }
$ cd ~/riscv/repo/11/strncpy32
$ idf.py build
...
idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
src='abc', n=8, buf[] = 'a','b','c',NUL,'X','X','X','X'
src='Main', n=8, buf[] = 'M','a','i','n',NUL,'X','X','X'
src='', n=8, buf[] = NUL,'X','X','X','X','X','X','X'
src='1234567890', n=8, buf[] = '1','2','3','4','5','6','7','8'
Done
How did we do? When the short string "abc" was copied, the destination buffer (buf) re-
ceived only the letters 'a', 'b', 'c' and the one null byte at the end, as expected. The library
strncpy() routine would have filled the entire buffer with null bytes. In the third test, an
empty source string was tested, but passed with just one null byte. In the last case, we
supplied a longer source string than the destination buffer could hold. This too passed, but
be careful in this case, since there is no terminating null byte.
● 174
unsigned int value. This function will also return a boolean indicating whether or not the
conversion was successful.
Listing 11.7 illustrates the function struint(). This function will skip all space characters but
will fail if any other non-digit character is encountered.
1 .global struint
2 .text
3
4 # extern unsigned struint(char const *text,bool *ok)
5 #
6 # ARGUMENTS:
7 # a0 char const *text (text to convert)
8 # a1 pointer to bool
9 #
10 # RETURNS:
11 # a0 unsigned value (when ok is true)
12 # ok:
13 # true, conversion successful
14 # false, conversion failed
15
16 struint:
17 mv t6,a0 # t6 = ptr to test
18 li a0,0 # Accumulator for uint
19 li t2,0 # Digit count
20 li t4,'0'
21 li t3,'9'
22 li t1,' ‚
23 li t0,10
24
25 loop: lbu t5,0(t6) # Load text char
26 beqz t5,nulbyt # Branch if null byte
27 beq t5,t1,skip # Skip white space
28 bgt t5,t3,fail # char > '9'?
29 blt t5,t4,fail # char < '0'?
30 andi t5,t5,0x0F # Mask out 0x00 to 0x09
31 mul a0,a0,t0 # a0 *= 10
32 add a0,a0,t5 # a0 += t5
33 addi t2,t2,1 # Bump digit count
34 skip: addi t6,t6,1 # ++text ptr
35 j loop
36
37 nulbyt: beqz t2,fail # Fail if no digits
38 li t0,1
39 sb t0,0(a1) # ok = true
40 ret
● 175
41
42 fail: li t0,0
43 exit: sb t0,0(a1) # ok = false
44 ret
1. The pointer to the input text arrives in register a0, and a1 is a pointer to the bool
that will receive a pass (1) or fail (0) status.
2. Register t6 receives a working copy of the input text address from a0 (line 17).
3. Register a0 is the value to be returned. It is initialized to zero in line 18.
4. The digit count in t2 is initialized to zero (line 19).
5. Registers t4 and t3 are loaded with the ASCII characters '0' and '9' respectively, to
be used for comparison purposes (lines 20, 21).
6. Register t1 loads a space character, for comparison purposes (line 22).
7. Temporary register t0 loads the value 10, to be used for multiplication (line 23).
8. The loop begins at line 25 by loading a text character into t5.
9. A branch is taken to "nulbyt" if the loaded character is 0x00 (line 26).
10. A branch to "skip" is taken if the character loaded was a space (line 27).
11. Lines 28 and 29 test if the character is a digit. If not, a branch is made to "fail".
12. The last 4 bits of the digit are masked out in line 30.
13. Line 31 multiplies the accumulated unsigned value in register a0, by 10.
14. Then the isolated digit from step 12 is added to a0 (line 32).
15. Line 33 increments the digit count (for validating results).
16. Line 34 increments the text byte pointer, before returning to the top of the loop
(at step 8).
17. If execution lands at "nulbyt", the digit count in t2 is checked. If there are no digits,
the branch to "fail" is taken (line 37).
18. Otherwise, ok is set to true, and control returns to the caller.
19. If execution lands at "fail", the value of "ok" is set to false.
There is considerable room for improvement in the error checking in this routine, but it was
kept simple to highlight the important concepts. The main driver program is illustrated in
Listing 11.8, calling struint() with various strings and reporting the results.
1 #include <stdio.h>
2 #include <stdbool.h>
3
4 extern unsigned struint(char const *text,bool *ok);
5
6 void
7 app_main(void) {
8 static char const *tests[] = {
9 " 123",
● 176
10 "0234",
11 "0",
12 "905",
13 "9_05",
14 " 907, ",
15 NULL
16 };
17 bool ok;
18 unsigned v;
19
20 for ( unsigned ux=0; tests[ux] != NULL; ++ux ) {
21 ok = 0;
22 v = struint(tests[ux],&ok);
23
24 printf("struint('%s') => %u, ok=%d\n",
25 tests[ux],v,ok);
26 }
27
28 puts("Done");
29 fflush(stdout);
30 }
$ cd ~/riscv/repo/11/struint
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
struint(' 123') => 123, ok=1
struint('0234') => 234, ok=1
struint('0') => 0, ok=1
struint('905') => 905, ok=1
struint('9_05') => 9, ok=0
struint(' 907, ') => 907, ok=0
Done
From the test run output, all of the input texts converted ok except for the last two. The
failures were due to the underscore (_) and comma (,) characters that were encountered
in the input string.
● 177
sll a6,a0,3 # a6 = a0 * 8
sll a0,a0,1 # a0 *= 2
add a0,a0,a6 # a0 += a6
And retry the build, flash and run. If you made the change correctly, the execution results
will be identical. These three instructions perform the multiplication of a0 by 10 as follows:
1. Shift a0 left 3 bits, which effectively multiplies the value by 8 and save this in a6.
2. Shift a0 left 1 bit, which effectively multiples the value by 2.
3. Add these two and the result is the original value multiplied by 10.
Is this faster than using the hardware multiply? Without timing information or doing a
benchmark on the ESP32-C3, it is difficult to know.
Tip: When processing with a character pointer, it is often convenient to look at the next
or previous character without changing the pointer.
If the pointer variable in C is cp, then *cp or cp[0] returns the current character while
cp[1] and cp[-1] return the next and prior characters respectively. In RISC-V assembly,
if the pointer is held in register t3, then "0(t3)" references the current character, while
expressions "1(t3)" and "-1(t3)" access the next and prior characters respectively.
1 .global uintstr
2 .text
3
4 # extern char *uintstr(unsigned u,char const *buf,unsigned buflen)
5 #
6 # ARGUMENTS:
7 # a0 unsigned value to convert to text
8 # a1 pointer to buf
9 # a2 max length for buf
10 #
11 # RETURNS:
12 # a0 pointer to buf
13
14 uintstr:
● 178
The uintstr() function uses the quotient and remainder from a division, to convert an un-
signed value into a text string, one digit at a time. When the function is called register a0
contains the value to be converted, a1 points to the receiving character buffer, and a1 is the
maximum length for that buffer. The steps for conversion are as follows:
1. Register t5 is set to the pointer of the buffer + its maximum length. This points
one byte past the end of the passed buffer (line 15). This is done because we must
fill the buffer in reverse order. This is because each division will determine the low
order decimal digits first.
2. Line 16 tests for a zero-length buffer, and if so, just returns at line 30, returning
the buffer pointer.
3. A null byte is stored at the very end of the caller's buffer in line 17.
4. Then the pointer in t5 is decremented towards the start of the buffer by one (line
18).
5. The unsigned integer 10 is loaded into t1 (line 19). This will be the divisor used in
the loop.
6. The loop begins in line 21. Here the value of a0 (the unsigned integer) is divided
by 10 and the result is placed in t4.
7. Line 22 also fetches the remainder into register t3 (Line 24).
8. The ASCII value for '0' is added to the remainder in t3, to form a digit character,
and this replaces t3 (Line 24).
9. The pointer in t5 is tested against the one in a1, to see if we have gone past the
start of the buffer. If so, the branch is taken to "smlbuf" to exit safely (line 25).
10. Otherwise, pointer register t5 is decremented once more (line 26) and then the
character is stored in the buffer at line 27.
11. If the division did not end in a quotient of zero, we loop back to step 6.
● 179
The main driver program for this test is provided in Listing 11.10. Of particular note, notice
that the pointer returned by uintstr() is saved in pointer variable cp in line 16. This is done
because the start of the converted number is not necessarily going to be at the start of the
array buf when function uintstr() returns. This is due to the fact that function works from
the tail end of the buffer and works backwards.
1 #include <stdio.h>
2
3 extern char *uintstr(unsigned v,char *buf,unsigned buflen);
4
5 void
6 app_main(void) {
7 static unsigned const tests[] = {
8 1023, 32, 96001, 10045, 90999, 1770771,
9 0, 0xFFFFFF
10 };
11 char buf[7];
12 char const *cp;
13 unsigned v;
14
15 for ( unsigned ux=0; (v = tests[ux]) != 0xFFFFFF; ++ux ) {
16 cp = uintstr(v,buf,sizeof buf);
17
18 printf("uintstr(%u,buf,%u) => '%s' %s\n",
19 v,(unsigned)sizeof buf,cp,
20 cp <= buf ? "!!" : "");
21 }
22
23 puts("Done");
24 fflush(stdout);
25 }
$ cd ~/riscv/repo/11/uintstr
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
uintstr(1023,buf,7) => '1023'
uintstr(32,buf,7) => '32'
uintstr(96001,buf,7) => '96001'
● 180
All conversions worked except for the one identified with "!!". The buffer was not large
enough to support writing the text to buf, to contain the value 1770771. This was inten-
tionally done to test the safety of the function.
Lines 3 to 10 define an enumerated value for each function we want to execute (the excep-
tion is func_bad, which is used to test the handling of a bad index). The values start at zero
for func_add, with a value of 1 for func_sub, etc.
Our assembly language function cgoto() will execute an arithmetic function based upon the
first argument "fun" (line 12). Line 16 defines an array for six returned results that will be
reported once the testing is completed. Lines 18 through 23 test the function with different
arguments and requested functions.
1 #include <stdio.h>
2
3 typedef enum {
4 func_add=0,
5 func_sub,
6 func_mul,
7 func_div,
8 func_rem,
9 func_bad
10 } func_t;
11
12 extern int cgoto(func_t fun,int a,int b);
13
14 void
15 app_main(void) {
16 int r[6];
17
18 r[0] = cgoto(func_add,1,2);
19 r[1] = cgoto(func_sub,5,2);
20 r[2] = cgoto(func_mul,6,5);
21 r[3] = cgoto(func_div,35,6);
● 181
22 r[4] = cgoto(func_rem,13,5);
23 r[5] = cgoto(func_bad,9,99);
24
25 for ( unsigned ux=0; ux<6; ++ux ) {
26 printf("r[%u] = %d\n",ux,r[ux]);
27 }
28 puts("Done");
29 fflush(stdout);
30 }
Now let's examine Listing 11.12, which illustrates the assembly language code. When cgo-
to() is called, register a0 contains the function number, and registers a1 and a2 have the
integer operands to compute with.
1. Lines 15 and 16 do a range check on the function requested. If the value is out of
range, then control passes to label "null" (at line 34), which then returns the result
-1 in protest.
2. Line 17 multiplies the function index by 4 (by shifting left 2), to change the func-
tion index into a byte offset (we need to address the table entries (line 37) by byte
address. Since our table need not change, it is kept in .text and remains read-only.
3. The address of "table" is loaded into temporary register t5 (line 18).
4. The byte offset is added to the address in t5, to address the required word in "ta-
ble". The result of that is placed in register t4 (line 19).
5. The address of the starting code required replaces t4 by loading the word from the
table (line 20).
6. In line 21, we finally jump to the required code using register t4. Depending upon
the calculation, the code will jump to "add", "sub", "mul", "div" or "rem".
1 .global cgoto
2 .text
3
4 # Computed goto example:
5 # extern int cgoto(unsigned func,int a,int b);
6 #
7 # ARGUMENTS:
8 # a0 function
9 # a1 int a
10 # a2 int b
11 #
12 # RETURNS:
13 # a0 result
14
15 cgoto: li t6,4
● 182
$ cd ~/riscv/repo/11/cgoto
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
r[0] = 3
r[1] = 3
r[2] = 30
r[3] = 5
r[4] = 3
r[5] = -1
Done
● 183
Checking with the main program, the reported results are correct. The sixth result of -1 is
due to requesting an out-of-range function index, which the routine anticipated.
11.7. Summary
This chapter began with a focus on matrix and array subscripting. From there, a matrix
identity function and some string functions were developed. The last example demonstrat-
ed how to compute a branch based upon an index value for efficiently selecting the desired
code to run. At this point, you should be well equipped for programming RISC-V for inte-
ger-based data. In the next chapter, we will explore the hardware floating point.
● 184
A Floating Point
In this chapter, we'll explore the hardware support of floating-point for RISC-V. The
ESP32-C3 does not have this facility so it manages by using compiler-supplied software li-
braries. The QEMU emulated Fedora Linux, however, does emulate hardware floating-point,
giving us the opportunity to exercise the floating-point opcodes.
There is considerable theory surrounding floating-point data formats. Since this is a tu-
torial book, I will assume that you are familiar with that material or will research it when
necessary. So, without further delay, let's examine the RISC-V hardware registers and
instructions available.
When extension "D" is supported, the floating-point registers are capable of holding 32-bit
single-precision or 64-bit double-precision values. Most of the floating-point instructions
work upon the floating-point register file. There are, however, some that transfer values
to or from an integer register. Finally, there are also opcodes, which load or store float-
ing-point values from or to memory.
Unlike the integer register x0, the floating-point register f0 has no special function and may
be used in general calculations.
● 185
Table 12.1: Floating-point registers and their ABI names and savers.
31 87 5 4 3 2 1 0
Reserved Rounding Mode (frm) Accrued Exception (fflags)
NV DZ OF UF NX
24 3 1 1 1 1 1
The fcsr.frm field establishes a default rounding mode to be used or when the opcodes
specify "dyn". The accrued exceptions field fcsr.fflags, remain set until they are cleared. In
this manner, the programmer has the option of testing these flags after each opcode or at
the end of a calculation.
The values for fcsr can be loaded and modified using the following pseudo-ops:
frcsr rd # rd = fcsrr
fscsr rd,rs1 # rd = original fcsr, fscr = rs1
The "frcsr" simply reads fcsr into the destination register. The "fscsr" replaces the fcsr with
the value in integer register rs1, after loading the destination register with the original fcsr
value.
The reserved field is for use with other standard extensions. For example, extension L pro-
vides for decimal floating-point. The RISC-V standards document has this to say about the
reserved field:[1]
● 186
If these extensions are not present, implementations shall ignore writes to these bits
and supply a zero value when read. Standard software should preserve the contents of
these bits.
Note: The GNU assembler recognizes the mnemonics of Table 12.2 for the optional
rounding mode parameter of an opcode. However, if you want to load a rounding mode
as immediate data in fsrmi for example, then you must either specify the immediate
data as a numeric constant or declare a symbol with the correct value. For example, you
might declare a symbol .equ rmm,4, and then use fsrmi x0,rmm.
The following pseudo-opcodes are available for working with the fcsr.frm field directly.
frrm rd # rd = fcsr.frm
fsrm rd,rs1 # rd = fcsr.frm, fcsr.frm=rs1
fsrmi rd,imm # rd = fcsr.frm, fcsr.frm=imm
The opcode "frrm" simply copies the fcsr.frm into the destination integer register (bit posi-
tions left of the loaded value are set to zero). The "fsrm" opcode likewise loads the destina-
tion integer register with the original fcsr.frm and sets fcsr.frm from rs1. The "fsrmi" opcode
is similar, except that fcsr.frm is set from the immediate data.
The following is an instruction that loads register a2 with the current copy of fcsr.frm, while
setting the fcsr.frm from the current value of t3:
● 187
frflags rd # rd = fcsr.fflags
fsflags rd,rs1 # rd = fcsr.fflags, fcsr.fflags = rs1
fsflagsi rd,imm # rd = fcsr.fflags, fcsr.fflags = imm
The "frflags" pseudo-opcode conveniently loads the flags into the destination integer reg-
ister (bits left of the flags are all set to zero). Opcode "fsflags" likewise loads the flags into
the destination register, but also sets the flags from integer register rs1. Finally, "fsflagsi"
performs the same except that the fcsr.fflags are set from immediate data instead.
NV Invalid operation
DZ Divide by Zero
OF Overflow
UF Underflow
NX Inexact
Table 12.3: Floating-point accrued exception flag encoding.
00 S Single-precision 32 F
01 D Double-precision 64 D
10 H Half-precision 16 V
fopcode.{S|D|H|Q} rd,rs1,rs2[,rm]
fopcode.{S|D|H|Q} rd,rs1[,rm]
● 188
The braces show a choice for format (Table 12.4), while the square brackets show an op-
tional rounding mode (Table 12.2). For example:
When the rounding mode is unspecified or given as "dyn", the rounding mode used is de-
termined by fcsr.frm.
For example, if register a1 contains the pointer to a double-precision value, then register
fa2 can be loaded as follows:
For example, the following divides fa0 by ft0 using rounding mode rmm, placing the result
into fa2:
fdiv.d fa2,fa0,ft0,rmm
In addition to these, RISC-V provides "fused multiply-add" operations, which require a third
operand rs3 (an optional rounding mode may be added from Table 12.2):
● 189
Mnemonic Meaning
The general format of the "fcvt" opcode is as follows, where F is one of H, S, D, or Q (Table
12.4), and "int" is from Table 12.5. An optional rounding mode may also be added from
Table 12.2:
These two examples of producing a floating-point zero will never raise an exception.
● 190
The following is an example that loads fa0 with the double-precision value of |ft0| but using
the sign of ft1:
fsgnj.d fa0,ft0,ft1
The pair of pseudo-opcodes "fneg" and "fabs" take advantage of the sign injection opcodes,
where F is one of S, D or Q (Table 12.4):
● 191
Note that the "d" opcode versions are supported by extension RV64 (or larger). These
require an integer register width of 64 bits.
Note: The RISC-V specification notes that: "The FMV.W.X and FMV.X.W instructions were
previously called FMV.S.X and FMV.X.S. The use of W is more consistent with their se-
mantics as an instruction that moves 32 bits without interpreting them. This became
clearer after defining NaN-boxing. To avoid disturbing existing code, both the W and S
versions will be supported by tools."[1]
In the "flt" and "fle" opcodes, be aware that Invalid Operation (NV) is raised when either
input is NaN (since no proper comparison can be made). For "feq", only the signaling NaN
(sNaN) causes an Invalid Operation (NV) to be raised. All three opcodes return boolean
false (zero) when either operand is a NaN value.
Note: The purpose of a Signaling NaN (sNaN) is to cause an exception for debugging
(perhaps because of an uninitialized value). A sNaN is never produced as the result of
arithmetic. Arithmetic may, however, produce a quiet NaN value.
● 192
The destination register rd is an integer register, which receives the 8 bits illustrated in
Table 12.7.
Note: A subnormal number is a non-zero number that is smaller than the smallest num-
ber that can be represented in the given precision.
rd Bit Meaning
0 rs1 is -∞
3 rs1 is -0
4 rs1 is +0
7 rs1 is +∞
Instead of multiplying by 5 and then dividing by 9, we'll just divide by the constant 1.8
instead.
This will be a Fedora Linux project using QEMU where we have hardware floating-point sup-
port in extensions F and D. The listing for the assembler portion is provided in Listing 12.1.
● 193
11 # ARGUMENTS:
12 # fa0 temperature in Fahrenheit
13 # a0 pointer to int to return flags
14 #
15 # RETURNS:
16 # fa0 temperature in Celsius
17 # flags through ptr in a0
18
19 fconvtemp:
20 frcsr t2 # t2 = original fcsr
21 fsrmi x0,rmm # Set rnd mode to RMM
22 fsflagsi x0,0 # Clear exceptions
23 la t4,f18
24 fld ft0,0(t4) # ft0 = 1.8
25 addi t0,x0,32 # t0 = 32
26 fcvt.d.lu ft1,t0,rtz # ft1 = 32.0
27
28 conv: fsub.d fa0,fa0,ft1,rtz # fa0 -= 32.0
29 fdiv.d fa0,fa0,ft0,rmm # fa0 /= 1.8
30
31 frflags t0 # t0 = fcsr.flags
32 sw t0,0(a0) # Store fcsr.flags
33
34 fscsr x0,t2 # Restore fscr
35 ret
36
37 .section .rodata
38 f18: .double 1.8
2. Line 20 loads the current value of fcsr into integer register t2. We'll use this value
to restore the fcsr when the function returns later.
● 194
3. Just in case no rounding mode is specified on the instruction, we set the rounding
mode in line 21. We don't care about its prior value so x0 is the destination. We set
the default rounding mode to "round to nearest".
4. We are interested in the exception flags, so these are all cleared in line 22 by set-
ting the exceptions register to zero. Again, we don't care about the prior value so
the register x0 is used for the destination.
5. We need to access a double constant from line 38, so we establish an address in
t4 using the "load address" pseudo-op (line 23).
6. Line 24 loads the double-precision value 1.8 into the floating-point register ft0, for
use later on.
7. We establish the integer constant of 32 in temporary register t0 (line 25).
8. Then convert that integer (32) to floating-point in line 26, using "round to zero",
just for fun.
9. Line 28 finally gets started on the calculation. It subtracts 32.0 in ft1 from the
value passed in fa0, rounding towards zero. The result is returned to fa0.
10. Then fa0 is divided by 1.8 in ft0 to arrive at the temperature in Celsius, rounding to
the nearest. The result is returned to fa0, which will be the function's return value.
11. Line 31 then loads the fcsr.flags into temporary register t0. This value is returned
to the caller through the pointer passed in integer a0 (recall that the first argument
went into fa0 instead).
12. The flags in t0 are passed back to the caller in line 32 by storing a word through
the given pointer.
13. The fcsr register is restored to the way we found it, in case the caller (and C/C++)
needs it that way in line 34.
14. Finally, the function returns, with the return value passed back in floating-point
register fa0 at line 35.
For a simple calculation, that procedure may seem a little tedious. But this is an advantage
to consider. If this were a complex and serious scientific calculation, then we've guided
the rounding at every step of the way. In C/C++, a rounding mode is selected (likely by
default) and used throughout.
1 #include <stdio.h>
2
3 extern double fconvtemp(double f,unsigned *pflags);
4
5 int
6 main(int argc,char **argv) {
7 static double const tests[] = {
8 32.0, 0.0, -40.0, 18.5
9 };
10
11 for ( int ux=0; ux < 4; ++ux ) {
12 unsigned flags;
● 195
13
14 double celsius = fconvtemp(tests[ux],&flags);
15
16 printf("%.1lf F -> %.1lf C (flags = 0x%04X)\n",
17 tests[ux], celsius, flags);
18 }
19 return 0;
20 }
Listing 12.2: Main driver program for the Fahrenheit to Celsius conversion in
~/riscv/repo/12/celsius/qemu64/main.c.
The main program loops through trying every test Fahrenheit temperature in lines 11 to
18. The result of the conversion is reported along with the returned flags. Compile and run
it as follows:
$ cd ~/riscv/repo/12/celsius/qemu64
$ gcc -g celsius.S main.c
$ ./a.out
32.0 F -> 0.0 C (flags = 0x0000)
0.0 F -> -17.8 C (flags = 0x0001)
-40.0 F -> -40.0 C (flags = 0x0001)
18.5 F -> -7.5 C (flags = 0x0001)
Yes, -40 ºF is the same as -40ºC (it's one of a few favourite things I memorized). Note
that the flag's value for the first conversion was zero, indicating that there were no excep-
tions raised. The flag's value of 0x0001 for the remaining calculations indicates that the
NX exception flag was set. This simply means that the computed result was inexact. This
is not unusual in floating-point calculations. These exception flags can be very helpful for
checking difficult calculations.
12.14. Summary
Precise floating-point calculations can be tedious to get right. In C/C++, they are often
written in a careless fashion without any regard for rounding. In the RISC-V assembler
language, you are more likely to pay close attention to each step. This can be helpful for
correctly rounding critical formulas. Finally, the hardware operation of these floating calcu-
lations is much faster than performing those in software.
Bibliography
[1] Asanovi, K., & SiFive Inc. (n.d.). In A. Waterman (Ed.), The RISC-V Instruction Set
Manual Volume I: User-Level ISA (Version 2.2, Vol. I).
● 196
Chapter 13 • Portability
Portability Pioneers
Programs, particularly assembler language code can be tedious to write and test for cor-
rectness. Portable C/C++ code is relatively straightforward with the use of the CPP (C
Pre-Processor) macro capabilities. Sometimes, it is equally desirable to do the same for
assembly language code so that only one source module needs to be maintained. In this
chapter, let's see what is available to you using the GNU Compiler Collection (GCC).
The first question that follows is "what macros are available?" Because of the large number
of them, you'll find yourself referring to the source code rather than a specific document.
You might say that the macros are self-documenting. But these are inconveniently defined
in several files, so it is more convenient to ask the compiler to list them instead. On POSIX
systems like Linux/MacOS/*BSD, you can ask for this list from GCC directly as follows:
This dumps all predefined GCC macros. For example, the following is a dump of the first
few:
● 197
The compiler options and arguments used here are described in Table 13.1. The definitions
are directed to standard output, which is convenient for piping to grep or other filter com-
mands.
Option/Argument Description
Table 13.1: GCC Compiler Options for dumping out macro definitions.
Macros relevant to RISC-V are listed in Table 13.2. For example, when the macro __riscv is
defined, then you know that the compile is targeted to the RISC-V instruction set.
Some of the macros depend upon the compiler options used. The two options that are ref-
erenced in Table 13.2 are described in Table 13.3.
● 198
Option Meaning
1 .global foo
2 .text
3
4 #ifndef __RISCV
5 #error This is not a RISCV architecture!
6 #endif
7 foo: ret
In Listing 13.1 the CPP #ifndef is used with the macro __RISCV. When the macro value
for __RISCV is undefined, the #error message "This is not a RISC-V architecture!" will be
issued and the build will stop.
1 .global struint
2 .text
3
4 # extern unsigned struint(char const *text,bool *ok)
5 #
6 # ARGUMENTS:
7 # a0 char const *text (text to convert)
8 # a1 pointer to bool
9 #
● 199
10 # RETURNS:
11 # a0 unsigned value (when ok is true)
12 # ok:
13 # true, conversion successful
14 # false, conversion failed
15
16 struint:
17 mv t6,a0 # t6 = ptr to test
18 li a0,0 # Accumulator for uint
19 li t2,0 # Digit count
20 li t4,'0'
21 li t3,'9'
22 li t1,' '
23 li t0,10
24
25 loop: lbu t5,0(t6) # Load text char
26 beqz t5,nulbyt # Branch if null byte
27 beq t5,t1,skip # Skip white space
28 bgt t5,t3,fail # char > '9'?
29 blt t5,t4,fail # char < '0'?
30 andi t5,t5,0x0F # Mask out 0x00 to 0x09
31 #ifdef __riscv_mul
32 mul a0,a0,t0 # a0 *= 10
33 #else
34 sll a6,a0,3 # a6 = a0 * 8
35 sll a0,a0,1 # a0 *= 2
36 add a0,a0,a6 # a0 += a6
37 #endif
38 add a0,a0,t5 # a0 += t5
39 addi t2,t2,1 # Bump digit count
40 skip: addi t6,t6,1 # ++text ptr
41 j loop
42
43 nulbyt: beqz t2,fail # Fail if no digits
44 li t0,1
45 sb t0,0(a1) # ok = true
46 ret
47
48 fail: li t0,0
49 exit: sb t0,0(a1) # ok = false
50 ret
● 200
Line 31 uses the C Pre-Processor to test the macro name __riscv_mul. If the macro is
defined, then the source at line 32 invokes the hardware multiply opcode. Otherwise, the
assembly falls back to lines 38 to 36 to perform that multiplication by ten in three steps
instead. We can test this change by producing a compiler listing.
$ cd ~/riscv/repo/13/cpp
$ ~/riscv/repo/listesp struint.S # For ESP32-C3 environment
$ ~/riscv/repo/list struint.S # For QEMU in Fedora Linux
$ C:\riscv\repo\listesp.bat struint.S # For Windows ESP32-C3
Once the listing is produced, look at lines 25 to 41. Notice how there is assembled opcode
data for line 32 specifying the mul instruction, but no code for lines 33 to 36.
Now try removing extension support 'M'. Under Fedora, specify -march=rv64iac and for the
ESP32-C3 environment, use -march=rv32iac instead. This compiles without the 'M' exten-
sion. Listing 13.4 illustrates what the ESP32-C3 listing would look like after the assembly:
$ cd ~/riscv/repo/13/cpp
$ ~/riscv/repo/listesp -march=rv32iac struint.S # For ESP32-C3 environment
● 201
31 #ifdef __riscv_mul
32 mul a0,a0,t0 # a0 *= 10
33 #else
34 002c 13183500 sll a6,a0,3 # a6 = a0 * 8
35 0030 0605 sll a0,a0,1 # a0 *= 2
36 0032 4295 add a0,a0,a6 # a0 += a6
37 #endif
38 0034 7A95 add a0,a0,t5 # a0 += t5
39 0036 8503 addi t2,t2,1 # Bump digit count
40 0038 850F skip: addi t6,t6,1 # ++text ptr
41 003a E9BF j loop
Listing 13.4 Compiling with -march=rv32iac for struint.S for ESP32-C3 environment.
Notice in Listing 13.4 how there is no code assembled for line 32, but there is for lines 34 to
36. When we take away the extension 'M', the assembler substitutes the software solution
for us.
But when you need to return a 64-bit result in the RV32 environment, you have to split the
returned results into a0 and a1, with a0 holding the low order word.
But for RV64, the product can be fully returned in a0 because the register holds 64 bits:
#if __riscv_xlen == 32
mulh t2,a0,a1 # high order a0 * a1 into t2
mul a0,a0,a1 # low order a0 * a1
mv a1,t2 # a1 = high order a0 * a1
● 202
#elif __riscv_xlen == 64
mul a0,a0,a1 # return a0 * a1
#else
#error The budgie died: wrong RISC-V environment
#endif
Let's examine the listings for RV32 and RV64, to see if the assembler did the correct thing.
Listing 13.5 is the ESP32-C3 listing and Listing 13.6 is the QEMU Fedora Linux assembler
listing for the RV64 environment.
1 # 1 "port3264.S"
1 .global foo
0
0
2 .text
3
4 foo:
5 #if __riscv_xlen == 32
6 0000 B313B502 mulh t2,a0,a1 # high order a0 * a1 into t2
7 0004 3305B502 mul a0,a0,a1 # low order a0 * a1
8 0008 9E85 mv a1,t2 # a1 = high order a0 * a1
9 #elif __riscv_xlen == 64
10 mul a0,a0,a1 # return a0 * a1
11 #else
12 #error The budgie died: wrong RISC-V environment
13 #endif
14 000a 8280 ret
1 # 1 "port3264.S"
1 .global foo
0
0
1 /* Copyright (C) 1991-2020 Free Software Foundation, Inc.
2 .text
3
4 foo:
5 #if __riscv_xlen == 32
6 mulh t2,a0,a1 # high order a0 * a1 into t2
7 mul a0,a0,a1 # low order a0 * a1
8 mv a1,t2 # a1 = high order a0 * a1
9 #elif __riscv_xlen == 64
10 0000 3305B502 mul a0,a0,a1 # return a0 * a1
11 #else
12 #error The budgie died: wrong RISC-V environment
● 203
13 #endif
14 0004 8280 ret
Comparing the two listings, you can see that for the RV32 environment, lines 6 through 8
are assembled, but not line 10. In the RV64 listing, lines 6 through 8 are not assembled,
yet line 10 is. The preprocessor allowed the code to adapt to its environment.
where:
Let's take the last assignment from chapter 12 and define and use a macro named fpinit, to
save the current floating-point state and to initialize the environment for a new calculation.
The definition of the macro has been extracted for ease of reference:
8 #
9 # Macro to save current fscr status to register 'save',
10 # and set the default rounding mode to 'round' (round to
11 # nearest, by default), and clear exceptions:
12 #
13 .macro fpinit save=t2 round=0x4
14 #if __riscv_flen > 0
15 frcsr \save # save = original fcsr
16 fsrmi x0,\round # Set rnd mode to RMM
17 fsflagsi x0,0 # Clear exceptions
18 #else
19 #error No RISC-V hardware floating point support
20 #endif
21 .endm
● 204
Line 13 declares the start of the macro body, declaring its name to be fpinit and to take two
optional parameters save and round. Each of these parameters has default values of t2 for
save and 0x4 for round (which is the value for round to the nearest). Specifying defaults
for each parameter is optional.
Line 14 checks that the hardware-floating point exists, and if not, the error at line 19 is
highlighted. Otherwise, lines 15 to 17 are expanded, with the text for "\save" and "\round"
substituted according to the values supplied in the macro parameters.
Listing 13.7 illustrates the entire program for celsius.S where the macro is defined and
used. The definition of the macro must appear before its use.
● 205
We see that the macro is declared at the top of the source file on lines 13 to 21. The macro
is invoked in line 35, with the same parameter set to a7, and the round parameter set to
rmm. The excerpted assembler output listing of the main section of code is illustrated in
Listing 13.8.
34 fconvtemp:
35 0000 F3283000 fpinit save=a7 round=rmm
35 73502200
35 73501000
36 000c 970E0000 la t4,f18
36 938E0E00
37 0014 07B00E00 fld ft0,0(t4) # ft0 = 1.8
38 0018 93020002 addi t0,x0,32 # t0 = 32
39 001c D39032D2 fcvt.d.lu ft1,t0,rtz # ft1 = 32.0
40
41 0020 5315150A conv: fsub.d fa0,fa0,ft1,rtz # fa0 -= 32.0
42 0024 5345051A fdiv.d fa0,fa0,ft0,rmm # fa0 /= 1.8
43
44 0028 F3221000 frflags t0 # t0 = fcsr.flags
45 002c 23205500 sw t0,0(a0) # Store fcsr.
flags
46
47 0030 73903800 fscsr x0,a7 # Restore fscr
48 0034 8280 ret
● 206
The macro invocation occurs in line 35, from which you can see the assembled code at the
left in lines labeled as 35 (all three of them). Had we set the parameter save=t2, the code
shown would have been identical to the program in chapter 12, Floating-Point. Since the
macro permits us to use a different register, we chose unused register a7 this time. Note
that this change requires us to change the register referenced in line 47.
Line 35 supplied the value of round=rmm to the macro. The default for this parameter was
the value 0x4, which is the bit pattern for round to the nearest. Since we supplied "rmm"
to this parameter and it is used as immediate data in line 16, the value "rmm" must be a
defined symbol. In this program, that value comes from line 6 of the source program:
These are little details that sometimes mess us up. If we were lazy and didn't want to define
rmm, we could have invoked the macro in either of the following ways:
While this is a fairly simple macro with limited utility in this case, the idea was to introduce
to you the ability of the GNU assembler to apply macros. For completeness, Listing 13.9
illustrates the main driver program for this particular project.
1 #include <stdio.h>
2
3 extern double fconvtemp(double f,unsigned *pflags);
4
5 int
6 main(int argc,char **argv) {
7 static double const tests[] = {
8 32.0, 0.0, -40.0, 18.5
9 };
10
11 for ( int ux=0; ux < 4; ++ux ) {
12 unsigned flags;
13
14 double celsius = fconvtemp(tests[ux],&flags);
15
16 printf("%.1lf F -> %.1lf C (flags = 0x%04X)\n",
17 tests[ux], celsius, flags);
18 }
19 return 0;
20 }
● 207
Let's make sure that this change to use the fpinit macro produces the same results. Start
up Fedora Linux in QEMU, and perform the following:
$ cd ~/riscv/repo/13/cpp/qemu64
$ gcc -g celsius.S main.c
$ ./a.out
32.0 F -> 0.0 C (flags = 0x0000)
0.0 F -> -17.8 C (flags = 0x0001)
-40.0 F -> -40.0 C (flags = 0x0001)
18.5 F -> -7.5 C (flags = 0x0001)
13.6. Summary
Using the C preprocessor macros in assembler work the same as they do for C/C++ when
the source code is provided in a ".S" suffixed file. This permits the programmer to perform
clever work-arounds to adjust for environmental differences. Whether you use this feature
or not will depend upon the complexity involved. Sometimes it may be simpler to use sep-
arate source files for different environments because a source file with too many preproc-
essor directives can be difficult to read.
When developing a more involved function, it may be profitable to develop separately for
RV32 and RV64 at first. Once both are debugged and tested, you might try to merge the
two by careful use of preprocessing. If that still proves difficult the other option is to #if
out large unique sections of the code for each environment and share the other sections in
common. Do keep the source code pleasant for the reader.
Additionally, the GNU assembler provides macro capabilities that are especially helpful for
repetitive or error-prone code. The GCC preprocessor and macro processor offer another
pair of tools for the RISC-V developer to exploit.
● 208
Trusted Support
In the last chapter, C preprocessor macros helped shape the assembly of RISC-V code by
use of macros to direct it. For example, if it was known at build time that multiply support
was provided then the multiply opcode would be assembled. But what do you do when you
must assemble a binary that will run on platforms that may or may not have multiply sup-
port? If you can't determine the support level at compile time, then how do you determine
it at runtime? That is one exploration that lies before us in this chapter.
Another area that will be touched upon is the CPU's counters and timers. RISC-V defines
some standard ones that should exist. There may also be custom timers and counters pro-
vided, which are obviously defined by the vendor.
0 0b00 User/Application U
1 0b01 Supervisor S
2 0b10 Reserved -
3 0b11 Machine M
For a given hardware platform, the vendor may support a subset of these privilege levels.
The one mandatory level for all platforms is, however, the Machine Level. Level 2 in Table
14.1 may be defined as Hypervisor in some early documents. But some of these aspects of
RISC-V are still undergoing development.
● 209
"This register must be readable in any implementation, but a value of zero can be re-
turned to indicate the misa register has not been implemented, requiring that CPU capa-
bilities be determined through a separate non-standard mechanism."[1]
According to the RISC-V standard, the misa register is classed as "Write Any Values, Read
Legal Values" (WARL) type of register. This means that no exceptions are raised if you
write values in unsupported bits of the register. When reading the register, only the legal
supported bit values are returned. The layout of the misa CSR is illustrated in Figure 14.1.
Notice that the leftmost 2 bits encode the MXL field, while extensions are the rightmost 26
bits. This has an impact when working with 32-bit or 64-bit machines. The XLEN value is
encoded in the left 2 most significant bits according to Table 14.2.
● 210
MXL XLEN
1 32
2 64
3 128
Table 14.2: Encoding of the MXL field within MISA.
The bit encodings for the instruction sets and extensions are shown in Table 14.3. Some of
these may change in meaning as the standards are ratified over time. Most notable is bit
8 which indicates one of the RV32I, RV64I or RV128I base ISA, which combined with the
MXL value identifies your base platform. If bit 8 is not set, then check bit 4 for the special
RV32E base ISA, reserved for very small-embedded platforms. In addition, extension bits
will appear when they apply. For example, the ESP32-C3 will also indicate bit 2 (C) for
compressed instruction set support and bit 12 (M) for multiply support.
0 A Atomic
1 B Bit-Manipulation
2 C Compressed
3 D Double-precision floating-point
5 F Single-precision floating-point
6 G Reserved
7 H Hypervisor
10 K Reserved
11 L Reserved
12 M Integer Multiply
14 O Reserved
16 Q Quad-precision floating-point
17 R Reserved
19 T Reserved
21 V Vector
22 W Reserved
● 211
24 Y Reserved
25 Z Reserved
Table 14.3: Encoding for the Extensions field in MISA.
14.3. Opcodes
There are three basic opcodes to know about before we look at the pseudo-opcodes. Where
"CSR" is specified, supply the Control Status Register address that you want to work with.
For one example the address misa from Figure 14.1 can be supplied. The generalized op-
codes are:
csrrs rd, csr, rs1 # rd<-csr, csr<-rs1 Atomic Read and Set Bit in CSR
csrrw rd, csr, rs1 # rd<-csr, csr<-rs1 Atomic Read/Write CSR
csrrwi rd, csr, imm # rd<-csr, csr<-imm Atomic Read/Write CSR immediate
The csrrw and csrrwi atomically place the CSR contents into rd, while replacing the CSR
with the value in rs1 or the immediate data. The csrrs opcode differs by only setting bits in
the CSR for each bit that is a 1-bit in rs1, while zero bits have no effect.
For programmer convenience, there are three pseudo-instructions offered, with the equiv-
alent opcode shown at right in the comment field:
To read the misa register into register t3 without changing the value of the misa register,
you would code:
14.4. ESP32-C3
The ESP32-C3 device as programmed by the Espressif ESP-IDF operates in the Ma-
chine Level mode, allowing us to explore privileged opcodes like csrr. The ESP-IDF
incorporates a number of libraries, including FreeRTOS, so that preemptive mul-
ti-threading is possible. But all this runs at the "machine level" (M).
● 212
1 #include <stdio.h>
2 #include <stdbool.h>
3
4 extern unsigned extensions(char *buf,unsigned bufsiz,unsigned *bits);
5
6 void
7 app_main(void) {
8 char buf[32];
9 unsigned exten=0, bits=0;
10
11 exten = extensions(buf,sizeof buf,&bits);
12
13 printf("exten = 0x%06X, %u bits, RV%u%s\n",
14 exten, bits, bits, buf);
15 }
The assembler function is illustrated in Listing 14.2. It might look a little complicated, but it
is merely a bunch of little steps used to return an improved format to the caller.
1 .global extensions
2 .text
3
4 # extern unsigned extensions(
5 # char *buf,
6 # unsigned bufsiz,
7 # unsigned *bits);
8 #
9 # ARGUMENTS:
10 # a0 char const *buf (text to return)
11 # a1 unsigned buf size (bytes)
12 # a2 pointer to unsigned int 'bits'
13 #
14 # RETURNS:
15 # a0 unsigned value (extension bits)
16
17 extensions:
18 mv t6,a0 # t6 = buf ptr
19 add a6,t6,a1 # a6 = buf + buf_size
20 csrr t3,misa # t3 = misa register
21 li t2,1 # t2 = 1
22 sll t2,t2,8 # t2 <<= 8 (Mask for 'I')
23 and t4,t3,t2 # t4 = t3 & t2
24 beq x0,t4,3f # Branch if not 'I'
25 li t1,'I' # t1 = ascii 'I'
● 213
● 214
Let's first examine the inner function named putch() defined in lines 65 to 68.
1. The function is called using t0 as the "link register". The value in t0 will be used
for returning to the caller.
2. Register a6 has been initialized as the address past the end of the caller's buffer,
which is established in line 19. Register t6 is used as the working pointer into the
caller's buffer.
3. Upon entry to putch() in line 67, we first test if the buffer pointer (t6) is greater
than or equal to the end of the buffer. If it is, control passes to label 5 in line 68,
where we promptly return.
4. Otherwise, the byte held in temporary t1 is stored at the pointer 0(t6) placing that
character into the caller's buffer (line 66).
5. The buffer pointer is then incremented by 1 in line 67.
6. The function returns by the link register t0 in this case.
This use of the little internal function allows us to blindly call it, knowing that if the caller
ever supplies a buffer that is too small, no harm will be done. The bounds of the buffer will
always be checked.
The entry of the function is line 18, where the buffer pointer in a0 is copied to t6, and the
end pointer is computed in a6 (an unused argument register). Now let's trace the remaining
steps:
1. The misa CSR is read in line 20, with the value placed into t3. The misa is left
unchanged.
2. Register t2 is then loaded with 1, then shifted left 8 bits in lines 21 and 22.
3. That mask value is used in line 23 to set t4 (line 23).
4. If t4 is zero (in other words, the 'I' bit in misa is not set) the code branches to line
31 (label 3).
5. Otherwise, the misa 'I' bit was set, and then the ASCII value for 'I' is loaded into
t1 and then stored in the caller's buffer in lines 25 and 26.
6. To suppress the 'I' from being reported later, we mask out that bit in line 27, by
using the xor operation to flip the 1-bit to a zero.
7. Line 31 then sets t4 to the value 32 as a trial value for XLEN.
8. If the sign bit of t3 is set (negative), then branch to line 34 (label 7).
9. Otherwise, we jump to line 39 from line 33 (XLEN is 32).
10. If control passes to line 34 (from line 32), we then set t4 to 64 as the next trial
XLEN value.
11. Line 35 shifts t4 left one bit and tests the sign bit again, branching from line 36
if the bit was set. If the branch to line 38 is taken, the XLEN is determined to be
128 (label 9).
12. At line 39, the value for XLEN in t4 is saved to the caller's unsigned int argument,
returning the XLEN value to the caller.
13. Lines 43 to 46 then mask out the lower 26 bits of the misa register, since we want
to eliminate all but the lower 26 bits. These bits are left in register a0 to be the
return value containing all extension bits (except for 'I', which we turned off).
● 215
14. Line 50 initializes t1 with the ASCII value for 'A' and copies a0 to t2, for the exten-
sion bits to be tested.
15. The loop begins at line 52, where we test the low-order bit of t2, placing that bit in
t5. If that bit was zero, then the code skips ahead to line 55 (label 4).
16. If the bit tested is a 1 bit, then the ASCII character in t1 is saved to the caller at
line 54. The putch() function will also increment the buffer pointer in t6.
17. Line 55 increments the ASCII character to the next in sequence. The first time
around the 'A' will become a 'B'.
18. The extension bits in t2 are shifted right one bit in line 56, and if that shift results
in a non-zero value, then we loop back to line 52, from 57.
19. Otherwise, we fall through to line 58, where we load a null byte into t1.
20. Internal function putch() is called one more time from line 59 to place a null byte
at the end of the string.
21. The extensions() function returns at line 60.
$ cd ~/riscv/repo/14/extend
$ idf.py build
...
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
exten = 0x101004, 32 bits, RV32ICMU
From the last line of output (from our main program), we see that it reports an XLEN
of 32 bits, and that it is an RV32I platform with CMU extensions. These are all encoded
in the "exten = 0x101004" bits that were returned and reported. Note that the 'I' bit
was stripped out of this string in the code. With the information returned, the caller can
determine at runtime, whether multiply is supported for example.
$ ~/riscv/repo/14/extend/qemu64
$ gcc -g -Wall extend.S main.c
$ ./a.out
Illegal instruction (core dumped)
$
It appears that the program builds ok, but aborts when run under Fedora Linux. To see why,
let's enlist the help of gdb:
● 216
$ gdb ./a.out
GNU gdb (GDB) Fedora 9.0.50.20191119-2.0.riscv64.fc32
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "riscv64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
Normally we'd just analyze a "core file" in gdb after an abort but in recent years it seems
that every distribution is doing their best to do away with core files (this drives developers
nuts!) That problem is fixable, but each platform has its own rules for configuring core file
support. Because of this annoyance, I've cut to the chase and just run the executable a.out
directly from gdb instead.
Note: If gdb does not report the source line of the failure, it is because the executable
was not compiled and linked with debugging support (GCC option -g).
● 217
There is nothing wrong with the way the program (or the opcode) is written. What caused
the above signal is that Fedora Linux is running our programs in "User/Application mode"
(U). This mode of execution does not permit privileged instructions like csrr to be executed.
If you wanted to, you could write a kernel device module, and issue the opcode there. But
that is beyond the scope of this book.
$ cat /proc/cpuinfo
processor : 0
hart : 1
isa : rv64imafdcsu
mmu : sv48
processor : 1
hart : 0
isa : rv64imafdcsu
mmu : sv48
From this display, we see that there are two configured cores (processors). Each RV64I core
with MAFDCSU extensions supported. Table 14.4 lists the support available in our Fedora
Linux RISC-V system. A user program can open and parse the file /proc/cpuinfo to deter-
mine the level of support available.
0 A Atomic
2 C Compressed
3 D Double-precision floating-point
5 F Single-precision floating-point
12 M Integer Multiply
Table 14.4: Encoding for the Extensions field in MISA reported by QEMU.
● 218
14.7. Counters
The RISC-V standards define performance counters and timers, which are available to the
unprivileged mode of operation. These counters are 64 bits in width and are CSR read-only
values. For RV32, the upper half of the counter is read with an "h" version of the opcode.
For example, the upper word is read with rdcycleh instead of rdcycle.
The RISC-V standard for reading the cycles counter is the rdcycle pseudo-opcode:
For RV32I, there is the risk of the counter overflowing between the reading of the low order
32 bits and the high order bits. To work around that, use the following sequence:
loop: rdcycleh t2
rdcycle t1
rdcycleh t3
bne t2,t3,loop
This sequence checks to see if the upper 32-bit word changed between the start and end
of the sequence. If it did, the sequence is repeated one more time.
There are also two more timers and counters that often interest the programmer:
For RV32I, there are also rdtimeh and rdinstreth opcodes. The rdtime pseudo-opcode reads
the time CSR value into the destination register. This counter tracks the wall-clock real time
that has passed from some arbitrary start point (the units used may be implementation
specific). The CSR instret counts the number of instructions "retired" by this hardware
thread.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 extern uint64_t measure(int mul);
5
6 int
7 main(int argc,char **argv) {
8 uint64_t cycles;
● 219
9
10 for ( int x=0; x<10; ++x ) {
11 cycles = measure(1);
12 printf("muliply cycles = %lu\n",cycles);
13 cycles = measure(0);
14 printf("mul10 cycles = %lu\n",cycles);
15 }
16 return 0;
17 }
When the assembler routine measure() is called with a true argument, it will measure the
cycle time of the multiply opcode. When called with zero (false), the multiply by 10 without
the multiply opcode is measured instead. The test is repeated ten times for the pair of calls
by the main program (lines 11 and 13).
The source program for the assembler measure() function is provided in Listing 14.4.
1 .global measure
2 .text
3
4 # extern unsigned measure(int mul)
5 #
6 # ARGUMENTS:
7 # a0 When true, measure the mul instruction,
8 # otherwise measure two shifts and one add.
9 # RETURNS:
10 # a0 unsigned count of cycles
11
12 measure:
13 beqz a0,1f # Branch if measuring by 10
14 li t5,10 # t5 = 10
15 rdcycle t1
16 mul a2,a0,t5 # a2 = a0 * 10
17 rdcycle t2
18 sub a0,t2,t1 # Difference in cycles
19 ret
20
21 1: rdcycle t1
22 sll a2,a0,3 # a2 = a0 * 8
23 sll a1,a0,1 # a1 = a0 * 2
24 add a2,a2,a1 # a2 = a0 * 10
25 rdcycle t2
26 sub a0,t2,t1 # Difference in cycles
27 ret
● 220
1. The 1 or 0 value is provided in the call through the register a0 and is tested in line
13. The argument was non-zero (true), then the execution falls through to line 14.
2. The constant 10 is loaded into temporary register t5 (line 14).
3. Line 15 takes a cycle counter snapshot into register t1.
4. Then the actual multiplication by the mul opcode occurs in line 16.
5. Followed by that is the second snapshot of the cycle counter into t2 at line 17.
6. The difference between the two counts is computed and returned in a0 (line 18).
7. Control returns to the caller in line 19.
8. When the argument is false, control passes to label "1" at line 21 from line 13. At
this point, a cycle snapshot is copied into register t1.
9. Lines 22 through 24 compute a multiply by ten using shifts and add.
10. The second cycle snapshot is captured into t2 at line 25.
11. Finally, the cycle difference is computed in line 26 and returned in register a0.
There is no provision for handling a counter rollover in the presented code. If the cycle
counter overflows after capturing the first snapshot, then the difference computed will be
huge. This is unlikely to happen depending on how long you have had QEMU running. If it
does happen, repeat the test.
● 221
Your results will likely differ, perhaps considerably so. The first three lines indicate a lot of
work being done by Fedora Linux, perhaps the result of loading and linking with shared
libraries. There is also the overhead of mapping the executable file into memory. So I would
throw away at least the first four results.
The remaining results are inconsistent. This is because the QEMU emulation is not a true re-
flection of what hardware would have returned. So, the best we can do with the emulation
run is say "we did it" and leave it at that. The essential lesson from this exercise is how we
used the rdcycle opcode to measure cycle time.
15 rdcycle t1
16 mul a2,a0,t5 # a2 = a0 * 10
17 rdcycle t2
Benchmarking is tricky. This exercise glosses over other issues like the fact that there is no
guarantee that the instructions in lines 15 to 17 run without interruption. So, keep these
factors in mind when designing benchmarks.
If you have access to real RV64 hardware, you should be able to run this exercise and
obtain good results. New affordable hardware announcements for RISC-V are frequently
being made these days.
.equ mpccr,0x7E2
The main program, which is similar to the QEMU version is presented in Listing 14.5.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 extern uint32_t measure(int mul);
5
6 void
7 app_main(void) {
8 uint32_t cycles;
9
10 for ( int x=0; x<10; ++x ) {
11 cycles = measure(1);
12 printf("muliply cycles = %u\n",cycles);
● 222
13 cycles = measure(0);
14 printf("shift cycles = %u\n",cycles);
15 }
16 }
This program runs the measure() test multiple times, for reasons that will be explained
shortly. The assembler routine for the ESP32-C3 version is shown in Listing 14.6.
1 .global measure
2 .text
3
4 .equ mpccr,0x7E2
5
6 # extern unsigned measure(bool mul)
7 #
8 # ARGUMENTS:
9 # a0 true = use mul else shift
10 #
11 # RETURNS:
12 # a0 unsigned count of cycles
13
14 measure:
15 li a1,99 # Some number
16 beqz a0,1f # If mul is false, jump to 1
17 li a2,10 # ten
18 #
19 # Multiply by ten with mul opcode
20 #
21 csrr t1,mpccr
22 mul t0,a1,a2 # 99 * 10 -> 136 cycles
23 csrr t3,mpccr
24 j xit
25 #
26 # Multiply by ten without mul opcode
27 #
28 1: csrr t1,mpccr
29 sll a2,a1,3 # a2 = a1 * 8
30 sll a1,a1,1 # a1 *= 2
31 add a0,a1,a1 # a0 = a1 * 10 => 200 cycles
32 csrr t3,mpccr
33
34 xit: sub a0,t3,t1
35 ret
● 223
1. Line 4 defines a symbol mpccr with the Espressif custom address of 0x7E2. This
permits us to refer to it symbolically in lines 21 and 23 for example.
2. The function begins by defining a value for a1 in line 15, though this is not really
required (we can multiply any number).
3. Line 16 tests the calling argument to see if it is zero or non-zero. If non-zero, the
execution falls through to line 17.
4. The value of 10 is loaded into a2, to keep the code comparison fair, so that we are
multiplying by ten in line 22 (line 17).
5. The first mpccr value is captured into t1 (line 21).
6. The multiply opcode is used in line 22.
7. The second mpccr value is captured into t3 (line 23).
8. Then the code branches from line 24 to line 34, where the mpccr difference is
computed and returned in register a0.
9. If the argument was zero upon calling this routine, control would resume at line 28
where label "1" is defined. At this point, the first mpccr counter is captured into t1.
10. Then the multiply by ten the hard way is performed in lines 29 through 31.
11. The second mpccr value is captured into t3 at line 32.
12. Finally, the difference in mpccr values is computed and returned in register a0 (line
34) before returning to the caller (line 35).
Once again, note that this code does not handle counters that roll over after overflow.
Because your demonstration runs shortly after the CPU reset, you are not likely to see a
problem with that.
$ idf.py build
$ idf.py -p <<<yourport>>> flash monitor
...
I (258) cpu_start: Starting scheduler.
muliply cycles = 64
shift cycles = 4
muliply cycles = 2
shift cycles = 4
muliply cycles = 2
shift cycles = 4
muliply cycles = 2
shift cycles = 4
muliply cycles = 2
shift cycles = 4
muliply cycles = 2
shift cycles = 4
muliply cycles = 2
shift cycles = 4
● 224
muliply cycles = 2
shift cycles = 4
muliply cycles = 2
shift cycles = 4
muliply cycles = 2
shift cycles = 4
Except for the initial surprise, we have better results this time. It is easy to forget that the
ESP32C3 code resides in flash memory and must be loaded into the RAM cache before it
can execute. This is the reason for the initial high count.
Once the code has entered the cache, we see that the multiply instruction measures con-
sistently 2, while the shift and add return a count of 4. If we assume that this counter
represents instructions performed, then we can say that the multiply should be a count of
1, while the shift and add should be 3. Keep in mind an added instruction was needed to
perform the second counter capture within measure(). The Espressif documentation in the
"ESP32C3 Technical Reference Manual" simply describes the MPCCR counter as "Machine
Performance Counter Value". Espressif's counter address values are defined in the address
space reserved by the RISC-V standard for customization.
14.8. Summary
There is considerably more that could be said about the counters, timers, Control and Sta-
tus Registers and privilege levels than space would allow. This chapter serves the reader,
however, as an introduction to these concepts.
[1] Waterman, A., Krste Asanović, & Hauser, J. (Eds.). (n.d.). The RISC-V Instruction Set
Manual: Volume II: Privileged Architecture. Retrieved June 23, 2022.
● 225
Debugging can be difficult at the best of times. Programs today are often long and com-
plicated. When an MCU program suddenly faults, how do you determine the exact cause?
Knowing exactly where the fault occurs is a help. But knowing the state of the variables it
was working with can make the solution to the problem obvious.
When programming in assembly language the need for a debugger is often more urgent.
You can look at the code hundreds of times and still be convinced that it should work. But
the evidence demonstrates that it clearly doesn't. If only you were able to step through that
same code one instruction at a time looking at the registers and associated memory along
the way. That is when you experience that "aha!" moment. Seeing is believing.
JTAG is a standard that was developed to help with testing circuits without the use of the
traditional bed-of-nails approach. This required defining new electrical connections and a
protocol to drive it. In addition to chip-level testing, the JTAG protocol can be used to pro-
gram flash memory and debug your device. In this chapter, we will examine the Espressif
ESP32-C3 debugging capabilities made available over the USB link.
● 226
device is revision 3 or later and will support JTAG. Figure 15.1 illustrates one example of a
JTAG capable ESP32-C3 dev board.
Figure 15.1 illustrates the relationships between the major components. For debugging,
you will have the OpenOCD software running and communicating with the RISC-V aware
gdb process, both on your desktop PC. The gdb process will be operated through a PC win-
dow. OpenOCD in turn communicates with the ESP32-C3 device over USB, working through
a layer of device JTAG support. In this manner, the user on the PC can direct the execution
of code running on the ESP32-C3 device.
Optionally, you can have simultaneous USB CDC (Communication Device Class) support
so that your ESP32C3 code can send and receive serial data to another window on your
desktop. This allows the user to see what the code has "printed" for example. This support
is optional and requires some configuration to use.
● 227
PC ESP32-C3 device
debug
Serial
Window
Window
Figure 15.2: Major components of ESP32-C3 JTAG debugging.
If you don't need to see your program output, you can dispense with the second optional
(serial) window session and interact only with the gdb debugger.
While Figure 15.2 does not show a window attached to the OpenOCD software, there can be
a window used there also. The procedure that I am going to show you will have it run from
a terminal session window. If you get fancy and use an IDE like Eclipse or Visual Studio,
you can have OpenOCD run automatically for you in the background and in that case won't
need a terminal session. I'm going to stick with manual command line methods here, since
it leaves us in full control and makes it easier to identify and resolve problems.
The following subsections will expand on the details for OpenOCD and gdb.
● 228
$ openocd -f board/esp32c3-builtin.cfg
Open On-Chip Debugger v0.11.0-esp32-20211220 (2021-12-20-15:45)
Licensed under GNU GPL v2
For bug reports, read
http://openocd.org/doc/doxygen/bugs.html
Info : only one transport option; autoselect 'jtag'
Info : esp_usb_jtag: VID set to 0x303a and PID to 0x1001
Info : esp_usb_jtag: capabilities descriptor set to 0x2000
Warn : Transport "jtag" was already selected
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
Info : esp_usb_jtag: Device found. Base speed 40000KHz, div range 1 to 255
Info : clock speed 40000 kHz
Info : JTAG tap: esp32c3.cpu tap/device found: 0x00005c25 (mfg: 0x612 (Espressif
Systems), part: 0x0005, ver: 0x0)
Info : Examined RISC-V core; found 1 harts
Info : hart 0: XLEN=32, misa=0x40101104
Info : starting gdb server for esp32c3 on 3333
Info : Listening on port 3333 for gdb connections
Note that you don't need to specify a port because OpenOCD locates the device using USB
identifiers. If the device is not plugged in or not recognized, you will see the message:
since this tells you that OpenOCD is ready to work with a gdb session.
● 229
Normally it is bad practice to use kill -9, but when OpenOCD goes astray, I found it neces-
sary under MacOS. Once all instances of OpenOCD have been terminated, plug in the USB
cable again and restart OpenOCD.
Note: It is possible for the ESP32-C3 to get into a state that OpenOCD cannot work
with because the flashed code has done something bad. If you cannot get OpenOCD to
start successfully, try flashing the device with a different project. Test if the OpenOCD
and gdb can work with that project. If so, go back and recheck your code of the problem
project. Often adding a 10-second delay at the start of your app_main() will give time
for OpenOCD to connect. Use the FreeRTOS function vTaskDelay().
$ cd ~/riscv/repo/14/rdcycle
$ idf.py build
$ idf.py -p <<<yourport>>> flash
This creates an executable in file ./build/rdcycle.elf, that will be loaded by gdb to match the
code in the device flash. To start gdb, use the Espressif aid, idf.py as follows (it will know
where to find your *.elf file). You must be in the same directory that you used to build the
project:
$ idf.py gdb
● 230
The messages should resemble Figure 14.2. If you see error messages instead, it may be
because OpenOCD was not started, or that the process ran into problems. Recheck the
OpenOCD startup and try again.
$ idf.py gdb
Executing action: gdb
GNU gdb (crosstool-NG esp-2021r2-patch3) 9.2.90.20200913-git
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-host_apple-darwin12
--target=riscv32-esp-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
Notice that gdb has automatically set a breakpoint at app_main(), where the ESP32-C3
program starts. To quit debugging, you can type "quit" followed by return.
● 231
(gdb) quit
A debugging session is active.
Quit anyway? (y or n) y
Detaching from program: /Users/ve3wwg/riscv/repo/14/rdcycle/build/rdcycle.elf,
Remote target
Ending remote debugging.
[Inferior 1 (Remote target) detached]
$
If you got that far, then consider it a success! We'll dig into gdb in the next section, so that
you can do a whole lot more.
In addition to command line editing and history, the gdb debugger repeats the last execut-
ed command if you just press enter (return). This is extremely useful for some commands
like stepping through code one instruction (or statement) at a time.
15.5.1. Abbreviations
The gdb command goes further than editing to make entering commands easy. It allows
you to abbreviate commands as short as you want, as long as your abbreviation is unique.
For example, the "info" command can be shortened to "inf", "in" or even just "i" at the
moment. If GNU eventually adds another "i" command, then the abbreviation "i" will no
longer work.
(gdb) i r
● 232
While [4] is an older resource, you might consider it as a gentle introduction to using gdb.
In this chapter I am just going to show some highlights to gdb usage, to whet your appetite
and get you started. There is a hefty manual for gdb that can be downloaded and printed.
See [5] under the section "GDB User Manual (PDF)".
$ cd ~/riscv/repo/14/rdcycle
$ $ idf.py gdb
...
Thread 1 hit Temporary breakpoint 1, app_main () at /Users/ve3wwg/riscv/repo/14/
rdcycle/main/main.c:7
7 app_main(void) {
(gdb)
Note: Square brackets as in "i[nfo]" will be used in this chapter to indicate valid abbrevi-
ations. The square brackets are not typed and are not part of the command. A frequently
used command is seldom typed out in full.
Note: The sessions that are shown in this chapter have pathnames abbreviated to a path
that begins with a tilda (~). The actual path shown will include a full pathname when you
perform the same operations. For example, userid jackie on MacOS will display without
the tilda as in:
/Users/jackie/riscv/repo/14/rdcycle/main/main.c
Espressif automatically breakpoints at the start of app_main(), where the ESP32-C3 appli-
cation begins. One helpful command is the "l[ist]" command, since it shows us source lines
near the current breakpoint without having to refer to the program listing:
● 233
11 cycles = measure(1);
(gdb) <CR>
12 printf("muliply cycles = %u\n",cycles);
13 cycles = measure(0);
14 printf("shift cycles = %u\n",cycles);
15 }
16 fflush(stdout);
17 }
(gdb) <CR>
Line number 18 out of range; ~/riscv/repo/14/rdcycle/main/main.c has 17 lines.
(gdb)
Notice how the "list" command was repeated twice by just pressing return (shown as
"<CR>" in the sample session. After listing several lines of code, you might lose track of
where the program was stopped. Use the "f[rame]" command to display and refresh your
memory:
(gdb) frame
#0 app_main () at ~/riscv/repo/14/rdcycle/main/main.c:7
7 app_main(void) {
(gdb)
Since there is a stack involved, we can also do a "ba[cktrace]" (or "bt"). This is one com-
mand that is also abbreviated as "bt" since it is often used and is more mnemonic than
"ba":
#0 app_main () at ~/riscv/repo/14/rdcycle/main/main.c:7
#1 0x4201080c in main_task (args=<optimized out>) at ~/esp32c3/esp-idf/
components/freertos/port/port_common.c:129
#2 0x40385b32 in vPortSetInterruptMask () at ~/esp32c3/esp-idf/components/
freertos/port/riscv/port.c:306
Backtrace stopped: frame did not save the PC
(gdb)
In C/C++ code, a person often wants to step through the code one statement at a time.
This is done with the "n[ext]" command. Let's perform one "next" now:
● 234
(gdb) n
10 for ( int x=0; x<10; ++x ) {
(gdb)
This takes us to the first executable statement in main.c:10 (the convention used by gdb
is to report a file name followed by a colon and then a line number for ease of reference).
Let's step one more time:
(gdb) n
11 cycles = measure(1);
(gdb)
At this point, the "for" statement has begun, and we are now ready to call the assembler
routine measure(). What if you wanted to know what variable x was at this point? You
"p[rint]" it of course:
(gdb) p x
$2 = 0
(gdb)
This is what we expected the first time into the loop. Sometimes you may want to see all
local variables at once for convenience. This can be done with the "i[nfo] lo[cals" command:
(gdb) i lo
x = 0
cycles = <optimized out>
(gdb)
Unfortunately, the value for cycles has been "optimized out" by the compiler. This can be a
nuisance when debugging. In this particular example, the value is not technically defined
anyway, since it was not initialized. Anyway, that is another gdb command to keep in your
back pocket.
Let's assume that for this particular time, we don't actually want to examine all the assem-
bler steps involved in the measure() function. We can invoke the function and have it return
its result by use of the "n[ext]" command again:
(gdb) n
Note: automatically using hardware breakpoints for read-only addresses.
12 printf("muliply cycles = %u\n",cycles);
(gdb)
The JTAG system reports a message about using hardware breakpoints, and then we're
placed at the statement following the call to measure(). The value cycles should now be
defined with the returned value. Fortunately, we can report it this time (sometimes even
this is optimized out):
● 235
(gdb) i lo
x = 0
cycles = 2
(gdb)
From this, we can see that variable x is still zero, and that the variable cycles now have the
value 2. Let's continue with the next statement:
(gdb) n
13 cycles = measure(0);
(gdb)
Notice in this example, that no printed output is shown. This is because we don't have a
window open for the serial output. In this example, we don't need it since we've already
seen the first value of cycles returned.
Let's assume that we now want to see the assembler function's operation in greater detail.
To step into the function measure(), use the "s[tep]" command:
(gdb) s
measure () at /Users/ve3wwg/riscv/repo/14/rdcycle/main/measure.S:15
15 li a1,99 # Some number
(gdb)
This time, the execution has stopped at measure.S:15 at the beginning of the assembled
function measure(). Just for fun, issue the "ba[cktrace]" (or "bt") to report on the stack:
(gdb) bt
#0 measure () at /Users/ve3wwg/riscv/repo/14/rdcycle/main/measure.S:15
#1 0x42004c5c in app_main () at /Users/ve3wwg/riscv/repo/14/rdcycle/main/
main.c:13
#2 0x4201080c in main_task (args=<optimized out>) at /Users/ve3wwg/esp32c3/
esp-idf/components/freertos/port/port_common.c:129
#3 0x40385b32 in vPortSetInterruptMask () at /Users/ve3wwg/esp32c3/esp-idf/
components/freertos/port/riscv/port.c:306
Backtrace stopped: frame did not save the PC
(gdb)
From this, you can see that app_main() at main.c:13 has called measure() at measure.S:15.
Notice the frame numbers have changed, with app_main() at frame 1, and measure() is
now frame 0.
● 236
We know that our argument arrives in register a0. We can display that as follows:
(gdb) p $a0
$4 = 0
(gdb)
From this, we can report any register by name by prepending a dollar ($) to the register
name. We can also report all registers if you need to with the "i[nfo] r[egisters]" command:
(gdb) i r
ra 0x42004c5c 0x42004c5c <app_main+36>
sp 0x3fc8edf0 0x3fc8edf0
gp 0x3fc8a600 0x3fc8a600
tp 0x3fc8891c 0x3fc8891c
t0 0x3de 990
t1 0x3fc8ea4c 1070131788
t2 0x0 0
fp 0x0 0x0
s1 0x0 0
a0 0x0 0
a1 0x3fc8ea28 1070131752
a2 0x0 0
a3 0x1 1
a4 0x3fc8c000 1070120960
a5 0x0 0
a6 0x42001dc8 1107303880
a7 0x0 0
s2 0x0 0
s3 0x0 0
s4 0x0 0
s5 0x0 0
s6 0x0 0
s7 0x0 0
s8 0x0 0
s9 0x0 0
s10 0x0 0
s11 0x0 0
t3 0x6b197044 1796829252
t4 0x0 0
t5 0x0 0
t6 0x0 0
pc 0x42004c84 0x42004c84 <measure>
(gdb)
● 237
(gdb) fr
#0 measure () at /Users/ve3wwg/riscv/repo/14/rdcycle/main/measure.S:15
15 li a1,99 # Some number
(gdb) s
16 beqz a0,1f # If mul is false, jump to 1
(gdb) p $a1
$5 = 99
(gdb)
(gdb) s
28 1: csrr t1,mpccr
(gdb)
Stepping again:
(gdb)
29 sll a2,a1,3 # a2 = a1 * 8
(gdb) <CR>
(gdb) p $t1
$6 = 324356966
(gdb) p /x $t1
$7 = 0x13554b66
(gdb)
Since the last thing we did was a "s[tep]", pressing enter (shown as "<CR>"), repeats the
step once again. Printing register t1, shows us the value in decimal, which is inconvenient
here. To print in hexadecimal, use the "/x" argument after the "p[rint]" command name.
(gdb)
29 sll a2,a1,3 # a2 = a1 * 8
(gdb)
30 sll a1,a1,1 # a1 *= 2
(gdb)
31 add a0,a2,a1 # a0 = a1 * 10 => 200 cycles
(gdb) p $a2
$1 = 792
(gdb) p $a1
● 238
$2 = 198
(gdb) s
32 csrr t3,mpccr
(gdb) p $a0
$3 = 990
(gdb)
In this sequence, we stepped through the execution of lines 29 to 31, where the multiply
by ten was performed by two shifts and an add. Examining the result in $a0 confirms that
99 was indeed multiplied by 10.
As we step through the rest of the measure() function, the computed return value may
report a wildly large number. This is because the mpccr values returned in measure.S:28
and measure.S:32 include a number of cycles that are going on in the background as a
result of being debugged in single stepping mode. With that in mind, let's step until the
function returns:
(gdb) fr
#0 xit () at /Users/ve3wwg/riscv/repo/14/rdcycle/main/measure.S:34
34 xit: sub a0,t3,t1
(gdb) s
35 ret
(gdb)
app_main () at /Users/ve3wwg/riscv/repo/14/rdcycle/main/main.c:14
14 printf("shift cycles = %u\n",cycles);
(gdb) p cycles
$15 = 716845243
(gdb)
We see here that the value of cycles is much larger than it should be. But this is under-
standable since we single-stepped through the code while the cycles kept ticking.
Let's assume that you're now satisfied that the program is executing correctly and that you
want it to continue unhindered. We can "c[ontinue]" the program:
(gdb) c
Continuing.
At this point, gdb will seem to hang and not give you a prompt. Let's find out why. Press
Control-C to interrupt gdb:
^C
Thread 2 received signal SIGINT, Interrupt.
[Switching to Thread 1070133268]
0x42006342 in esp_vApplicationIdleHook () at /Users/ve3wwg/esp32c3/esp-idf/
● 239
components/esp_system/freertos_hooks.c:50
50 for (int n = 0; n < MAX_HOOKS; n++) {
(gdb) bt
#0 0x42006342 in esp_vApplicationIdleHook () at /Users/ve3wwg/esp32c3/esp-idf/
components/esp_system/freertos_hooks.c:50
#1 0x40384960 in prvIdleTask (pvParameters=<optimized out>) at /Users/ve3wwg/
esp32c3/esp-idf/components/freertos/tasks.c:3973
#2 0x40385b32 in vPortSetInterruptMask () at /Users/ve3wwg/esp32c3/esp-idf/
components/freertos/port/riscv/port.c:306
Backtrace stopped: frame did not save the PC
(gdb)
After interrupting gdb with Control-C, gdb reports to us that it is stuck in a FreeRTOS rou-
tine. This is what happens after app_main() returns, since control has now returned to the
environment that called our app_main().
While we're here, let's take note of the fact that gdb is FreeRTOS thread-aware. We can
report on the threads that are running using "i[nfo] "th[reads]":
(gdb) i th
Id Target Id Frame
* 2 Thread 1070133268 (Name: IDLE) 0x42006342 in esp_vApplicationIdleHook
()
at /Users/ve3wwg/esp32c3/esp-idf/components/esp_system/freertos_hooks.c:50
3 Thread 1070128416 (Name: esp_timer) 0x40385ba4 in vPortClearInterruptMask
(mask=1)
at /Users/ve3wwg/esp32c3/esp-idf/components/freertos/port/riscv/port.c:329
(gdb)
We see that we are now in the FreeRTOS IDLE thread, while another thread also exists as
the esp_timer thread. If our app_main() program was still running, it would have been
shown as running as the main thread. You can change threads if you want to examine
where it is executing by using "t[hread] <n>", where "<n>" is the thread number shown.
$ idf.py gdb
as before.
● 240
Configuration
This procedure assumes that the "Channel for console output" has been set to "USB Serial/
JTAG Controller". The default for ESP32-C3 projects is to use "UART0". To check this, per-
form the following in your build directory (and environment).
$ idf.py menuconfig
Then select the option "Component config " shown in Figure 15.3.
From Component config, select "ESP System Settings " shown in Figure 15.4.
● 241
Then look for the line "Channel for console output" in Figure 15.5.
If it already shows "USB Serial/JTAG Controller" then you are set. Otherwise, enter into that
menu and change the selection so that there is an "X" in the line USB Serial/JTAG Control-
ler" as shown in Figure 15.6.
Make certain you save your changes before quitting the menuconfig. After making changes
you will need to build and flash your project again.
Procedure
Using this procedure, the following general steps must be performed in order:
1. Configure your project to use the USB Serial console with menuconfig, build and
flash your ESP32-C3 device (as shown in the previous section).
2. Plug in your ESP32-C3 device USB cable to the PC (or unplug and replug).
● 242
3. In a new terminal session, establish your ESP environment (if necessary) and start
idf.py monitor.
4. In its own window (and environment), launch OpenOCD
5. In a third window (and environment) launch idf.py gdb.
When OpenOCD starts, it will again cause a reset of the device, so the monitor window may
show some output. Once the gdb window starts, you should again see the output:
$ idf.py gdb
Executing action: gdb
GNU gdb (crosstool-NG esp-2021r2-patch3) 9.2.90.20200913-git
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-host_apple-darwin12
--target=riscv32-esp-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
Now if you were to step or run through the program, the console output should appear in
your monitor window.
● 243
Problems
Problems can trip up the OpenOCD software, so it is sometimes necessary to recover from
them. If OpenOCD doesn't start correctly try the following:
For Linux/MacOS systems, check and kill any rogue OpenOCD processes still running:
Then repeat the procedure. Sometimes the ESP32-C3 device can be messed up by the
code flashed into it. If you suspect that, then include a vTaskDelay() call at the start of
app_main() to give you time to get the monitor started and the OpenOCD process. In that
manner, OpenOCD can reset the CPU before the monitor run can mess things up. Then
work through gdb to locate the source of your inflicted problem.
15.6. Miscellaneous
The gdb debugger is too large to fully cover its power in one chapter. However, a few more
things are worth mentioning:
1. When running gdb from Fedora Linux (without using JTAG), you need to start the
program running with the "r[un]" command. But before you do that, most users
set a breakpoint at main() with the use of "b[reakpoint] main" first. Then after you
start the program running, it will pause upon entry to the main() function.
2. Breakpoints are also useful when using JTAG. They permit you to skip the execu-
tion of a large body of code, until you get to the point where you are interested in
scrutinizing. But there may be restrictions based upon whether the code is in ROM,
flash, or RAM.
3. To delete a breakpoint, use the "d[elete] <n>" where "<n>" is the breakpoint
number.
4. You can list breakpoints with "i[nfo] b[reakpoints]".
5. The "p[rint]" command can also display the contents of structures and classes.
This is far better than instrumenting a program with oodles of printf() calls.
Don't be afraid to use the gdb help system and peruse the fine manual.
15.7. Summary
There is considerably more that could be explained about gdb's debugging capabilities. But
what you have seen in this chapter should get you started using gdb to locate bugs. JTAG
is wonderful in that it allows you to debug the actual hardware that you are using. Only do
be aware that there are a number of restrictions that the user should be aware of. Espressif
● 244
has documented these on their website.[5] The use of JTAG and gdb can save the user time
and frustration, by tracing assembly code one instruction at a time.
Bibliography
[1] https://docs.espressif.com/projects/esp-idf/en/latest/esp32c3/api-guides/jtag-
debugging/index.html
[2] https://www.gnu.org/software/bash/manual/html_node/Command-Line-Editing.html
[3] http://web.mit.edu/gnu/doc/html/features_7.html
[4] https://developer.apple.com/library/archive/documentation/DeveloperTools/gdb/gdb/
gdb_4.html
[5] https://www.sourceware.org/gdb/documentation/
[6] https://docs.espressif.com/projects/esp-idf/en/latest/esp32c3/api-guides/usb-serial-
jtag-console.html
● 245
With the utility and productivity of C/C++, there are likely to be times when you just want
to invoke a few assembly language instructions directly from your high-level code. The GCC
compiler makes this possible through the "asm" (or "__asm__") extension. This chapter
gets you started with the basics of this advanced facility.
This most basic form can be provided both inside and outside of functions (the extended
form can only be used inside functions).
Let's see this simple mechanism at work. I'll reuse the rdcycle project from chapter 14 but
use copies of the files for this chapter so that we can mess with it. Listing 16.1 is our first
demonstration of the code change to main.c. Take note of line 10, where a "nop" (no oper-
ation) instruction was added inline with the C code.
19 #include <stdio.h>
2 #include <stdint.h>
3
4 extern uint32_t measure(int mul);
5
6 void
7 app_main(void) {
8 uint32_t cycles;
● 246
9
10 asm volatile ( "nop" );
11
12 for ( int x=0; x<10; ++x ) {
13 cycles = measure(1);
14 printf("ultiply cycles = %u\n",cycles);
15 cycles = measure(0);
16 printf("shift cycles = %u\n",cycles);
17 }
18 fflush(stdout);
19 }
If you need convincing that everything still works, build, flash and monitor the program as
follows:
$ cd ~/riscv/repo/16/rdcycle
$ idf.py build
$ idf.py -p <<<yourport>>> flash monitor
Now let's examine the assembler listing for the main.c program in Listing 16.2, which has
been edited to reduce the page count. You can view the full listing by generating it as fol-
lows:
$ ~/riscv/repo/listesp main/main.c
...snip...
6 void
7 app_main(void) {
8 uint32_t cycles;
9
10 0008 0100 asm volatile ( "nop" );
11
22 nop
23 # 0 "" 2
24 #NO_APP
25 000a 232604FE sw zero,-20(s0)
26 000e 81A8 j .L2
...snip...
Listing 16.2: Edited assembler listing for main.c with nop added.
The important thing to note is how the compiler inserts the nop instruction at line 22 of
the listing. On the left side of the listing, you see at offset 0008 the assembled nop opcode
● 247
0100, in hexadecimal. Of course, this doesn't do much for the program but serves as an
illustration.
asm volatile (
"nop\n"
"\tnop\n"
);
Notice a few things here. We used the newline (\n) character to separate the two assembler
instructions. The tab character (\t) was put in front of the second opcode to indicate that
no label was present. A space works equally well. You might not need to do this for RISC-V,
but it makes the assembler listing easier to read. As you can see, lines 10 and 11 report
the two assembled nop instructions as we expected.
It is also possible to provide multiple opcodes on one line and in one string. Remember that
C/C++ will concatenate string constants if they are listed one after the other. So, in the last
example, the compiler would have compiled:
I believe that making code friendly for the reader is important. So listing opcodes on their
own lines is polite.
● 248
The "goto" keyword informs the compiler that there may be a jump to one of the labels
listed. Extended asm statements must be used inside a C/C++ function.
Token Description
%% Represents a single percent (%) character.
%{ Represents a left curly bracket ({).†
%| Represents a vertical bar (|).†
%} Represents a right curly bracket (}).†
%= Output a unique asm instance number
† While GCC documents these, they don't seem to be accepted by the RISC-V compiler for
the ESP32-C3.
The following example illustrates a listing extract where some of these special symbols are
used. Extracted listing lines 10 through 14 show the source lines that went into the asm
block, while listing lines 22 to 24 show what was actually written into the assembly file.
Notice that '%=' caused "10" to be substituted and confirmed in listing line 23. It is also
● 249
used as an immediate constant on line 24. The single '%' was written in place of the '%%',
confirmed in listing line 23.
The [asmSymbolicName] parameter is optional. When used, it gives the parameter a name
for references within the AssemblerTemplate like "%[name]". When omitted, use a ze-
ro-based parameter reference of the form '%0', '%1' etc. for parameters listed as output
operands. When the parameter is provided, you must put the square brackets around the
name.
16.3.2.1. Constraint
The constraint string normally begins with a '=' or '+' character for output parameters
(there are other possibilities, but they have limited use). Table 16.2 lists their functions.
Characters that follow this leading character, identify possibilities where the value resides
(there can be more than one). Table 16.3 lists the constraint characters discussed in this
chapter.
Character Meaning
'=' This operand is written to by this asm block. The previous value held
is discarded and replaced by new data.
'+' This operand is both read and written by this asm block.
Character Meaning
'm' A memory operand is permitted, with any kind of address that the
machine generally supports.
● 250
The GCC notes some special exceptions like the following. If the constraint used on an Out-
put Operand starts with a '+' (rather than '='), then that counts as two parameters (input
+ output). So, when specifying '%5', for example, make sure to take this into account. For
code reading sanity and correctness, it is far safer to use the '%[name]' form instead.
Note: In these first two Output Operand examples, I have left out the Clobbers clause so
that it can be explained later. Technically, these examples should list register t1 as being
clobbered (even though I was able to get away without it this time).
6 void
7 app_main(void) {
8 uint32_t cycles;
9 uint32_t ninety5;
10
11 0008 1303F005 asm volatile (
12 000c 232264FE " li t1,95\n"
13 0010 232404FE " sw t1,%[ninety5]\n"
14 " sw x0,%[cycles]\n"
15 : [ninety5] "=m" (ninety5), [cycles] "=m" (cycles)
25 li t1,95
26 sw t1,-28(s0)
27 sw x0,-24(s0)
Listing 16.3: The program app_main() initializing two variables from inline assembly.
1. This tells the compiler that we will refer to the variable in the assembler code as
"%[ninety5]".
2. The constraint character '=' indicates that the output value will be overwritten and
that any previous value of ninety5 is discarded.
3. The 'm' in the constraint indicates that the output variable is in memory.
4. Finally, the C expression '(ninety5)' indicates to the compiler where the value is
going.
● 251
1. This indicates that "%[cycles]" is how the variable for cycles is referenced in the
assembler code.
2. The constraint "=m" indicates that the output value will be overwritten and is in
memory.
3. The C expression "cycles" is used to provide the location for the operand output.
25 li t1,95
26 sw t1,-28(s0)
27 sw x0,-24(s0)
From this, we see how the compiler has inserted the target address for ninety5 as -28(s0),
and cycles as -24(s0). The compiler knows the offsets of these variables on the stack, rela-
tive to the save register s0. This frees the programmer from having to figure it out.
Note: It might seem that the asmSymbolicName is redundant ([cycles] vs (cycles)). But
what is specified in brackets (cexpression) can be a C/C++ expression, which need not
be a simple variable name.
27 #include <stdio.h>
2 #include <stdint.h>
3
4 extern uint32_t measure(int mul);
5
6 void
7 app_main(void) {
8 uint32_t cycles;
9 uint32_t ninety5;
10
11 0008 1303F005 asm volatile (
12 000c 232264FE " li t1,95\n"
13 0010 232404FE " sw t1,%0\n"
14 " sw x0,%1\n"
15 : "=m" (ninety5), "=m" (cycles)
25 li t1,95
26 sw t1,-28(s0)
27 sw x0,-24(s0)
In this example, the [asmSymbolicName] has been omitted from both output parameter
specifications. Because of this, the inline code uses "%0" to refer to the first parameter
ninety5, and "%1" to refer to cycles. An examination of listing lines 25 to 27 reveals that
● 252
the same code was generated. When there is a large number of output parameters, the use
of the [asmSymbolicName] is recommended for code clarity.
The optional [asmSymbolicName], the constraint and cexpression are used in a manner
similar to the output parameters, except that this time the data is coming from the C pro-
gram. A full C program example is illustrated in Listing 16.4.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 void
5 app_main(void) {
6 uint32_t a=33, b=75, m;
7
8 asm volatile (
9 " lw a0,%1\n" // a0 = a
10 " lw a1,%2\n" // a1 = b
11 " mul t0,a0,a1\n" // t0 = a0 * a1
12 " sw t0,%0\n" // m = t0
13 : "=m" (m) // Outputs
14 : "m" (a), "m" (b) // Inputs
15 : "a0", "a1", "t0" // Clobbers
16 );
17
18 printf("%u * %u => %u\n",a,b,m);
19 fflush(stdout);
20 }
First build, flash and monitor it to convince yourself that it works. Here the variable a is
multiplied by the value in b and the product is stored in the variable m. The printf() call in
● 253
$ cd ~/riscv/repo/16/inmul
$ idf.py build
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
33 * 75 => 2475
The reported product is good. Now let's break down what was coded in Listing 16.1.
1. Line 9 loads the value from variable a, into register a0. Notice that the input value
for a is identified as "%1" because the expression (a) is the second parameter
within the entire asm block.
2. In the same manner, line 10 loads the value from variable b into register a1. The
value for b is coded as "%2" since it is the third parameter in the asm block.
3. Line 11 performs the multiplication, placing the product into temporary register t0.
4. Line 12 stores a word value from register t0 into the expression "%0", which is the
variable m (line 6).
5. Notice that the input parameter list uses a constraint of "m", whereas the output
constraint used "=m". Both reference a value in memory, but the output needs the
'=' to indicate how the value will be stored/updated. Inputs, on the other hand, can
simply be fetched from memory.
6. Line 15 identifies the registers that were clobbered by our assembly code.
16.3.3.1. Clobbers
For some platforms there can be side effects from executing certain opcodes that change
register values. Or we may simply assign registers to compute intermediate results. These
must be identified in the clobbers clause so that the compiler is informed. In Listing 16.4,
the registers a0, a1 and t0 were identified in the clobbers clause. This informs the compiler
that these registers were used ("clobbered").
The compiler must choose registers to use for input and output operands. Registers listed in
the clobbered list are not used by the compiler. Hence clobbered registers become available
for any use in your assembler code. Additionally, the stack pointer register must not appear
in the clobber list and must not be altered. The stack pointer must have the same value
upon exit as it had upon entry.
Note: Incorrect code can result when modified registers are not identified in the clobbers
clause. For example, imagine if the compiler placed an address in register t0, to be used
later on. But in the asm block that follows, you modified register t0. If register t0 is not
referenced in an input or output clause, then the compiler would be completely oblivious
to the fact that the address in t0 was lost.
There are two special clobber arguments listed in Table 16.4. The "cc" argument has no
value to RISC-V and can be ignored because it has no flags register. The "memory" argu-
● 254
ment may, however, be necessary if your code is reading/modifying memory outside of the
parameters provided for input/output.
Argument Meaning
cc Indicates that the flags register was modified. On platforms like RISC-V
that don't support a flags register, this argument is simply ignored.
memory Indicates that the code reads/modifies memory locations other than
those listed in the input/output operands. This may cause the compiler to
flush certain variables held in registers.
Table 16.4: Special Clobber Arguments.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 void
5 app_main(void) {
6 uint32_t a=9300000, b=7500000;
7 union {
8 uint64_t m64;
9 uint32_t m32[2];
10 } u;
11
12 asm volatile (
13 " lw a0,%[a]\n" // a0 = a
14 " lw a1,%[b]\n" // a1 = b
15 " mulh t1,a0,a1\n" // t1 = high a0 * a1
16 " mul t0,a0,a1\n" // t0 = low a0 * a1
17 " sw t0,%[low]\n" // m = t0 (low word)
18 " sw t1,%[hi]\n" // m = t1 (high word)
19 : [low] "=m" (u.m32[0]), // Output: low
20 [hi] "=m" (u.m32[1]) // high
21 : [a] "m" (a), [b] "m" (b) // Inputs
22 : "a0", "a1", "t0", "t1" // Clobbers
23 );
24
25 printf("%u * %u => %llu\n",a,b,u.m64);
26 fflush(stdout);
27 }
● 255
The variables a and b have been assigned extra large constants to prove that our product
results in more than 32 bits, is valid. Due to the increased number of input and output
values, the [asmSymbolicName] form was used in this specification for inputs and outputs.
For example, "%[a]" refers to the input variable a (line 21).
A C language union was used in lines 7 through 10 to permit access of the 64-bit value
of u.m64 as a pair of 32-bit values u.m32[0] and u.m32[1]. Notice in the output clause,
how the C expressions in the refer to "%[low]" as the expression u.m32[0] and "%[hi]" as
m.32[1].
$ cd ~/riscv/repo/16/inmul64
$ idf.py build
$ idf.py -p <<<yoourport>>> flash monitor
I (257) cpu_start: Starting scheduler.
9300000 * 7500000 => 69750000000000
$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
9300000 * 7500000
69750000000000
^D
$
The results agree. To see the result in hexadecimal, try the following:
$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
obase=16
9300000 * 7500000
3F6FEFF91C00
^D
$
● 256
Given that the result is longer than 8 hexadecimal digits, we know the result size is greater
than 32 bits.
The first thing to take special note of is that there can be no output operands with this form
(note how the output operands clause is blank). If you try to specify outputs, you get the
compile error:
This limits the usefulness of this form, but this is what we have to work with. The GCC doc-
umentation indicates that the asm goto statement is always considered volatile. I suggest
that you always include it in case the compiler defaults change.
Note: The GCC document suggests that you can provide input/output operands in the
asm goto statement using the '+' constraint. However, I was not able to succeed in
this using the compiler version: riscv32-esp-elf-gcc (crosstool-NG esp-2021r2-patch3)
8.4.0. That functionality may vary with platform type.
The addition of the "goto" keyword permits the specification of C/C++ language labels that
can be referenced by the asm code. When referencing the labels within the assembler code,
use one of the following formats:
• "%l<n>", for example, "%l2" (note the lowercase 'L' after the percent
character).
• "%l[label]", for example, "%l[exception]" (note the lowercase 'L' after the
percent character).
Another restriction is that the total number of input + output + goto operands is limited
to 30.
● 257
1 #include <stdio.h>
2 #include <stdint.h>
3
4 void
5 app_main(void) {
6 uint32_t a=315, b=75, r=0;
7
8 asm volatile goto (
9 " lw a1,%[b]\n" // a1 = b
10 " beqz a1,%l[excep]\n" // Jump if b=zero
11 : /* no outputs allowed for goto */
12 : [b] "m" (b) // Inputs
13 : "a1" // Clobbers
14 : excep
15 );
16
17 asm volatile (
18 " lw a0,%[a]\n" // a0 = a
19 " lw a1,%[b]\n" // a1 = b
20 " div a0,a0,a1\n" // a0 /= a1
21 " sw a0,%[r]\n" // r = result
22 : [r] "=m" (r) // Outputs
23 : [a] "m" (a), [b] "m" (b) // Inputs
24 : "a0", "a1" // Clobbers
25 );
26
27 printf("%u / %u => %u\n",a,b,r);
28 fflush(stdout);
29 return;
30
31 excep:
32 printf("Division by zero!\n");
33 fflush(stdout);
34 }
Because the "goto" form does not permit the specification of output parameters, there were
two asm blocks defined in this program. The first block (lines 8 to 15) tests if variable b is
zero, and if so, branches to the C label "excep". Otherwise, it just returns. Of course, it is
silly to use asm to test for zero in this way, but it is justified for demonstration purposes.
● 258
The second block is a normal asm block so it can actually perform the division and return
the result (lines 17 to 25). Again, this simple asm example is justifiable in the name of
education.
Build, flash and monitor the program to see if the "happy path" works as expected:
$ cd ~/riscv/repo/16/except
$ cd idf.py build
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
315 / 75 => 4
The divide was indeed performed and is correct. Now edit the program main.c so that the
value of b is zero. Change line 6 to read:
In this particular run, the goto was in fact performed to report "Division by zero!". In this
example, the only way execution can proceed to the label "excep" is from our asm code.
One of the best applications for the asm goto form is perhaps the management of a state
machine. Where entry into the asm block moves the execution from one label to another,
depending upon the current state.
1 #include <stdio.h>
2 #include <stdint.h>
3
4 // #pragma GCC optimize ("-O3")
5
6 void
7 app_main(void) {
8 uint32_t a=33, b=75, m;
9
● 259
10 asm volatile (
11 " mul %[m],%[a],%[b]\n" // m = a * b
12 : [m] "=r" (m) // Outputs
13 : [a] "r" (a), [b] "r" (b) // Inputs
14 ); // No clobbers
15
16 printf("%u * %u => %u\n",a,b,m);
17 fflush(stdout);
18 }
Notice that in line 12, the constraint is "=r" indicating that the product for variable m is ex-
pected to be in a register ('=' indicates that the result will overwrite the original value). The
constraint "r" is used in line 13 for both variables a and b. This means that both of these
variables are expected to have values already in memory.
With the pragma optimize commented out (line 4), let's examine the assembler listing to
see what the compiler does with this.
$ cd ~/riscv/repo/16/inmulrr
$ ~/riscv/repo/listesp main/main.c
1 .file "main.c"
2 .option nopic
3 .text
4 .section .rodata
5 .align 2
6 .LC0:
7 0000 2575202A .string "%u * %u => %u\n"
7 20257520
7 3D3E2025
7 750A00
8 .text
9 .align 1
10 .globl app_main
12 app_main:
13 0000 0111 addi sp,sp,-32
14 0002 06CE sw ra,28(sp)
15 0004 22CC sw s0,24(sp)
16 0006 0010 addi s0,sp,32
17 0008 93071002 li a5,33
18 000c 2326F4FE sw a5,-20(s0)
19 0010 9307B004 li a5,75
20 0014 2324F4FE sw a5,-24(s0)
21 0018 8327C4FE lw a5,-20(s0)
● 260
DEFINED SYMBOLS
*ABS*:0000000000000000 main.c
/var/folders/jp/_ktfnf412kvbdznr4m9769tr0000gn/T//ccYqAtij.s:12
.text:0000000000000000 app_main
● 261
/var/folders/jp/_ktfnf412kvbdznr4m9769tr0000gn/T//ccYqAtij.s:6
.rodata:0000000000000000 .LC0
UNDEFINED SYMBOLS
printf
__getreent
fflush
$ cd ~/riscv/repo/16/inmulrr
$ cd idf.py build
$ idf.py -p <<<yourport>>> flash monitor
...
I (257) cpu_start: Starting scheduler.
33 * 75 => 2475
Indeed, it does! Now uncomment the #pragma in line 4 of main.c and repeat the build.
Then produce an assembler listing (as shown in Listing 16.9) to see what the optimized
code looks like:
$ cd ~/riscv/repo/16/inmulrr
$ cd idf.py build
$ ~/riscv/repo/listesp main/main.c
● 262
1 .file "main.c"
2 .option nopic
3 .text
4 .section .rodata.str1.4,"aMS",@progbits,1
5 .align 2
6 .LC0:
7 0000 2575202A .string "%u * %u => %u\n"
7 20257520
7 3D3E2025
7 750A00
8 .text
9 .align 1
10 .globl app_main
12 app_main:
13 0000 4111 addi sp,sp,-16
14 0002 06C6 sw ra,12(sp)
15 0004 93061002 li a3,33
16 0008 9307B004 li a5,75
17 #APP
18 # 10 "main/main.c" 1
1 #include <stdio.h>
2 #include <stdint.h>
3
4 #pragma GCC optimize ("-O3")
5
6 void
7 app_main(void) {
8 uint32_t a=33, b=75, m;
9
10 000c B386F602 asm volatile (
11 " mul %[m],%[a],%[b]\n" // m = a * b
12 : [m] "=r" (m) // Outputs
19 mul a3,a3,a5
20
21 # 0 "" 2
22 #NO_APP
23 0010 37050000 lui a0,%hi(.LC0)
24 0014 1306B004 li a2,75
25 0018 93051002 li a1,33
26 001c 13050500 addi a0,a0,%lo(.LC0)
27 0020 97000000 call printf
27 E7800000
28 0028 97000000 call __getreent
28 E7800000
29 0030 B240 lw ra,12(sp)
30 0032 0845 lw a0,8(a0)
● 263
DEFINED SYMBOLS
*ABS*:0000000000000000 main.c
/var/folders/jp/_ktfnf412kvbdznr4m9769tr0000gn/T//ccjVYvnV.s:12
.text:0000000000000000 app_main
/var/folders/jp/_ktfnf412kvbdznr4m9769tr0000gn/T//ccjVYvnV.s:6 .rodata.
str1.4:0000000000000000 .LC0
UNDEFINED SYMBOLS
printf
__getreent
fflush
1. Line 15 loads the value for a into the register a3. Notice that no store to memory
has been made yet to variable a.
2. Line 16 loads the value for b into the register a5. Likewise, the memory copy of
variable b is left uninitialized.
3. Lines 10 and 19 show that the "mul" instruction multiplies registers a3 and a5,
replacing a3 with the product.
4. The printf() format string address is loaded in line 23 in a0 (first argument).
5. The value 33 for a is simply loaded into a1 in line 25 (second argument).
6. The value 75 for b is simply loaded into a2 in line 24 (third argument).
7. The product is already in register a3, so this gets passed to printf() as the fourth
argument in the call to print in line 27.
From all of this, it should be clear that leaving the argument shuffle up to the C/C++ com-
piler works to your advantage. Use the "r" register constraint whenever it is practical to do
so.
It is tempting to supply "rm" for an operand constraint. This does not work for RISC-V be-
cause if the operand is in memory, you will have written a load or store instruction like "sw".
You cannot store words to a register, causing the compiler to complain (its operand must be
a memory reference). Likewise, you cannot supply a memory reference to an opcode like
"mul", for example. That opcode requires a register operand.
● 264
16.7. Summary
If you found the use of the asm statement tedious in this chapter, then don't be discour-
aged. It is somewhat cryptic but is useful when minor assembler help is required. It is
otherwise not the best form for larger assembly language code segments. However, it is a
good tool to keep in your bag of tricks. You'll also run across it in other people's code. All
the more reason to master it.
There are some additional asm tricks for advanced use that didn't make it into this chapter.
You can view those other options in detail at the gnu website.[1] I hope, however, that you
found this chapter a painless guide to getting started. Having mastered the basics, building
upon that is easier for those who need it.
Please accept my thanks for allowing me to be your guide in this grand adventure. Re-
ignite that passion for controlling your RISC-V machine at its most basic level: assembly
language!
Bibliography
[1] https://gcc.gnu.org/onlinedocs/gcc/Basic-Asm.html#Basic-Asm
● 265
Index
32-bit register 83
H
A hello_world 25
ABI register 53 HomeBrew 43
Ada 148
AMD instructions sets 14 I
Application Binary Interface 53 IDLE thread 240
apt 37 Install Prerequisites 21
Arch 22 Instruction Set Architecture 13
arithmetic shift 118
J
B JTAG support 227
boot.bat script 48
L
C License Agreement 30
C/C++ 81 Linux USB access 29
C compiler 62 LP64 Solaris model 62
CentOS 21
cexpression 253 M
CHG340 chip 19 Machine Performance Counter Value 225
CMU extensions 216 MacOS 22
COM port 33 Manual Installation 19
CSR 186 matrix organized 164
cycle snapshot 221 MIPS 15
MISA register 59
D MPCCR counter 222
Debian 21 MXL field 210
E N
embedded platforms 178 NaN-boxing 192
Environment Variables 25 NULL pointer 164
ESP-IDF 30
ESP-IDF PowerShell 32 O
OpenOCD 226
F OpenOCD session 230
Fedora Linux 18, 41 operating privilege mode 209
FLEN bits 185 overflow occurred 147
floating-point format 192
FreeRTOS 212 P
Pre-Processor 197
G pseudo-instructions 212
GCC 197 PuTTY 50
gdb debugger 87
● 266
Q
QEMU emulator 13
QEMU Install 47
S
SRAM 81
SSH Authentication 45
stack layout 109
Stanford University 15
subi 130
T
tilda 233
U
UART0 241
Ubuntu 21
Unicode UTF-8 29
USB Serial/JTAG Controller 242
USB-UART bridge 19
V
variable bcount 119
W
WARL 210
X
XLEN 51
XLEN-bit divisor 153
Xtensa 16
● 267
RISC-V Assembly
Language Programming
Using the ESP32-C3 and QEMU
With the availability of free and open source C/C++ compilers today, you Warren Gay is a datablocks.net
might wonder why someone would be interested in assembler language. senior software developer, writing
Linux internet servers in C++.
What is so compelling about the RISC-V Instruction Set Architecture
He got involved with electronics
(ISA)? How does RISC-V differ from existing architectures? And most at an early age, and since then he
importantly, how do we gain experience with the RISC-V without a major has built microcomputers and has
investment? Is there affordable hardware available? worked with MC68HC705, AVR,
STM32, ESP32 and ARM computers,
just to name a few.
The availability of the Espressif ESP32-C3 chip provides a way to get
hands-on experience with RISC-V. The open sourced QEMU emulator
adds a 64-bit experience in RISC-V under Linux. These are just two
ways for the student and enthusiast alike to explore RISC-V in this book.
The projects in this book are boiled down to the barest essentials to keep
the assembly language concepts clear and simple. In this manner you
will have “aha!” moments rather than puzzling about something difficult.
The focus in this book is about learning how to write RISC-V assembly
language code without getting bogged down. As you work your way
through this tutorial, you’ll build up small demonstration programs to be
run and tested. Often the result is some simple printed messages to prove
a concept. Once you’ve mastered these basic concepts, you will be well
equipped to apply assembly language in larger projects.