2 RISC V Performance ISA
2 RISC V Performance ISA
Fall 2024
[Adapted from Computer Organization and Design, P&H, UCB; Computer Architecture, Jie Zhang
Guangyu Sun, PKU; Computer Architecture, Myoungsoo Jung, KAIST]
[email protected]
Performance
Peking University
Performance Metrics
Ø Purchasing perspective
• given a collection of machines, which has the
• best performance ?
• least cost ?
• best cost/performance?
Ø Design perspective
• faced with design options, which has the
• best performance improvement ?
• least cost ?
• best cost/performance?
Ø Both require
• basis for comparison
• metric for evaluation
Ø Our goal is to understand what factors in the architecture contribute to overall system
performance and the relative importance (and cost) of these factors
Peking University
Defining (Speed) Performance
Ø Normally interested in reducing
• Response time (aka execution time) – the time between the start and the completion
of a task
• Important to individual users
• Thus, to maximize performance, need to minimize execution time
Peking University
Machine Clock Rate
Ø Clock rate (MHz, GHz) is inverse of clock cycle time (clock period)
CC = 1 / CR
Peking University
Performance
Ø Two common measures
• Latency (how long to do X)
• Also called response time and execution time
• Throughput (how often can it do X)
Ø Example of car assembly line
• Takes 6 hours to make a car (latency is 6 hours per car)
• A car leaves every 5 minutes (throughput is 12 cars per hour)
• Overlap results in Throughput > 1/Latency
Peking University
Clock Cycles per Instruction
Ø Not all instructions take the same amount of time to execute
• One way to think about execution time is that it equals the number of instructions
executed multiplied by the average time per instruction
• Clock cycles per instruction (CPI) – the average number of clock cycles each
instruction takes to execute
• A way to compare two different implementations of the same ISA
Peking University
Effective CPI
Ø Computing the overall effective CPI is done by looking at the different
types instructions and their individual cycle counts and averaging
Peking University
CPU Performance Equation (1)
CPU time = CPU Clock Cycles Clock cycle time
CPU time = Instruction Count Cycles Per Instruction Clock cycle time
Peking University
Car Analogy
Ø Drive from SIT to Sinchon
• “Clock Speed” = 3500 RPM
• “CPI” = 5250 rotations/km or 0.19 m/rot
• “ Insts” = 6 0 km
= 90 minutes
Peking University
CPU Version
Ø Program takes 33 billion instructions to run
Ø CPU processes instructions at 2 cycles per inst
Ø Clock speed of 3GHz
Peking University
CPU Performance Equation (2)
CPU time = CPU Clock Cycles X Clock cycle time
Peking University
Comparing Performance
Ø “X is n times faster than Y”
Peking University
A Simple Example
Op Freq CPIi Freq x CPIi
ALU 50% 1 .5 .5 .5 .25
Load 20% 5 1.0 .4 1.0 1.0
Store 10% 3 .3 .3 .3 .3
Branch 20% 2 .4 .4 .2 .4
• How much faster would the machine be if a better data cache reduced the
average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
• How does this compare with using branch prediction to shave a cycle off the
branch time?
CPU time new = 2 . 0 x IC x CC so 2 . 2 / 2 . 0 means 1 0 % faster
• What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
Peking University
If Only it Were That Simple
Ø “X is n times faster than Y on A”
Ø But what about different applications (or even parts of the same
application)
• X is 10 times faster than Y on A, and 1.5 times on B, but Y is 2 times faster
than X on C, and 3 times on D, and…
Which would you buy?
So does X have better
Performance than Y?
Peking University
Summarizing Performance
Ø Arithmetic mean
• Average execution time
• Gives more weight to longer-running programs
Ø Weighted arithmetic mean
• More important programs can be emphasized
• But what do we use as weights?
• Different weight will make different machines look better
Peking University
Normalizing & the Geometric Mean
Ø Speedup of arithmetic means != arithmetic mean of speedups
Ø Use geometric mean:
Peking University
CPI/IPC
Ø Often when making comparisons in comp-arch studies
• Program (or set of) is the same for two CPUs
• The clock speed is the same for two CPUs
ØSo we can just directly compare CPI’s and often we use IPC’s
Peking University
Average CPI vs. “Average” IPC
Peking University
Harmonic Mean
Peking University
A.M.(CPI) vs. H.M.(IPC)
Peking University
Amdahl’s Law (1)
Peking University
Amdahl’s Law (2)
Ø Make the Common Case Fast
Peking University
Amdahl’s Law (3)
Ø Diminishing Returns
Peking University
Yet Another Car Analogy
Peking University
Now Consider Price‐ Performance
Ø Without Turbo
• Car costs $8,000 to manufacture
• Selling price is $12,000 → $4K profit per car
• If we sell 10,000 cars, that’s $40M in profit
Ø With Turbo
• Car costs extra $3,000
• Selling price is $16,000 → $5K profit per car
• But only a few gear heads buy the car:
• We only sell 400 cars and make $2M in profit
Peking University
CPU Design is Similar
Ø What does it cost me to add some performance enhancement?
Ø How much effective performance do I get out of it?
• 100% speedup for small fraction of time wasn’t a big win for the car example
Ø How much more do I have to charge for it?
• Extra development, testing, marketing costs
Ø How much more can I charge for it?
• Does the market even care?
Ø How does the price change affect volume?
Peking University
Summary: Evaluating ISAs
Ø Design-time metrics:
• Can it be implemented, in how long, at what cost?
• Can it be programmed? Ease of compilation?
Ø Static Metrics:
• How many bytes does the program occupy in memory?
Ø Dynamic Metrics:
• How many instructions are executed? How many bytes does the processor fetch
to execute the program?
• How many clocks are required per instruction?
• How "lean" a clock is practical?
Best Metric: Time to execute the program!
depends on the instructions set, the processor organization, and compilation techniques.
Peking University
RISC-V ISA
Peking University
Abstraction Hierarchy
Application Software
Peking University
Definitions
ØThe words of a computer’s language are called instructions, and its vocabular
y is called an instruction set.
ØThe similarity of instruction sets occurs because all computers are
constructed from hardware technologies based on similar underlying
principles and because there are a few basic operations that all computers
must provide.
ØComputer designers have a common goal: to find a language that makes it ea
sy to build the hardware and the compiler while maximizing performance and
minimizing cost and energy.
What is ISA (Instruction Set Architecture)?
Peking University
(vonNeumann) Processor Organization
Memory Datapath needs to have the
• Components – the functional units
and storage (e.g., register file) needed
to execute instructions
Interconnects - components
(Store)
•
Fetch
connected so that the instructions can
be accomplished and so that data can
be loaded from and stored to Memory
Control needs to
1.Bring input instructions from Memory
2.Issue signals to control the information
flow between the Datapath
Decode Execute components and to control what
operations they perform
3.Manage instruction sequencing
Peking University
History of ISA designs
ØLong long ago, resources are limited:
• Memory: very expensive and very small capacity
• Most programmers work on assembly languages
ØCISC (Complex Instruction Set Computer)
üDense instruction size (1~15 Bytes)
üProgrammer friendly
pComplexity: almost 1 new instructions per month
IBM 1st hard disk (5MB in total)
pHardware unfriendly: compiler, registers, state machines
e.g., a = a * b
CISC assembly instruction:
MULT 2:3, 5:2
Peking University
RISC
ØRISC (Reduced Instruction Set Computer) philosophy
• fixed instruction lengths
• load-store instruction sets
• limited addressing modes
• limited operations
e.g., a = a * b
CISC assembly instruction: RISC assembly instruction:
MULT 2:3, 5:2 LOAD A, 2:3
LOAD B, 5:2
PROD A, B
STORE 2:3, A
ARM, RISC-V, MIPS, Sun SPARC, HP PA-RISC, IBM PowerPC, Intel (Compaq) Alpha, …
Peking University
RISC: Past, Present, and Future
PC Era PostPC Era: Client/Cloud
ØHardware translates x86 instructions • IP in SoC vs. MPU
into internal RISC instructions • Value die area, energy as much
ØThen use any RISC technique inside as performance
MPU • > 20B total / year in 2017
Ø> 350M / year ! • x86 in PCs peaks in 2011, now
decline ~8% / year (2016 < 2007)
Øx86 ISA eventually dominates servers as
• x86 servers ⇒ Cloud ~10M
well as desktops servers total* (0.05% of 20B)
• 99% Processors today are RISC
Source: P&H Turing Award Talk* “A Decade of Mobile Computing”, Vijay Reddi, 7/21/17, Computer Architecture Today
Peking University
What is RISC‐V
• RISC-V (pronounced "risk-five”) is a ISA standard
– An open source implementation of a reduced instruction set computing (RISC) based
instruction set architecture (ISA)
– There was RISC-I, II, III, IV before
• Most ISAs: X86, ARM, Power, MIPS, SPARC
– Commercially protected by patents
– Preventing practical efforts to reproduce the computer systems.
• RISC-V is open
– Permitting any person or group to construct compatible computers
– Use associated software
• Originated in 2010 by researchers at UC Berkeley
– Krste Asanović, David Patterson and students
• 2017 version 2 of the userspace ISA is fixed
– User-Level ISA Specification v2.2
– Draft Compressed ISA Specification v1.79
– Draft Privileged ISA Specification v1.10
Peking University
Goals in Defining RISC‐V
• A completely open ISA that is freely available to academia and industry
• An ISA that avoids "over-architecting" for
– a particular microarchitecture style (e.g., microcoded, in-order, decoupled, out-of-
order) or
– implementation technology (e.g., full-custom, ASIC, FPGA), but which allows
efficient implementation in any of these
• RISC-V ISA includes
– A small base integer ISA, usable by itself as a base for customized accelerators or
for educational purposes, and
– Optional standard extensions, to support general-purpose software development
– Optional customer extensions
• Support for the revised 2008 IEEE-754 floating-point standard
Peking University
RISC‐V ISA Principles
• Generally kept very simple and extendable
• Separated into multiple specifications
– User-Level ISA spec (compute instructions)
– Compressed ISA spec (16-bit instructions)
– Privileged ISA spec (supervisor-mode instructions)
– More …
Peking University
User Level ISA
• Defines the normal instructions needed for computation
– A mandatory Base integer ISA
• I: Integer instructions:
– ALU
– Branches/jumps
– Loads/stores
– Standard Extensions
• M: Integer Multiplication and Division
• A: Atomic Instructions
• F: Single‐Precision Floating‐Point
• D: Double‐Precision Floating‐Point
• C: Compressed Instructions (16 bit)
• G = IMAFD: Integer base + four standard extensions
– Optional extensions
Peking University
RISC‐V Instruction Set Architecture (ISA)
Registers
ØInstruction Categories
• Arithmetic, Logical, Shift x0 - x31
• Data transfer
• Un-/Conditional branch
Peking University
Question review
Ø“Wired” orders of immediate
https://five-embeddev.com/riscv-isa-manual/latest/a.html#
Peking University
Issues to be Explored
ØInstruction Types
• R, I, S, SB, U and UJ
ØHow to identify/encode these instructions?
ØHow to process data?
• Data types supported
• Where to store data?
• Addressing methods
Peking University
RISC‐V Arithmetic Instructions
ØRISC-V assembly language arithmetic statement
add x5, x6, x7
sub x5, x6, x7
Ø Each arithmetic instruction performs ______
1 operation
Ø Each arithmetic instruction fits in 32 bits and specifies
exactly ____
3 operands
Peking University
Aside: RISC‐V Register Convention
Name Register Usage
Number
x0 0 the constant value 0
x1 (ra) 1 return address (link
register)
x2 (sp) 2 stack pointer
x3 (gp) 3 global pointer
x4 (tp) 4 thread pointer
x5 - x7 5-7 temporaries
x8 - x9 8-9 frame pointer/saved
x10 - x17 10-17 arguments/results
x18 - x27 18-27 saved
X28 - x31 28-31 temporaries
Peking University
RISC‐V Register File
Register File
ØHolds thirty-two 64-bit registers 64 bits
• How many read ports ? 2
src1 addr src1 data
• How many write ports? 1
src2 addr 32
locations
des addr
Ø Registers are
write data src2 data
l Faster than other memory levels
- But register files with more locations
are slower (e.g., a 64-word file could write control
be as much as 50% slower than a 32-word file)
- Read/write port increase impacts speed quadratically
l Easier for a compiler to use
- e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order vs. stack
l Can hold variables so that
- code density improves (since register are named with fewer bits
than a memory location)
Peking University
Machine Language ‐ Add Instruction
Instructions, like registers and words of data, are 32 bits long
Arithmetic Instruction Format (R format):
add x5, x6, x7
Peking University
RISC‐V Memory Access Instructions
• RISC-V has two basic data transfer instructions for accessing
memory
ld x5, 24(x6) #load doubleword from memory
sd x5, 24(x6) #store doubleword to memory
• The data is loaded into (ld) or stored from (sd) a register in the
register file – a ___
5 bit address
Peking University
Machine Language ‐ Load & Store Instruction
Load Instruction Format (I format):
ld x5, 24 ( x6)
x6
x5
x6
Peking University
Machine Language ‐ Load & Store Instruction
Load Instruction Format (I format):
ld x5, 24 ( x6)
Peking University
Byte Addresses
, RISC-V
Peking University
Loading and Storing Bytes
ØRISC-V provides special instructions to move bytes
lb x5, 40(x6) #load byte from memory
sb x5, 40(x6) #store byte to memory
Peking University
Immediate Instructions
Ø Small constants are used often in typical code
Ø Possible approaches?
l put “typical constants” in memory and load them
l create hard-wired registers (like $zero) for constants like 1
l have special instructions that contain constants !
Peking University
How About Larger Constants?
We'd also like to be able to load a long constant into a register, for this we have a new
"load upper immediate" instruction
lui x5, 34534
lui loads a 20-bit constant into bits 12 through 31 of a register. The most significant 32
bits are filled with copies of bit 31, and the least significant 12 bits are filled with zeros.
Peking University
RISC‐V Control Flow Instructions
ØRISC-V conditional branch instructions:
beq rs1, rs2, L1 #go to Ll if rs1==rs2
bne rs1, rs2, L1 #go to Ll if rs1!=rs2
Ex: if (i==j) h = i + j;
bne x22, x23, L1
add x19, x22, x23
L1: ...
Peking University
Specifying Branch Destinations
ØUse a register (like in lw and sw) added to the 12-bit offset
PC )
• which register? Instruction Address Register (____
• its use is automatically implied by instruction
• PC gets updated (PC+4) during the fetch cycle so that it holds the address
of the next instruction
• limits the branch distance to -210 to +210-1 words from the (instruction
after the) branch instruction, but most branches are local anyway
Add 0 ?
4
Peking University
Instructions for Accessing Procedures
RISC-V procedure call instruction:
jal x1, ProcedureAddress #jump and link
Saves PC+4 in register $x1 to have a link to the next instruction for the procedure return
Machine format (UJ format):
immediate[20,10:1,11,19:12] rd opcode
Return instruction:
jalr x0, 0(x1) #jump and link register
Peking University
Unconditional Branch
Use register x0 to help
jal x0, Label #unconditionally branch to Label
Peking University
Instructions for Accessing Procedures
Recall how function works in Name Register Usage Preserve
the programming language: Number on call?
1. Parameters; x0 0 the constant value 0 n.a.
2. Reserve caller’s info; x1 (ra) 1 return address (link caller
3. Global variables; register)
x2 (sp) 2 stack pointer callee
x3 (gp) 3 global pointer --
x4 (tp) 4 thread pointer --
x5 - x7 5-7 temporaries caller
x8 - x9 8-9 frame pointer/saved callee
x10 - x17 10-17 arguments/results caller
x18 - x27 18-27 saved callee
X28 - x31 28-31 temporaries caller
Peking University
Spilling Registers
What if the callee needs more registers? What if the procedure is recursive?
• uses a ______
stack – a last-in-first-out queue – in memory for passing additional
values or saving (recursive) return address(es)
Argument reg.
Return address
Saved reg.
Peking University
Instructions for Synchronization
• Synchronization primitive:
• A simple lock: the value 0 is used to indicate that the lock is free
and 1 is used to indicate that the lock is unavailable
• Hardware: atomic exchange or atomic swap
• Recall: C&S primitive in CISC
Peking University
Instructions for Synchronization (cont.)
C&S is a complex instruction, so RISC-V uses two R-type commands to replace it:
lr.d x5, (x6) # x5 = Memory[x6]
sc.d x7, x5, (x6) # Memory[x6] = x5; x7=0/1
• lr.d: load-reserved doubleword;
• sc.d: store-conditional doubleword;
• Function: if the contents of the memory location specified by the load-reserved are
changed before the store-conditional to the same address occurs, then the store-
conditional fails and does not write the value to memory.
• Example of atomic exchange between x23 and (x20):
Peking University
Instructions for Atomic Memory Updates
The pair of synchronization instructions is used to achieve atomic memory
updates without locking.
lr.d x5, (x6) # x5 = Memory[x6]
sc.d x7, x5, (x6) # Memory[x6] = x5; x7=0/1
In this example, lr.d (load reserved) will load the value stored at Memory[x6] into
register x5, then you can modify it however you like there.
sc.d (store conditional) will overwrite Memory[x6] with your modified value in x5, only
if Memory[x6] has not been altered while you were working on the copy in x5.
Peking University
Question review
ØEdge case study of instruction LR.d, SC.d
ØCase 1: LR/SC addresses don’t match – can this succeed?
lr.w t0,(a0)
sc.w t1,a1,(a3)
Note that:
ØCase 2: unbalanced LR.d, SC.d • the SC.W succeeds only if the
lr.w t0,(a0) reservation is still valid and the
sc.w t1,a1,(a0) reservation set contains the bytes
addi a1,a1,1 being written.
sc.w t2,a1,(a0) • Regardless of success or failure,
executing an SC.W instruction
ØCase 3: multiple LRs, SCs from one core invalidates any reservation held by
lr.w t0,(a0) this hart.
lr.w t1,(a2)
sc.w t2,a1,(a0)
sc.w t3,a1,(a2)
https://five-embeddev.com/riscv-isa-manual/latest/a.html#
Peking University
RISC‐V ISA So Far
Category Instr Op Code Example Meaning
Arithmetic add 0110011 add x5, x6, x7 x5 = x6 + x7
(R & I subtract 0110011 sub x5, x6, x7 x5 = x6 - x7
format) add immediate 0010011 addi x5, x6, 20 x5 = x6 + 20
or immediate 0010011 ori x5, x6, 20 x5 = x6 | 20
Data load double word 0000011 ld x5, 40(x6) x5 = Memory[x6 + 40]
Transfer store double word 0100011 sd x5, 40(x6) Memory[x6 + 40] = x5
(I & U load byte 0000011 lb x5, 40(x6) X5(7:0) = Memory[x6 + 40](7:0)
format) store byte 0100011 sb x5, 40(x6) Memory[x6 + 40](7:0) = x5(7:0)
load upper imm 0110111 lui x5, 0x12345 x5 = 0x12345000
Cond. br on equal 1100111 beq x5, x6, 100 if (x5 == x6) go to PC+100
Branch br on not equal 1100111 bne x5, x6, 100 if (x5 != x6) go to PC+100
(SB format)
Jump (UJ jump and link 1100111 jal x1, imm x1 = PC+4; PC = PC+{imm,1’b0}
and I
format) jump and link reg. 1101111 jalr x0, 0(x1)
Peking University
Addressing modes
• Register addressing – operand is in a register
Peking University
Addressing modes (cont.)
• Immediate addressing – operand is a 12-bit constant contained within the
instruction
Peking University
RISC‐V Design Principles
ØSimplicity favors regularity
• fixed size instructions – 32-bits
• small number of instruction formats
ØGood design demands good compromises
• six instruction formats
ØSmaller is faster
• limited instruction set
• limited number of registers in register file
• limited number of addressing modes
ØMake the common case fast
• arithmetic operands from the register file (load-store machine)
• allow instructions to contain immediate operands
Peking University