CSC 520 Chapter 1
CSC 520 Chapter 1
• Performance improvements:
- Improvements in semiconductor technology
• Feature size, clock speed
- Improvements in computer architectures
• Enabled by HLL compilers, UNIX
• Led to RISC architectures
RISC
• “We are dedicating all of our future product development to multicore designs. … This is a sea of
change in computing”
Paul Otellini, President, Intel (2004)
• Difference is all microprocessor companies switch to multiprocessors (AMD, Intel, IBM, Sun; all
new Apples 2 CPUs)
Procrastination penalized: 2X sequential perf. / 5 yrs
Biggest programming challenge: 1 to 2 CPUs
• Class of ISA
- General-purpose registers
- Register-memory vs load-store
• RISC-V registers Register Name Use Saver
- 32 g.p., 32 f.p. x9 s1 saved callee
Register Name Use Saver x10-x17 a0-a7 arguments caller
x0 zero constant 0 n/a x18-x27 s2-s11 saved callee
x1 ra return addr caller x28-x31 t3-t6 temporaries caller
x2 sp stack ptr callee f0-f7 ft0-ft7 FP temps caller
x3 gp gbl ptr f8-f9 fs0-fs1 FP saved callee
x4 tp thread ptr f10-f17 fa0-fa7 FP arguments callee
x5-x7 t0-t2 temporaries caller
f18-f27 fs2-fs21 FP saved callee
saved/
x8 s0/fp callee f28-f31 ft8-ft11 FP temps caller
frame ptr
• Memory addressing
- RISC-V: byte addressed, aligned accesses are faster
• Addressing modes
- RISC-V: Register, immediate, displacement (base+offset)
- Other examples: autoincrement, indexed, PC-relative
• Types and size of operands
- RISC-V: 8-bit, 32-bit, 64-bit
• Operations
- RISC-V: data transfer, arithmetic, logical, control, floating point
- See Fig. 1.5 in text
• Control flow instructions
- Use content of registers (RISC-V) vs. status bits (x86, ARMv7, ARMv8)
- Return address in register (RISC-V, ARMv7, ARMv8) vs. on stack (x86)
• Encoding
- Fixed (RISC-V, ARMv7/v8 except compact instruction set) vs. variable length
(x86)
1/24/2024 Chapter 1 Fundamentals of Quantitative Design and Analysis 13
RISC-V Registers
• Bandwidth (throughput)
- Total work done in a given time
- 32,000-40,000X improvement for processors
- 300-1200X improvement for memory and disks
• Latency or response time
- Time between start and completion of an event
- 50-90X improvement for processors
- 6-8X improvement for memory and disks
• Interesting Observation
- Bandwidth hurts latency
- Latency helps bandwidth
1/24/2024 Chapter 1 Fundamentals of Quantitative Design and Analysis 22
Latency Lags Bandwidth (last ~20 years)
CPU high,
Memory low • Bandwidth improves by more
(“Memory Wall”) than square of Latency
- Moore’s law
- Distance
- OS overhead
• Feature size
- Minimum size of transistor or wire in
x or y dimension
• For CMOS chips, traditional dominant energy consumption has been in switching
transistors, called dynamic power
Power dynamic ∝ 1/2 × Capacitive load × Voltage 2 × Frequency Switched
• For a fixed task, slowing clock rate (frequency switched) reduces power, but not energy
• Capacitive load a function of number of transistors connected to output and
technology, which determines capacitance of wires and transistors
• Dropping voltage helps both, so went from 5V to 1V
• To save dynamic power, most CPUs now turn off clock of inactive modules (e.g. Fl. Pt.
Unit)
- Power
Powernew (FrequencySwitched × 0.85)
= 0.72 = 0.61
Powerold FrequencySwitched
• Heat must be
dissipated from 1.5 x
1.5 cm chip
MTTF =
• Elapsed time
- Total response time, including all aspects
• Processing, I/O, OS overhead, idle time
- Determines system performance
• CPU time
- Time spent processing a given job
• Discounts I/O time, other jobs’ shares
• Performance improved by
- Reducing number of clock cycles
- Increasing clock rate
- Hardware designer must often trade off clock rate against cycle count
• “Instruction Frequency”
n
ICi
CPI = ∑ CPI i ⋅ Fi where Fi =
i =1 IC
• Frequent case is often simpler and can be done faster than the infrequent case
- E.g., overflow is rare when adding 2 numbers, so improve performance by optimizing more
common case of no overflow
- May slow down overflow, but overall performance improved by optimizing for the normal case
• What is frequent case and how much performance improved by making case faster
=> Amdahl’s Law
1/24/2024 Chapter 1 Fundamentals of Quantitative Design and Analysis 46
Amdahl’s Law
⎡ Fraction enhanced ⎤
ExTimenew = ExTimeold × ⎢(1 − Fraction enhanced )+ ⎥
⎣ Speedupenhanced ⎦
ExTimeold 1
Speedupoverall = =
ExTimenew Fraction enhanced
(1 − Fractionenhanced ) +
Speedupenhanced
1
Speedupmaximum =
(1 - Fractionenhanced )
ExecutionTimereference
SPECRatio A ExecutionTime A
1.25 = =
SPECRatioB ExecutionTimereference
ExecutionTimeB
ExecutionTimeB Performance A
= =
ExecutionTime A PerformanceB
n
GeometricMean = n ∏ SPECRatioi
µg i =1
-
• Bose-Einstein formula:
- Defects per unit area = 0.016-0.057 defects per square cm (2010)
- N = process-complexity factor = 11.5-15.5 (40 nm, 2010)
-
1/24/2024 Chapter 1 Fundamentals of Quantitative Design and Analysis 56
Summary
• Architectural Trends
• Performance Metrics
• Geometric Means
• IC cost