Computer Architecture Note
Computer Architecture Note
Computer architecture
Chap 1: Intro
Performance evaluation: Response time
CPU performance: the time that CPU actually spent on executing user
program.
Performance ratio
CPU time
→ Increase perf by reducing either the length of the clock cycle or the
number of clock cycles required for a program
Computer architecture 1
Clock rate (clock cycles per second in MHz or GHz)
Computer architecture 2
Clock cycles per instruction (CPI)
Performance equation
Computer architecture 3
Improve performance
Computer architecture 4
Shorter clock cycle = faster clock rate → lastest CPU technology
Computer architecture 5
Computer architecture 6
CPU (Central Processing Unit)
Memory
Input/Output
Link
Computer functions
Executing Programs
Computer architecture 7
Instruction Cycle: the processing required for a single instruction
execution
Interrupts
Interrupt Handling:
Computer architecture 8
Temporarily pauses the current program.
Interrupt cycle
Sources of Interrupts:
Hardware failures.
Multiple Interrupts:
System Interconnection
CPU & I/O → CPU-controlled data transfer
I/O & Memory: data is transferred between memory and I/O, under the
control of special controllers called DMAC.
Computer architecture 9
Chap 3: Instruction Set Architecture
Core Components of RISC-V ISA
RISC-V Operands
Types:
Registers: Fast storage inside the CPU (32 registers in RV32I, each
32-bit).
Memory: Slower but larger storage for variables, arrays, and data
structures.
Data Types: Byte (8 bits), Halfword (16 bits), Word (32 bits), and
Doubleword (64 bits). RV32 registers hold 4-byte words. Each register
has a unique 5-bit address.
Registers
Each register has a unique 5-bit address.
Register Operations
Advantages
Computer architecture 10
Faster than memory due to direct access.
Usage
Memory
Memory Operations
Memory operands are stored in main memory, slower than register file
100 to 500 times
High level lang program use memory operands: Variables, Array and
string, Composite data structures.
Endianness:
Computer architecture 11
RISC-V uses Little Endian (LSB at the smallest address).
Immediate operand
Does not need to be stored in register file or memory. Value stored right in
instruction → faster
Instruction Formats
6 format: R, I, S, B, U, J. Why not only one format? Or 20 formats? → Good
design demands good compromises!
Wide Immediates:
Computer architecture 12
Stack structure
A region of memory operating on LIFO
Computer architecture 13
The bottom of stack is at the highest location
Passing control
Passing data
Use registers: input argument (a0-a7), return value (a0)
Computer architecture 14
Memory Management and Stack
Stack
Procedure Calls:
Six steps:
Computer architecture 15
RISC-V memory configuration
Program text:
stores machine
code of program,
declared with .text
Computer architecture 16
Addressing
Immediate addressing: A mode of addressing where the operand is directly
specified within the instruction itself, rather than in a register or memory
location
→ i instructions
Computer architecture 17
→ Useful in accessing array elements or variables within a data segment
Overflow
Integer Representation
Unsigned Binary Integers
Range: 0 to 2^{n}−1.
Flip all bits and add 1 to the least significant bit (LSB).
Computer architecture 18
Example:
+2 = 0000 0010
Example:
lb/lbu, lh/lhu
blt/bltu, bge/bgeu
slt/sltu, slti/sltiu
div/divu, rem/remu
Integer Arithmetic
Addition and Subtraction
Carryout:
Occurs when the result produces a carry beyond the maximum bit
width.
Overflow:
Computer architecture 19
when subtracting operands with the same sign, overflow can never
occur.
Multiplexer
Computer architecture 20
Computer architecture 21
Multiply Division
Computer architecture 22
Floating point number: Sign, mantissa, and exponent
Ex: 2013.1228 = 2.0131228 * 10^3 = 2.0131228E+03
mantissa: 2.0131228
exponent: 03
Computer architecture 23
Chap 5: The Processor
CPU implementation (datapath, datapath with control, multiplexor)
Pipeline
Datapath
Def: the collection of functional units and registers within the CPU that are
responsible for the manipulation and movement of data. It handles the
processing of data during execution. Component of datapath are:
register
ALU
Multiplexers
Memory units
Shifters
Control
Computer architecture 24
directing the operation of the datapath components by generating the
appropriate control signals. Components:
Control signals
Instruction decoder
Program counter
⇒ The datapath handles the actual data processing (operations like arithmetic
or moving data between registers), while the control unit ensures the correct
sequencing and timing of operations.
Sending the fetched instruction’s opcode and function field bits to the
control unit
The control unit send appropriate control signals to other parts inside
CPU to execute the operations corresponds to the instruction
What is ALU?
store the result back into the register file (reg rd)
Computer architecture 25
Executing Load and store (Memory instructions)
Calculate address using 12-bit offset (Use ALU, but sign-extend offset)
store: read from the Register File, write to the Data Memory
load: read from the Data Memory, write to the Register File
Computer architecture 26
Executing Branch instruction (beq)
Computer architecture 27
Instruction times (critical paths)
Pipelining: start fetching and executing the next ins before the current
one has completed. This is called overlapping execution
Laundry work
With n loads:
Computer architecture 28
When n → \inf, T_norm → 4*T_pipeline
RISC-V pipeline
Pipeline performance
Computer architecture 29
Data hazards
Data hazards happens with la/li pseudo instruction
Structure hazards
Computer architecture 30
In RISC-V pipeline with a single memory
Example: a CPU has only memory unit. Two ins need to access
memory at the same time (1 load and 1 store)
Fix register file access hazard by doing reads in the second half of
the cycle and writes in the first half.
Use result when it is computer. Don’t wait until it’s stored in the
register.
Forward from EX to EX
Computer architecture 31
Code scheduling to avoid stalls
Control hazards
In RISC-V pipeline
Delayed branch
Branch prediction
Computer architecture 32
50ns – 70ns, $10 – $20 per GB
3. Flash memory
4. Magnetic memory
Memory Hierarchy
Reg File > Instr cache & Data cache > SRAM > DRAM > Secondary
Memory (Disk)
Locality principle
Computer architecture 33
If a memory location is referenced then it will tend to be
referenced again soon → keep most recently accessed data
items closer to the processor
Miss
Adding set of tags fields into cache: each block in cache has a tag
Computer architecture 34
Methods for mapping: Direct mapping, Fully associative mapping, N-
way set associative mapping
Ex 1:
Determine how much faster that processor would run with a perfect
cache that never missed. Assume the frequency of all loads and
stores is 36%.
CPU time with stalls / CPU time with perfect cache = CPI stall / CPI
perfect = (2 + 3.44) / 2 = 2.72
Ex 2:
How many total bits are required for a direct-mapped cache with 16 KiB
of data and 1-word blocks, assuming a 32-bit address?
Computer architecture 35
Tag bits = 4096 blocks * 18 bits/block =
Valid bits = 4096 * 1 bit/block =
Total bits = Data bits + Tag bits + Valid bits
Capacity
Conflict
Larger size
Computer architecture 36
Increases Conflict Misses: The larger the block size, the
more data is stored in each cache block, meaning that
different memory addresses are more likely to share the
same cache block. This results in more evictions of data
that could be useful, increasing the conflict miss rate.
Smaller size
When you increase the block size, even though you might
reduce the number of compulsory misses, you increase the
miss penalty because more data needs to be fetched from
memory when a cache miss occurs.
→ Big block miss less, but when miss, the miss penalty is higher
Computer architecture 37
Direct mapped cache: a memory block maps to exactly one cache
block
Direct mapped
Each memory block is mapped to exactly one block in the cache
Computer architecture 38
The tag field: associated with each cache block that contains the
address information (the upper portion of the address) required to
identify the block
step 1: The cache controller takes the memory address and applies the
modulo operation to determine which cache block to use based on the
index.
step 2: The cache block at that index is checked to see if the tag
matches the tag stored in the cache block. If they match, it's a cache
hit.
step 3: If the tag does not match or the valid bit is 0 , a cache miss
occurs, and the data is fetched from main memory.
step 4: The data fetched from memory is stored in the cache block, and
the valid bit is set to 1 .
Disadvantage
Cache conflicts: Since many memory blocks can map to the same
cache block (based on the modulo index), there can be frequent
cache misses if those memory blocks are used at the same time.
Set associative
Set associative Four-way set associative cache
Computer architecture 39
→ Still 1K words
Block replacement
Computer architecture 40
Design considerations for L1 and L2 caches are very different
Explain
For the L2 cache, hit time is less important than miss rate
Given a processor with a base CPI of 1.0 and clock rate of 4 GHz. Main
memory access time is 100 ns.
A new L2 is added
without L2 → 2% * 400 = 8
Computer architecture 41
Handling catch hits
Read hits (I$ and D$).
When there is a read hit, it means that the data or instruction that the
CPU needs is already available in the cache. There is no need to
access the next level of memory.
always write the data into both the cache block and the next level in
the memory hierarchy (write-through)
writes run at the speed of the next level in the memory hierarchy –
so slow! – or can use a write buffer and stall only if the write buffer
is full
write the data only into the cache block (write-back the cache
block to the next level in the memory hierarchy when that cache
block is “evicted” - replaced)
need a dirty bit for each data cache block to tell if it needs to be
written back to memory when it is evicted – can use a write buffer
to help “buffer” write-backs of dirty blocks.
Write-through
Write through: every time there is a write hit in the data cache (D$), the
data is written to both the cache and the next level of memory. This
ensures that the cache and memory are always consistent.
Computer architecture 42
Write-back
The data is only written to the cache, and the cache block is marked as
"dirty.”
The dirty bit is used to track whether a block of data in the cache has
been modified but not yet written back to the next level of memory.
Computer architecture 43
→ How many I/O operations can we do per unit time?
Advantages
Disad
Computer architecture 44
The maximum bus speed is largely limited by:
Interrupt
Computer architecture 45
Advantages:
Indicate the I/O device causing the interrupt and to save the
necessary information prior to servicing the interrupt and to
resume normal processing after servicing the interrupt
RISC-V interrupt
DMA
Computer architecture 46
.text: 0x00400000
.data: 0x10010000
Computer architecture 47