0% found this document useful (0 votes)
3 views

Computer Architecture Note

The document provides an overview of computer architecture, focusing on performance evaluation, CPU components, instruction set architecture, and computer arithmetic. It discusses key concepts such as CPU performance metrics, clock cycles, instruction execution, and the role of the ALU. Additionally, it covers pipelining in CPU design and the handling of data hazards in instruction execution.

Uploaded by

cretech.site
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Computer Architecture Note

The document provides an overview of computer architecture, focusing on performance evaluation, CPU components, instruction set architecture, and computer arithmetic. It discusses key concepts such as CPU performance metrics, clock cycles, instruction execution, and the role of the ALU. Additionally, it covers pipelining in CPU design and the handling of data hazards in instruction execution.

Uploaded by

cretech.site
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

💻

Computer architecture
Chap 1: Intro
Performance evaluation: Response time

CPU performance: the time that CPU actually spent on executing user
program.

Performance ratio

CPU time

→ Increase perf by reducing either the length of the clock cycle or the
number of clock cycles required for a program

Computer architecture 1
Clock rate (clock cycles per second in MHz or GHz)

Clock cycle = duration of a cycle

💡 Clock cycle time = 1 / Clock rate

💡 CPU Time = Instruction Count × CPI × Clock Cycle Time = # CPU


Clock Cycle / Clock Rate

Computer architecture 2
Clock cycles per instruction (CPI)

The avg number of clock cycles per ins

CPU clock cycles for a program = # Instructions for a program x


Average clock cycles per instruction

Performance equation

The clock rate: CPU specification

CPI: varies by instruction type and ISA implementation

Instruction count: measure by using profilers/simulators

Dynamic instruction count

Each “for” consists of two instructions: increment index, check exit


condition

Computer architecture 3
Improve performance

Computer architecture 4
Shorter clock cycle = faster clock rate → lastest CPU technology

Smaller CPI → optimizing Instruction Set Architecture

Smaller instruction count → optimizing algorithm and compiler

XOR: same = 0, diff = 1


Chap 2: Computer System and
Interconnection
Computer Components
Detailed computer organization

Computer architecture 5
Computer architecture 6
CPU (Central Processing Unit)

Control Unit: Fetches and interprets instructions, Control other


components to execute operations.

Datapath: Performs arithmetic and logic operations via the ALU


(Arithmetic Logic Unit)

Registers: Small, high-speed storage for instructions and intermediate


data during execution.

Memory

Stores program instructions and actively used data.

Organized as an array of memory cells, each with a unique address and


hold one byte of data.

Data storage specifics:

8-bit integers = 1 cell, 32-bit = 4 cells

Array requires consecutive cells

Input/Output

Interfacing computer with physical world/environment

Input devices: Mouse, keyboard, webcam.

Output devices: Monitor, printer, speakers.

Storage: HDDs, SSDs, USB drives.

Communication: Wi-Fi, Ethernet, Bluetooth.

Link

The fabric to connect all components

Huge number of connection, requires very good design so that all


components function properly

Computer functions
Executing Programs

Programs = set of instructions in binary format (Opcode + Operands).

Computer architecture 7
Instruction Cycle: the processing required for a single instruction
execution

Fetch: Control unit fetches the instruction from memory, guided by


the Program Counter (PC).

At the beginning of each instruction cycle the processor,


fetches an instruction from memory.

the Program Counter (PC) holds the address of the instruction


to be fetched

The processor increases PC after each instruction fetch so that


PC points to the next instruction in sequence.

The fetched instruction is loaded into the instruction register


(IR).

Execution: control unit decodes the instruction, then “tells”


datapath and other components to perform the required action.

Instruction (fetched and stored in IR) is decoded to get the


operation, the location to get input data (source operands),
location to store output data (destination operand)

Instruction format: 16 bits, 4 bits for the Opcode

Internal CPU registers:

PC: address of ins

IR: Ins being executed

AC (Accumulator): temporary storage

Interrupts

Mechanism to handle tasks when external components need immediate


CPU attention.

Servicing interrupt: processor temporarily switch from the current


program to execute a different (rather short) program, before
continuing the original program.

Interrupt Handling:

Computer architecture 8
Temporarily pauses the current program.

Executes the interrupt service routine (ISR)

Determine the nature of interrupt: source and reason of


interrupt.

Perform corresponding operation.

Return control to the interrupted program.

Resumes the original program after handling the interrupt.

Interrupt cycle

CPU saves context of current program (current value of PC).

Address of interrupt handler is loaded to PC.

CPU continues with new instruction cycles, with new PC.

At the end of interrupt handler, context will be restored


including PC value. CPU return to the interrupted program.

Sources of Interrupts:

Software (exceptions like division by zero).

Timer (for scheduling tasks).

I/O devices (e.g., keyboard input).

Hardware failures.

Multiple Interrupts:

Managed using priorities; nested interrupts are possible.

System Interconnection
CPU & I/O → CPU-controlled data transfer

I/O & Memory: data is transferred between memory and I/O, under the
control of special controllers called DMAC.

Computer architecture 9
Chap 3: Instruction Set Architecture
Core Components of RISC-V ISA
RISC-V Operands

Source operand: provides input data. Destination operand: stores the


result of operation

Types:

Registers: Fast storage inside the CPU (32 registers in RV32I, each
32-bit).

Memory: Slower but larger storage for variables, arrays, and data
structures.

Immediate Values: Fixed constants encoded directly in instructions


for efficiency.

Data Types: Byte (8 bits), Halfword (16 bits), Word (32 bits), and
Doubleword (64 bits). RV32 registers hold 4-byte words. Each register
has a unique 5-bit address.

Why only 32 registers, not more? → Smaller is faster!

Registers
Each register has a unique 5-bit address.

Data processing done on registers inside CPU

Register Operations

Advantages

Computer architecture 10
Faster than memory due to direct access.

Fewer registers (32) ensure speed and simplicity.

Usage

Arithmetic and logic operations are performed on registers.

Temporary values and frequently used data are stored in registers.

Memory
Memory Operations

Memory operands are stored in main memory, slower than register file
100 to 500 times

High level lang program use memory operands: Variables, Array and
string, Composite data structures.

RISC-V memory organization

Endianness:

Computer architecture 11
RISC-V uses Little Endian (LSB at the smallest address).

Immediate operand
Does not need to be stored in register file or memory. Value stored right in
instruction → faster

Instruction Formats
6 format: R, I, S, B, U, J. Why not only one format? Or 20 formats? → Good
design demands good compromises!

Wide Immediates:

Many operations need 32-bit immediates: Loading 32-bit immediates to


registers, Loading addresses to registers → combine 2 instructions

Computer architecture 12
Stack structure
A region of memory operating on LIFO

Computer architecture 13
The bottom of stack is at the highest location

sp: point to the top of the stack

Passing control

Passing data
Use registers: input argument (a0-a7), return value (a0)

more than 8 arguments → use stack

Caller pushes arguments into stack before calling the callee

Callee get arguments from the stack

(Optional) Callee saves the return value to the stack

Computer architecture 14
Memory Management and Stack
Stack

Procedure Calls:

Six steps:

Computer architecture 15
RISC-V memory configuration
Program text:
stores machine
code of program,
declared with .text

Static data: data


segment, declared
with .data

Heap: for dynamic


allocation

Stack: for local


variable and
dynamic allocation
(LIFO)

Instruction Set Extensions


Additional operations and data types + Additional formats and customed
formats

Computer architecture 16
Addressing
Immediate addressing: A mode of addressing where the operand is directly
specified within the instruction itself, rather than in a register or memory
location
→ i instructions

Register addressing: An addressing mode where the operand is located in a


processor register, and the instruction specifies the register directly
→ Instructions that move data between registers or perform arithmetic
operations using register contents

Base addressing: A mode of addressing where the effective address of the


data is determined by adding a constant value (the base address) to the
contents of a base register.

Computer architecture 17
→ Useful in accessing array elements or variables within a data segment

PC-relative addressing: An addressing mode where the address of the data


is determined by adding a constant value to the current value of the
Program Counter (PC)
→ Commonly used in branch instructions

Chap 4: Computer Arithmetic

💡 Integer Representation (signed/unsigned)

Floating point number representation

Integer arithmetic operations (add, sub, mult, div)

Overflow

Integer Representation
Unsigned Binary Integers

Represent non-negative integers using n-bit binary numbers.

Range: 0 to 2^{n}−1.

Example: Using 32 bits, range = 0 to 4,294,967,295 = 0x0000 0000 to


0xFFFF FFFF

Signed Binary Integers

Represent integers (positive and negative) using n-bit binary numbers.

Range: −2^{n−1} to 2^{n−1}−1.

Example: Using 32 bits, range = −2,147,483,648 to 2,147,483,647 =

0x8000 0000 to 0x7FFF FFFF

Negation in Binary: calculate -x from x

Use 2's complement:

Flip all bits and add 1 to the least significant bit (LSB).

Computer architecture 18
Example:

+2 = 0000 0010

−2 = 1111 1101 (1's complement) + 1 = 1111 1110

Sign Extension (8 bit to 16 bit)

Extending a signed integer to a larger bit size:

Replicate the sign bit to preserve the value.

Example:

–2: 1111 1110 ⇒ 1111 1111 1111 1110

2: 0000 0010 ⇒ 0000 0000 0000 0010

Instruction to work with sign/unsigned

lb/lbu, lh/lhu

blt/bltu, bge/bgeu

slt/sltu, slti/sltiu

div/divu, rem/remu

Integer Arithmetic
Addition and Subtraction

Addition: Bit-by-bit operation with carry propagation.

Subtraction: Negate the second operand and add it to the first.

Carryout and Overflow

Carryout:

Occurs when the result produces a carry beyond the maximum bit
width.

Overflow:

Happens when the result of signed addition/subtraction exceeds the


representable range. When adding operands with different signs or

Computer architecture 19
when subtracting operands with the same sign, overflow can never
occur.

Multiplexer

Making addition faster: infinite hardware

Computer architecture 20
Computer architecture 21
Multiply Division

NOTE: counter = n (# bit of


multiplicand or multiplier). Every
shift right, counter -= 1. If counter
= 0 → End

Computer architecture 22
Floating point number: Sign, mantissa, and exponent
Ex: 2013.1228 = 2.0131228 * 10^3 = 2.0131228E+03

mantissa: 2.0131228

exponent: 03

In binary: X = ±1.xxxxx * 2^{yyyy}

Computer architecture 23
Chap 5: The Processor
CPU implementation (datapath, datapath with control, multiplexor)

Pipeline

Hazards & solving hazards

(except designing ALU & designing pipelined datapath)

Datapath
Def: the collection of functional units and registers within the CPU that are
responsible for the manipulation and movement of data. It handles the
processing of data during execution. Component of datapath are:

register

ALU

Multiplexers

Memory units

Shifters

The datapath performs computations, moves data between registers, and


interacts with memory for reading and writing data.

It executes the operations specified by instructions, like arithmetic


calculations or data manipulation, based on the control signals from the
control unit.

Control

Computer architecture 24
directing the operation of the datapath components by generating the
appropriate control signals. Components:

Control signals

Instruction decoder

Program counter

⇒ The datapath handles the actual data processing (operations like arithmetic
or moving data between registers), while the control unit ensures the correct
sequencing and timing of operations.

Fetching instructions involves

Reading the ins from the Instruction Memory

updating the PC value to be the address of the next instruction


in memory

Decoding ins involves

Sending the fetched instruction’s opcode and function field bits to the
control unit

The control unit send appropriate control signals to other parts inside
CPU to execute the operations corresponds to the instruction

What is ALU?

Arithmetic Logic Unit

critical component of CPU

performs arithmetic and logical operations

handles operations addition, subtraction, multiplication, division, as well


as logical operations like AND, OR, NOT, and XOR.

R-format (ALU instructions)

read reg operands rs1 and rs2

perform operation (opcode, funct7, and funct3) on values

store the result back into the register file (reg rd)

Computer architecture 25
Executing Load and store (Memory instructions)

read register operands

Calculate address using 12-bit offset (Use ALU, but sign-extend offset)

store: read from the Register File, write to the Data Memory

load: read from the Data Memory, write to the Register File

Combining ALU and Memory instruction

Computer architecture 26
Executing Branch instruction (beq)

read register operands

compare the operands (subtract, check zero ALU output)

compute the branch target address: adding the PC to the signed


extended offset shifted left 1 bit

Computer architecture 27
Instruction times (critical paths)

Instruction fetch and data access (200ps)

ALU operation and adders (200ps)

Register File access (reads or writes) (100ps)

Single cycle disad and ad

The clock cycle must be timed to accommodate the slowest instruction


→ uses the clock cycle inefficiently.

Some functional units must be duplicated (they cannot be shared


during a clock cycle) → wasteful of area

But it’s simple and easy to understand

make the computer faster

Divide ins cycles into smaller cycles

Executing ins in parallel

Pipelining: start fetching and executing the next ins before the current
one has completed. This is called overlapping execution

Laundry work

assume 4 stages washing, drying, ironing, folding

With n loads:

T_norm = n*2 hours

T_pipeline = (3+n)/2 hours

Computer architecture 28
When n → \inf, T_norm → 4*T_pipeline

RISC-V pipeline

Five stages, one step per stage

IF > ID > EX > MEM > WB

Instruction Fetch from Memory and Update PC

Instruction Decode and Register Read

Execute R-type or calculate memory address

Read/write the data from/to the Data Memory

Write the result data into the register file

Execution time for a single ins is always 5 cycles (regardless of ins


operation)

Pipeline performance

All modern processors are pipelined

Under ideal conditions and large number of ins → five-stage pipeline is


nearly 5 times faster because CC is five times faster

Improves throughput (total amount of work done in given time)

Ins latency is NOT reduced

In reality, speedup is less because of imbalance and overhead

Single cycle vs. Pipeline???

Computer architecture 29
Data hazards
Data hazards happens with la/li pseudo instruction

Pipeline can cause troubles

Hazards: situations that prevent starting the next ins

Structural: attempt to use the same resource by 2 different ins at


the same time

Data: attempt to use data before it is ready


→ An instruction’s source operand(s) are produced by a prior
instruction still in the pipeline

Control: make a decision about program control flow before the


condition has been evaluated and the new PC target address
calculated

Structure hazards

Conflict for use of a resource

Computer architecture 30
In RISC-V pipeline with a single memory

Load/store requires data access

Example: a CPU has only memory unit. Two ins need to access
memory at the same time (1 load and 1 store)

However, a single memory unit cannot handle multiple operations


simultaneously.

Hence, pipelined datapaths require separate instruction/data


memories.

Fix register file access hazard by doing reads in the second half of
the cycle and writes in the first half.

Data hazards: CPU must wait until data becomes valid

Pipeline Stalls introduce a delay in the pipeline, effectively pausing the


flow of instructions for one or more cycles until the required resource or
data becomes available.
This leads to "bubble" instructions (empty slot that causes the pipeline
to advance without performing any useful work)

Solve data hazard with forwarding

Use result when it is computer. Don’t wait until it’s stored in the
register.

Forward from EX to EX

Solving Load-Use data hazard

Forward from MEM (output) to EX (input)

Can’t always avoid stalls by forwarding

Computer architecture 31
Code scheduling to avoid stalls

Control hazards

Fetching next instruction depends on branch outcome

Pipeline can’t always fetch correct instruction

In RISC-V pipeline

Need to compare registers and compute target early in pipeline

Add hardware to do it in ID stage

Solving control hazards

Delayed branch

Compute target earlier

Branch prediction

Chap 6: Memory Hierarchy


Locality principle
Memory technology

1. Static RAM (SRAM)

0.5ns - 2.5 ns, $500 - $1000 per GB

2. Dynamic RAM (DRAM)

Computer architecture 32
50ns – 70ns, $10 – $20 per GB

3. Flash memory

5,000ns – 50,000ns, $0.75 – $1 per GB

4. Magnetic memory

5,000,000ns – 20,000,000ns, $0.05 – $0.1 per GB

→ Large memories are slow


→ Fast memories are small and expensive

Memory Hierarchy

Reg File > Instr cache & Data cache > SRAM > DRAM > Secondary
Memory (Disk)

Locality principle

Data memory at location of


int x[1000], temp;
temp and x are accessed
for (i = 0; i < 999; i++
multiple times
for (j = i+1; j < 100
if (x[i] < x[j]) Instruction memory at
temp = x[i]; location of the two fow loops
x[i] = x[j]; are used repeatedly
x[j] = temp;
}

Temporal Locality (locality in time)

Computer architecture 33
If a memory location is referenced then it will tend to be
referenced again soon → keep most recently accessed data
items closer to the processor

Spatial Locality (locality in space)

If a memory location is referenced, the locations with nearby


addresses will tend to be referenced soon → move blocks
consisting of contiguous words closer to the processor

Hierarchical memory access

Data are stored in multiple levels

Data are transferred in units of block (of multiple words)


between levels, through the hierarchy.

Frequently used data are stored closer to processor

If accessed data is present in upper level:

Hit: access satisfied by upper level

Hit ratio: hits/accesses

If accessed data is absent

Miss

Time taken: miss penalty

Miss ratio: misses/accesses


Cache
CPU needs to access a data item in memory

How does CPU know if the data item is in the cache?

Adding set of tags fields into cache: each block in cache has a tag

The tags contain address information to identify whether a word in


cache is corresponding to the requested one in memory.

If it is, how does CPU find it?

Depends on how a block in memory is mapped into block (line) in


cache

Computer architecture 34
Methods for mapping: Direct mapping, Fully associative mapping, N-
way set associative mapping

Cache performance exercise

Tag = address size - log_2(# block) - log_2(block_size * 4)

Total bits = data bits + tag bits + valid bits

Ex 1:

Given a RISC-V CPU running a program with

miss rate of instruction cache is 2%

miss rate of data cache is 4%

processor has CPI of 2 without any memory stalls

the miss penalty is 100 cycles for all misses

Determine how much faster that processor would run with a perfect
cache that never missed. Assume the frequency of all loads and
stores is 36%.

Given instruction count A

Instruction miss cycles = A * 2% * 100 = 2A (cycles)

Data miss cycles = A * 36% * 4% * 100 = 1.44A (cycles)


→ total mem-stall cycles = 3.44A (cycles)

CPU time = IC * CPI * Clock cycle

CPU time with stalls / CPU time with perfect cache = CPI stall / CPI
perfect = (2 + 3.44) / 2 = 2.72

If CPU has faster CPI of 1 → rate = (3.44 + 1) / 1 = 4.44

Ex 2:
How many total bits are required for a direct-mapped cache with 16 KiB
of data and 1-word blocks, assuming a 32-bit address?

Data bits = 4096 blocks * 4 bytes/block * 8 bits/byte =

Computer architecture 35
Tag bits = 4096 blocks * 18 bits/block =
Valid bits = 4096 * 1 bit/block =
Total bits = Data bits + Tag bits + Valid bits

Miss rate vs Block size vs Cache size

Source of Cache misses

Compulsory (cold start)

We cannot do much on this

Solution: increase block size (but also increases miss


penalty)

Capacity

Cache cannot contain all blocks accessed by the program

Solution: increase cache size (may increase access time)

Conflict

Multiple memory locations mapped to the same cache


location

Solution 1: increase cache size

Solution 2: increase associativity (may increase access time)

Miss rate vs Block size

Larger size

Larger block sizes reduce compulsory misses by bringing


in more data at once, which is beneficial when there is
spatial locality (i.e., nearby data is likely to be accessed
soon).

Increases Capacity Misses: when the block size increases,


there is less cache space available for other data. If the
program needs to access more data than the cache can
hold, it will evict data prematurely, leading to increased
capacity misses.

Computer architecture 36
Increases Conflict Misses: The larger the block size, the
more data is stored in each cache block, meaning that
different memory addresses are more likely to share the
same cache block. This results in more evictions of data
that could be useful, increasing the conflict miss rate.

Smaller size

Increases Compulsory Misses: Smaller block sizes


increase compulsory misses, as only a small amount of
data is fetched with each miss.

Decreases Capacity Misses: With smaller blocks, more


blocks can fit into the same-sized cache. This allows the
cache to store more distinct pieces of data, reducing the
chance of eviction and lowering capacity misses.

Reduces Conflict Misses: With smaller blocks, the cache


has more blocks, which means there are more places to
store data. This reduces the chances of two memory
addresses colliding in the same cache block and helps in
reducing conflict misses.

Miss penalty for big block size

When you increase the block size, even though you might
reduce the number of compulsory misses, you increase the
miss penalty because more data needs to be fetched from
memory when a cache miss occurs.

Fetching larger blocks takes longer (in terms of clock cycles)


because you're fetching more data from memory for each cache
miss. So, even if the miss rate decreases, the time it takes to
bring the data into the cache (miss penalty) increases.

→ Big block miss less, but when miss, the miss penalty is higher

Reducing Cache Miss Rates


→ Let cache block holds more than one word (Spatial locality)
→ Allow more flexible block placement

Computer architecture 37
Direct mapped cache: a memory block maps to exactly one cache
block

Fully associative cache allow a memory block to be mapped to any


cache block

A compromise is to divide the cache into sets each of which


consists of n “ways” (n-way set associative).
→ A memory block maps to a unique set (specified by the index
field) and can be placed in any way of that set (so there are n
choices)

Direct mapped
Each memory block is mapped to exactly one block in the cache

lots of lower level blocks must share blocks in the cache

💡 Address mapping: (block address) modulo (# of blocks in the


cache)

Computer architecture 38
The tag field: associated with each cache block that contains the
address information (the upper portion of the address) required to
identify the block

The valid bit: if there is data in the block or not

When memory address is provided (0xA34F25), the address is divided


into

Tag: used for comparison

Index: used to select the cache block

Block Offset: The lower portion, indicating the specific word/byte


within the block

step 1: The cache controller takes the memory address and applies the
modulo operation to determine which cache block to use based on the
index.

step 2: The cache block at that index is checked to see if the tag
matches the tag stored in the cache block. If they match, it's a cache
hit.

step 3: If the tag does not match or the valid bit is 0 , a cache miss
occurs, and the data is fetched from main memory.

step 4: The data fetched from memory is stored in the cache block, and
the valid bit is set to 1 .

Disadvantage

Cache conflicts: Since many memory blocks can map to the same
cache block (based on the modulo index), there can be frequent
cache misses if those memory blocks are used at the same time.

Limited flexibility: The fixed mapping of memory blocks to cache


blocks reduces the flexibility of the cache, and some blocks might
evict useful data too often.

Set associative
Set associative Four-way set associative cache

Computer architecture 39
→ Still 1K words

Range of associative caches

Benefits of Set Associative Caches

The choice of direct mapped or


set associative depends on the
cost of a miss versus the cost of
implementation

Block replacement

Associative cache: one of


multiple blocks in the set must be
selected

LRU scheme: (least recently


used) block that has been
unused the longest time is
selected for replacement.

Mechanism for relative last time


used tracking is necessary.

Reducing the miss penalty


→ Use multiple levels of caches
→ Normally a unified L2 cache (holding both instructions and data, for each
core) and a unified L3 cache shared for all cores.

Multi-level cache design

Computer architecture 40
Design considerations for L1 and L2 caches are very different

1. Primary cache should focus on minimizing hit time in support of a


shorter clock cycle
→ Smaller with smaller block sizes

2. Secondary cache(s) should focus on reducing miss rate to reduce the


penalty of long main memory access times
→ Larger with larger block sizes
→ Higher levels of associativity

Explain

The miss penalty of the L1 cache is significantly reduced by the


presence of an L2 cache – so it can be smaller but have a higher
miss rate

For the L2 cache, hit time is less important than miss rate

Multi-level cache design - example

Given a processor with a base CPI of 1.0 and clock rate of 4 GHz. Main
memory access time is 100 ns.

All data references are hit in primary cache (L1).

Instruction miss rate of 2% in primary cache (L1).

A new L2 is added

Access time from L1 to L2 is 5 ns.

Instruction miss rate (to main memory) reduced to 0.5%.

What is speed-up after adding the L2?

5ns = 5 * 10^-9 sec = 5 * 10^-9 * 4 * 10^9 = 20 cycles. 100ns = 400


cycles

without L2 → 2% * 400 = 8

L2: I_stall = I_stall_1 + I_stall_2 = 2% * 20 + 0.5% * 400 = 2.4

Speed up = 1 + 8 / 1 + 2.4 = 2.6

Computer architecture 41
Handling catch hits
Read hits (I$ and D$).

This is what we want.

When there is a read hit, it means that the data or instruction that the
CPU needs is already available in the cache. There is no need to
access the next level of memory.

Write hits (D$ only)

require the cache and memory to be consistent

always write the data into both the cache block and the next level in
the memory hierarchy (write-through)

writes run at the speed of the next level in the memory hierarchy –
so slow! – or can use a write buffer and stall only if the write buffer
is full

allow cache and memory to be inconsistent

write the data only into the cache block (write-back the cache
block to the next level in the memory hierarchy when that cache
block is “evicted” - replaced)

need a dirty bit for each data cache block to tell if it needs to be
written back to memory when it is evicted – can use a write buffer
to help “buffer” write-backs of dirty blocks.

Write-through

Write through: every time there is a write hit in the data cache (D$), the
data is written to both the cache and the next level of memory. This
ensures that the cache and memory are always consistent.

Write-through can be slow because it requires the data to be written to


both the cache and the memory. We fix this by using a write buffer,
allowing the CPU to continue executing instructions while the write to
memory happens asynchronously in the background. However, if the
write buffer is full, the CPU will have to stall, waiting for space to
become available in the buffer.

Computer architecture 42
Write-back

The data is only written to the cache, and the cache block is marked as
"dirty.”

The dirty bit is used to track whether a block of data in the cache has
been modified but not yet written back to the next level of memory.

Write-back is faster because it reduces the number of writes to the


memory hierarchy. The cache only needs to update memory when
blocks are evicted, which occurs less frequently than write-through
operations.

This also require a write buffer.

Chap 7: I/O system


Characteristics of I/O system and devices
Computer needs the interface to communicate with outside world

Important metrics for I/O system: performance, expandability,


dependability, cost, size, weight, security, etc.

Typical I/O system I/O devices: behavior, partner, and


data rate

I/O performance measures


I/O bandwidth (throughput): amount of information that can be
input/output and communicated across an interconnect between the
processor/memory and I/O device per unit time
→ How much data can we move through the system in a certain time?

Computer architecture 43
→ How many I/O operations can we do per unit time?

I/O response time (latency): the total alapsed time to accomplish an


input or output operation

Hardware operates in 2 states:

1. Service accomplishment: the service is delivered as specified.

2. Service interruption, the delivered service is different from the


specified service.

Change from (1) to (2) = failures, (2) to (1) = restorations

Permanent failure: service is stopped permanently

Intermittent failure: system oscillates between the two states

Reliability and Availability

Mean Time To Failure (MTTF): average time of normal operation


between two consecutive failure.

Mean Time To Repair (MTTR): average time of service interruption


when failure occurs.

Realiability: measure by MTTF

Availability = MTTF / (MTTF + MTTR)


I/O system organization
A bus is a shared communication link (a single set of wires used to
connect multiple subsystems) that needs to support a range of devices
with widely varying latencies and data transfer rates

Advantages

Versatile – new devices can be added easily and can be moved


between computer systems that use the same bus standard

Low cost – a single set of wires is shared in multiple ways

Disad

Creates a communication bottleneck – bus bandwidth limits the


maximum I/O throughput

Computer architecture 44
The maximum bus speed is largely limited by:

The length of the bus

The number of devices on the bus

Methods for I/O operation and control


Polling

the processor periodically checks the status of an I/O device to


determine its need for service. Processor is totally in control – but
does all the work. Can waste a lot of processor time due to speed
differences.

Interrupt

The I/O device issues an interrupt to indicate that it needs attention.

The processor detects and “serves” the interrupt by executing a


handler (aka. Interrupt service routine).

Computer architecture 45
Advantages:

Relieves the processor from having to continuously poll for an


I/O event.

User program progress is only suspended during the actual


transfer of I/O data to/from user memory space.

Disadvantage – special hardware is needed to

Indicate the I/O device causing the interrupt and to save the
necessary information prior to servicing the interrupt and to
resume normal processing after servicing the interrupt

RISC-V interrupt

DMA

Computer architecture 46
.text: 0x00400000
.data: 0x10010000

Computer architecture 47

You might also like