0% found this document useful (0 votes)
5 views15 pages

CAO Unit 3

The document outlines the architecture of a non-pipelined CPU, detailing its core components such as the Control Unit, Arithmetic Logic Unit, and various registers, as well as the sequential execution of instructions. It also discusses the memory hierarchy, emphasizing the importance of different memory types and the principle of locality for efficient access. Additionally, it covers I/O techniques and various CPU architecture types based on operand storage, highlighting their characteristics and operational mechanisms.

Uploaded by

aigug.r24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views15 pages

CAO Unit 3

The document outlines the architecture of a non-pipelined CPU, detailing its core components such as the Control Unit, Arithmetic Logic Unit, and various registers, as well as the sequential execution of instructions. It also discusses the memory hierarchy, emphasizing the importance of different memory types and the principle of locality for efficient access. Additionally, it covers I/O techniques and various CPU architecture types based on operand storage, highlighting their characteristics and operational mechanisms.

Uploaded by

aigug.r24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

‭Unit 3‬

‭1. Basic Non-Pipelined CPU Architecture‬


‭ non-pipelined CPU executes instructions sequentially. One instruction must‬
A
‭complete‬‭all‬‭its stages before the next instruction can begin. Think of it like a‬
‭single worker handling one task entirely before starting the next.‬

‭Core Components:‬

‭1.‬ ‭Control Unit (CU):‬‭The "brain" of the CPU. It fetches instructions from‬
‭memory, decodes them, and generates control signals to direct the other‬
‭components (ALU, registers, memory interfaces) on what to do and‬
‭when.‬
‭2.‬ ‭Arithmetic Logic Unit (ALU):‬‭Performs arithmetic operations (addition,‬
‭subtraction, multiplication, division) and logical operations (AND, OR,‬
‭NOT, XOR, shifts). It takes operands from registers and returns the result‬
‭to a register.‬
‭3.‬ ‭Registers:‬‭Small, extremely fast storage locations‬‭within‬‭the CPU. Used‬
‭to hold data, instructions, addresses, and status information temporarily‬
‭during execution. Key types include:‬
‭○‬ ‭Program Counter (PC):‬‭Holds the memory address of the‬‭next‬
‭instruction to be fetched.‬
‭○‬ ‭Instruction Register (IR):‬‭Holds the instruction‬‭currently‬‭being‬
‭decoded and executed.‬
‭○‬ ‭Memory Address Register (MAR):‬‭Holds the address of the‬
‭memory location to be accessed (read from or written to).‬
‭○‬ ‭Memory Data Register (MDR) / Memory Buffer Register‬
‭(MBR):‬‭Temporarily holds data being transferred to or from‬
‭memory.‬
‭○‬ ‭General Purpose Registers (GPRs):‬‭Used to hold operands and‬
‭results for ALU operations. Accessible by the programmer (via‬
‭assembly language).‬
‭○‬ ‭Status Register / Flags Register:‬‭Holds status bits (flags)‬
‭indicating results of operations (e.g., Zero flag, Carry flag,‬
‭Overflow flag, Negative flag).‬
‭4.‬ ‭Internal Buses:‬‭Pathways connecting the different components (CU,‬
‭ALU, Registers) within the CPU, allowing data and control signals to‬
‭travel between them.‬

‭ peration:‬‭In a non-pipelined architecture, the CPU follows the‬


O
‭Fetch-Decode-Execute cycle strictly sequentially for each instruction. If‬
f‭ etching takes 1 clock cycle, decoding 1, and executing 3, a single instruction‬
‭takes 5 clock cycles. The next instruction only starts fetching‬‭after‬‭the previous‬
‭one has fully completed execution. This leads to underutilization of CPU‬
‭components (e.g., the fetch unit is idle during decode and execute).‬

‭2. Memory Hierarchy‬


‭ PUs operate much faster than main memory (RAM). Accessing RAM for‬
C
‭every instruction and data operand would create a massive bottleneck. The‬
‭Memory Hierarchy is a structure that uses multiple levels of memory with‬
‭different speeds, sizes, and costs to bridge this gap.‬

‭Levels (Closest to CPU outwards):‬

‭1.‬ ‭Registers:‬‭Fastest, smallest, most expensive (part of the CPU). Hold‬


‭currently active data/instructions. Access time ~1 CPU cycle.‬
‭2.‬ ‭Cache Memory (L1, L2, L3):‬‭Small, fast Static RAM (SRAM) located‬
‭closer to the CPU (often on the same chip). Stores frequently accessed‬
‭data and instructions from main memory.‬
‭○‬ ‭L1 Cache:‬‭Smallest, fastest cache, often split into instruction cache‬
‭(L1i) and data cache (L1d). Access time ~ few CPU cycles.‬
‭○‬ ‭L2 Cache:‬‭Larger and slower than L1. Access time ~ 10-20 cycles.‬
‭○‬ ‭L3 Cache:‬‭Largest and slowest cache level, often shared by‬
‭multiple CPU cores. Access time ~ 30-50 cycles.‬
‭3.‬ ‭Main Memory (RAM):‬‭Dynamic RAM (DRAM). Much larger than‬
‭cache but significantly slower. Holds the currently running operating‬
‭system and application programs/data. Access time ~ 100-200+ cycles.‬
‭4.‬ ‭Secondary Storage (Virtual Memory/Swap Space):‬‭Hard Disk Drives‬
‭(HDDs) or Solid State Drives (SSDs). Largest capacity, slowest access‬
‭time, cheapest per bit. Holds data and programs not currently in RAM.‬
‭Used as an extension of RAM (virtual memory). Access time ~‬
‭milliseconds.‬
‭5.‬ ‭Tertiary Storage (Optional):‬‭Optical disks, magnetic tapes for backups‬
‭and archival. Very slow.‬

‭ rinciple of Locality:‬‭Memory hierarchy works efficiently because programs‬


P
‭tend to exhibit:‬

‭●‬ T
‭ emporal Locality:‬‭If an item (instruction or data) is accessed, it's likely‬
‭to be accessed again soon. (Loops, reuse of variables). Caching keeps‬
‭recently used items close.‬
‭●‬ S
‭ patial Locality:‬‭If an item is accessed, items whose addresses are close‬
‭by are likely to be accessed soon. (Sequential code execution, array‬
‭processing). Caching fetches blocks of data around the requested item.‬

‭ oal:‬‭To provide the CPU with an average memory access time close to the‬
G
‭cache speed, while offering the large capacity of main memory and secondary‬
‭storage.‬

‭3. I/O Techniques‬


I‭ nput/Output (I/O) techniques manage the communication between the‬
‭CPU/Memory system and external peripheral devices (keyboard, mouse, disk‬
‭drives, network interfaces, printers, etc.).‬

‭1.‬ ‭Programmed I/O (PIO):‬


‭○‬ ‭Mechanism:‬‭The CPU executes specific I/O instructions. It‬
‭continuously checks (polls) the status register of the I/O device‬
‭until it's ready for data transfer. The CPU is directly responsible for‬
‭moving data between memory/registers and the I/O device buffer.‬
‭○‬ ‭Pros:‬‭Simple to implement.‬
‭○‬ ‭Cons:‬‭Very inefficient. The CPU wastes significant time waiting‬
‭(polling) for the slow I/O device, unable to perform other tasks.‬
‭Only suitable for very simple or slow devices.‬
‭2.‬ ‭Interrupt-Driven I/O:‬
‭○‬ ‭Mechanism:‬‭The CPU initiates an I/O operation and then‬
‭continues executing other tasks. When the I/O device is ready (e.g.,‬
‭data received, operation complete), it sends an interrupt signal to‬
‭the CPU. The CPU suspends its current task, saves its state,‬
‭executes an Interrupt Service Routine (ISR) to handle the data‬
‭transfer, restores its state, and resumes the interrupted task.‬
‭○‬ ‭Pros:‬‭Much more efficient than PIO, as the CPU doesn't wait idly.‬
‭○‬ ‭Cons:‬‭Interrupt handling introduces overhead (saving/restoring‬
‭state, context switching). Still involves the CPU in the actual data‬
‭transfer process.‬
‭3.‬ ‭Direct Memory Access (DMA):‬

‭○‬ M
‭ echanism:‬‭A dedicated hardware controller, the DMA Controller‬
‭(DMAC), manages the data transfer directly between the I/O‬
‭device and main memory,‬‭without‬‭involving the CPU except at the‬
‭beginning (to set up the transfer: source address, destination‬
‭address, data count) and the end (DMAC sends an interrupt when‬
‭done).‬
‭○‬ P ‭ ros:‬‭Most efficient for large data transfers. Frees the CPU almost‬
‭entirely during the transfer. Reduces CPU overhead significantly.‬
‭○‬ ‭Cons:‬‭Requires a dedicated DMAC. Can lead to bus contention if‬
‭the DMAC and CPU need the memory bus simultaneously (cycle‬
‭stealing).‬

‭4. CPU Architecture Types (Based on Operand Storage)‬


‭ his classification refers to how the Instruction Set Architecture (ISA) specifies‬
T
‭the operands for ALU instructions.‬

‭1.‬ ‭Accumulator Architecture:‬


‭○‬ ‭Concept:‬‭Uses a single special register called the "accumulator" as‬
‭one implicit operand for most arithmetic/logic instructions. The‬
‭other operand typically comes from memory. The result is stored‬
‭back in the accumulator.‬
‭○‬ ‭Example Instruction:‬‭ ADD address‬‭(meaning‬‭ ACC = ACC +‬
‭emory[address]‬
M ‭)‬
‭○‬ ‭Characteristics:‬‭Simple hardware, short instructions (one explicit‬
‭address). High memory traffic as operands frequently need‬
‭loading/storing. Older architecture type (e.g., early‬
‭microprocessors).‬
‭2.‬ ‭Stack Architecture:‬
‭○‬ ‭Concept:‬‭Operands are implicitly on the top of a processor stack.‬
‭ALU operations pop operands from the stack and push the result‬
‭back onto it. Requires‬‭ PUSH‬‭and‬‭
POP‬‭instructions to move data‬
‭between memory and the stack.‬
‭○‬ ‭Example Instruction:‬‭ ADD‬‭(pops top two values, adds them,‬
‭pushes result)‬
‭○‬ ‭Characteristics:‬‭Can lead to very compact code ("zero-address‬
‭instructions"). Stack management can be complex. Efficient for‬
‭evaluating complex expressions. Used in some systems like Java‬
‭Virtual Machine (JVM).‬
‭3.‬ ‭General Purpose Register (GPR) Architecture:‬
‭○‬ ‭Concept:‬‭Uses multiple general-purpose registers to hold operands‬
‭and results. Dominant modern architecture.‬
‭○‬ ‭Sub-types:‬
‭■‬ ‭Register-Memory:‬‭Allows ALU instructions to have one‬
‭operand in a register and another in memory.‬‭ADD R1,‬
address‬‭(meaning‬‭
‭ R1 = R1 + Memory[address]‬
‭).‬
‭■‬ R
‭ egister-Register (Load/Store):‬‭ALU operations‬‭only‬‭work‬
LOAD‬‭and‬‭
‭on registers. Separate‬‭ STORE‬‭instructions are‬
LOAD‬
‭required to move data between registers and memory.‬‭
R1, address1‬
‭ LOAD R2, address2‬
‭,‬‭ ADD R3,‬
‭,‬‭
‭1, R2‬
R STORE address3, R3‬
‭,‬‭ ‭.‬
‭○‬ ‭Characteristics:‬‭Reduces memory traffic compared to‬
‭accumulator/stack (registers are faster). Requires more complex‬
‭instruction formats (specifying multiple registers). Load/Store is‬
‭the basis for most RISC architectures (like ARM, MIPS, RISC-V).‬
‭4.‬ ‭Memory-Memory Architecture (Less Common now):‬
‭○‬ ‭Concept:‬‭Allows ALU instructions to operate directly on operands‬
‭located in main memory, potentially storing the result back to‬
‭memory.‬
‭○‬ ‭Example Instruction:‬‭ ADD address1, address2,‬
address3‬‭(meaning‬‭
‭ Memory[address1] =‬
‭emory[address2] + Memory[address3]‬
M ‭)‬
‭ ‬ ‭Characteristics:‬‭Very high flexibility, complex instructions. Very‬

‭slow due to multiple memory accesses per instruction. Not‬
‭common in modern general-purpose CPUs, though some complex‬
‭instructions in CISC architectures might resemble this.‬

‭ . Detailed Datapath of a Typical Register-Based (Load/Store)‬


5
‭CPU‬
‭ he datapath shows the physical connections (buses) and functional units (ALU,‬
T
‭registers, memory interfaces) through which data flows during instruction‬
‭execution. Here's a simplified view for a load/store architecture:‬

‭(Visualize blocks connected by lines/arrows representing buses)‬

‭●‬ ‭Instruction Fetch:‬


‭1.‬ ‭Content of the PC is sent to MAR.‬
‭2.‬ ‭A Read signal is sent to the Memory Interface.‬
‭3.‬ ‭PC is incremented (usually PC = PC + 4, assuming 32-bit‬
‭instructions/addresses).‬
‭4.‬ ‭Memory returns the instruction via the data bus to the MDR.‬
‭5.‬ ‭Instruction moves from MDR to the IR.‬
‭●‬ ‭Instruction Decode:‬
‭1.‬ ‭The opcode part of the instruction in IR is sent to the Control Unit.‬
‭2.‬ ‭The Control Unit decodes the instruction and generates control‬
‭signals for subsequent steps.‬
‭3.‬ ‭Register operands specified in the IR (e.g., source registers Rs, Rt)‬
‭are used to select registers from the Register File.‬
‭ ‬ ‭Execute (Example:‬‭
● ADD R_dest, R_src1, R_src2‬ ‭)‬
‭1.‬ ‭Data from R_src1 and R_src2 in the Register File are sent to the‬
‭ALU inputs (Input A, Input B).‬
‭2.‬ ‭The Control Unit sends an "ADD" signal to the ALU.‬
‭3.‬ ‭ALU performs the addition.‬
‭4.‬ ‭ALU result is routed back towards the Register File.‬
‭●‬ ‭Execute (Example: Address calculation for‬‭ LOAD/STORE R_t,‬
offset(R_s)‬
‭ ‭)‬
‭1.‬ ‭Data from R_s in the Register File is sent to ALU Input A.‬
‭2.‬ ‭The 'offset' value (part of the instruction, possibly sign-extended) is‬
‭sent to ALU Input B.‬
‭3.‬ ‭The Control Unit sends an "ADD" signal to the ALU.‬
‭4.‬ ‭ALU calculates the effective memory address (Base Address +‬
‭Offset).‬
‭ ‬ ‭Memory Access (Example:‬‭
● LOAD R_t, address‬ ‭)‬
‭1.‬ ‭The calculated address (from ALU) is sent to the MAR.‬
‭2.‬ ‭The Control Unit sends a Read signal to the Memory Interface.‬
‭3.‬ ‭Memory returns the data via the data bus to the MDR.‬
‭●‬ ‭Memory Access (Example:‬‭ STORE R_t, address‬ ‭)‬
‭1.‬ ‭The calculated address (from ALU) is sent to the MAR.‬
‭2.‬ ‭Data from register R_t (specified in IR) is read from the Register‬
‭File and sent to the MDR.‬
‭3.‬ ‭The Control Unit sends a Write signal to the Memory Interface.‬
‭4.‬ ‭Data from MDR is written to memory at the specified address.‬
‭●‬ ‭Write Back (Example:‬‭ ADD R_dest, ...‬‭or‬‭ LOAD R_t, ...‬
‭)‬
‭1.‬ ‭For ADD: The result from the ALU output is written into R_dest in‬
‭the Register File.‬
‭2.‬ ‭For LOAD: The data fetched from memory (now in MDR) is‬
‭written into R_t in the Register File.‬
‭3.‬ ‭Control Unit provides the correct register address (R_dest or R_t)‬
‭and the Write Enable signal to the Register File.‬

‭Key Datapath Elements:‬

‭‬
● ‭ C -> Adder -> Mux -> PC (For incrementing PC)‬
P
‭●‬ ‭PC -> MAR‬
‭●‬ ‭Memory Interface <-> MAR, MDR‬
‭●‬ ‭MDR -> IR‬
‭●‬ ‭IR -> Control Unit‬
‭‬
● I‭ R (register fields) -> Register File (Read/Write addresses)‬
‭●‬ ‭Register File (Read Ports) -> ALU Inputs (possibly via Muxes)‬
‭●‬ ‭IR (immediate field) -> Sign Extender -> ALU Input B (via Mux)‬
‭●‬ ‭ALU Output -> MAR (for address calculation)‬
‭●‬ ‭ALU Output -> Register File (Write Port) (for R-type results)‬
‭●‬ ‭MDR -> Register File (Write Port) (for Load results)‬
‭●‬ ‭Register File (Read Port) -> MDR (for Store data)‬
‭●‬ ‭Control Unit -> Control signals to Muxes, ALU, Register File (Write‬
‭Enable), Memory Interface (Read/Write).‬

‭6. Fetch-Decode-Execute Cycle (Typically 3 to 5 Stages)‬


‭This is the fundamental cycle performed by the CPU to execute instructions.‬

‭Basic 3-Stage Cycle:‬

‭1.‬ ‭Fetch:‬
‭○‬ ‭Get the address from the PC.‬
‭○‬ ‭Load the instruction from memory at that address into the IR.‬
‭○‬ ‭Increment the PC to point to the next instruction.‬
‭2.‬ ‭Decode:‬
‭○‬ ‭Interpret the opcode in the IR.‬
‭○‬ ‭Identify the operands needed.‬
‭○‬ ‭Generate control signals for the execute stage.‬
‭○‬ ‭Fetch operands from registers if needed.‬
‭3.‬ ‭Execute:‬
‭○‬ ‭Perform the operation specified by the instruction (using the ALU,‬
‭accessing memory, changing PC for jumps/branches, writing‬
‭results to registers).‬

‭ ypical 5-Stage RISC Pipeline Cycle:‬‭(This breakdown is crucial for‬


T
‭understanding pipelining)‬

‭1.‬ ‭IF (Instruction Fetch):‬‭Fetch instruction from memory using the address‬
‭in PC, store in IR, increment PC.‬
‭2.‬ ‭ID (Instruction Decode & Register Fetch):‬‭Decode instruction in IR,‬
‭identify required registers, read operand values from the Register File.‬
‭Decode immediate values. Check for hazards.‬
‭3.‬ ‭EX (Execute / Address Calculation):‬
‭○‬ ‭For ALU instructions: Perform the operation using the ALU on‬
‭operands fetched in ID.‬
‭○‬ F ‭ or Load/Store: Calculate the effective memory address using the‬
‭ALU (Base + Offset).‬
‭○‬ ‭For Branches: Calculate branch target address and evaluate branch‬
‭condition.‬
‭4.‬ ‭MEM (Memory Access):‬
‭○‬ ‭For Load: Read data from memory using the address calculated in‬
‭EX.‬
‭○‬ ‭For Store: Write data (fetched from register in ID) to memory using‬
‭the address calculated in EX.‬
‭○‬ ‭Other instructions usually do nothing at this stage.‬
‭5.‬ ‭WB (Write Back):‬‭Write the result back into the Register File.‬
‭○‬ ‭For ALU instructions: Write the result from the EX stage.‬
‭○‬ ‭For Load instructions: Write the data fetched in the MEM stage.‬

I‭ n a non-pipelined CPU, one instruction goes through all 5 stages before the‬
‭next one starts IF.‬

‭ . Microinstruction Sequencing & Implementation of Control‬


7
‭Unit‬
‭ he Control Unit generates the signals that control the datapath. There are two‬
T
‭main implementation approaches:‬

‭A. Hardwired Control Unit:‬

‭●‬ I‭ mplementation:‬‭Uses fixed, dedicated combinational logic circuits‬


‭(AND, OR, NOT gates, decoders) to generate control signals based on the‬
‭instruction opcode, ALU flags, and timing signals (clock).‬
‭●‬ ‭Operation:‬‭The opcode bits directly feed into the logic gates. The‬
‭outputs of these gates are the control signals.‬
‭●‬ ‭Microinstruction Sequencing:‬‭Not applicable in the same way. The‬
‭"sequence" is determined by the flow through the fixed logic based on the‬
‭current state and instruction.‬
‭●‬ ‭Pros:‬‭Very fast execution speed.‬
‭●‬ ‭Cons:‬‭Complex to design and debug. Inflexible; modifying the‬
‭instruction set requires redesigning the hardware. Difficult to implement‬
‭complex instruction sets. Typically used in RISC processors.‬

‭B. Microprogrammed Control Unit:‬

‭●‬ I‭ mplementation:‬‭Control signals are stored as sequences of‬


‭"microinstructions" in a special memory called the Control Store (or‬
‭Control Memory - C M), typically ROM or fast RAM.‬
‭●‬ ‭Components:‬
‭○‬ ‭Control Store (CS):‬‭Holds the microprogram(s).‬
‭○‬ ‭Microinstruction Register (µIR):‬‭Holds the current‬
‭microinstruction being executed.‬
‭○‬ ‭Microprogram Counter (µPC):‬‭Holds the address of the‬‭next‬
‭microinstruction in the CS to be fetched (analogous to the main‬
‭PC).‬
‭○‬ ‭Sequencing Logic:‬‭Determines the next value for the µPC.‬
‭●‬ ‭Operation:‬
‭○‬ ‭The instruction opcode from the IR is mapped to a starting address‬
‭in the Control Store.‬
‭○‬ ‭The µPC is loaded with this starting address.‬
‭○‬ ‭The microinstruction at µPC address is fetched from CS into the‬
‭µIR.‬
‭○‬ ‭The bits in the µIR directly represent the control signals needed for‬
‭the datapath for that micro-step.‬
‭○‬ ‭The Sequencing Logic uses information from the µIR (next address‬
‭field), the instruction opcode, and ALU flags to calculate the‬
‭address of the‬‭next‬‭microinstruction (µPC update).‬
‭○‬ ‭Repeat steps 3-5 until the end of the micro-routine for the current‬
‭machine instruction.‬
‭●‬ ‭Microinstruction Sequencing:‬‭How the next microinstruction address is‬
‭determined:‬
‭○‬ ‭Increment:‬‭µPC = µPC + 1 (default sequential execution).‬
‭○‬ ‭Branching:‬‭Based on ALU flags (e.g., if Zero flag is set, jump to‬
‭microinstruction X, else continue).‬
‭○‬ ‭Dispatching:‬‭Based on the opcode of the‬‭machine‬‭instruction‬
‭(used to find the start of the correct micro-routine).‬
‭○‬ ‭Explicit Next Address:‬‭The current microinstruction contains the‬
‭address of the next one.‬
‭●‬ ‭Pros:‬‭Flexible (changing the instruction set means rewriting the‬
‭microprogram in the CS, not redesigning hardware). Easier to implement‬
‭complex instruction sets (CISC). Simpler design process.‬
‭●‬ ‭Cons:‬‭Slower than hardwired control due to the extra memory access‬
‭time for fetching microinstructions from the CS.‬

‭8. Enhancing Performance with Pipelining‬


‭ ipelining is a technique used to improve CPU throughput by overlapping the‬
P
‭execution stages of multiple instructions. It doesn't make a single instruction‬
‭faster, but it increases the number of instructions completed per unit of time.‬
‭●‬ C ‭ oncept:‬‭Divide instruction processing into multiple stages (like the‬
‭5-stage IF, ID, EX, MEM, WB). Insert pipeline registers between stages‬
‭to hold the intermediate results and control information for an instruction‬
‭as it moves down the "assembly line".‬
‭●‬ ‭Operation:‬‭In an ideal pipeline, a new instruction enters the first stage‬
‭(IF) in every clock cycle. While instruction 1 is in ID, instruction 2 is in‬
‭IF. While instruction 1 is in EX, instruction 2 is in ID, and instruction 3 is‬
‭in IF, and so on.‬
‭●‬ ‭Benefit:‬‭If there are 'k' stages, the ideal speedup compared to a‬
‭non-pipelined CPU is 'k' times (assuming balanced stage delays and no‬
‭interruptions). In the 5-stage example, after the first instruction takes 5‬
‭cycles to complete, subsequent instructions complete at a rate of one per‬
‭cycle (ideally).‬
‭●‬ ‭Challenges - Pipeline Hazards:‬‭Situations that prevent the next‬
‭instruction in the pipeline from executing during its designated clock‬
‭cycle.‬
‭○‬ ‭Structural Hazards:‬‭Hardware resource conflict. Two different‬
‭instructions in the pipeline need the same resource (e.g., memory‬
‭access) at the same time. Solved by duplicating resources (e.g.,‬
‭separate instruction/data caches) or stalling.‬
‭○‬ ‭Data Hazards:‬‭An instruction depends on the result of a previous‬
‭instruction that is still in the pipeline and hasn't completed writing‬
‭its result.‬
‭■‬ ‭Read After Write (RAW - True Dependence):‬‭Instruction J‬
ADD R1,‬
‭tries to read before instruction I writes. (e.g.,‬‭
‭2, R3‬‭followed by‬‭
R SUB R4, R1, R5‬ ‭). Solved by‬
‭forwarding/bypassing‬‭(routing the result directly from ALU‬
‭output/MEM stage back to ALU input for the next‬
‭instruction) or stalling (inserting bubbles/NOPs).‬
‭■‬ ‭Write After Read (WAR - Anti Dependence):‬‭Instruction J‬
‭tries to write before instruction I reads. Less common in‬
‭simple pipelines; handled by register renaming in more‬
‭advanced CPUs.‬
‭■‬ ‭Write After Write (WAW - Output Dependence):‬‭Instruction J‬
‭tries to write before instruction I writes (to the same‬
‭register). Handled by ensuring writes happen in order or‬
‭register renaming.‬
‭ ‬ ‭Control Hazards (Branch Hazards):‬‭Occur with branch/jump‬

‭instructions. The pipeline fetches sequential instructions assuming‬
‭the branch is not taken, but if the branch‬‭is‬‭taken, the fetched‬
‭instructions are wrong and must be flushed. Solved by:‬
‭‬ S
■ ‭ talling:‬‭Wait until the branch outcome is known.‬
‭■‬ ‭Branch Prediction:‬‭Guess the outcome (e.g., predict not‬
‭taken, predict taken, use history). If wrong, flush and fetch‬
‭correct path.‬
‭■‬ ‭Delayed Branch:‬‭Execute one or more instructions‬‭after‬‭the‬
‭branch instruction, regardless of the outcome (compiler tries‬
‭to fill this "delay slot" with useful independent instructions).‬
‭The Need for a Memory Hierarchy‬
‭ odern computer systems use a‬‭memory hierarchy‬‭to balance‬‭speed‬‭,‬‭cost‬‭, and‬
M
‭capacity‬‭. No single memory type can offer the ideal‬‭combination of‬‭very fast‬‭,‬
‭very large‬‭, and‬‭very cheap‬‭. Thus, the hierarchy is‬‭designed to‬‭optimize‬
‭performance while managing cost‬‭.‬

‭Why a Memory Hierarchy Needed ?‬

‭1. Processor vs. Memory Speed Gap‬

‭●‬ M ‭ odern CPUs are‬‭extremely fast‬‭, capable of executing billions of‬


‭instructions per second.‬
‭●‬ ‭Main memory (RAM) is‬‭slower‬‭than the processor.‬
‭●‬ ‭If the CPU had to wait every time for RAM access, performance would‬
‭drastically degrade.‬

‭➤‬‭Solution:‬‭Use‬‭faster, smaller memory (caches)‬‭closer‬‭to the CPU.‬

‭2. Cost vs. Capacity Trade-off‬

‭‬ F
● ‭ ast memory (like SRAM used in caches) is‬‭expensive‬‭.‬
‭●‬ ‭Slower memory (like DRAM or hard disks) is‬‭cheaper‬‭and provides‬
‭more capacity‬‭.‬


‭ ‬‭Solution:‬‭Use‬‭small amounts of fast memory‬‭and‬‭larger‬‭amounts of‬
‭slow memory‬‭.‬

‭3. Locality of Reference Principle‬

‭ rograms tend to access a small portion of memory repeatedly over short‬


P
‭periods:‬

‭●‬ T ‭ emporal locality‬‭: If a memory location is accessed,‬‭it is likely to be‬


‭accessed again soon.‬
‭●‬ ‭Spatial locality‬‭: If one memory location is accessed, nearby locations are‬
‭likely to be accessed soon.‬


‭ ‬‭Caches exploit this‬‭by storing recently or nearby used data to speed up‬
‭access.‬

‭Locality of Reference Principle‬


‭ he locality of reference principle is a key concept in computer architecture that‬
T
‭describes how programs tend to access a relatively small portion of memory at‬
‭any given time. This principle can be broken down into two types:‬

‭1.‬ ‭Temporal Locality‬‭:‬‭This refers to the tendency of a program to access‬


‭the same memory locations repeatedly within a short time frame. For‬
‭example, if a program accesses a particular variable, it is likely to access‬
‭it again soon.‬
‭2.‬ ‭Spatial Locality‬‭:‬‭This refers to the tendency of a program to access‬
‭memory locations that are close to each other. For instance, if a program‬
‭accesses a certain array element, it is likely to access nearby elements‬
‭shortly thereafter.‬

‭ he locality of reference principle is crucial for designing efficient memory‬


T
‭systems because it allows for the implementation of faster, smaller memory‬
‭types (like cache) that can store frequently accessed data, thereby reducing the‬
‭average time to access memory.‬

‭Memory Hierarchy in Practice‬


‭ memory hierarchy is a structured arrangement of different types of memory‬
A
‭that vary in speed, size, and cost. The main levels of the memory hierarchy‬
‭include:‬

‭1.‬ ‭Cache Memory‬‭:‬‭This is the fastest type of memory, located closest to the‬
‭CPU. It is used to store frequently accessed data and instructions to speed‬
‭up processing. Cache memory is typically divided into levels (L1, L2,‬
‭L3), with L1 being the fastest and smallest.‬
‭2.‬ ‭Main Memory (RAM)‬‭:‬‭This is the primary storage used‬‭by the CPU to‬
‭hold data and instructions that are currently in use. It is slower than cache‬
‭but larger in capacity. Main memory is volatile, meaning it loses its‬
‭contents when power is turned off.‬
‭3.‬ ‭Secondary Memory‬‭:‬‭This includes storage devices like hard drives,‬
‭SSDs, and optical disks. Secondary memory is non-volatile and is used‬
‭for long-term data storage. It is much slower than both cache and main‬
‭memory but offers much larger storage capacity at a lower cost.‬

‭Memory Parameters‬
‭ ccess Time‬‭: Time between a memory request and delivery of data.‬
A
‭Cycle Time‬‭: Time between successive accesses.‬
‭ ache memory has the shortest access time, followed by main memory,‬
C
‭and then secondary memory.‬
‭ ost per Bit‬‭:‬‭This is a measure of how much it costs to store one bit of‬
C
‭data. Cache memory is the most expensive per bit, followed by main‬
‭memory, and then secondary memory, which is the least expensive.‬

‭Main Memory‬
‭Semiconductor RAM & ROM Organization‬

‭ AM (Random Access Memory)‬‭:‬‭This is a type of volatile‬‭memory that‬


R
‭allows data to be read and written. It is organized into cells, each with a‬
‭unique address. RAM can be further categorized into:‬

‭ tatic RAM (SRAM)‬‭:‬‭Uses bistable latching circuitry to store each bit. It‬
S
‭is faster and more expensive than DRAM but is used for cache memory‬
‭due to its speed.‬

‭ ynamic RAM (DRAM)‬‭:‬‭Stores each bit in a capacitor,‬‭which must be‬


D
‭refreshed periodically. It is slower and less expensive than SRAM and is‬
‭used for main memory.‬

‭ OM (Read-Only Memory)‬‭:‬‭This is non-volatile memory‬‭that is used to‬


R
‭store firmware or software that does not change. It is organized similarly to‬
‭RAM but is typically slower and cannot be easily modified.‬

‭Memory Expansion‬
‭ emory expansion refers to the ability to increase the amount of RAM in a‬
M
‭system. This can be done by adding more RAM modules to the motherboard,‬
‭allowing for improved performance and the ability to run more applications‬
‭simultaneously.‬

‭Cache Memory‬
‭Associative & Direct Mapped Cache Organizations‬
‭Cache memory can be organized in different ways to optimize performance:‬

‭1.‬ ‭Direct Mapped Cache‬‭:‬‭Each block of main memory maps to exactly one‬
‭cache line. This is simple and fast but can lead to cache misses if multiple‬
‭memory blocks map to the same cache line (known as conflict misses).‬
‭2.‬ ‭Associative Cache‬‭:‬‭Any block of main memory can be stored in any‬
‭cache line. This flexibility reduces conflict misses but requires more‬
‭complex hardware to search for data, making it slower than‬
‭direct-mapped caches.‬
‭3.‬ ‭Set-Associative Cache‬‭:‬‭This is a compromise between‬‭direct-mapped‬
‭and fully associative caches. The cache is divided into sets, and each set‬
‭can hold multiple blocks. A block of memory can be placed in any line‬
‭within a specific set, balancing speed and complexity.‬

‭Summary Table‬

‭Component‬ ‭Speed‬ ‭Cost/bit‬ ‭Volatility‬ ‭Use case‬

‭Registers‬ ‭Fastest‬ ‭Highest‬ ‭Volatile‬ ‭CPU operations‬

‭ ache‬
C ‭Very fast‬ ‭High‬ ‭Volatile‬ ‭Speed up access‬
‭(SRAM)‬

‭ AM‬
R ‭Moderate‬ ‭Medium‬ ‭Volatile‬ ‭Main memory‬
‭(DRAM)‬

‭ROM‬ ‭Slow‬ ‭Medium‬ ‭Non-volatile‬ ‭Firmware/boot‬


‭code‬

‭HDD/SSD‬ ‭Slowest‬ ‭Low‬ ‭Non-volatile‬ ‭Long-term storage‬

You might also like