Computer Architecture
Computer Architecture
2. Pipelining
What are the key properties of hierarchical memory technology (inclusion, coherence,
locality)?
How do cache memory organizations work, and what are their types?
What are the techniques used for reducing cache misses?
What is the role of virtual memory in modern computer systems?
How is virtual memory organized, mapped, and managed?
What are the different memory replacement policies, and how do they work?
5. Multiprocessor Architecture
What are the different types of parallel architectures, and how are they classified?
What is centralized shared-memory architecture, and how is synchronization achieved?
How does memory consistency work in a centralized shared-memory system?
What are interconnection networks, and why are they important in multiprocessor
architectures?
What is distributed shared-memory architecture, and how is it different from centralized
shared-memory?
How do cluster computers work, and what are their advantages in parallel computing?
What are data flow computers, and how do they differ from traditional von Neumann
architectures?
What are reduction computer architectures, and how are they used in parallel processing?
What are systolic architectures, and how do they handle data processing efficiently?
These questions cover the essential aspects of each topic in computer architecture and should
help in understanding the key concepts and mechanisms discussed in the list.
COMPUTER ARCHITECTURE
MODULE 1:
Computer architecture refers to the design and organization of a computer system, defining
how the hardware components interact to execute software programs efficiently. The
architecture of a computer consists of several key components:
The CPU is the core of any computer system and is responsible for executing instructions. It
consists of:
Control Unit (CU): Directs operations within the CPU, fetching, decoding, and executing
instructions.
Arithmetic and Logic Unit (ALU): Performs arithmetic calculations (addition, subtraction,
etc.) and logical operations (AND, OR, NOT, etc.).
Registers: Small, high-speed storage units used for temporary data storage during
processing. Important registers include:
o Program Counter (PC): Holds the memory address of the next instruction.
2. Memory Hierarchy
Cache Memory: High-speed memory between the CPU and main memory, used to store
frequently accessed data.
Main Memory (RAM): Primary volatile memory used to store programs and data for
active processes.
Secondary Storage: Non-volatile storage like HDDs and SSDs for long-term data storage.
Virtual Memory: An extension of main memory using disk storage, managed by the OS
to handle larger programs.
Input devices (keyboard, mouse, scanner) allow users to interact with the computer.
Storage devices (HDD, SSD, USB drives) retain data for future use.
Communication devices (network cards, modems) enable data exchange over networks.
Address Bus: Carries memory addresses from the CPU to memory or I/O devices.
Control Bus: Sends control signals from the CPU to other components.
Defines the set of instructions that a CPU can execute, classified into:
RISC (Reduced Instruction Set Computing): Uses simple instructions for faster execution.
CISC (Complex Instruction Set Computing): Uses complex instructions, requiring fewer
lines of code but more CPU cycles.
1. Amdahl’s Law
Amdahl’s Law is used to estimate the maximum theoretical speedup achievable by enhancing
a portion of a system.
Formula:
Where:
Insight:
Amdahl's Law shows that even if you significantly speed up one part of a system, the
overall gain is limited by the parts that remain unimproved.
Helps identify the point of diminishing returns—i.e., when further optimization yields
little benefit.
2. Little’s Law
Little's Law is essential for analyzing the performance of queuing systems such as CPUs,
memory controllers, and I/O subsystems.
Formula:
Where:
Application:
Where:
Key Concepts:
Reducing voltage (V) significantly lowers power consumption due to the squared
relationship.
Techniques to reduce power:
o Dynamic Voltage Scaling (DVS): Adjusts voltage and frequency based on
workload.
o Clock Gating: Turns off clock signals to idle circuits to save power.
o Power Gating: Shuts off power supply to inactive parts of the chip.
a. Pipelining:
b. Parallel Processing:
c. Caching:
Stores frequently accessed data in small, fast memory (cache).
Reduces average memory access time.
d. Branch Prediction:
Formula:
Terms:
2. Benchmarks
Standardized programs used to measure system performance under typical workloads.
Types:
SPEC Benchmarks: Evaluate general CPU performance using real-world application
workloads.
TPC Benchmarks: Focus on transaction processing and database systems.
Limitations:
b. Efficiency:
Throughput:
2. Decode (ID - Instruction Decode): Decodes the instruction and determines the required
operands.
3. Execute (EX - Execute): Performs the operation (arithmetic, logic, or data transfer).
5. Write Back (WB - Write Back to Register): Stores the result back into the register file.
By overlapping these stages, a processor can achieve higher instruction throughput compared
to sequential execution.
Instruction Pipeline
Arithmetic Pipeline
1. Fetch operands
4. Store results
This is especially useful in high-performance processors and digital signal processing (DSP)
applications.
3. Hazards in Pipelining
Pipeline execution is not always smooth due to various hazards that may cause delays or
incorrect execution.
Data hazards occur when instructions depend on the results of previous instructions that have
not yet completed. Types of data hazards include:
RAW (Read After Write): Occurs when an instruction tries to read a value that has not
been written yet by a previous instruction.
WAR (Write After Read): Occurs when an instruction writes a value before a previous
instruction reads it.
WAW (Write After Write): Occurs when two instructions try to write to the same
register in an overlapping manner.
Pipeline Stalling (Bubble Insertion): The pipeline is stalled until the necessary data is
available.
Register Renaming: Used to eliminate WAW and WAR hazards by dynamically allocating
different registers for different instructions.
Control hazards occur when the pipeline does not know which instruction to fetch next due to a
branch or jump instruction.
Delayed Branching: Rearranges instructions so that useful work is done while the branch
decision is pending.
Branch Target Buffer (BTB): Stores branch outcomes to improve prediction accuracy.
Structural hazards occur when multiple instructions compete for the same hardware resource
(e.g., memory, ALU, registers) at the same time.
Resource Duplication: Adding more hardware resources (e.g., multiple execution units,
multiple memory ports).
Precise Interrupts: Ensuring that all instructions before the exception are completed,
and none after it are executed.
Reordering Buffers: Storing out-of-order execution results and committing them only
when it is safe.
Adding more pipeline stages can improve clock speeds but increases control complexity.
Example: Deep pipelines in modern CPUs (e.g., Intel’s Pentium 4 had a 20-stage
pipeline).
Uses multiple pipelines to execute more than one instruction per cycle.
Example: Modern processors like Intel Core i7 and AMD Ryzen use superscalar
execution.
Allows instructions to be executed as soon as their operands are ready rather than
strictly following the program order.
Example: Instead of executing a loop 10 times, it might execute 5 iterations twice with
increased efficiency.
Loop Invariant Code Motion: Moves constant computations outside of loops to reduce
redundant calculations.
The inclusion property in hierarchical memory systems refers to the relationship between data
stored at various levels of the memory hierarchy. Specifically, it ensures that all data present in
a lower-level cache (e.g., L1) must also be present in the higher-level cache (e.g., L2 or L3).
Types of Inclusion:
1. Inclusive Cache:
o Higher-level caches contain all the data from lower levels.
o Advantage: Easier coherence tracking in multiprocessor systems.
o Disadvantage: Wasted space due to duplication of data.
2. Exclusive Cache:
o Data is uniquely stored at only one level of the cache hierarchy.
o Advantage: Maximizes effective cache capacity.
o Disadvantage: Slightly more complex management and cache coherence.
3. Non-Inclusive (or Partially Inclusive):
o No strict rule. A block may or may not be present in both levels.
o Offers a balance between capacity and complexity.
Importance:
1. Write Propagation: Changes in one cache must eventually propagate to all other caches
or to the main memory.
2. Transaction Serialization: All processors must observe writes in the same order (global
ordering).
Coherence Protocols:
1. Directory-Based Protocols:
o A centralized directory keeps track of which caches hold a copy of each block.
o Efficient for large-scale multiprocessors.
2. Snoopy Protocols:
o Caches monitor a common bus for memory access by others.
o Example: MESI (Modified, Exclusive, Shared, Invalid) protocol.
Challenges:
Locality of reference describes how programs tend to access memory locations in a predictable
pattern.
Types of Locality:
1. Temporal Locality:
o If a memory location is referenced once, it is likely to be referenced again soon.
o Example: Loop counters or recently used variables.
o Cache Implication: Frequently accessed data should be kept in faster memory.
2. Spatial Locality:
o If a memory location is accessed, nearby locations are likely to be accessed soon.
o Example: Accessing elements of an array in a loop.
o Cache Implication: Fetching contiguous memory blocks is beneficial.
3. Sequential Locality (Subset of Spatial):
o Memory is accessed in a sequential pattern (e.g., instruction fetching).
Cache memory organization refers to how data is stored, accessed, and managed within the
cache.
Key Elements:
1. Mapping Techniques: Determines how main memory blocks are placed in cache.
e. Direct Mapping:
b. Associative Mapping:
c. Set-Associative Mapping:
b. Write-Back:
Design Considerations:
To enhance performance and reduce these misses, several hardware and software-level
techniques are employed:
Larger caches can store more data, thus reducing capacity misses.
Impact:
Trade-offs:
2.Higher Associativity
Description:
Using set-associative or fully associative caches reduces conflict misses by allowing a memory
block to be stored in multiple places.
Types:
Trade-offs:
When cache is full, a good replacement policy decides which block to evict to minimize future
misses.
Common Policies:
LRU (Least Recently Used): Evicts block not used for the longest time.
Random: Chooses a block randomly.
LFU (Least Frequently Used): Evicts block with fewest accesses.
Advanced Techniques:
4.Cache Prefetching
Description:
Prefetching predicts which data the CPU will need and fetches it into the cache before it’s
requested.
Types:
Hardware Prefetching:
o Uses dedicated logic to detect access patterns (like sequential or strided access).
o Fetches next block(s) automatically.
Software Prefetching:
o Compiler or programmer inserts prefetch instructions.
o Useful in loops and predictable access patterns.
Effectiveness:
A software optimization technique where large data is divided into blocks (tiles) that fit into
the cache.
Common in:
6.Victim Cache
Description:
A small buffer (victim cache) stores recently evicted cache lines from L1 cache.
Purpose:
Trade-off:
7.Compiler Optimizations
Techniques:
Benefit:
9. Multilevel Caching
Description:
Benefit:
Divides a cache block into smaller sub-blocks or sectors with individual valid bits.
Use Case:
Impact:
Reduces unnecessary memory transfers. Lowers compulsory and capacity misses for fine-
grained accesses.
✅ Conclusion:
Reducing cache misses is critical for improving system performance, especially in modern
processors with deep memory hierarchies. A combination of architectural enhancements (like
associativity and multilevel caches) and software-level optimizations (like tiling and
compiler techniques) provides the best results. Choosing the right strategies depends on the
application workload, cache architecture, and system constraints.
Virtual Memory (VM) is a memory management technique that creates an illusion of a large,
continuous memory space to applications, even if the physical memory (RAM) is limited. It
allows systems to execute programs larger than the available physical memory by using disk
space as an extension of RAM.
Key Features:
Advantages:
Program Isolation: Each process has its own address space, improving security.
Memory Efficiency: Only needed pages are loaded into memory, saving space.
Simplifies Programming: Developers don’t need to manage memory allocation
manually.
Supports Multitasking: Multiple programs can run simultaneously with isolated
memory.
Diagram:
[Virtual Address] -> [Page Number + Offset] -> [Page Table] -> [Frame Number]
-> [Physical Address]
Components:
🔹 1. Paging
Concept:
Virtual memory is divided into fixed-size blocks called pages (e.g., 4KB).
Physical memory is divided into frames of the same size.
A Page Table maps virtual page numbers to physical frame numbers.
Translation:
Virtual Address = [Page Number | Offset]
→ Page Table Lookup → Frame Number
→ Physical Address = [Frame Number | Offset]
Advantages:
Challenges:
🔹 2. Segmentation
Concept:
Memory is divided into variable-sized logical segments (e.g., code, data, stack).
Each segment has a base (starting address) and a limit (length).
The virtual address consists of a segment number and an offset.
Translation:
Virtual Address = [Segment Number | Offset]
→ Segment Table Lookup → Base + Offset = Physical Address
Advantages:
Supports logical program structure.
Facilitates memory protection and sharing.
Challenges:
Translation:
Virtual Address = [Segment Number | Page Number | Offset]
→ Segment Table → Page Table Base Address
→ Page Table Lookup → Frame Number
→ Physical Address = [Frame Number | Offset]
Advantages:
Challenges:
Instead of one entry per virtual page, the inverted page table has one entry per physical
frame.
Each entry stores the virtual address mapped to that frame and a process ID.
Translation:
Requires a search (often hashed) to find the virtual-to-physical mapping.
Helps reduce memory overhead in systems with large virtual address spaces.
Advantages:
Challenges:
TLB is a small, fast hardware cache that stores recent virtual-to-physical address
translations.
Used with all mapping techniques to speed up access.
How it Works:
Advantage:
Management Techniques:
Single-level page tables: Simple but large for big address spaces.
Multi-level page tables: Hierarchical approach; reduces memory overhead.
Inverted page tables: One entry per frame, used in systems with large address spaces.
c. Demand Paging:
d. Copy-On-Write (COW):
When physical memory is full, the operating system must replace a page to load a new one. The
page replacement policy determines which page to evict, and it significantly impacts system
performance.
Replaces the page that has not been used for the longest time.
Based on the assumption that recently used pages will be used again.
Implementation: Time-stamps or stack-based methods.
Drawback: Expensive to implement in hardware.
i. Optimal Replacement (OPT or MIN):
Replaces the page that will not be used for the longest time in the future.
Ideal but theoretical (needs future knowledge).
Used as a benchmark for other algorithms.
Maintains a counter for each page; incremented whenever the page is referenced.
Replaces the page with the lowest count.
Approximate but simpler than LRU.
✅ Conclusion:
Efficient virtual memory systems rely heavily on organized address translation, effective page
table management, and smart page replacement strategies. Together, these ensure seamless
multitasking, optimized performance, and better memory utilization, making them critical
aspects of modern OS design.
MODULE 3:
🔹 1. Instruction-Level Parallelism (ILP) – Basic Concepts
(10 Marks)
✅ Definition:
📌 Types of ILP:
1. Fine-Grained ILP:
o Executes multiple independent instructions in the same clock cycle.
o Found in superscalar and VLIW architectures.
2. Coarse-Grained ILP:
o Executes large blocks of independent code (e.g., loop unrolling).
o Relies on compiler-level optimizations.
✅ Key Concepts:
m. Types of Parallelism:
b. Dependencies:
1. Data Dependency:
o Occurs when an instruction depends on the result of a previous one.
o Types: RAW (Read After Write), WAR (Write After Read), WAW (Write After
Write).
2. Control Dependency:
o Happens due to branching (e.g., if-else conditions).
3. Resource Dependency:
o Caused by competition for hardware resources (e.g., same ALU).
✅ Importance of ILP:
✅ 2. Superscalar Execution:
✅ 4. Register Renaming:
✅ 5. Branch Prediction:
✅ 6. Loop Unrolling:
Compiler-level technique.
Reduces control instructions and increases instruction parallelism.
✅ Compiler-Level Techniques:
n. Instruction Scheduling:
Rearranges instructions to avoid pipeline stalls or hazards.
b. Loop Unrolling:
Duplicates the loop body multiple times to expose parallel instructions.
c. Software Pipelining:
Overlaps instructions from different loop iterations.
✅ Hardware-Level Techniques:
o. Pipelining:
Divides instruction execution into stages; multiple instructions proceed simultaneously in
different stages.
b. Out-of-Order Execution:
Executes instructions as their operands become ready, not strictly in program order.
c. Register Renaming:
Eliminates false dependencies by using additional physical registers.
d. Speculative Execution:
Predicts outcomes of branches and executes instructions ahead of time.
e. Branch Prediction:
Reduces stalls by guessing the result of branch instructions early.
A superscalar processor can issue multiple instructions per clock cycle. It includes multiple
pipelines and execution units.
📌 Key Features:
📌 Advantages:
📌 Challenges:
✅ Pipeline Structure:
Stages: Fetch → Decode → Issue → Execute → Writeback
Multiple instructions pass through stages in parallel.
✅ Benefits:
Increased throughput.
Utilizes ILP more effectively.
✅ Challenges:
Complexity in dependency resolution, hazard detection, and instruction dispatch.
Diminishing returns due to limited parallelism in programs.
📌 How it Works:
📌 Features:
📌 Disadvantages:
✅ Key Points:
Pipeline clock frequency is increased (faster stages).
Instruction throughput is improved by shortening stage durations.
✅ Advantages:
Higher clock rates.
Better utilization of each pipeline stage.
✅ Drawbacks:
Increased control complexity.
More prone to pipeline hazards and stalls.
In VLIW architecture, a single instruction word contains multiple operations that are executed
in parallel. The compiler decides which instructions can run together.
📌 Structure:
Each VLIW instruction is composed of several operations (e.g., ALU, memory, branch).
Example: [ADD R1,R2,R3 | LOAD R4, 0(R5) | BRANCH R6]
📌 Key Characteristics:
1. Static Scheduling:
o Compiler handles dependency checking and scheduling.
2. Simple Hardware:
o Less complex than superscalar because no dynamic scheduling is needed.
📌 Advantages:
Compiler complexity.
Wasted instruction slots if parallelism is not found.
Compatibility issues due to fixed instruction formats.
✅ Structure:
Each instruction word may contain multiple independent operations (e.g., 4–8).
Rely on compiler to handle dependency checks and scheduling.
✅ Features:
Static scheduling by the compiler.
Simplifies hardware (no need for dynamic scheduling or hazard detection).
Suitable for embedded systems, DSPs, and scientific applications.
✅ Advantages:
Efficient use of execution units.
Lower hardware complexity compared to superscalar.
✅ Limitations:
Requires powerful compilers.
Increased code size (instruction words are large).
Less flexible for runtime conditions like branching.
An Array Processor uses a set of identical processing elements (Pes) to perform the same
operation on different data simultaneously.
📌 Types:
📌 Advantages:
📌 Limitations:
Features:
High throughput for structured data.
Data broadcasting and synchronization support.
A Vector Processor executes a single instruction on a vector of data elements using vector
registers.
📌 Key Features:
1. Vector Registers:
o Hold vectors (arrays of data).
2. Vector Instructions:
o Perform operations like ADD.V V1, V2 → V3.
3. Pipelined Functional Units:
o Allow fast processing of large vectors.
📌 Advantages:
📌 Limitations:
Vector Instructions:
Pipeline Execution:
Applications:
✅ Comparison:
✅ Summary Table:
Architecture ILP Type Issued by Key Advantage
Superscalar Dynamic Hardware Multiple instructions per cycle
Superpipelined Sequential Hardware Faster pipelines
VLIW Static Compiler Hardware simplicity
Array Processor SIMD Hardware Massively parallel data ops
Vector Processor SIMD/Vector Compiler+Hardware Efficient vector handling
MODULE 4:
🔹 1. Taxonomy of Parallel Architectures (10 Marks)
✅ Definition:
Taxonomy refers to the classification of parallel architectures based on how instructions and
data are handled. The most widely accepted taxonomy is Flynn’s Taxonomy.
📌 Flynn’s Taxonomy:
Category Description
SISD (Single Instruction, Traditional sequential computer – one instruction stream, one data
Single Data) stream (e.g., basic CPU).
SIMD (Single Instruction, One instruction operates on multiple data – ideal for vector
Multiple Data) processing and graphics (e.g., GPUs, array processors).
MISD (Multiple Rarely used – multiple instructions operate on the same data stream.
Instruction, Single Data) Mostly theoretical.
MIMD (Multiple Most modern multiprocessors – each processor works on different
Instruction, Multiple Data) data using different instructions (e.g., multicore CPUs, clusters).
📌 MIMD Subcategories:
📌 Applications:
In this architecture, multiple processors share a single main memory and communicate
through it. The memory is centrally located and accessed by all processors.
📌 Key Features:
📌 Components:
📌 Types:
📌 Benefits:
📌 Challenges:
Synchronization ensures correct execution order when multiple processors access shared data.
It avoids race conditions, deadlocks, and data inconsistency.
📌 Types of Synchronization:
1. Mutual Exclusion:
o Ensures that only one processor accesses a critical section at a time.
o Implemented via locks, mutexes, semaphores.
2. Barriers:
o Forces all threads/processors to reach a point before proceeding.
o Used to coordinate phases in parallel execution.
3. Condition Variables:
o Allow threads to wait for certain conditions to become true.
o Used for producer-consumer models.
📌 Synchronization Primitives:
Test-and-Set, Compare-and-Swap: Hardware-level atomic instructions for lock
implementation.
📌 Challenges:
Memory consistency defines how memory operations (reads and writes) appear to execute
across multiple processors in a shared-memory system.
📌 Common Models:
1. Strict Consistency:
o Every read returns the most recent write.
o Very hard to implement in real systems.
2. Sequential Consistency:
o The result of execution is as if all operations were executed in some sequential
order.
o Easier to implement, widely used.
3. Weak Consistency:
o Relaxed rules; synchronization is needed to enforce consistency.
o Higher performance at the cost of complexity.
4. Release Consistency:
o Memory operations are grouped around acquire and release synchronization
points.
o Offers better performance in multithreaded programs.
📌 Importance:
Interconnection networks connect processors to memory and other processors. They determine
the communication pattern, bandwidth, and latency.
📌 Types:
1. Bus-Based Networks:
o All processors share a common bus.
o Simple and cost-effective.
o Limited scalability due to contention.
2. Crossbar Switch:
o Full connectivity; any processor can access any memory simultaneously.
o High bandwidth, but expensive for large systems.
3. Multistage Interconnection Networks (MINs):
o Use a layered approach (e.g., Omega, Butterfly).
o Good performance with lower cost than crossbars.
4. Mesh and Torus:
o Used in large systems (e.g., supercomputers).
o Each processor is connected to neighbors.
5. Hypercube:
o Processors connected in a multi-dimensional cube.
o Scalable and efficient.
📌 Key Metrics:
📌 Importance:
✅ Summary
Topic Key Focus
Taxonomy Flynn’s classification (SISD, SIMD, MISD, MIMD)
Centralized Shared One memory accessed by all CPUs; suitable for small-scale
Memory systems
Synchronization Mechanisms to safely share data (locks, barriers, condition vars)
Memory Consistency Rules for how memory changes appear to different processors
Structures to connect processors and memory (bus, mesh,
Interconnection Networks
crossbar)
📌 Key Characteristics:
Feature Description
Physical Distribution Memory is located locally with each processor.
Logical Sharing System software allows all processors to access all memory addresses.
Transparency Programmers interact with memory as if it’s shared, simplifying coding.
📌 Working Mechanism:
Memory pages are replicated or migrated as needed.
A software layer handles memory accesses, consistency, and coherence.
The system tracks which memory is located where and moves data as needed.
📌 Advantages:
📌 Challenges:
📌 Examples:
📌 Applications:
🔷 2. Cluster Computers
✅ Definition:
A Cluster Computer is a group of loosely coupled, independent computers (nodes) that work
together as a single system. Each node has its own processor(s), memory, and operating system,
but they are connected through a high-speed network to collaborate on tasks.
📌 Key Components:
Component Description
Nodes Individual computers/servers with CPU, memory, storage.
Interconnect Network to connect the nodes (e.g., Ethernet, Infiniband).
Middleware Software layer that manages job distribution, synchronization, etc.
Usually Linux; may include cluster management tools (e.g., SLURM,
Operating System
OpenMPI).
📌 Types of Clusters:
📌 Advantages:
📌 Applications:
✅ Summary Table
📌 Key Features:
No program counter.
Execution is asynchronous and parallel.
Data dependencies determine instruction execution.
Programs are represented as data flow graphs.
📌 Working Mechanism:
📌 Advantages:
📌 Disadvantages:
📌 Use Cases:
📌 Examples:
📌 Key Features:
📌 Working Principle:
📌 Advantages:
📌 Disadvantages:
1. Harder to implement efficient memory management.
2. Programs may require extensive rewriting.
3. Lack of commercial hardware implementations.
📌 Applications:
Symbolic mathematics.
Compilers for functional programming languages.
Research in declarative computing.
📌 Examples:
🔷 3. Systolic Architectures
📚 (10 Marks Answer)
✅ Definition:
📌 Key Characteristics:
📌 Working Mechanism:
Each processor performs part of a computation.
Data enters at one end and flows through the array.
Each processor works in synchrony with a global clock.
📌 Advantages:
1. Deterministic execution.
2. Highly parallel and pipelined.
3. Efficient for specific applications like matrix operations, DSP.
📌 Disadvantages:
📌 Applications:
Signal processing.
Matrix operations (e.g., convolution in deep learning).
Cryptographic hardware.
📌 Examples:
1. Amdahl’s Law
Amdahl’s Law is used to predict the theoretical speedup of a system when a portion of it is
improved. It states: Speedup=1(1−P)+PSSpeedup = \frac{1}{(1 - P) + \frac{P}{S}} where:
S = Speedup of the improved portion. This helps identify diminishing returns when
optimizing system components.
2. Little’s Law
W = Average task wait time. This law is critical for designing efficient processors and
memory systems.
3. Power-Performance Tradeoff
Techniques like dynamic voltage scaling (DVS) and clock gating help reduce power
consumption.
Caching: Reduces memory access time by storing frequently used data closer to the
CPU.
Performance measurement involves analyzing a system’s efficiency and speed using various
metrics.
Cycles per instruction (CPI): Average number of clock cycles per instruction.
2. Benchmarks
The inclusion property ensures that all data present in a lower level (e.g., L2 cache) is
also present in a higher-level cache (e.g., L1 cache).
This helps in reducing cache misses and ensures consistency in data retrieval.
Spatial Locality: Memory locations near recently accessed data are likely to be accessed
soon.
Any block from main memory can be placed in any cache block.
Larger caches can store more data, reducing the frequency of cache misses.
3.3 Prefetching
Using L1, L2, and L3 caches improves performance by reducing main memory accesses.
4.2 Paging
The page table keeps track of mapping between virtual and physical addresses.
Reduces fragmentation but requires Translation Lookaside Buffer (TLB) for fast lookup.
4.3 Segmentation
Divides memory into variable-sized segments based on logical divisions (e.g., code,
stack, heap).
Hierarchical Page Tables: Reduce memory overhead by splitting page tables into levels.
Inverted Page Table: Uses a single page table indexed by frame number rather than page
number.
Replaces the page that will not be used for the longest time.
By applying these techniques, memory hierarchy can be optimized to achieve better system
performance.
The inclusion property ensures that all data present in a lower level (e.g., L2 cache) is
also present in a higher-level cache (e.g., L1 cache).
This helps in reducing cache misses and ensures consistency in data retrieval.
Spatial Locality: Memory locations near recently accessed data are likely to be accessed
soon.
Any block from main memory can be placed in any cache block.
Larger caches can store more data, reducing the frequency of cache misses.
3.3 Prefetching
Using L1, L2, and L3 caches improves performance by reducing main memory accesses.
4. Virtual Memory Organization, Mapping, and Management
4.2 Paging
The page table keeps track of mapping between virtual and physical addresses.
Reduces fragmentation but requires Translation Lookaside Buffer (TLB) for fast lookup.
4.3 Segmentation
Divides memory into variable-sized segments based on logical divisions (e.g., code,
stack, heap).
Hierarchical Page Tables: Reduce memory overhead by splitting page tables into levels.
Inverted Page Table: Uses a single page table indexed by frame number rather than page
number.
Replaces the page that will not be used for the longest time.
Requires future knowledge, so it is not practical but serves as a benchmark.
By applying these techniques, memory hierarchy can be optimized to achieve better system
performance.
o VLIW (Very Long Instruction Word) Architectures: Using wide instruction words
to encode multiple operations.
2.1 Pipelining
Instruction Pipeline: Divides instruction execution into multiple stages (Fetch, Decode,
Execute, Memory, Write-Back).
Instructions are executed as resources become available, rather than in program order.
By leveraging these ILP techniques, modern processors achieve significant speedup and
efficiency in executing parallel workloads.
COMPUTER ORGANISATION BASICS
Stored-Program Computer: Organization and Execution
Registers: Small, fast storage locations within the CPU to hold data
and intermediate results. Key registers include:
2. Memory:
Allow the system to interact with the outside world by receiving input
and providing output.
4. Bus System:
The Program Counter (PC) points to the memory address where the
next instruction is stored.
2. Decode:
3. Execute:
4. Store (Optional):
5. Repeat:
The sequence continues with the next instruction being fetched (Step
1). The process continues until the program ends, typically when a
“halt” instruction is encountered or a specific condition is met.
1. Resource Management:
2. Process Management:
4. User Interface:
Provides a Command-Line Interface (CLI) or Graphical User
Interface (GUI) for user interaction with the system.
Compiler:
2. Error Checking:
Assembler:
An assembler translates assembly language (a low-level programming
language) into machine code.
Converts mnemonics (e.g., MOV, ADD, JMP) into binary code for CPU
execution.
Summary of Differences
Component Function
In computer architecture and programming, the concepts of operator, operand, registers, and
storage are fundamental components in how data is manipulated and processed. Here's an
explanation of each:
1. Operator
Examples:
o Arithmetic operators: +, -, *, /
o Logical operators: AND, OR, NOT
o Comparison operators: =, <, >, !=
o Bitwise operators: AND, OR, XOR
2. Operand
An operand is the data or value on which an operator performs its operation. Operands can be
constants (literal values), variables, or expressions that hold data.
Examples:
o In the expression 5 + 3, the operands are 5 and 3.
o In x * y, the operands are x and y.
3. Registers
Registers are small, fast storage locations within the processor (CPU) that are used to hold data
temporarily during the execution of instructions. They are essential for the operation of the CPU,
as they store operands and results of operations, memory addresses, and control information.
Types of registers:
o Data registers: Store intermediate data during calculations.
o Address registers: Store memory addresses for accessing data.
o Program counter (PC): Stores the address of the next
instruction to execute.
o Status registers/Flags: Store flags (like zero, carry, overflow)
indicating the status of operations.
4. Storage
Storage refers to memory or devices that store data persistently, as opposed to registers, which
hold data temporarily. Storage is typically slower than registers, but it has much larger capacity.
Types of storage:
o Primary storage (RAM): Temporarily stores data and
instructions that are actively being used or processed by the
CPU. It is volatile, meaning it loses data when power is off.
o Secondary storage: Non-volatile storage like hard drives, solid-
state drives (SSDs), or optical disks, used for long-term data
storage.
o Cache memory: A smaller, faster type of volatile memory
located close to the CPU, used to store frequently accessed data
for quick retrieval.
Summary
Instruction Format
This example assumes a format where the opcode is followed by two operands and a mode field.
Instruction Set
An instruction set (or instruction set architecture, ISA) is a collection of all the instructions
that a particular CPU can execute. It defines the operations the processor can perform, the types
of operands it can work with, and how instructions are formatted.
Types of Instruction Sets:
Addressing Modes
Addressing modes define the method used to access data (operands) for an instruction. They
specify where the operands are located and how they can be referenced by the instruction.
Different addressing modes provide flexibility in how data is manipulated and accessed.
1. Immediate Addressing: The operand is a constant value embedded directly within the
instruction itself.
o Example: ADD R1, #5 (Add the constant value 5 to the value in
register R1).
4. Indirect Addressing: The operand is located in memory, but the instruction specifies a
register that contains the memory address of the operand.
o Example: MOV R1, [R2] (Move the value stored at the memory
address in register R2 into register R1).
5. Indexed Addressing: The effective memory address is computed by adding a constant
value (index) to the contents of a register.
o Example: MOV R1, [R2 + 5] (Move the value at the memory
address calculated by adding 5 to the contents of register R2 into
register R1).
6. Base-Register Addressing: Similar to indexed addressing, but the base address is stored
in a specific register, and an offset is added to it.
o Example: MOV R1, [R2 + R3] (Move the value at the memory
address computed by adding the values in registers R2 and R3
into register R1).
8. Register Indirect Addressing: The operand is accessed by first retrieving the memory
address from a register.
o Example: MOV R1, (R2) (Move the value stored at the memory
address contained in register R2 into R1).
Summary:
In computer systems, numbers are represented in binary format for processing. There are two
primary ways to represent numbers: fixed-point representation and floating-point
representation. Both have different advantages and are used in different contexts depending on
the requirements of precision, range, and the type of calculations.
1. Fixed-Point Representation
Characteristics:
Fixed Precision: The number of digits before and after the decimal
point is fixed, which means that there is a limited range for both the
integer and fractional parts.
Integer-Based: The number is stored as an integer, and operations
like multiplication or division are done using integer arithmetic. The
decimal point’s position is implied based on the scaling factor.
Advantages:
Disadvantages:
2. Floating-Point Representation
In floating-point representation, numbers are represented in a way that allows for a dynamic
decimal point. This representation is more flexible and is able to handle a much wider range of
values, including very large and very small numbers.
The standard IEEE 754 format defines the structure of floating-point numbers. The most
common formats are:
Advantages:
Disadvantages:
Summary:
Your request covers several fundamental concepts in digital circuits, arithmetic logic units
(ALUs), and algorithms related to fixed-point and floating-point operations. Below is an
explanation of the key points you mentioned:
Ripple Carry Adder (RCA): The simplest type of adder. It consists of a series of full
adders connected in a chain, where each full adder takes the carry input from the previous
adder and produces a carry output for the next adder. The main drawback of the RCA is
that it can be slow because the carry bit ripples through all the stages.
Structure:
o Each full adder has 3 inputs: two bits to be added and the carry
input (Cin).
o It produces two outputs: the sum (S) and the carry output (Cout).
o The carry propagation slows down the operation for large bit-
widths.
Carry Look-Ahead Adder (CLA): A faster adder design that solves the delay problem
of the ripple carry adder. It uses a carry look-ahead logic to predict the carry outputs in
advance, reducing the delay compared to the RCA.
Key Principles:
An ALU is a digital circuit that performs arithmetic and logical operations on binary data. It is a
key component in a processor or microcontroller.
Booth's algorithm is a multiplication algorithm that handles both positive and negative numbers
in binary form. It is an efficient way to perform signed multiplication.
Steps:
Booth's algorithm reduces the number of required partial products, making it faster than a simple
bit-by-bit multiplication.
Steps:
The IEEE 754 standard is a widely used standard for representing floating-point numbers in
binary format. It defines the format for 32-bit (single precision) and 64-bit (double precision)
floating-point numbers.
Format:
The number is represented as: (−1)S×1.M×2(E−127)(-1)^S \times 1.M \times 2^{(E - 127)}
Special Values:
Carry Generation and Carry Propagation are key concepts in digital circuits, especially in the
design of binary adders like the Ripple Carry Adder (RCA) or the more advanced Carry
Lookahead Adder (CLA).
Example:
Example:
These two concepts are crucial for optimizing the speed of binary adders, as they determine how
quickly carries can be calculated and propagated through the entire operation.
A Carry Look-Ahead Adder (CLA) is an advanced type of binary adder used to improve the
speed of addition by reducing the time delay associated with carry propagation in traditional
ripple carry adders. The primary goal of a CLA is to compute the carries in parallel, thus
speeding up the addition process.
Working Principle:
In a traditional adder like the Ripple Carry Adder (RCA), carries are computed sequentially
from the least significant bit (LSB) to the most significant bit (MSB), causing delays as each bit
must wait for the previous carry to be computed. In contrast, the CLA works by precomputing
the carry signals using carry generation and carry propagation logic, enabling it to generate
carries in parallel for all bits.
The CLA utilizes two key concepts: Carry Generation (G) and Carry Propagation (P).
Key Equations:
The carry generation equations for multiple bits are computed in parallel, allowing the
adder to operate faster than the RCA.
CLA Logic:
1. Speed: CLA significantly speeds up the addition process by reducing the propagation
delay of carries. Since carry bits are calculated in parallel, the time complexity is
reduced.
2. Scalability: CLA can be extended to add larger numbers by increasing the number of bits
in the carry look-ahead circuit.
3. Efficiency: Unlike the Ripple Carry Adder, which requires time to propagate carries
through each bit, CLA minimizes this time, making it ideal for high-speed applications
like processors.
1. Complexity: The CLA is more complex to design than simpler adders like Ripple Carry
Adders. The logic circuits required for carry generation and propagation grow
exponentially as the bit-width increases.
2. Area: CLA requires more logic gates than the Ripple Carry Adder, resulting in higher
hardware costs and greater silicon area for implementation.
3. Power Consumption: Due to the complexity of the logic, CLA consumes more power
compared to simpler adders.
Conclusion:
The Carry Look-Ahead Adder (CLA) is a powerful solution for fast binary addition by
addressing the major bottleneck in traditional adder designs, which is carry propagation. While it
significantly improves speed, it comes at the cost of increased hardware complexity, area, and
power consumption. CLA is well-suited for high-performance applications, such as processors,
where speed is critical.
Arithmetic Logic Unit (ALU)
Functions of ALU:
1. Arithmetic Operations:
o Addition: Adds two operands.
o Subtraction: Subtracts one operand from another.
o Multiplication: Multiplies two operands (although multiplication
can sometimes be handled by separate circuits in some
systems).
o Division: Divides one operand by another (similar to
multiplication, this may be offloaded in certain architectures).
2. Logical Operations:
o AND: Performs bitwise AND operation.
o OR: Performs bitwise OR operation.
o XOR: Performs bitwise exclusive OR operation.
o NOT: Performs bitwise NOT operation, flipping all the bits of an
operand.
3. Shift Operations:
o Shift Left/Right: Shifts the bits of a number left or right, often
used for multiplication or division by powers of two.
Structure of ALU:
The ALU's operation is controlled by the control unit of the CPU, which sends control signals to
the ALU. The control signals determine which operation (arithmetic or logical) the ALU should
perform and may also dictate additional operations like setting flags based on the results.
The ALU is an essential part of the CPU architecture. It works closely with other components
like:
Advantages of ALU:
Speed: ALUs are optimized for fast execution of arithmetic and logical
operations, essential for the overall performance of the CPU.
Versatility: They support a wide range of operations, making them
suitable for various applications in computing, from basic calculations
to more complex logical decisions.
Disadvantages of ALU:
Conclusion:
The Arithmetic Logic Unit (ALU) is a critical component in any digital computer, responsible
for executing fundamental arithmetic and logical operations. It plays a central role in the
processing power of CPUs and is an integral part of the system's overall functionality, driving
tasks ranging from simple calculations to complex decision-making processes.
Serial Adder:
Step-1:
The two shift registers A and B are used to store the
numbers to be added.
Step-2:
A single full adder is used too add one pair of bits at a time
along with the carry.
Step-3:
The contents of the shift registers shift from left to right and
their output starting from a and b are fed into a single full
adder along with the output of the carry flip-flop upon
application of each clock pulse.
Step-4:
The sum output of the full adder is fed to the most
significant bit of the sum register.
Step-5:
The content of sum register is also shifted to right when
clock pulse is applied.
Step-6:
After applying four clock pulse the addition of two registers
(A & B) contents are stored in sum register.
In computer architecture, the memory unit is a crucial component that stores data and
instructions that the CPU can access for execution. Effective memory unit design and CPU-
memory interfacing are key to enhancing the overall performance of a computer system. Let’s
break down the design of a memory unit, focusing particularly on CPU-memory interfacing.
The memory unit in a computer system is typically composed of several types of memory, each
serving different purposes and characteristics. These include:
The design of memory systems in modern computers aims to minimize the latency of accessing
data from memory and to maximize throughput.
3. CPU-Memory Interfacing
The CPU-memory interface involves communication between the processor and the memory
unit. It determines how data and instructions are transferred between these components. The key
aspects of CPU-memory interfacing include the following:
Address Bus: A collection of lines used to carry memory addresses. The width of the
address bus (number of lines) determines the amount of addressable memory. For
example, a 32-bit address bus can address up to 4 GB of memory (2^32).
Data Bus: A collection of lines that carry the actual data to and from memory. The width
of the data bus (number of lines) influences the amount of data that can be transferred per
clock cycle.
Control Bus: A collection of lines used to carry control signals that manage the
operations between the CPU and memory. This includes signals like:
o Read/Write: Indicates whether data is being read from or
written to memory.
o Memory Access (or Chip Select): Determines which memory
module is being accessed.
o Clock: Synchronized timing for data transfers.
4. Bus Architecture
A bus is a system of communication pathways used for transferring data between the CPU and
memory. A common bus architecture includes:
Single Bus Systems: A single bus used for both addressing and data
transfer. This can be inefficient in systems with high-speed
requirements.
Multiple Bus Systems: Separate buses for data, address, and control
signals. This helps in improving the speed of memory operations, as
these buses can operate simultaneously.
5. Memory Hierarchy
Registers: Directly inside the CPU, used for very fast data access.
Cache Memory: Sits between the CPU and RAM to store frequently
accessed data.
Main Memory (RAM): Stores the programs and data that are in use.
Secondary Memory: Provides long-term storage for data and
programs.
DMA is a method by which peripherals can access memory directly, without involving the CPU.
This frees up the CPU to perform other tasks while data transfer is taking place. DMA is
typically used for high-speed data transfer tasks, such as disk operations, audio/video data, and
networking.
Synchronous Access: The memory and CPU operate in sync with the same clock cycle.
This makes the timing predictable and simpler but can limit speed if the memory is
slower.
Asynchronous Access: Memory access occurs without synchronization with the CPU
clock. This can allow faster operation but requires complex timing protocols.
Pipelined Memory Access: Data access is staged in multiple steps to allow one stage of
memory access to occur while the previous one is still in process. This increases
throughput but requires sophisticated control mechanisms.
Burst Mode: In this mode, multiple data words are transferred in a single operation,
allowing for faster data transfer than standard single-word access.
Interleaving: Memory is divided into multiple banks, and data can be read or written to
different banks simultaneously. This improves throughput by reducing memory access
bottlenecks.
Virtual Memory: Uses a combination of RAM and secondary memory (e.g., hard disk)
to simulate a large amount of memory, with the operating system managing data
swapping between RAM and disk storage.
9. Implementation Challenges
Latency: Memory access time is critical. Techniques like cache memory and pipelining
help mitigate the latency involved in accessing memory.
Data Consistency: In multi-core processors or systems with multiple memory
hierarchies, ensuring that data remains consistent across various levels of memory is
complex. Cache coherence protocols help manage this.
Bandwidth: The bandwidth of the memory system determines the amount of data that
can be transferred per unit of time. High-bandwidth systems are necessary for
applications that involve large data sets, such as gaming or data analytics.
Conclusion
The design of a memory unit and CPU-memory interfacing requires careful attention to speed,
efficiency, and scalability. By optimizing the communication between the CPU and memory
through advanced techniques like cache memory, pipelining, interleaving, and DMA, overall
system performance can be significantly enhanced. Additionally, the implementation of modern
memory hierarchies ensures that data is accessed quickly and efficiently, meeting the demands of
various computational tasks.
1. Memory Organization
Memory organization refers to how data is structured and accessed within a computer’s memory
system. It is essential for improving system performance, ensuring efficient data retrieval, and
optimizing storage space. The organization of memory depends on the type of memory used, its
access method, and its purpose in the system.
Flat Memory Organization: In flat memory systems, all memory locations are viewed
as part of a single, continuous address space. This is typical in smaller or less complex
systems where there is no need to separate different types of memory (e.g., data vs. code).
Hierarchical Memory Organization: Modern computers employ hierarchical memory
systems, where different levels of memory (such as registers, cache, main memory, and
secondary storage) are organized according to speed and capacity. Faster memory (like
registers and cache) is used to store frequently accessed data, while slower memory (like
hard drives or SSDs) stores larger amounts of data.
Address Space Partitioning: Memory can be organized into partitions to separate
system programs, application programs, and user data. This partitioning improves
security and allows better management of resources. Examples of this are segmenting
memory in operating systems using techniques like paging or segmentation.
Virtual Memory Organization: In a virtual memory system, the physical memory is
divided into fixed-size blocks called pages. The operating system maps these pages to
logical addresses. This gives the illusion of a larger memory than physically available and
allows better memory management.
Direct Access: Memory locations are accessed directly via address lines, typical of
Random Access Memory (RAM).
Sequential Access: Memory locations must be accessed in a specific sequence. Tape
drives, for example, use sequential access.
Random Access: Any memory location can be accessed directly and in any order, typical
of systems with RAM and cache.
2. Static and Dynamic Memory
Memory can be broadly classified into static memory and dynamic memory, based on how
they store data and the power required for their operation.
Static Memory
Static memory retains its data as long as power is supplied to the system. This type of memory
does not require periodic refreshing and is faster but more expensive than dynamic memory. The
most common form of static memory is Static RAM (SRAM).
SRAM (Static RAM): SRAM stores data in flip-flop circuits, which maintain their state
as long as power is on. It is used primarily in cache memory due to its speed and
reliability.
Characteristics of Static Memory:
o Faster than dynamic memory.
o No need for periodic refresh cycles.
o More expensive to manufacture.
o Lower memory density compared to dynamic memory.
Dynamic Memory
Dynamic memory loses its data when the power is turned off and requires periodic refreshing to
maintain the data stored in it. The most common type of dynamic memory is Dynamic RAM
(DRAM).
DRAM (Dynamic RAM): DRAM stores data in capacitors, which naturally leak charge
over time, requiring constant refreshing to retain data. DRAM is widely used for main
memory due to its higher storage capacity and lower cost.
Characteristics of Dynamic Memory:
o Slower than static memory.
o Requires periodic refreshing of data.
o More cost-effective for high-capacity memory.
o Higher density, allowing more data to be stored in the same
physical space.
3. Memory Hierarchy
Memory hierarchy is a structure that organizes different types of memory in a layered manner,
based on access speed, cost, and capacity. The idea behind memory hierarchy is to provide fast
access to frequently used data and store less frequently accessed data in slower, larger memories.
Levels of Memory Hierarchy
1. Registers: The fastest and smallest form of memory, directly inside the CPU. They store
data that the CPU is currently processing.
2. Cache Memory: Located between the CPU and main memory, cache memory stores
copies of frequently used data from main memory. It is much faster than RAM but has a
limited capacity. There are typically multiple levels of cache (L1, L2, L3).
3. Main Memory (RAM): This is the primary storage used to hold running programs and
data that the CPU actively uses. It is larger than cache but slower.
4. Secondary Storage (Disk/SSD): This includes hard drives, solid-state drives, and optical
discs. Secondary storage is slower but has much higher capacity than main memory.
4. Associative Memory
Content-Based Searching: Data is accessed by matching the content (value) rather than
using a specific address. For example, if a system needs to find the location of a specific
word, it compares the word with every entry in memory.
Parallel Search: All memory locations are searched simultaneously, which makes
associative memory fast in performing lookups. This is particularly useful in applications
requiring rapid data retrieval, such as database systems or routing tables in networking.
Applications:
o Cache Management: Associative memory is used in cache
systems, where it helps quickly find a specific value stored in
cache memory.
o Pattern Matching: It is used in AI systems and pattern
recognition tasks where identifying patterns from data is needed.
o Networking: In routers, associative memory is used for fast
lookups in routing tables.
Limitations
Conclusion
Memory organization, static and dynamic memory, memory hierarchy, and associative memory
each play essential roles in optimizing the efficiency, speed, and capacity of modern computer
systems. Effective memory design ensures that data is accessed and processed as quickly as
possible while balancing performance and cost.
Cache Memory
Cache memory is a high-speed storage medium located between the CPU and main memory
(RAM), designed to speed up data access. It stores frequently used data and instructions so that
the processor can access them faster than if it were to retrieve them from the main memory.
Cache memory operates much faster than main memory, which reduces the time the CPU spends
waiting for data. Typically, the data stored in the cache comes from the main memory, and when
the CPU needs data, it first checks the cache before accessing the slower main memory.
Cache memory is organized into levels, with each level having its own speed and size
characteristics. L1 cache is the smallest but fastest, located directly on the CPU chip. L2 cache is
larger but slower and can be located either on the CPU chip or near it. L3 cache is the largest but
slowest, typically shared by multiple processor cores.
Working Principle: The fundamental idea behind cache memory is to exploit temporal and
spatial locality. Temporal locality refers to the likelihood that recently accessed data will be
accessed again in the near future. Spatial locality indicates that data near the recently accessed
data is likely to be accessed soon as well. Cache systems exploit both types of locality to keep
relevant data close to the processor.
When the processor needs to access data, it checks if the data is in the cache. If the data is found
(a cache hit), the processor can proceed without waiting. If the data is not found (a cache miss),
the processor retrieves the data from main memory, and this data is then stored in the cache for
future access.
The performance of cache memory is usually measured in terms of the cache hit rate, which is
the percentage of memory accesses that are satisfied by the cache. A high hit rate improves the
overall performance of the system.
Cache Coherence and Consistency: In systems with multiple processors or cores, cache
coherence becomes important. Each processor may have its own cache, and ensuring that each
cache contains the most up-to-date data is essential. Cache coherence protocols, such as MESI
(Modified, Exclusive, Shared, Invalid), manage this by coordinating cache updates.
Speed: Cache memory significantly reduces the time taken by the CPU
to fetch data from main memory.
Efficiency: It minimizes the CPU's idle time and optimizes the
performance of programs.
Cost-effective: Compared to upgrading to larger, faster main
memory, increasing cache size is often a more affordable way to
improve performance.
Virtual memory is a memory management technique that allows a computer to compensate for
physical memory shortages by temporarily transferring data from the RAM to disk storage. It
provides the illusion to the user and programs that they have access to a large and contiguous
block of memory, even if the system's actual physical memory is limited. This is achieved by
using both the computer's RAM and secondary storage (like hard drives or SSDs) to simulate a
larger pool of memory.
Working Principle: The key concept behind virtual memory is the abstraction of memory into
virtual addresses, which the system uses to map to physical addresses in RAM. This allows
programs to reference memory locations as if they have access to a large address space, even
though the system may not have enough physical memory to accommodate all of them
simultaneously.
When a program accesses data, the operating system checks whether the data is currently in the
main memory (RAM). If the data is not in memory (a page fault), it is loaded from the secondary
storage (usually a hard drive or SSD) into RAM. The operating system swaps data between
RAM and disk as needed, a process known as paging or swapping.
Page and Page Tables: Virtual memory is divided into small, fixed-size blocks called "pages"
(typically 4 KB each). Similarly, physical memory is divided into "frames" of the same size. The
operating system maintains a page table that maps virtual pages to physical memory frames.
Each entry in the page table corresponds to a virtual page and its corresponding physical frame.
The operating system uses the page table to translate the virtual page number into a physical
frame number. This process allows the program to access memory without worrying about the
actual physical location.
Demand Paging and Thrashing: In a system using virtual memory, demand paging is used to
load pages only when they are needed. When a program accesses a page that is not currently in
RAM, a page fault occurs, and the page is brought into memory. However, if the system is
overburdened with too many page faults, it can experience "thrashing." Thrashing occurs when
the system spends more time swapping pages in and out of memory than executing instructions,
significantly degrading performance.
In conclusion, virtual memory enables efficient use of system resources by abstracting the
memory hierarchy and allowing programs to run even when there isn't enough physical memory
available. However, it requires careful management to prevent performance degradation.
In digital systems, especially within a processor or microcontroller, the data path refers to the
collection of functional units and interconnections that perform operations such as data
movement, arithmetic, and logic operations. It consists of registers, multiplexers, ALUs
(Arithmetic Logic Units), buses, and memory elements that work together to execute read and
write operations efficiently. The design of the data path is crucial in determining how data is
accessed, moved, and processed within a system.
1. Registers: These are small, fast storage elements that hold data temporarily. Registers are
used to store operands for arithmetic and logic operations, as well as intermediate results.
o General-purpose registers: Used by the CPU for storing
operands, results, and temporary data.
o Special-purpose registers: These include the Program Counter
(PC), Stack Pointer (SP), and status registers, which control and
store the state of the system.
2. Memory: Memory is used for storing both program instructions and data. The system
typically employs both primary memory (RAM) and cache memory to improve
read/write performance.
o Read/Write Memory: A region of memory from which data can
be both read and written.
o ROM (Read-Only Memory): Memory that can only be read, not
written.
3. Multiplexers (MUX): Multiplexers are used to select between multiple input sources and
direct the selected input to a particular output. In data paths, MUXes are used to choose
between different data sources, such as registers, memory, or ALUs, depending on the
operation to be performed.
4. Buses: Buses are used to carry data between registers, memory, and functional units (like
the ALU). A data bus, address bus, and control bus are typically present in the data path
design.
o Data bus: Carries the data between registers, memory, and
other components.
o Address bus: Carries the memory addresses for reading from or
writing to memory.
o Control bus: Carries signals that determine the operation being
performed, such as read, write, or execute.
5. Arithmetic Logic Unit (ALU): The ALU performs arithmetic and logical operations on
the data. It receives inputs from registers or memory and produces output based on the
operation being executed (addition, subtraction, AND, OR, etc.).
6. Control Unit (CU): The control unit sends signals to the other components of the data
path, controlling the operation and flow of data. It decodes instructions, determines what
operations need to be executed, and sends appropriate control signals to the ALU,
memory, and registers.
The process of designing a data path that facilitates both read and write operations involves
determining how data is moved, manipulated, and stored in the system. Here is how read and
write operations typically occur within a data path:
1. Read Operation:
o When the processor needs to read data from memory, the
address of the data is sent over the address bus.
o The control unit issues a signal to enable the memory to be
read (often a "read" signal).
o The memory sends the data back over the data bus to the
register or ALU for further processing. If the read data is to be
used immediately, it is written to a register.
o Depending on the design, the register or ALU might act as the
next destination for the read data, based on the current
operation.
2. Write Operation:
o In a write operation, the data to be written is sent from a
register or ALU through the data bus to the target memory
location.
o The address where the data is to be written is sent over the
address bus.
o The control unit sends a "write" signal to enable writing in
memory.
o The data is written to the specified address in memory.
To handle read/write access efficiently, the data path needs to support different types of read and
write scenarios:
1. Register-to-Memory Write:
o The register provides the data, which is sent over the data bus to
memory.
o The address for where the data will be written is sent over the
address bus.
o The control unit issues a "write" signal to memory, enabling the
memory to accept the data.
2. Memory-to-Register Read:
o The control unit generates a "read" signal to fetch data from
memory.
o The address for the memory location is sent through the address
bus.
o Data from memory is sent back through the data bus to a
register or ALU.
1. Efficiency: The data path should minimize the number of clock cycles required to
perform a read or write operation. This is achieved through efficient memory addressing,
register management, and control signal design.
2. Pipeline Design: Pipelining can be used to overlap different stages of data processing
(fetch, decode, execute, memory access, and write-back) to speed up read and write
operations.
3. Access Time: The system should minimize the access time to memory. Techniques such
as cache memory, read buffers, or write buffers are often employed to reduce the time
needed for read/write access.
4. Data Integrity: Proper synchronization between different components of the data path is
crucial to ensure that data is written to and read from the correct locations at the
appropriate time.
5. Control Signals: Proper generation and management of control signals are essential to
select the right path for data movement and to specify whether a read or write operation is
to be performed.
6. Parallelism: In more advanced systems, multiple read and write operations may be
handled simultaneously using multiple memory banks or multiple ALUs.
Conclusion
The data path design for read/write access is a critical aspect of computer architecture. It defines
how data flows within the system, from memory to registers and through functional units like the
ALU. Effective data path design ensures efficient data retrieval and storage, minimizing latency
and maximizing throughput. Careful attention to control signals, memory management, and
optimization techniques such as pipelining and caching is required to enhance performance and
support complex operations.
The Control Unit (CU) is a critical component of the central processing unit (CPU) responsible
for directing the operation of the processor. It generates control signals that manage the activities
of the CPU, including instruction fetching, decoding, and execution. There are two primary
approaches to designing a control unit: Hardwired Control and Microprogrammed Control.
Both have their distinct characteristics, advantages, and limitations.
A hardwired control unit uses fixed logic circuits, such as gates, flip-flops, and decoders, to
produce control signals. These control signals dictate the operation of the CPU based on the
instruction being executed. The control logic is hardcoded in hardware, meaning that any change
in the operation requires a physical modification of the circuit.
Working Principle:
In a hardwired control unit, the control signals are generated using combinational logic circuits
based on the opcode (operation code) of the instruction. The opcode is decoded by the control
unit, and the necessary signals for data movement, ALU operation, and memory access are
produced.
The control unit receives the instruction from memory, decodes it, and then generates the
appropriate control signals for the execution of the instruction. The control signals are generated
for operations like:
Register reads/writes
ALU operations (addition, subtraction, etc.)
Memory read/write operations
Instruction fetching
Conditional branching
Design Process:
Example:
For a simple instruction like ADD R1, R2, the hardwired control unit will:
Advantages:
Speed: Since hardwired control uses fixed logic circuits, the generation
of control signals is very fast. The control unit can operate at high
speeds with minimal delay.
Simplicity: The design of a hardwired control unit is relatively
straightforward and involves using standard combinational logic
circuits.
Efficiency: For simple and small systems, hardwired control is often
more efficient in terms of both performance and hardware complexity.
Disadvantages:
Working Principle:
Control fields: These fields specify the control signals for the different
units (ALU, memory, registers).
Address field: This specifies the address of the next microinstruction
to be executed.
Design Process:
Example:
For the ADD R1, R2 instruction, the microprogrammed control unit might:
Fetch the microinstruction for the "ADD" operation from control
memory.
Generate control signals for reading from registers R1 and R2.
Generate a signal to perform addition in the ALU.
Generate a control signal to write the result back to R1.
Advantages:
Disadvantages:
Difficult (requires
Maintenance Easier (microprogram updates)
hardware changes)
Conclusion
The choice between hardwired and microprogrammed control depends on the system
requirements:
In a non-pipelined processor, each instruction must pass through all stages of execution
sequentially. That is, one instruction is fully executed before the next one begins. However, in a
pipelined processor, an instruction is divided into smaller stages, and each stage works on a
different part of an instruction. These stages typically include:
The fundamental idea of pipelining is that while one instruction is being executed in one stage,
another instruction can be processed in a different stage of the pipeline. For example:
This overlap increases the instruction throughput, as the CPU is working on multiple instructions
at the same time but in different stages.
Stages of Pipelining
In a typical pipelined architecture, the following five stages are commonly seen in many
processors:
1. Data Hazards: These occur when instructions that are close together in the pipeline
depend on the same data. For example, if an instruction needs data that is not yet
available because a previous instruction is still in the pipeline, this creates a delay.
o Read-after-write (RAW) hazard: A situation where a
subsequent instruction depends on the result of a previous
instruction.
o Write-after-write (WAW) hazard: A situation where two
instructions write to the same register.
o Write-after-read (WAR) hazard: A situation where one
instruction writes to a register that another instruction reads
from.
2. Control Hazards: These arise when there is a branch instruction (such as a jump or if-
else condition), which can alter the flow of execution. The pipeline may need to be
stalled or flushed to handle branch predictions and fetch the correct instruction.
3. Structural Hazards: These happen when the hardware resources are insufficient to
handle multiple instructions simultaneously. For example, if both instructions need access
to the memory at the same time, it can create a conflict.
4. Pipeline Stalls: These occur when the pipeline cannot proceed due to hazards. A stall
may be necessary to wait for data to become available or for control decisions to be
made. This can reduce the efficiency gains from pipelining.
Pipeline Performance
The performance of a pipelined processor is typically measured by its throughput (the number
of instructions completed per cycle) and latency (the time taken for a single instruction to
complete). Ideally, with perfect pipelining, an instruction would exit the pipeline in every clock
cycle. However, due to stalls and hazards, the ideal throughput is often not achievable.
The speedup achieved by pipelining depends on how efficiently the pipeline is managed and
how well hazards are handled. In practice, the throughput improvement is less than the
theoretical maximum due to the above challenges.
Consider a simple example where the following sequence of instructions needs to be executed:
Without pipelining, each instruction would take several cycles to complete (fetch, decode,
execute, etc.). However, with pipelining, each instruction can enter the pipeline and execute
concurrently in different stages:
While the ADD instruction is in the execute stage, the SUB instruction
can be decoded, and the MUL instruction can be fetched, effectively
using all parts of the CPU simultaneously.
Conclusion
RISC (Reduced Instruction Set Computing) is a CPU architecture design philosophy that
emphasizes simplicity and efficiency by using a small set of highly optimized instructions. The
core idea behind RISC is to design processors with a relatively small number of simple
instructions that can be executed in a single clock cycle, allowing for high-speed execution and
efficient use of the processor’s pipeline.
Examples of RISC architectures include the ARM architecture, MIPS, and SPARC.
CISC (Complex Instruction Set Computing) and RISC (Reduced Instruction Set Computing) are
two different philosophies in CPU design. While both aim to improve the performance and
efficiency of computing systems, they differ significantly in terms of instruction sets, design
goals, and implementation.
RISC:
o RISC instructions are designed to be simple and to complete in a
single clock cycle. This makes RISC processors highly efficient at
executing a large number of simple instructions.
o With fewer complex instructions, RISC processors tend to have
higher performance for programs with a large number of
instructions that can be executed in parallel.
CISC:
o CISC processors have more complex instructions that may take
multiple cycles to execute, as some instructions may involve
multiple operations (e.g., a single instruction that performs both
a memory access and an arithmetic operation).
o While CISC processors may require fewer instructions to perform
a task, the execution time for each instruction is typically longer
than in RISC.
3. Memory Access
RISC:
o RISC architectures emphasize the use of registers. Most
instructions perform operations on registers rather than directly
on memory.
o Memory access is done using load and store instructions,
meaning that the processor only interacts with memory to load
values into registers or store values from registers.
CISC:
o CISC processors are designed to directly operate on memory,
and instructions may perform memory-to-memory operations
(e.g., an arithmetic instruction that operates on values in
memory).
o This reduces the need for multiple instructions to move data
between registers and memory, which can make the program
code more compact.
4. Instruction Decoding
RISC:
o RISC processors have simpler instruction sets, leading to simpler
and faster instruction decoding. The uniform length of
instructions further simplifies the decoding process.
o This simplicity allows for highly efficient pipelining, where
multiple instructions can be processed simultaneously in
different stages of execution.
CISC:
o CISC processors have more complex instruction sets, which
means that instruction decoding is more complicated and may
take more cycles. This can lead to slower instruction processing
and less efficient pipelining.
o The variable length of CISC instructions makes instruction
decoding more time-consuming, as the processor must first
determine the length of the instruction before it can be decoded.
5. Program Size
RISC:
o RISC programs tend to be larger in size because more
instructions are required to perform a given task. Since each
RISC instruction is simple and performs only one operation, more
instructions are needed to accomplish complex tasks.
o However, the simplicity and regularity of the instructions can
lead to faster execution times for programs that are optimized
for RISC architectures.
CISC:
o CISC architectures are typically more efficient in terms of
program size because each instruction can perform multiple
operations. This can reduce the overall number of instructions
needed to complete a program.
o However, the complexity of CISC instructions can lead to slower
execution times, especially when the processor must decode a
large number of complex instructions.
6. Hardware Complexity
RISC:
o RISC processors tend to have simpler hardware designs, with a
focus on speed and efficiency. The design of a RISC processor is
typically less complex because it has fewer instruction formats
and simpler decoding logic.
o The reduced complexity of RISC hardware allows for higher clock
speeds and easier integration of advanced features like
pipelining.
CISC:
o CISC processors have more complex hardware, with support for a
wide variety of instructions and addressing modes. The decoding
and execution units are more intricate, which can increase the
overall size and cost of the processor.
o The complexity of CISC hardware can make it more challenging
to implement high-performance features like pipelining or out-of-
order execution.
RISC vs. CISC: A Summary
Feature RISC CISC
Instruction
Fixed-length instructions Variable-length instructions
Length
Instruction
Simple and fast Complex and slower
Decoding
Hardware Simple design with fewer More complex design with more
Complexity features features
Conclusion
In summary, RISC and CISC are two different processor design philosophies, each with its own
set of trade-offs:
While RISC architectures excel in speed and efficiency due to their simple instruction set and
pipelining capabilities, CISC architectures aim to reduce the program size by using more
complex instructions. The choice between RISC and CISC depends on the specific needs of the
application, with RISC often being preferred for high-performance systems like smartphones and
embedded devices, while CISC has traditionally been used in general-purpose computers.
I/O Operations: Concept of Handshaking, Polled I/O, Interrupts,
and DMA
Input/Output (I/O) operations are a fundamental part of computer systems, allowing the
processor to communicate with external devices (like keyboards, printers, displays, and storage
devices). To manage these operations, different techniques are used to coordinate the transfer of
data between the CPU and I/O devices. Four key concepts in I/O operations are handshaking,
polled I/O, interrupts, and Direct Memory Access (DMA).
1. Concept of Handshaking
Handshaking is a process used for communication between two devices (typically the CPU and
I/O devices) to ensure that data is transferred in a coordinated manner, preventing data loss or
conflicts.
In handshaking, the sending and receiving devices use control signals to signal the readiness of
each device to send or receive data. There are two main steps in the handshaking process:
Handshaking ensures that data is transmitted only when both devices are ready, preventing one
device from sending data too quickly or too slowly. This process can be either synchronous or
asynchronous:
2. Polled I/O
Polled I/O is a method where the CPU continuously checks or "polls" the status of an I/O device
to determine if it is ready for data transfer. In this technique, the CPU repeatedly reads a status
register or flag associated with the I/O device.
How It Works: The CPU periodically checks if the device is ready for
input/output operations. If the device is ready, the CPU will initiate the
appropriate operation (e.g., reading data from an input device or
writing data to an output device).
Polling Loop: The CPU enters a loop where it constantly checks the
device status. If the status indicates the device is ready (e.g., data is
available for reading), the CPU proceeds with the data transfer. If not, it
keeps checking the status in a cyclic manner.
CPU Wastage: Since the CPU constantly checks for I/O status, it
wastes processing power that could be used for other tasks.
Inefficiency: The CPU is involved in checking the device status, which
reduces overall system efficiency.
3. Interrupts
An interrupt is a mechanism that allows an I/O device to signal the CPU when it needs
attention, instead of the CPU constantly checking the device status (like in polling). When an
interrupt occurs, the CPU stops executing its current instructions and jumps to a special function
called the interrupt service routine (ISR) to handle the interrupt. After the interrupt is
processed, the CPU resumes its normal execution.
How It Works:
1. An I/O device sends an interrupt signal to the CPU when it is
ready for data transfer (e.g., input data is available or the output
device is ready to receive data).
2. The CPU saves its current state and starts executing the interrupt
service routine (ISR) for the specific device.
3. After the ISR completes the necessary action (like reading input
data or writing output data), the CPU restores its state and
resumes executing the program from where it left off.
Types of Interrupts:
Efficiency: The CPU is not tied up with constant polling; it can perform
other tasks and only handle I/O when necessary.
Better Resource Utilization: Interrupts allow more efficient CPU
utilization, as the CPU can focus on other tasks and only be interrupted
when needed.
Disadvantages of Interrupts:
Direct Memory Access (DMA) is a technique that allows I/O devices to directly transfer data to
or from memory without involving the CPU for every byte of data. DMA reduces the CPU's
involvement in I/O operations, allowing it to perform other tasks while the data transfer occurs in
the background.
How It Works:
1. The CPU configures the DMA controller with information about
the source and destination of the data transfer (e.g., from I/O
device to memory).
2. The DMA controller takes over the data transfer, moving data
directly between memory and the I/O device without CPU
intervention.
3. Once the data transfer is complete, the DMA controller sends an
interrupt to notify the CPU that the transfer is finished, and the
CPU can proceed with further processing.
Types of DMA:
Burst Mode DMA: In this mode, the DMA controller transfers all the
data in one go, effectively blocking the CPU until the transfer is
completed.
Cycle Stealing DMA: The DMA controller steals a single cycle at a
time from the CPU to perform part of the data transfer, allowing the
CPU to continue processing between DMA transfers.
Block Mode DMA: The DMA controller transfers a block of data while
the CPU is idle, then signals when the transfer is complete.
Advantages of DMA:
Disadvantages of DMA:
Slow, can be
Uses control signals Simple, ensures
Handshaki inefficient for high-
for synchronization reliable data
ng speed
between devices transfer
communication
Each I/O operation technique—handshaking, polled I/O, interrupts, and DMA—has its
specific use cases and benefits depending on the system's performance requirements.
Handshaking is useful for low-speed devices, while polled I/O is easy but inefficient. Interrupts
provide a more efficient way for handling I/O operations, and DMA offers the highest
performance by allowing direct memory-to-memory transfers without the CPU's intervention. In
modern systems, DMA and interrupt-driven I/O are the most commonly used for high-
performance and efficient data handling.
In conclusion, the Carry Look-Ahead Adder is the better choice for high-speed applications
and large bit-width arithmetic operations, while the Ripple Carry Adder may still be suitable
for simple and small-scale applications where speed is less critical.
Number of
0 1 2 3
Addresses
specified, two
specified
implicitly on specified, one are sources
(accumulator
the stack is destination and one is
used implicitly)
destination
More complex,
Instructio More complex,
Simple, usually allows multiple
n Simple but uses allows flexible
stack-based operations
Complexit an accumulator operations with
operations with two
y three operands
operands
ADD X
ADD A, B (A = ADD A, B, C (A
Example ADD (stack) (accumulator and
A + B) = B + C)
X)
Highest,
Low, as Moderate, uses Higher, as two allowing more
operations are an accumulator operands are complex
Efficiency
limited to stack for fast explicitly operations with
manipulation operations addressed multiple
operands
Conclusion