CAO Units PDF
CAO Units PDF
ANSWER KEY
SEM/YEAR: IV/II
PART- A
1. Express the equation for the dynamic power required per transistor.
Ans:
Pdynamic=CLV2f
Where,
I. Base Address: The address of the operand is calculated relative to the address of the
current instruction (or program counter).
II. Program Counter (PC): The instruction's offset is added to the program counter to
get the effective address.
III. Flexibility: It allows for easy implementation of control flow operations like loops,
branches, and jumps since the address of the next instruction can be given relative to
the current position.
V. Compact Instructions: It reduces the need for large immediate values or fixed
addresses, making instructions more compact, as the offset is usually small and can
fit within a limited number of bits.
Example:
assembly
BEQ -10
-10 is the relative offset from the current program counter (PC).
If the program counter is at address 0x1000, the actual address for the branch would be
calculated as:
Ans:
a) Stored-Program Concept
c) Pipelining
d) Parallelism
e) Memory Hierarchy
g) Out-of-Order Execution
h) Speculative Execution
Component Description
Memory Stores data and instructions that are being processed. Divided into
Component Description
Devices used to input data into the computer include keyboard, mouse,
Input Devices
microphone, and scanner.
Devices that store data persistently. Examples include Hard Disk Drives
Storage Devices
(HDDs), Solid State Drives (SSDs), optical drives, and USB flash drives.
6. Interpret the various instructions based on the operations they perform and give one
example to each category.
Operation: These instructions move data from one location to another, either
between registers, between memory and registers, or from input/output devices to
memory.
Example: MOV A, B
2. Arithmetic Instructions
Example: ADD A, B
Explanation: This instruction adds the contents of register B to register A and stores the
result in A.
3. Logical Instructions
Operation: These instructions perform logical operations like AND, OR, XOR, and NOT
on data.
Example: AND A, B
Explanation: This instruction performs a bitwise AND operation between the contents of
register A and B and stores the result in A.
Operation: These instructions modify the sequence of execution by altering the flow
of control in a program, typically through conditional and unconditional jumps,
branches, and calls.
Explanation: This instruction causes the program to jump to the instruction at memory
location 1000, continuing execution from there.
5. Comparison Instructions
Operation: These instructions compare two operands and set flags (such as zero,
carry, sign, or overflow) based on the result of the comparison.
Example: CMP A, B
Explanation: This instruction compares the contents of register A with register B and sets the
flags based on whether A is equal to, greater than, or less than B.
Ans:
II. Memory: Temporarily stores data and instructions (RAM and Cache).
IV. Input Devices: Allow user interaction with the computer (Keyboard, Mouse).
V. Output Devices: Present the results from the computer (Monitor, Printer).
Ans:
MIPS Code:
g is in register $s0
h is in register $s1
i is in register $s2
j is in register $s3
Ans:
Ans:
12. Measure the performance of the computers: If computer A runs a program in 10 seconds
and computer B runs the same program in 15 seconds, how much faster is A over B?
Ans: The speedup of computer A over computer B can be calculated using the formula:
Given:
This means Computer A is 1.5 times (or 50%) faster than Computer B.
Ans:
The equation for CPU execution time for a program is formulated as:
CPU Time = Instruction Count × Cycles Per Instruction × Clock Cycle Time
Where:
Cycles Per Instruction (CPI) = Average number of clock cycles per instruction.
Clock Cycle Time = Time taken for one clock cycle (Clock Cycle Time) = 1/ clock speed
This equation helps in understanding how different factors affect the execution time of a
program on a CPU.
14. State the need for an indirect addressing mode. Give an example.
Ans:
a) Efficient Memory Utilization – Allows access to large memory areas beyond the
direct addressing limit.
c) Flexibility in Data Access – Useful for handling arrays, linked lists, and dynamic
memory allocation.
d) Code Reusability – The same instruction can work with different data by changing
the pointer.
15. Show the formula for CPU clock cycles required for a program.
Ans:
The total number of CPU clock cycles required for a program can be calculated using this
formula:
Breaking It Down:
Instruction Count (IC) – The total number of instructions the program executes.
Cycles Per Instruction (CPI) – The average number of clock cycles needed to
complete each instruction.
Total Clock Cycles – The final number of CPU cycles required to finish running the
program.
Example:
Let’s say a program has 500 instructions, and on average, each instruction takes 4 cycles to
execute.
So, the CPU will need 2000 cycles to complete the program.
Ans: The Stored Program Concept, proposed by John von Neumann, states that instructions
and data are stored in the same memory and treated alike. This allows a computer to fetch
and execute instructions sequentially from memory.
Key Features:
2. CPU fetches, decodes, and executes instructions one by one from memory.
3. Programs can be modified (e.g., self-modifying code) since they are stored in
memory.
Example:
When a program runs, instructions are fetched from RAM and executed by the CPU step by
step, rather than requiring manual reconfiguration of hardware.
Thus, the Stored Program Concept enables flexibility, automation, and ease of
programming in modern computers.
Ans:
c) Direct Addressing Mode – The instruction contains the memory address of the
operand.
Example: MOV A, 5000H (Loads value from memory address 5000H into A)
Example: MOV A, (R1) (Loads value from the address stored in R1)
Ans:
More reliable; if one processor fails, Less reliable; system failure occurs if
Reliability
others can continue. the processor fails.
Power Higher power consumption due to Lower power consumption since only
Consumption multiple processors running. one processor operates.
Ans:
In Relative Addressing Mode, the address of the operand is determined by adding an offset
to the Program Counter (PC). It is mainly used for branching instructions (jumps, loops).
If the PC = 2000H and the offset is +05H, the jump goes to:
Key Points:
20. Consider the following performance measurements for a program. Which computer has
the higher MIPS rating?
Measurement Computer Computer B
A
Computer A:
CPI = 1.0
= =4×10910×109
Computer B:
CPI = 1.1
Execution Time =
Conclusion:
Since Computer A has the higher MIPS rating (4000 vs. 3636.36), it has a higher instruction
throughput.
PART – B
Ans: The eight great ideas of computer architecture are fundamental concepts that have
shaped the design and evolution of computer systems over time. These ideas are key to
understanding how computers function efficiently and are integral to both hardware and
software design. Here’s a summary of the eight great ideas:
What it is: The ISA is the interface between the software and the hardware. It defines
the set of instructions that a processor can execute, along with the machine-level
operations and the format of the instructions.
Why it matters: The ISA allows software to interact with the hardware in a
predictable way. It serves as the foundation for writing assembly programs, which
directly control the machine.
2. Abstraction
What it is: Abstraction involves hiding the details of how something works while
exposing only the necessary aspects to the user. In computer architecture,
abstraction is used to manage complexity.
4. Parallelism
5. Memory Hierarchy
What it is: Memory hierarchy refers to the organization of memory systems with
varying speeds and sizes, from the fastest (registers and caches) to the slowest (main
memory and storage). The goal is to provide faster data access by using smaller,
faster memory levels closer to the CPU.
6. Virtualization
What it is: Virtualization allows a single physical machine to run multiple virtual
machines, each with its own operating system and applications, as if they were
separate physical machines.
What it is: Latency refers to the time it takes to process a single piece of data, while
throughput refers to the amount of data that can be processed over a period of time.
Why it matters: Optimizing for both latency and throughput is essential for balancing
the speed of individual operations (latency) with the overall capacity of the system
(throughput). Designing systems to efficiently handle both is key for performance in
various applications, from web servers to data centers.
8. Dependability
What it is: Dependability refers to the ability of a system to consistently perform its
functions without failure. It encompasses reliability, availability, safety, and security.
Ans: Building processors involves a variety of technologies that address both the physical
hardware design and the optimization of computational performance. These technologies
are used to create the core components of processors, such as the arithmetic logic units
(ALUs), control units, memory, and interconnects. Here’s an explanation of the major
technologies used in building processors:
What it is: Transistors are the building blocks of modern processors. They function as
switches to control the flow of electrical signals. Semiconductor technology,
especially silicon-based technology, is used to fabricate transistors on chips.
Key Technologies:
o Moore’s Law: This principle states that the number of transistors on a chip
doubles approximately every two years. This enables constant improvements
in processing power and efficiency.
2. Parallelism
Key Technologies:
3. Pipeline Architecture
What it is: Pipeline architecture involves breaking down the processor’s instruction
execution process into several stages, allowing multiple instructions to be processed
at once in different stages.
Key Technologies:
o Deep Pipelines: Modern processors use deep pipelines with many stages
(fetch, decode, execute, etc.), which can allow instructions to be processed
more efficiently.
4. Memory Hierarchy
What it is: Memory hierarchy refers to the different levels of memory in a computer
system (e.g., registers, cache, main memory, disk storage) with varying access speeds
and sizes.
Key Technologies:
o Cache Memory: Small, fast memory located close to the CPU to store
frequently accessed data, improving overall performance.
What it is: Clocking involves the use of a clock signal to synchronize the operations of
different components of a processor. The clock signal determines the timing of
instruction execution.
Why it matters: Proper clocking ensures that all parts of the processor work together
correctly. A high clock speed leads to more operations per second, improving
performance, but must be balanced with power consumption and heat dissipation.
Key Technologies:
o Clock Gating: A technique to save power by turning off clock signals to parts
of the processor that are not in use.
o Global vs. Local Clocking: Global clocking uses one clock for the entire
processor, while local clocking uses multiple clocks, reducing synchronization
problems in complex processors.
6. Interconnect Technologies
What it is: Interconnects are the pathways used to transfer data between
components of the processor and between processors in a system. High-speed
interconnects are essential for ensuring fast data communication.
Why it matters: The speed and efficiency of interconnects directly impact the
processor’s ability to communicate with memory, I/O devices, and other processors.
Key Technologies:
What it is: Processor power consumption has become a critical consideration as chip
performance improves. Power management techniques are used to reduce energy
use while maintaining performance.
Why it matters: As processors become more powerful, the heat and energy
consumption increase. Efficient power management can help reduce energy costs,
heat dissipation, and extend battery life in mobile devices.
Key Technologies:
What it is: Custom processors or ASICs are specialized hardware designs optimized
for specific applications, such as machine learning, cryptocurrency mining, or video
processing.
Why it matters: Custom processors allow for highly efficient execution of specific
tasks, often outperforming general-purpose processors (like CPUs) in those areas.
Key Technologies:
o ASIC Design: Creating custom processors for tasks that benefit from highly
optimized hardware, such as AI acceleration or encryption.
2. List the various components of a computer system and explain with a neat diagram.
A computer system consists of various components that work together to perform tasks,
process data, and communicate with external devices. These components can be broadly
categorized into hardware and software components, but for this explanation, we will focus
primarily on the hardware components.
2. Memory
3. Input Devices
4. Output Devices
5. Motherboard
7. Bus System
8. Communication Devices
Description: The CPU is the "brain" of the computer and is responsible for executing
instructions and performing calculations. It consists of several key parts:
o Arithmetic and Logic Unit (ALU): Performs all arithmetic and logical
operations (addition, subtraction, comparison, etc.).
o Control Unit (CU): Directs the flow of instructions and data within the system.
o Registers: Small, fast storage locations within the CPU used to hold data
temporarily during processing.
2. Memory
3. Input Devices
Description: These devices allow the user to provide data to the computer.
4. Output Devices
Description: Output devices are used to display the processed data or results of
computation to the user.
5. Motherboard
Description: The motherboard is the main circuit board that houses the CPU,
memory, and other crucial components. It also provides the electrical connections
and pathways for communication between different parts of the computer system.
o It contains slots for expansion cards (e.g., graphics cards, network cards) and
connectors for peripheral devices.
Description: The power supply unit converts electricity from a wall outlet into a
usable form for the computer's internal components.
7. Bus System
Description: The bus system refers to the communication pathways used by the CPU
to communicate with memory and other components. Buses carry data, addresses,
and control signals.
o Types of Buses:
Address Bus: Carries memory addresses from the CPU to other parts
of the system.
8. Communication Devices
Description: These devices enable the computer to communicate with external
devices and networks.
Ans: The addressing mode is the method to specify the operand of an instruction. The job of
a microprocessor is to execute a set of instructions stored in memory to perform a specific
task. Operations require the following: The operator or opcode which determines what will
be done.
ii) Describe the basic addressing modes for MIPS and give one suitable example instruction
to each category
MIPS (Microprocessor without Interlocked Pipeline Stages) is a RISC (Reduced Instruction Set
Computing) architecture that supports several addressing modes to access operands
efficiently. Below are the fundamental addressing modes used in MIPS, along with example
instructions for each.
1. Immediate Addressing
📌 Example:
Here, 5 is an immediate value added to register $t1, and the result is stored in $t0.
2. Register Addressing
📌 Example:
MIPS generally does not use direct addressing because it follows a load-store
architecture.
MIPS does not have direct addressing, but other architectures may use it like:
LDA 1000 # Load data from memory address 1000 into the accumulator (not MIPS)
4. Indirect Addressing
📌 Example:
Here, the value stored at the memory location pointed to by $t1 is loaded into $t0.
📌 Example:
This loads a word from memory at the address $t1 + 4 into $t0.
6. PC-Relative Addressing
📌 Example:
7. Pseudo-Direct Addressing
Used in jump instructions, where the target address is partially specified in the
instruction.
📌 Example:
j TARGET # Jump to address TARGET
The address is formed by combining part of the PC with the target field.
In computer hardware, operands refer to the data on which instructions operate, and
operations define the tasks performed by the CPU. The execution of instructions involves
fetching, decoding, and executing operations on operands stored in registers, memory, or
immediate values.
Operands are the data elements used in computation. They can be classified into the
following types:
a) Types of Operands
Operations define the type of computation or task performed by the CPU. The key categories
of operations include:
Example Instructions:
Example Instructions:
c) Control Operations
Example Instructions:
d) I/O Operations
e) Floating-Point Operations
Example Instructions:
Computers perform different types of operations to process data efficiently. Among them,
logical operations and control operations play crucial roles in decision-making and program
execution.
1. Logical Operations
Logical operations involve bitwise manipulations on binary data. These operations are
primarily used in Boolean logic, bitwise calculations, and condition checking.
Logical operations work on bits (0s and 1s). The common types include:
o Assembly Example:
o Assembly Example:
4. NOT (~) – Inverts the bits (1s become 0s and vice versa).
o Assembly Example:
5. Shift Operations
Assembly Example:
Assembly Example:
Control operations manage the flow of execution in a program. These operations enable
decision-making, looping, and function calls.
2. Unconditional Jumping
o Example:
o Calls a function (subroutine) and returns to the main program after execution.
o Example:
4. Loop Control
o Example:
o Loop:
The PowerWall processor concept is primarily associated with addressing the memory wall
problem in computer architecture. The term "PowerWall" represents the power
consumption barrier faced by modern processors due to increasing transistor density and
clock speeds.
As CPU speeds increase, their power consumption and heat generation rise
exponentially.
To overcome these limitations, modern processor designs focus on power efficiency rather
than raw clock speed. Key strategies include:
a) Multi-Core Processors
Instead of increasing clock speed, processors now have multiple cores that execute
tasks in parallel.
b) Power-Efficient Architectures
Reduced Instruction Set Computing (RISC) architectures like ARM focus on lower
power consumption.
Dynamic voltage and frequency scaling (DVFS) adjusts power usage based on
workload.
Liquid cooling and better thermal materials help dissipate heat efficiently.
Low-power states (e.g., Intel’s SpeedStep, AMD’s Cool’n’Quiet) reduce power usage
when the CPU is idle.
d) Heterogeneous Computing
Increased focus on energy efficiency and specialized processors (like GPUs, TPUs).
Conclusion
The PowerWall processor challenge has led to innovative solutions in modern computing,
shifting focus from raw speed to efficient power consumption. Today's processors achieve
high performance without excessive power usage by leveraging multi-core architectures,
low-power designs, and intelligent power management.
6. Consider three different processors PI, P2, and P3 executing the same instruction set. PI
has a 3 GHz clock rate and a CPI of 1.5. P2 has a 2.5 GHz clock rate and a CPI of 1.0. P3 has a
4.0 GHz clock rate and has a CPI of 2.2.
i).Which processor has the highest performance expressed in instructions per second?
ii).If the processors each execute a program in 10 seconds, find the number of cycles and the
number of instructions?
iii).We are trying to reduce the execution time by 30% but this leads to an increase of 20% in
the CPI. What clock rate should we have to get this time reduction?
Ans: To analyze the performance of the three processors P1,P2,P3P_1, P_2, P_3, let's go step
by step.
Given Data
(i) Which processor has the highest performance in Instructions Per Second (IPS)?
Instructions Per Second (IPS) is given by:
For P1:
For P2:
For P3:
Answer:
Processor P2 has the highest IPS at 2.5 billion instructions per second.
(ii) If each processor executes a program in 10 seconds, find the number of cycles and the
number of instructions executed.
For P1:
For P2:
For P3:
For P1:
For P2:
For P3:
Answer:
P1 20 30
P2 25 25
P3 18.18 40
(iii) If execution time is reduced by 30% and CPI increases by 20%, what should be the new
clock rate?
Given that execution time is reduced by 30%, the new execution time is:
For P1:
Thus, to achieve a 30% reduction in execution time, P1 needs a new clock rate of 5.14 GHz.
Final Answers:
P1 20 30
P2 25 25
P3 18.18 40
i).By how much must we improve the CPI of FP instructions if we want the program to run
two times faster?
iii).By how much is the execution time of the program improved if the CPI of INT and FP
Instructions are reduced by 40% and the CPI of L/S and Branch is reduced by 30%?
Given Data:
o FP: 1
o INT: 1
o L/S: 4
o Branch: 2
Let the new CPI of FP instructions be CPInew FP\text{CPI}_{\text{new FP}}. The new total
cycle count is:
Since CPI cannot be negative, this isn't possible. Reducing only FP CPI won't be enough to
double the speed.
Let the new CPI of L/S instructions be CPInew L/S\text{CPI}_{\text{new L/S}}, and keep all
other CPI values unchanged.
Answer:
The CPI of L/S instructions must be reduced from 4 to 0.8 to make the program run
2× faster.
(iii) Execution Time Improvement with 40% CPI Reduction for FP and INT, and 30%
Reduction for L/S and Branch
Improvement Factor
Speedup=ToriginalTnew=0.2560.1712=1.496
Percentage Improvement
Final Answer
Execution time improves by 33.1% when FP and INT CPI are reduced by 40% and L/S
and Branch CPI are reduced by 30%.
8. Recall how performance is calculated in a computer system and derive the necessary
performance equations.
Execution time (also called response time or latency) depends on the instruction count, clock
cycles per instruction (CPI), and clock cycle time. The fundamental performance equation is:
Since Total Cycles is determined by the number of instructions executed (IC\text{IC}) and the
average CPI, we substitute:
Total Cycles=IC×CPI
where:
Since clock rate is the inverse of clock cycle time, we can also write:
where:
Another important performance metric is Instructions Per Second (IPS), which is the number
of instructions a processor can execute per second:
The higher the IPS, the faster the processor can execute instructions. However, IPS alone
does not fully determine performance because different instructions take different amounts
of time to execute.
While MIPS is useful, it does not always reflect real performance, because different
instruction sets have different instruction complexities.
If we improve only part of a system (e.g., floating point operations), the overall speedup is
limited by the fraction of execution time that part contributes. Amdahl’s Law states:
where:
And,
(Performance of A / Performance of B)
= (Execution Time of B / Execution Time of A)
If given that Processor A is faster than processor B, that means execution time of A is less
than that of execution time of B. Therefore, performance of A is greater than that of
performance of B. Example – Machine A runs a program in 100 seconds, Machine B runs the
same program in 125 seconds
(Performance of A / Performance of B)
= (Execution Time of B / Execution Time of A)
= 125 / 100 = 1.25
That means machine A is 1.25 times faster than Machine B. And, the time to execute a given
program can be computed as:
Since clock cycle time and clock rate are reciprocals, so,
Which gives,
Execution time
= Instruction Count x CPI x clock cycle time
= Instruction Count x CPI / clock rate
Ans:
Clock speed
Clock speed is the number of pulses the central processing unit's (CPU) clock generates per
second. It is measured in hertz.
CPU clocks can sometimes be sped up slightly by the user. This process is known as
overclocking. The more pulses per second, the more fetch-decode-execute cycles that can be
performed and the more instructions that are processed in a given space of time.
Overclocking can cause long term damage to the CPU as it is working harder and producing
more heat.
Cache size
Cache is a small amount of high-speed random access memory (RAM) built directly within
the processor. It is used to temporarily hold data and instructions that the processor is likely
to reuse.
The bigger its cache, the less time a processor has to wait for instructions to be fetched.
Number of cores
A processing unit within a CPU is known as a core. Each core is capable of fetching, decoding
and executing its own instructions.
The more cores a CPU has, the greater the number of instructions it can process in a given
space of time. Many modern CPUs are dual (two) or quad (four) core processors. This
provides vastly superior processing power compared to CPUs with a single core.
10. i) Illustrate the following sequence of instructions and identify the addressing modes
used and the operation done in every instruction
i) Move (R5)+, RO
j) Add(R5)+, RO
l) Move 16(R5),R3
m) Add #40, R5
Ans: Illustration of Instructions, Addressing Modes, and Operations
Let's analyze the given sequence of instructions, identify their addressing modes, and
explain the operation performed.
Operation:
1. The value at the memory location pointed to by R5 is loaded into R0.
Illustration:
Memory [1000] 50 -
R0 - 50
Operation:
Illustration:
R5 1004 1008
Memory [1004] 30 -
R0 50 50 + 30 = 80
Operation:
Illustration:
R5 1008 1008
R0 80 80
Register Before Execution After Execution
Memory [1008] - 80
Operation:
1. Fetch the value from the memory location (R5 + 16) and store it in R3.
Illustration:
R5 1008 1008
Memory[1024] (1008+16) 70 -
R3 - 70
Operation:
Illustration:
R5 1008 1048
ii) Calculate which code sequence will execute faster according to execution time for the
following conditions:
Consider the computer with three instruction classes and CPI measurements as given below
and instruction counts for each
instruction class for the same program from two different compilers are given. Assume that the computer's
clock rate is IGHZ
Compiler 1 2 1 2
Compiler 2 2 1 1
Ans: To determine which code sequence executes faster, we need to calculate the execution
time for each compiler based on the given CPI values and instruction counts.
Since the clock rate is 1 GHz (1 × 10⁹ Hz), the execution time simplifies to:
Compiler CPI for Class 1 CPI for Class 2 CPI for Class 3
Compiler 1 2 1 2
Compiler 2 2 1 1
Compiler 1:
Compiler 2:
11. Consider two different implementation of the same instruction (13) set architecture,
The instruction can be divided into four classes according to their CPI ( class A,B,C and
D). P1 with clock rate 2.5 Ghz and CPI s of 1,2,3, and 3 respectively and P2 with clock
rate 3 Ghz and CPI s of 2,2,2and 2 respectively. Given a program with a dynamic
instruction count of 1.0*10° instruction divided into classes as follows: 10% class A,
20% class B, 50% class C, and 20% class D, which implementation is faster? What is the
global CPI for each implementation? Find the clock cycles required in both cases.
Ans: To determine which implementation is faster, we will compute the execution time for
both processors P1 and P2. This requires calculating the global CPI, total clock cycles, and
then computing the execution time.
Processor Specifications
Processor Clock Rate CPI for Class A CPI for Class B CPI for Class C CPI for Class D
P1 2.5 GHz 1 2 3 3
P2 3.0 GHz 2 2 2 2
Instruction Distribution
Clock Cycles for P1=1.0×109×2.6=2.6×109 cycles\text{Clock Cycles for P1} = 1.0 \times 10^9 \
times 2.6 = 2.6 \times 10^9 \text{ cycles}
Clock Cycles for P2=1.0×109×2.0=2.0×109 cycles\text{Clock Cycles for P2} = 1.0 \times 10^9 \
times 2.0 = 2.0 \times 10^9 \text{ cycles}
Step 5: Conclusion
Ans:
Parameter Single Processor Systems Multiprocessor Systems
The name itself is saying that the single For this also the name itself indicates that
Description processor system contains only one the multiprocessor system contains two or
processor for processing. more processors for processing.
Reliability of the Single processor system is less reliable Multiprocessor system is more reliable
system because failure in one processor will because failure of one processor does not
Parameter Single Processor Systems Multiprocessor Systems
ii) Analyze how instructions that involve decision making are executed with an example.
add $t2, $t2, $t3 # Normal execution: This runs if branch is not taken
LABEL:
o The CPU decodes the instruction and reads the values of $t0 and $t1.
3. Execution (EX):
4. Decision Making:
o If $t0 == $t1, the CPU branches to LABEL (jumps to the new address).
o The CPU executes add $t2, $t2, $t3, and then sub $t4, $t4, $t5 before
reaching LABEL.
In pipelined processors, branch instructions cause hazards (delays) because the CPU
prefetches instructions assuming sequential execution.
Branch Prediction techniques help mitigate delays by guessing whether a branch will
be taken.
Conclusion
13. Analyze the various instruction formats and illustrate them with an example.
Ans:
Used for arithmetic and logical operations that involve only registers.
Breakdown:
Rs = $t1
Rt = $t2
Rd = $t0
Breakdown:
Rs = $t1
Rt = $t0
6 bits 26 bits
Breakdown:
3. Example Illustration
j 10000 # J-type
Instruction Breakdown
add $s0, $s1, $s2 R-type 000000 $s1 $s2 $s0 00000 100000
Conclusion
Each format is optimized for efficient execution and reduced instruction size in machine
code.
14. i) Summarize the compilation of assignment statements into MIPS with suitable
examples.
Ans:
Assignment statements in high-level languages (such as C or Java) are compiled into MIPS
assembly using load/store operations and arithmetic instructions. Since MIPS is a load/store
architecture, data must be explicitly loaded from memory to registers before processing and
then stored back.
int a = 10;
MIPS Assembly
int a = b + c;
MIPS Assembly
int a = b * 5;
MIPS Assembly
A[2] = 5;
MIPS Assembly
Since each integer takes 4 bytes, the offset for A[2] is 2 × 4 = 8 bytes.
if (a > b) {
c = a;
MIPS Assembly
label:
sw $t1, 0($s3) # c = a
end:
Conclusion
Load and Store (lw, sw) are used for memory access.
ii) Translate the following C code to MIPS assembly code .Use a minimum number of
instructions. Assume that I and k correspond to register $s3 and $s5 and the base of the
array save is in $s6.What is the MIPs assembly code corresponding to this is C segment
While(save[i]==k) i+=1;
Ans:
C Code:
while (save[i] == k) {
i += 1;
Assumptions:
loop:
exit:
1. sll $t0, $s3, 2 → Shift i left by 2 (equivalent to multiplying by 4) to calculate the byte
offset in the array.
2. add $t0, $t0, $s6 → Add the base address of save to get the address of save[i].
4. bne $t1, $s5, exit → If save[i] is not equal to k, exit the loop.
7. exit: → Label where the program jumps when the loop terminates.
1. Assume that the variables f and g are assigned to register Ss0 and SSI respectively.
Assume that base address of the array A is in register Ss2. Assume f is zero initially.
f- -g-- A[4] A[5]=f+ 100. Translate the above C statement into MIPS code. how many MIPS
assembly instructions are needed to perform the C statements and how many different
registers are needed to carry out the C statements?
Ans: Let's break down the C code and translate it into MIPS assembly instructions step-by-
step:
C Code:
f = g - A[4];
A[5] = f + 100;
Assumptions:
f is in register Ss0.
g is in register SSI.
A is an array with the base address in Ss2.
Initially, f = 0.
A[4] is the 5th element of the array (since arrays are zero-indexed).
The base address of the array A is in Ss2, so the address of A[4] is base_address + 4 *
element_size.
Assuming each element in A is 4 bytes (standard for integers), the offset to A[4] is 4 *
4 = 16 bytes.
The MIPS code to load A[4] into a register (say, $t0) would be:
lw $t0, 16($s2) # Load A[4] into $t0 (16 bytes offset from base address)
sub $s0, $s1, $t0 # f = g - A[4], where $s1 is g and $t0 is A[4]
The base address of the array A is in Ss2, and A[5] is at an offset of 20 bytes (5 * 4 =
20).
sw $t1, 20($s2) # Store result of f + 100 into A[5] (20 bytes offset from base address)
sub $s0, $s1, $t0 # f = g - A[4], where $s1 is g and $t0 is A[4]
$s0: For f.
$s1: For g.
$s2: For the base address of the array A.
Summary:
2. Integrate the eight ideas from computer architecture to the following ideas from other
fields:
Integrating the eight ideas from computer architecture with concepts from other fields can
provide deeper insights into optimization, efficiency, and functionality. Let's break this down
for each of the fields you mentioned, comparing them with key concepts from computer
architecture.
Ans:
In computer architecture, there are several key concepts that can be translated to improve
the functioning of an assembly line in automobile manufacturing:
1. Pipelining:
2. Parallelism:
3. Caching:
o Computer Architecture Concept: Caches store frequently accessed data
closer to the processor to speed up access.
Ans:
1. Cache Memory:
2. Load Balancing:
3. Instruction-level Parallelism:
4. Pipeline Scheduling:
o Computer Architecture Concept: Instruction pipelining allows the
simultaneous execution of instructions, with different stages operating at the
same time.
iii) Aircraft and marine navigation systems that incorporate wind information.
1. Branch Prediction:
2. Pipeline Processing:
By applying key concepts from computer architecture to these domains, we can improve
efficiency, reliability, and performance. Here’s a brief summary of the connections:
Assembly lines: Concepts like pipelining, parallelism, caching, and task scheduling
can optimize the manufacturing process.
These ideas show how principles from computer architecture can be applied to a variety of
real-world systems to improve their performance and functionality.
3.Evaluate a MIPS assembly instruction in to a machine instruction, for the add Sto, $s1,$s2
MIPS instruction.
Ans: To convert the MIPS assembly instruction add $s0, $s1, $s2 into a machine instruction,
we need to follow the MIPS instruction format.
The add instruction in MIPS is an R-type (Register-type) instruction, which uses the following
format:
op rs rt rd shamt funct
31 26 21 16 11-6 5-0
op: The operation code (6 bits) for the instruction type (for R-type instructions, it’s
0).
funct: The function code (6 bits), specific to the instruction (for add, it's 100000).
The instruction add $s0, $s1, $s2 performs the addition of the values in registers $s1 and $s2
and stores the result in $s0.
Now, we can put all the values together in the proper order:
This is the binary representation of the add $s0, $s1, $s2 instruction in MIPS machine code.
Group the binary representation into 4-bit chunks and convert each chunk to hexadecimal:
0 11 12 8 0 32
0x01328020
Final Answer:
The MIPS machine instruction corresponding to the assembly instruction add $s0, $s1, $s2
is:
0x01328020
4. Explain the steps to convert the following high level language (15) such as C into a MIPS
code. a=b+e; c=b+f;
Given C Code:
a = b + e;
c = b + f;
In MIPS assembly, we use registers to store variables. Assuming the variables a, b, c, e, and f
are mapped to specific registers, we will use the following typical conventions (though this
can vary depending on the calling convention):
Statement 1: a = b + e;
This statement means that the value of b and e are added together, and the result is
assigned to a.
In MIPS:
MIPS Assembly:
add $s0, $s1, $s3 # a = b + e, where $s1 contains b and $s3 contains e
Statement 2: c = b + f;
This statement means that the value of b and f are added together, and the result is assigned
to c.
In MIPS:
MIPS Assembly:
add $s2, $s1, $s4 # c = b + f, where $s1 contains b and $s4 contains f
Step 3: Final MIPS Code
Now, combining both statements, the final MIPS code will be:
add $s0, $s1, $s3: This adds the values in registers $s1 (representing b) and $s3
(representing e), and stores the result in $s0 (representing a).
add $s2, $s1, $s4: This adds the values in registers $s1 (representing b) and $s4
(representing f), and stores the result in $s2 (representing c).
1. Identify which registers will hold each variable from the high-level code (e.g., a, b, c,
e, f).
2. For each statement in the high-level code, identify the operands and the operation
(e.g., addition).
3. Map the operands to appropriate registers, perform the operation using MIPS
instructions, and store the result in the corresponding register.
UNIT - 2
PART A
1 . Calculate the following: Add 510 to 610 in binary and Subtract
-610 from 710 in binary?
1. 510 + 610 in binary: 101002 + 110102 = 1001002
• Addition Overflow: Occurs when the sum exceeds the maximum value that can be
represented within the number of bits. This happens if the carry-out of the most significant
bit is different from the carry-in.
• Subtraction Overflow: Occurs when the difference results in a value outside the
representable range. This happens if subtracting a larger positive number from a smaller
positive number or subtracting a smaller negative number from a larger negative number.
• Add and Shift: Repeatedly add the multiplicand to the result and shift the multiplicand
left until all bits of the multiplier are processed.
ALU Fast Multiplication: A method used to speed up multiplication in the Arithmetic Logic
Unit (ALU) by utilizing techniques like parallel processing, pipelining, or specialized
algorithms (e.g., Booth's algorithm) to perform multiplications more quickly and efficiently.
complement method?
8. Perform X-Y using 2 's complement arithmetic for the given two 16-bit numbers X=0000
1011 1110 1111 and Y=1111 0010 1001 1101?
Overflow: Occurs when a calculation exceeds the maximum limit of the number's range.
• Example: Adding two 8-bit numbers (127 + 1) results in 128, which cannot be
represented in an 8-bit signed integer.
Underflow: Occurs when a calculation goes below the minimum limit of the number's range.
1. Same Sign Addition: Add the absolute values and keep the common sign (e.g.,
+5+3=+8+5 + 3 = +8, −5+(−3) =−8-5 + (-3) = -8).
2. Different Sign Addition: Subtract the smaller absolute value from the larger and take
the sign of the larger (e.g., +7+(−3) =+4+7 + (-3) = +4, −7+3=−4-7 + 3 = -4).
3. Zero Property: Adding zero to any integer does not change its value (e.g., 5+0=55 + 0
=
5).
4. Commutative & Associative Properties: Order and grouping do not affect the sum
(e.g., a+ b=b+ aa + b = b + a and (a+ b) +c=a+ (b + c) (a + b) + c = a + (b + c)).
1. Arithmetic Instructions: add.s , sub.s , mul.s , div.s (for single precision) and add.d,
sub.d , mul.d , div.d (for double precision).
2. Data Transfer Instructions: lwc1 (load word to FPU), swc1 (store word from FPU),
ldc1, sdc1 (for double precision).
3. Comparison Instructions: c.eq.s, c.lt.s, c.le.s (single precision), and `c.eq
1. Sign Bit: Indicates the sign of the number (0 for positive, 1 for negative).
2. Exponent: Encoded exponent value (with a bias added).
3. Significant (Mantissa): Represents the precision bits of the number.
3. Data Path in CPU: The internal structure handling data movement, including ALU,
registers, buses, and control units, ensuring efficient instruction execution.
4.Components: Key elements include instruction fetch, decode, execute, memory access,
and write-back stages for processing data efficiently.
Edge Triggered Clocking: A method in digital circuits where changes to the state of a
flipflop or other storage element occur only at specific transitions (edges) of the clock signal,
rather than at the level state of the clock. This edge can be either the rising edge (transition
from low to high) or the falling edge (transition from high to low).
20. For the following MIPS assembly instructions above, decide the corresponding C
statement?
add f, g, h & add f, i , f?
For the given MIPS assembly instructions:
1. add f, g, h → This means f = g + h; in C, where f, g, and h are integer registers or
variables.
2. add f, i, f → This means f = i + f; in C, where f is updated by adding i to its previous
value.
3. Both instructions perform integer addition and store the result in register f.
4. The equivalent C statements are:
5. f = g + h;
6. f = i + f;
PART B
1. i).Discuss the multiplication algorithm its hardware and its sequential version with
diagram?
+-----------------------------------------+
| Control Unit |
+-----------------------------------------+
| | |
+---------+ +------+ +------+
| ALU | | A | | Q |
+---------+ +------+ +------+
| | |
Multiplicand Product Multiplier
ii).Express the steps to Multiply 2*3?
Let's illustrate the multiplication of signed numbers using Booth's algorithm with A = (-
34) and B = 22.
1. Initialize Variables:
o A = (-34)10 = (1011110)2 (multiplier) o B = (22)10 = (0010110)2
(multiplicand)
o Booth's algorithm uses an additional bit for sign extension, so we consider
both numbers in 7-bit representation.
2. Set Up Registers:
o Multiplicand (B): 0010110
o Multiplier (A): 1011110 (Booth's algorithm considers the 2's complement for
negative numbers)
o Product Register: Initial value of 0
3. Algorithm Process:
o Align the multiplicand and the product register.
o Apply the Booth's encoding for every bit of the multiplier:
▪ If the current bit is 0 and the previous bit is 1, add the multiplicand to
the product.
▪ If the current bit is 1 and the previous bit is 0, subtract the
multiplicand from the product.
▪ If the current bit is the same as the previous bit, perform arithmetic
right shift.
Iteration Process:
4. Final Result: Combine the product register after all iterations to get the final
multiplication result in binary.
This process repeats until all bits of the multiplier are processed. Booth's algorithm
efficiently handles signed number multiplication by reducing the number of necessary
additions/subtractions through encoding.
Design: 1. Functionality:
4 . Develop algorithm to implement A *B. Assume A and B for a pair of signed 2's
complement numbers with values: A 010111, B-101100?
1. Initialize:
register to 0.
- For negative values, find the 2's complement of B (if not already done).
4. Combine Results:
- After processing all bits, combine the results to get the final product.
Example:
1. A = 010111
2. B = 101100
Step-by-Step:
Division Algorithm:
1. Initialize
- Repeatedly shift the divisor to the right and subtract it from the dividend until the divisor
cannot be subtracted from the current remainder.
3. Update Quotient
- For each successful subtraction, shift the quotient left and set the least significant bit to 1.
4. End
- The final values in the quotient and remainder registers represent the result of the
division.
Example: Dividing 27 by 4
1. Binary representation
2.Steps
- Subtract divisor from dividend where possible, updating the quotient and remainder.
Step-by-Step Division:
- 7 ÷ 2 = 3 (Quotient: 3, Remainder: 1)
Binary Result
A Carry Look ahead Adder (CLA) is a type of adder used in digital circuits to improve the
speed of arithmetic operations by reducing the time required to determine carry bits. Unlike
the simpler ripple-carry adder, which calculates each carry bit sequentially, the CLA
calculates one or more carry bits before the sum, significantly reducing the wait time.
Key Concepts:
- The CLA uses generate and propagate signals to quickly determine carry bits for each bit
position.
- The carry-out for each bit position is calculated using the formula:
- This allows the carry bits to be calculated in parallel, rather than sequentially.
Example:
- Generate (G) and Propagate (P) signals are calculated for each bit position.
- Carry bits** are determined using the carry look ahead logic.
Diagram:
While I can't create visual images directly, here's a simplified textual representation:
A: 1010
B: 1101
G: 1001 (Generate)
P: 1110 (Propagate)
C: 0111 (Carry)
Advantages:
1. Speed: Significantly faster than ripple-carry adders due to parallel carry calculation.
2. Efficiency: Reduces the propagation delay, making it suitable for high-speed arithmetic
operations.
Applications:
- Used in CPUs and other digital systems where fast arithmetic operations are crucial.
- Commonly found in ALUs (Arithmetic Logic Units) and other processing units.
6 ii). Divide (12)10 by (3)10
1. (12)10(12)_{10}(12)10 to
binary o 12 in decimal =
110021100_211002
2. (3)10(3)_{10}(3)10 to binary o
3 in decimal =
11211_2112
Binary Division:
7. Point out how ALU performs division with flow chart and block diagram.
The Arithmetic Logic Unit (ALU) performs division using restoring or non-restoring
division algorithms, depending on the architecture. Below, I describe the basic restoring
division method, which is commonly used in ALUs.
1. Initialize:
o Load the dividend into the dividend
register. o Load the divisor into the
divisor register. o Set the quotient
register to zero. o Clear the remainder
register.
2. Shift Left:
o Shift the remainder and the most significant bit (MSB) of the dividend left.
3. Subtract Divisor:
o Subtract the divisor from the remainder.
o If the result is negative, restore the previous remainder (by adding back the
divisor) and set quotient bit to 0.
o If the result is positive, keep the remainder and set quotient bit to 1.
4. Repeat:
o Repeat steps 2 and 3 for n iterations, where n is the number of bits in the
dividend.
Flowchart Representation:
Start
↓
Initialize Registers (Dividend, Divisor, Quotient, Remainder)
↓
Shift Left (Dividend + Remainder)
↓
Subtract Divisor from Remainder
↓
If Remainder < 0?
→ Yes: Restore Remainder (Add Back Divisor), Set Quotient Bit = 0
→ No: Keep Remainder, Set Quotient Bit = 1
↓
Repeat Steps for n Bits
↓
Store Quotient & Remainder
↓
End
8. i).Examine with a neat block diagram how floating point addition is carried out in a
computer system.
• Sign Bit (S) – Determines whether the number is positive (0) or negative (1).
• Exponent (E) – Represents the exponent using a biased notation.
• Mantissa (M or Fraction) – Stores the significant digits of the number.
|
v
+---------------------------+
| Align Exponents |
| (Shift smaller mantissa) |
+---------------------------+
|
v
+---------------------------+
| Add/Subtract Mantissas |
| (Depends on sign bits) |
+---------------------------+
|
v
+---------------------------+
| Normalize Result |
| (Shift & Adjust Exponent) |
+---------------------------+
|
v
+---------------------------+
| Round the Result |
+---------------------------+
|
v
+---------------------------+
| Handle Exceptions |
| (Overflow/Underflow) |
+---------------------------+
|
v
+---------------------------+
| Store Final Result |
+---------------------------+
We will add 5.75 (101.11₂) + 3.25 (11.01₂) using IEEE 754 single-precision format.
• The larger exponent is 129 (for 5.75), so we shift the mantissa of 3.25 right by 1 bit:
css
Copy Edit
-------------------
10.0100 (result)
Exponent
(8-bit for single, 11- 126 = 1022 = 01111111110₂ (Bias: 1023)
01111110₂ (Bias:
127
) bit for
double)
Mantissa
(23-bit for
1000000000000000000000
10000000000000000000000000000000000000000000000000 single, 52-
0 00 bit for double)
Sing
le 0111111 0 01111110
0 10000000000000000000000 10000000000000000000000
(32- 0 (126)
bit)
For Si Expone
Mantissa Final Binary mat gn nt
Dou
0111111 0 01111111110
ble 1000000000000000000000000000000000
0 1110 1000000000000000000000000000000000
(64- 000000000000000000
(1022) 000000000000000000 bit)
• Determines the sign of the result based on the operands and the operation:
o Addition: If signs are the same, add mantissas; if different, subtract the smaller
from the larger.
o Multiplication/Division: Uses XOR of input signs.
iv) Normalization & Rounding Unit
+--------------------------------+
| Floating Point ALU |
+--------------------------------+
| | |
+--------+ +-------+ +--------+
| Sign Unit | | Exp | | Mantissa |
+--------+ +-------+ +--------+
| | |
+--------------------------------+
| Normalization & Rounding Unit |
+--------------------------------+
|
+----------------+
| Final Result |
+----------------+
4 .Implementation in Hardware (FPGA/ASIC)
• Floating-Point Adders/Subtractors use barrel shifters for exponent alignment.
• Floating-Point Multipliers use Booth’s algorithm for efficient multiplication.
• Floating-Point Dividers use Newton-Raphson or Goldschmidt’s algorithm
Traditional processors operate on word-sized data (e.g., 32-bit or 64-bit words). However,
many applications involve smaller data types, such as 8-bit characters or 16-bit integers.
Sub-word parallelism breaks a single word into multiple smaller units (sub-words) and
processes them in parallel.
For example, a 32-bit register can be split into:
• Four 8-bit integers (Byte-wise parallelism)
• Two 16-bit integers (Half-word parallelism)
This allows a single instruction to process multiple sub-words simultaneously, improving
performance.
2. Example of Sub-Word Parallelism in SIMD Execution
Consider adding two 32-bit registers, each containing four 8-bit values:
Without SWP (Scalar Execution)
Each 8-bit addition is done sequentially:
markdown
A = [10] [20] [30] [40] (4 bytes in 32-bit register)
B = [ 5] [ 5] [10] [10]
----------------------------
Result = [15] [25] [40] [50] (4 separate
additions) Time taken = 4 cycles (1 cycle per
addition).
With SWP (SIMD Execution)
Using SWP-based SIMD, a single instruction can add all four 8-bit values at once:
Copy Edit
A = [10] [20] [30] [40]
B = [ 5] [ 5] [10] [10]
----------------------------
Result = [15] [25] [40] [50] (All added in 1 cycle)
Time taken = 1 cycle (4 operations in
parallel). 4x Speedup!
3. Applications of Sub-Word Parallelism:
communication signals.
4. AI & Machine Learning o Used in matrix multiplications
Introduction:
Floating-point addition is used in computers to add numbers represented in the IEEE 754
format. Since floating-point numbers have separate sign, exponent, and mantissa, addition
is more complex than integer addition.
• Extract sign (S), exponent (E), and mantissa (M) from both operands.
Step 2: Align Exponents
Align Exponents:
1.0111 (5.75)
-------------
10.0100
No rounding needed.
0 10000010 00100000000000000000000
+---------------------------------+
+---------------------------------+
| | |
| | |
+---------------------------------+
+---------------------------------+
+---------------------------------+
| Mantissa Addition/Subtraction |
+---------------------------------+
+---------------------------------+
+---------------------------------+
+----------------
| Final Result |
+----------------+
Advantages of Floating-Point Addition:
11 ii).Assess the result of the numbers (0.5)10 and (0.4375)10 using binary Floating point
Addition algorithm?
We will follow the IEEE 754 Floating-Point Addition Algorithm to add 0.5 (₁₀) + 0.4375 (₁₀).
1. Convert to Binary:
0.5=0.120.5 = 0.1_2
2. Normalize:
1.0×2−11.0 × 2^{-1}
0 01111110 00000000000000000000000
Convert 0.4375₁₀ to IEEE 754 Format:
1. Convert to Binary:
0.4375=0.011120.4375 = 0.0111_2
2. Normalize:
1.11×2−21.11 × 2^{-2}
→ 0.11100000000000000000000
Step 3: Add Mantissas:
1.00000000000000000000000 (0.5)
+ 0.11100000000000000000000 (0.4375)
----------------------------------
1.11100000000000000000000
• Sign = 0
• Exponent = 126 (01111110₂)
• Mantissa = 11100000000000000000000 Final IEEE 754 Representation (32-
bit): 0 01111110 11100000000000000000000
Step 6: Convert Back to Decimal:
1.0000011×251.0000011 × 2^5
S | E (8-bit) | M (23-bit)
--------------------------------------
0 | 10000100 | 00001100000000000000000
0x420C0000
Binary Representation:
0 10000100 00001100000000000000000
3. 18.125₁₀ = 10010.001₂
Step 2: Normalize the Binary Number:
1.0010001×241.0010001 × 2^4
S | E (8-bit) | M (23-bit)
--------------------------------------
0 | 10000011 | 00100010000000000000000
0x41910000
Binary Representation:
0.062510=0.000120.0625₁₀ = 0.0001₂0.062510=0.00012
Step 2: Normalize the Binary Number:
1.0×2−41.0 × 2^{-4}1.0×2−4
S | E (8-bit) | M (23-bit)
--------------------------------------
0 | 01111011 | 00000000000000000000000
Final IEEE 754 (Single Precision) Representation (Hexadecimal):
0x3D800000
Binary Representation:
0 01111011 00000000000000000000000
IEEE 754 (Single Precision) for 0.0625 =
1.0×2−41.0 × 2^{-4}1.0×2−4
markdown
S | E (11-bit) | M (52-bit)
---------------------------------------------------
0 | 01111110101 | 0000000000000000000000000000000000000000000000000000
Final IEEE 754 (Double Precision) Representation (Hexadecimal):
0x3F90000000000000
Binary Representation:
0 01111110101 0000000000000000000000000000000000000000000000000000
IEEE 754 (Double Precision) for 0.0625 = 0x3F90000000000000
Final Answer:
Decimal Binary IEEE 754 Hexadecimal
0.0625
(Single) 0 01111011 00000000000000000000000 0x3D800000
0.0625 0 01111110101
0x3F90000000000000 (Double)
0000000000000000000000000000000000000000000000000000
decimal:1.10102=1.625101.1010₂ = 1.625_{10}1.10102=1.62510
o IEEE exponent calculation:1010 (Convert to base-2
approximation)≈23310^{10} \text{
(Convert to base-2 approximation)} \approx 2^{33}1010 (Convert to base-
2 approximation)≈233Exponent=33+127=160(100000002)\text{Exponent}
= 33 + 127 = 160 \quad (10000000₂)Exponent=33+127=160(100000002) o
Mantissa: 10100000000000000000000
S = 0, E = 10000000, M = 10100000000000000000000
• B = 9.200 × 10⁻⁵
Precision): S = 0, E = 01111010, M =
00111001100000000000000
(1.10102)×(1.0011100112)=1.10111101100112(1.1010₂) × (1.001110011₂) =
1.1011110110011₂(1.10102 )×(1.0011100112)=1.10111101100112
Step 5: Normalize the Result:
S = 0, E = 10011011, M = 10111101100110000000000
Result in IEEE 754 (Single Precision)
0 10011011 10111101100110000000000
ii) Solve 0.510 × 0.4375?
o Binary: 0.1000001100₂ o
Normalized: 1.000001100
(01111110₂) o Mantissa:
00000110000000000000000
Mantissa:
11000000000000000000000
(1.0000011002)×(1.1102)=1.1100011002(1.000001100₂) × (1.110₂) =
1.110001100₂(1.0000011002 )×(1.1102)=1.1100011002 Step 5: Normalize the Result:
S = 0, E = 01111100, M = 11000110000000000000000
Result in IEEE 754 (Single
Precision) 0 01111100
11000110000000000000000
Final Answers:
0 10011011
1.1010 × 10¹⁰ × 9.200 × 10⁻⁵
10111101100110000000000
0 01111100
0.510 × 0.4375
11000110000000000000000
PART C
1. Create the logic circuit for CLA. What are the disadvantages of Ripple cany addition and
how it is overcome in cany look ahead adder?
Carry Look-Ahead Adder (CLA):
A Carry Look-Ahead Adder (CLA) is a high-speed adder that improves upon the Ripple
Carry Adder (RCA) by reducing carry propagation delay. Instead of waiting for each bit's
carry to propagate, CLA computes carries in advance using the Generate (G) and Propagate
(P) functions.
1. Logic Circuit for Carry Look-Ahead Adder (CLA)
o CLA precomputes carry values using P and G, making it O(1) (constant time
complexity).
2. Parallel Carry Calculation:
A′=0.10100100002A' = 0.1010010000_2A
′=0.10100100002 Now, both have exponent 18.
Step 3: Perform Mantissa Addition
A′=0.10100100002A' = 0.1010010000_2A′=0.10100100002B=1.00001001002B =
1.0000100100_2B=1.0000100100
2 Binary addition:
0.1010010000
+ 1.0000100100
-----------------------
1.1010110100
New Mantissa: 1.1010110100
(Unnormalized) Step 4: Normalize the Sum:
Since the sum is 1.1010110100, it's already normalized.
• Exponent remains 18.
• Mantissa: 1010110100.
Step 5: Apply Guard, Round, and Sticky Bits:
We extend to 13 bits for rounding (including G, R, S bits):
1.10101101000101.1010110100\mathbf{010}1.1010110100010
• Guard (G) = 0, Round (R) = 1, Sticky (S) = 0
• Since R = 1 and rounding is to nearest even, we round up.
Rounded Mantissa:
1.101011011021.1010110110_21.10101101
102 Step 6: Store in IEEE 754 Format:
Final values:
• S=0
• E = 10010 (18)
• M = 1010110110
Decimal Given:
1.1010110110_{2}×2 ^(
18−15) The binary mantissa is:
1.10101101102
Expanding it:
= 1+ 1/2 + 0/4 + 1/ 8 + 0/16 + 1/32 + 1/64 + 0/128 + 1/256 + 1/512+ 0/1024
= 1+ 0.5 + 0 + 0.125 + 0 + 0.03125 + 0.015625 + 0 + 0.00390625
3. Accumulator (4-bit)
0111 - 0010 = No
1 0111 0001
0101 Restore
1010 - 0010 = No
2 1010 0011
1000 Restore
0000 - 0010 =
3 0000 Restore 0110
1110
1100 - 0010 = No
4 1100 0111
1010 Restore
Part – A
1.Express the truth table for AND gate and OR gate.
Answer :
AND Gate:
A B Output
0 0 0
1 0 0
0 1 0
1 1 1
OR Gate:
A B Output
0 0 0
1 0 0
0 1 0
1 1 1
2.Define hazard. Give an example for data hazard.
Answer :
Define hazard:
A hazard is a situation where the normal flow of a program is disrupted due to a conflict in accessing resources.
Example for data hazard:
Consider the following instructions:
1. lw $t0, 0($t1)
2. add $t2, $t0, $t3
Data hazard occurs because the add instruction uses $t0 before it's loaded.
Answer :
Pipeline bubble.
Answer :
Answer :
Answer :
Answer :
Answer :
Branch taken: update PC with target address
Answer :
Answer :
Example(MIPS):
Answer :
Data Hazards:
Example: add $t1, $t2, $t3; sub $t4, $t1, $t5; (sub depends on the result of add).
Control Hazards:
Example: beq $t1, $t2, target; (the next instruction depends on whether the branch is taken).
Structural Hazards:
Example: If the memory unit can only perform one access per cycle, and both instruction fetch and data
access occur in the same cycle.
12.Illustrate the two steps that are common to implement any type of instruction.
Answer :
Instruction Decode (ID): Decode the instruction and read the required operands from registers.
Answer :
Forwarding (Data Bypassing): Passing results directly from one pipeline stage to another.
Answer :
Answer :
3. Execute (EX)
16.Point out the concept of exceptions. Give one example of MIPS exception.
Answer :
An exception is an unexpected event that disrupts the normal flow of instruction execution.
Example (MIPS):
o Overflow Exception: Occurs when an arithmetic operation produces a result that exceeds the
register's capacity.
17.What is pipelining?
Answer :
Answer :
Superscalar: Multiple instructions can be issued and executed in the same clock cycle.
VLIW (Very Long Instruction Word): The compiler packages multiple independent instructions into a single
long instruction word for concurrent execution.
Answer :
+-------------------+
| Instruction Queue |
+-------------------+
|
v
+------------------------+
| Reservation Stations |------> +------------+
+------------------------+ | Execution |
| | Units |
v +------------+
+------------------------+
+------------------------+
Reservation Stations: Buffer operands and instructions until they are ready to execute.
Common Data Bus (CDB): Broadcasts results to reservation stations and registers.
Answer :
An exception is an unscheduled event that disrupts the normal execution flow of a program. It can arise from
various sources, such as hardware failures, software errors, or program interrupts.
Example(MIPS):
Address Error Exception: Occurs when a program attempts to access a memory address that is not
properly aligned or does not exist.
Part – B
Logic design is the art of creating circuits that perform specific functions based on binary inputs (logic
“0” and “1”). Its foundation is built on Boolean algebra—a mathematical system using binary variables and
operations (AND, OR, NOT, etc.) that model the behavior of digital components.
2. Boolean Algebra and Canonical Forms:
At the heart of digital circuit design is Boolean algebra. The two canonical forms often used are:
Sum-of-Products (SOP): The logic function is expressed as a logical OR (sum) of multiple AND
(product) terms. This representation simplifies implementation using AND-OR gate networks.
Product-of-Sums (POS): The function is represented as an AND (product) of several OR (sum) terms.
This form is particularly useful in designing NOR-logic based circuits.
For example, the Boolean function F = A·B + A′·C represents a typical SOP that can be directly
implemented using an AND-OR logic network.
Standardized symbols ensure consistency and clarity in circuit diagrams. Some key conventions
include:
AND Gate: Usually drawn with a flat left side and a curved right side.
OR Gate: Designed with a curved input side that funnels together, ending in a pointed output.
NOT Gate (Inverter): Represented as a triangle with a small circle (bubble) at the output, indicating
logical inversion.
NAND, NOR, XOR, and XNOR Gates: These are variations where additional bubbles may be added or
shapes adjusted slightly to indicate their specific operation.
The use of bubbles on inputs or outputs also serves as a convention to denote active-low signals. If a gate input
is “active low,” a small circle is placed at that terminal, indicating that a low (logic 0) represents the active or
asserted state.
Beyond static gate-level diagrams, logic design conventions extend to timing diagrams in sequential
circuits. Proper labelling of signals, clock edges, setup/hold times, and propagation delays is critical for ensuring
that the circuit works reliably under dynamic conditions. A consistent naming convention for signals and nets
helps in debugging, simulation, and later hierarchical integration.
Good practice in logic design involves breaking complex circuits into smaller, manageable blocks or
modules. Each module adheres to standard interface conventions—with clearly defined inputs, outputs, and
enable signals. This modular approach improves testability, reusability, and overall clarity of the design process.
Example Diagram
Below is an ASCII diagram representing a simple combinational logic circuit using these conventions. The
diagram shows the implementation of a function using SOP form:
A B C
| | |
| | +-----------+
| | |
| +----[ AND ]---|
| |
+----[ AND ]-----|----[ OR ]---- F = (A AND B) OR (A' AND C)
(A, B) /
NOT --| /
A'
Explanation of the Diagram:
The second AND gate uses the inverted input A' (obtained from an inverter, shown by the bubble on
the NOT gate) along with input C.
The outputs of both AND gates feed into an OR gate to produce the final output F.
Inversion (active-low representation) is indicated by the bubble (even though text-based, the term
“NOT” and the label A' clearly denote inversion).
Hierarchical Flow: The function F is clearly built by combining simpler logic blocks.
2. i)State the MIPS implementation in detail with necessary multiplexers and control
lines.
ii)Examine and draw a simple MIPS datapath with the control unit and the execution
of ALU instructions.
Answer :
The MIPS architecture is a Reduced Instruction Set Computer (RISC) architecture. Its implementation involves
breaking down instructions into distinct stages: Instruction Fetch (IF), Instruction Decode (ID), Execute (EX),
Memory Access (MEM), and Write Back (WB). To handle the various instruction types and data flows, we need
multiplexers and control lines.
Execute (EX):
o The ALU performs the operation specified by the instruction.
Control Lines:
RegDst: Determines the destination register for write-back.
ALUSrc: Determines the second ALU operand.
MemtoReg: Determines the data source for write-back.
RegWrite: Enables writing to the register file.
MemRead: Enables reading from data memory.
MemWrite: Enables writing to data memory.
Branch: Enables branching.
ALUOp: Specifies the ALU operation.
Jump: Enables Jump instruction.
PCSource: Selects the next PC address.
ii) Simple MIPS Datapath with Control Unit and ALU Instructions
Now, let's focus on the datapath for ALU instructions, including the control unit.
1. Instruction Fetch:
o The instruction bits [31:26] (opcode) are sent to the control unit.
o The register file reads the source registers (rs and rt).
o The Sign extend unit extends the 16 bit immediate value to 32 bits.
3. ALU Execution:
o The ALU performs the operation specified by the ALUOp control signals.
o ALUSrc selects the second operand (either a register value or the sign-extended immediate).
4. Write Back:
The control unit is the brain of the datapath. It generates the necessary control signals based on the instruction's
opcode. For ALU instructions:
MemRead: 0.
MemWrite: 0.
Branch: 0.
Definition: Parallelism is the technique of performing multiple computations simultaneously, rather than
sequentially. Its purpose is to improve performance and throughput by taking advantage of multiple processing
elements that work concurrently. In computer architecture, parallelism is key to speeding up data processing and
achieving high computational performance.
Types of Parallelism:
o What It Is: ILP refers to the ability of a processor to execute multiple instructions at the same
time.
o How It's Achieved: This is typically accomplished using techniques such as pipelining,
superscalar execution (multiple instructions issued per clock cycle), out-of-order execution,
and speculative processing.
o What It Is: DLP involves performing the same operation on multiple data items concurrently.
o How It's Achieved: Modern processors use vector units or specialized SIMD instructions that
work on arrays of data (for example, in multimedia processing or scientific computations).
o How It's Achieved: Multicore processors and multi-threaded software allow separate tasks
(which may be parts of a larger application) to run simultaneously on different cores or
hardware threads.
4. Bit-Level Parallelism:
o What It Is: Bit-level parallelism deals with the simultaneous processing of multiple bits
within a single machine word.
o How It's Achieved: By using wider words (for example, moving from 8-bit to 32-bit or 64-bit
processing), more bits are handled in a single operation, effectively processing data in parallel.
plaintext
[Parallelism]
+---------------+---------------+
| | |
Bit-Level Parallelism
This diagram illustrates that while all types of parallelism share the goal of concurrent execution, they operate at
different levels—from individual bits and instructions up to entire tasks or threads.
Characteristics of ILP:
1. Pipelining:
o Concept: ILP often leverages pipelining to overlap the execution of various instruction stages
(fetch, decode, execute, memory access, write-back).
o Benefit: This overlaps independent instructions so that while one instruction is being decoded,
another can be fetched, and yet another can be executed.
2. Superscalar Execution:
o Concept: Many modern processors can issue and execute more than one instruction per clock
cycle by having multiple execution units.
o Benefit: Increases the number of instructions completed per cycle (IPC), thereby boosting
performance.
3. Out-of-Order Execution:
o Concept: Instructions are allowed to execute as soon as their operands are available rather
than strictly in program order.
o Benefit: Improves resource utilization and minimizes stalls due to instruction dependencies.
o Concept: Processors predict paths of branch instructions to avoid stalling the pipeline.
Speculative results are committed only if the predictions are correct.
o Benefit: Mitigates control hazards and improves ILP by reducing delays on branch
instructions.
5. Hardware Complexity:
o Design Note: Implementing ILP requires additional hardware (such as reservation stations,
reorder buffers, and complex scheduling logic), which adds to design complexity and power
consumption.
Limitations of ILP:
1. Data Hazards:
o Definition: Occur when instructions depend on the results of previous instructions (e.g.,
Read-after-Write hazards).
o Impact: These dependencies force the processor to delay execution, which can limit parallel
instruction throughput.
2. Control Hazards:
o Impact: Mis-predicted branches lead to pipeline flushes and wasted cycles, which reduce the
benefits of ILP.
o Impact: Highly sequential code, with many dependencies, limits the number of instructions
that can be executed in parallel.
o Impact: These delays can stall the execution pipeline, reducing effective ILP, even if there are
many independent instructions.
o Definition: As more hardware is added to expose further ILP, the complexity and power
requirements increase.
o Impact: There is a practical limit (the "ILP wall") beyond which additional parallel execution
does not yield significant performance improvements due to overhead, dependencies, and
resource contention.
plaintext
Pipeline Stages
+--------------------+--------------------+--------------------+--------------------+--------------------+
+--------------------+--------------------+--------------------+--------------------+--------------------+
↓ ↓ ↓ ↓ ↓
+--------------------+--------------------+--------------------+--------------------+--------------------+
| ID / Decode & Reg | ID / Decode & Reg | ID / Decode & Reg | ID / Decode & Reg | ID / Decode & Reg |
+--------------------+--------------------+--------------------+--------------------+--------------------+
↓ ↓ ↓ ↓ ↓
+--------------------+--------------------+--------------------+--------------------+--------------------+
+--------------------+--------------------+--------------------+--------------------+--------------------+
↓ ↓ ↓ ↓ ↓
+--------------------+--------------------+--------------------+--------------------+--------------------+
| MEM Access | MEM Access | MEM Access | MEM Access | MEM Access |
+--------------------+--------------------+--------------------+--------------------+--------------------+
↓ ↓ ↓ ↓ ↓
+--------------------+--------------------+--------------------+--------------------+--------------------+
| WB / Write Back | WB / Write Back | WB / Write Back | WB / Write Back | WB / Write Back |
+--------------------+--------------------+--------------------+--------------------+--------------------+
Explanation:
Each column represents different instructions advancing through pipeline stages concurrently.
ILP is achieved because multiple instructions are at different processing stages simultaneously.
The diagram also implies that hazards (data or control) may force certain instructions to stall or require
forwarding, which limits the overall instruction throughput.
An instruction pipeline divides instruction processing into several consecutive stages. A typical RISC pipeline
has five stages:
By overlapping these stages—for example, while one instruction is in EX, another can be decoded
simultaneously—the processor improves throughput. However, simultaneous execution results in hazards that
force the pipeline to stall.
Pipeline stalls (or bubbles) occur when hazards interrupt the normal flow of instructions. The main hazards
include:
Structural Hazards: Occur when hardware resources are insufficient. For example, if an instruction in
MEM and one in IF both need the same memory port, one must wait.
Data Hazards: Happen when an instruction depends on a result not yet available. For example, a load–
use hazard arises if the instruction in the ID stage must wait for a preceding load instruction still in the
EX/MEM stage.
Control Hazards: Arise mainly from branch instructions. When the outcome of a branch is uncertain,
subsequent instructions may have to be flushed or stalled until branch resolution.
To handle these, modern pipeline designs include a hazard detection unit (HDU) plus techniques such as
forwarding (or bypassing) and stall insertion.
o Monitors instructions in the pipeline to detect conflicts (e.g., when a register needed in ID is
being produced in EX/MEM).
The Program Counter (PC) and IF/ID register are frozen so that no new
instruction is fetched.
A bubble (NOP) is inserted into the pipeline at the ID/EX boundary.
o This controlled delay ensures that the dependent instruction waits until the correct data is
available.
o For many data hazards, the forwarding unit can reroute data from later stages (EX/MEM or
MEM/WB) back to the EX stage.
o This minimizes the number of required stalls, although some hazards (e.g., load–use) may still
force a stall.
Data Hazard Example: Suppose Instruction I1 loads data from memory, and Instruction I2 in the next
cycle needs that data. If I2 starts in the ID stage before I1’s data reaches the WB stage, the hazard
detection unit issues a stall. The PC and IF/ID register do not update, and a NOP is injected into the
pipeline to delay I2 until the data is available.
Control Hazard Example: For a branch instruction, if the branch decision is not resolved early, the
instructions fetched after the branch may need to be canceled or stalled, wasting cycles.
Below is an ASCII diagram illustrating a five-stage pipeline along with the hazard detection and stall insertion
logic:
+------------------------+
| Instruction Memory |
| (Fetch Instruction) |
+-----------+------------+
|
v
+-----------------+
| IF Stage |
| (Fetch & PC MUX)|
+-----------------+
|
[IF/ID Pipeline Register]
|
v
+-----------------+
| ID Stage |
| (Decode & Reg |
| Read) |
+-----------------+
| ^
| | Hazard
v | Detection Unit
+-----------------+ (Monitors reg dependencies)
| ID/EX Pipeline |
| Register |<--------- Stall Signal ----+
+-----------------+ |
| |
v | Freeze IF/ID
+--------------+ | & PC update
| EX Stage | |
| (Execute) | |
+--------------+ |
| |
+----------------+----------------+ |
| | | |
v v v |
+-------------+ +-------------+ +-------------+ |
| EX/MEM | | Control/ | | MEM | |
| Pipeline | | Status Unit | | Stage | |
| Register | +-------------+ | (Data Mem) | |
+-------------+ +-------------+ |
| ^ |
v | |
+-------------+ | |
| MEM/WB | | |
| Pipeline | | |
| Register | | |
+-------------+ | |
| | |
v | |
+-------------+ | |
| WB Stage | | |
| (Write Back)| | |
+-------------+ | |
| |
[Stall/Bubble Insertion] <-------+-----------+
Explanation:
Normal Operation: Instructions flow from IF through ID, EX, MEM to WB using intermediate
pipeline registers.
Hazard Detection: The hazard detection unit monitors instructions in the ID stage for dependencies
against those in later stages. When a hazard (for example, a load–use hazard) is detected, it sends a stall
signal.
o Freezes the PC and the IF/ID register, thereby halting the pipeline fetch.
o Inserts a bubble (NOP) into the ID/EX register, delaying the execution of the dependent
instruction until the hazard is resolved.
Legend:
- The forwarding unit detects that I2’s operand (in its EX stage) is the destination of I1.
- Data is forwarded from I1’s MEM stage (or EX, if available) to the ALU input for I2.
- Multiplexers at I2’s ALU input select between the register file output and the forwarded
result.
Detailed Explanation:
I1 Execution: I1 calculates the result in its EX stage. Instead of waiting until WB to
write the result into the register file, the value can be used immediately.
I2 Dependency: I2 enters its EX stage and requires the operand that I1 is about to
produce. The forwarding unit checks and identifies that I1’s result should be
forwarded.
MUX Operation: A multiplexer at the ALU input of I2 selects the forwarded value
from I1 (coming from the EX/MEM pipeline register) rather than the stale value from
the register file.
Impact: This bypassing minimizes stalls and increases throughput by keeping the
pipeline full and reducing wait cycles.
11. Express the simple data path with control unit and modified data path to
accommodate pipelined executions with a diagram
Answer :
1. Simple (Single‐Cycle) Datapath with Control Unit
In a simple single-cycle processor, an instruction is fetched, decoded, executed, and its result
written back in one clock cycle. All functional units work in a sequential “data‐path” that
directly reflects the instruction’s requirements. The main components include:
Program Counter (PC): Holds the address of the next instruction.
Instruction Memory: Retrieves the instruction using the PC.
Instruction Register (IR): Temporarily holds the fetched instruction.
Control Unit: Decodes the instruction’s opcode (and other fields) to generate control
signals that steer the data—for example, to choose ALU inputs or select between
register file or immediate data.
Register File: Contains the processor’s general registers. It supplies operands (using
read ports) and accepts the result (using write port).
Sign-Extension Unit: Extends immediate fields (from 16 bits to 32 bits) when
needed.
ALU (Arithmetic Logic Unit): Performs arithmetic or logic operations.
Data Memory: Accessed during load/store instructions; reads or writes data.
Multiplexers (MUXes): Switch between alternative data sources (e.g., ALUSrc MUX
picks between a register operand and an immediate) or select the proper destination
register (RegDst MUX).
Diagram: Simple (Single-Cycle) Datapath
plaintext
+-----------------------------+
PC ------------>| Instruction Memory | <-- Fetch instruction
+--------------+--------------+
|
V
+----------------------+
| Instruction Reg |
+----------------------+
|
V
+---------------------------------------+
| Control Unit |
| (Generates signals: RegDst, ALUSrc, |--+
| ALUOp, MemRead, MemWrite, MemtoReg, | | Control signals
| RegWrite, etc.) | |
+---------------------------------------+ |
| |
| |
+---------+---------+ |
| | |
V V V
+----------------------+ +----------------------+
| Register File |<--- read ports ---| Sign Extend Unit |
| (Read Registers Rs, | | (Immediate to 32-bit)|
| Rt; Write Rd) | +----------------------+
+---------+------------+
| \
read data1| \read data2
| \
| \
| +-------------+ +--------------------------------+
+----->| ALUSrc |<---------| Control signal ALUSrc |
| MUX | | (Selects second ALU operand) |
+------+------+ +--------------------------------+
|
V
+-------+
| ALU | <--- ALUOp (arithmetic/logic control)
+---+---+
|
V
+---------+
| Data |
| Memory | <--- MemRead/MemWrite control signals
+---------+
|
V
+---------------+
| Write Back |<--- MUX selecting output: ALU result or
| (Register File| Data Memory output as per MemtoReg
| Write) | control signal.
+---------------+
Explanation:
The PC supplies the address to the Instruction Memory, which returns an instruction;
that instruction is stored in the Instruction Register.
The Control Unit decodes the instruction, generating signals that drive multiplexers
and direct operations in the ALU, Data Memory, and Register File.
The Register File supplies operands to the ALU and receives write-back data.
Multiplexers (such as the ALUSrc MUX and MemtoReg MUX) are used to select
alternative data paths according to the type of instruction.
All operations complete in one clock cycle in this design.
2. Modified Datapath for Pipelined Execution
To increase throughput, the processor is modified into a pipelined design that overlaps the
execution of multiple instructions. The stages are generally divided into:
1. IF (Instruction Fetch)
2. ID (Instruction Decode / Register Read)
3. EX (Execution / Computation)
4. MEM (Memory Access)
5. WB (Write Back)
Each stage is separated by pipeline registers that hold the intermediate values. The control
unit remains, but additional hazard detection, forwarding, and stall mechanisms may be used
(not shown in full detail in our diagram).
Key Modifications:
Pipeline Registers: Insert registers between IF/ID, ID/EX, EX/MEM, and MEM/WB
stages. These hold the intermediate data (instruction fields, operand values, ALU
result, etc.).
Stage-Specific Control: Control signals generated in one stage are passed along in
pipeline registers to ensure that each instruction carries the necessary control
information.
Increased Complexity: The pipeline requires careful handling of hazards such as
data, control, and structural hazards through techniques like forwarding and stalling;
however, the basic concept is partitioning the datapath into distinct stages operating
concurrently.
Diagram: Pipelined Datapath with Pipeline Registers
plaintext
----------------------- ----------------------- -----------------------
-----------------------
| IF Stage | | ID Stage | | EX Stage | | MEM Stage |
----------------------- ----------------------- -----------------------
-----------------------
| Instruction Fetch | -->| Instruction Decode | -->| ALU Operation | -->| Data
Memory Access |
----------------------- ----------------------- -----------------------
-----------------------
| | | |
V V V V
+--------------+ +--------------+ +--------------+ +--------------+
| IF/ID Reg | ------------>| ID/EX Reg | ------------>| EX/MEM Reg | ------------>|
MEM/WB Reg |
+--------------+ +--------------+ +--------------+ +--------------+
(Holds control, (ALU result, & (Data from Mem or
register data, etc.) forwarding signals) ALU result, etc.)
|
V
+--------------+
| WB Stage |
| (Write Back |
| to Reg File)|
+--------------+
Detailed Explanation:
IF Stage: The Program Counter (PC) and Instruction Memory function as in the
simple design. The fetched instruction is stored in the IF/ID pipeline register.
ID Stage: In this stage, the instruction is decoded and operands are read from the
Register File. Meanwhile, the Control Unit generates the control signals (which are
stored in the ID/EX pipeline register) along with the register read data.
EX Stage: The ALU operates on the operands received (with possible modifications
from forwarding paths). The ALU result and any control signals are passed on to the
EX/MEM pipeline register.
MEM Stage: For load and store instructions, the Data Memory is accessed. The
resulting data and/or the ALU result (for other instructions) are stored in the
MEM/WB pipeline register.
WB Stage: Finally, the correct result is written back to the Register File.
The pipelined datapath allows concurrent execution of different parts of multiple instructions
—thus increasing throughput. However, extra logic like hazard detection units and
forwarding paths (not fully depicted here) are needed to handle dependencies between
instructions.
12.With a suitable set of sequence of instructions show what happens when the branch
is taken, assuming the pipeline is optimized for branches that are not taken and that we
moved the branch execution to the stage.
Answer :
1. Instruction Sequence Example
Consider the following (MIPS-like) sequence of instructions:
mips
I1: add $t0, $t1, $t2 ; Compute $t0 = $t1 + $t2
I2: beq $t0, $zero, L1 ; Branch to label L1 if $t0 == 0
I3: sub $t3, $t4, $t5 ; Instruction following branch (fetched speculatively)
...
L1: or $t6, $t7, $t8 ; Branch target instruction at label L1
Assumptions:
The pipeline is optimized for branches that are not taken: By default, the hardware
predicts that branches will not be taken so that sequential instructions (like I3) are
fetched immediately.
The branch decision (i.e., the branch “execution”) has been moved to an early stage
(typically EX) to minimize penalty when the branch is not taken.
In this example, although the hardware predicts “not taken,” the branch is actually
taken because I2 finds that $t0 equals zero.
2. What Happens When the Branch Is Taken
1. Speculative Fetch under “Not Taken” Assumption:
o I1 is fetched and executed normally.
o I2 (the BEQ instruction) is fetched and starts its journey through the pipeline.
o Because the processor is optimized for “branch not taken,” the next sequential
instruction (I3) is fetched speculatively as if the branch will not occur.
2. Branch Evaluation in the EX Stage:
o When I2 reaches the EX stage, its branch condition (whether $t0 equals zero)
is evaluated.
o In our case, the condition is met—so the branch is taken.
o At this point, the branch target address (label L1) is computed.
3. Flushing Incorrectly Fetched Instructions:
o Because I3 was fetched under the wrong assumption (that the branch would
not be taken), it is now recognized as a mis-speculated instruction.
o The control logic flushes (or cancels) I3 (and any later instructions also
fetched along the wrong path).
o The PC is then updated to the branch target address (address of L1).
4. Resuming at the Correct Target:
o New instructions are fetched starting at label L1.
o The pipeline continues execution from the branch target.
3. Pipeline Timeline Diagram
Below is an ASCII timeline diagram that shows the progress of the instructions through a
classic five-stage pipeline (IF, ID, EX, MEM, WB). Assume that the branch decision is made
in the EX stage.
plaintext
Pipeline Stages: IF ID EX MEM WB
Cycle 1: I1: IF
-------------------------------
Cycle 2: I1: ID | I2: IF
-------------------------------
Cycle 3: I1: EX | I2: ID | I3: IF <-- I3 is speculatively fetched!
-------------------------------
Cycle 4: I1: MEM | I2: EX* | I3: ID <-- I2 EX: Branch decision is made:
------------------------------- condition true → branch taken.
(I3 is flushed since branch is taken)
Cycle 5: I1: WB | I2: MEM | L1: IF <-- Fetch from branch target address L1.
-------------------------------
Cycle 6: | I2: WB | L1: ID
-------------------------------
Cycle 7: | | L1: EX ... (and so on)
Notes on the Diagram:
Cycle 2–3:
o I1 proceeds normally.
o I2 is fetched and decoded.
o I3 is fetched because the hardware assumes “branch not taken.”
Cycle 4:
o I2 is in the EX stage and the branch condition is evaluated (marked with *).
o Since the branch is taken, I3 (which is in the ID stage) is no longer valid and
will be flushed.
Cycle 5 onward:
o The pipeline begins fetching instructions from the correct branch target—L1
—as indicated by the updated PC.
o The instructions following the branch target (at L1) now enter the pipeline.
Cycle 1: I1: IF
Cycle 2: I1: ID | I2: IF
Cycle 3: I1: EX | I2: ID | I3: IF
Cycle 4: I1: MEM | I2: EX | I3: ID | I4: IF
Cycle 5: I1: WB | I2: MEM | I3: EX | I4: ID | I5: IF
Cycle 6: I2: WB | I3: MEM | I4: EX | I5: ID | I6: IF
Explanation:
o After the pipeline is full (from cycle 5 onward), one instruction is finished
every cycle, even though each instruction still takes 5 stages (or cycles,
ignoring pipeline overhead) to flow through.
o The clock cycle can be shorter (if each stage is optimized) than the worst-case
delay of the single-cycle design, and the overall throughput is higher.
Performance Implication: The ideal throughput for a pipelined processor is near one
instruction per cycle. However, the actual performance can be reduced by hazards and
pipeline stall cycles.
Part (ii): Advantages and Limitations of Pipelining (with Overcoming Methods)
Advantages of Pipelining Over Single-Cycle
1. Increased Throughput:
o With overlapping instruction execution, a fully loaded pipeline ideally
completes one instruction per cycle.
2. Faster Clock Cycle:
o As the execution is divided into shorter stages, the clock period can be reduced
(since each stage performs a fraction of work).
3. Efficient Hardware Utilization:
o Functional units are continuously active as different instructions occupy
different stages.
4. Higher Performance Potential:
o Overall, pipelining leads to better instruction throughput compared to waiting
for long worst-case cycles in a single-cycle design.
Limitations of Pipelining a Processor’s Datapath
1. Pipeline Hazards:
o Data Hazards: Dependencies between instructions can cause stalls if an
instruction requires a result that isn’t yet available.
o Control Hazards: Branch instructions may lead to mispredicted paths and
require the pipeline to be flushed.
o Structural Hazards: Resource conflicts occur when hardware units needed
by different stages overlap.
2. Increased Complexity and Overhead:
o Additional pipeline registers and control logic (hazard detection, forwarding
logic, branch prediction, etc.) increase design complexity, power consumption,
and potential clock cycle overhead.
3. Pipeline Stalls and Bubbles:
o Hazards can cause idle cycles (stalls) that reduce the ideal throughput.
4. Difficulty in Handling Non-Uniform Instructions:
o Some instructions (e.g., memory operations) may have variable latency,
complicating pipeline balancing.
Methods to Overcome Limitations
1. Operand Forwarding and Bypassing:
o Forward data directly from one pipeline stage (e.g., EX/MEM) to an earlier
stage (e.g., EX), reducing data hazard stalls.
2. Hazard Detection Units:
o Implement hardware that detects potential hazards early and stalls the pipeline
only when necessary.
3. Branch Prediction and Delay Slot Scheduling:
o Use dynamic branch prediction to reduce control hazards. Alternatively,
restructure code to fill branch delay slots.
4. Superscalar and Out-of-Order Execution Techniques:
o With clever scheduling (dynamic issue), instructions can be reordered to avoid
stalls.
5. Pipeline Balancing and Optimization:
o Split stages evenly so that no stage becomes a bottleneck, and use pipeline
registers optimized for minimal overhead.
Diagram: Pipelined Datapath with Hazard Resolution Techniques
plaintext
+----------------------------------------------+
| Pipelined Datapath |
+----------------------------------------------+
| IF | ID | EX | MEM | WB |
+---------+---------+---------+---------+------+
| Inst | Control | ALU | Data | Write|
| Fetch | / Reg | Ops | Memory | Back |
+---------+---------+---------+---------+------+
| | | |
V V V V
+------------------------------------------------+
| Pipeline Registers (IF/ID, ID/EX, EX/MEM, |
| MEM/WB) with Hazard Detection |
+------------------------------------------------+
| ^
| |-- Operand Forwarding Path
V
[ Hazard Detection Unit & Branch Predictor ]
Explanation of Diagram:
The diagram shows a classic five-stage pipeline.
Pipeline registers separate each stage and carry control information.
The hazard detection unit monitors for data and control hazards and, along with
operand forwarding logic, minimizes stalls.
A branch predictor is incorporated to reduce control hazards by guessing branch
behavior early.
PART -C
1.Assume the following sequence of instructions are executed on a 5 stage pipelined
processor
Or rl ,r2,r3
Or r2,rl ,r4
Or rl, r1, ,r2
i) Indicate dependences and their type.
ii) Assume there is no forwarding in this pipelined processor. Indicate hazards and
add NOP instructions to eliminate them.
iii) Assume there is a full forwarding .Indicate hazard and add NOP instructions to
eliminate them
Answer :
i): Data Dependences and Their Types
1. Between I1 and I2: • I1 writes to r1 in WB. • I2 reads r1 (as a source
operand) in its ID stage. → Hazard: Read‐After‐Write (RAW) on r1.
2. Between I2 and I3: • I2 writes to r2 in WB. • I3 reads r2 as a source
operand in its ID (and uses it in EX). → Hazard: RAW on r2.
3. Between I1 and I3: • I1 writes to r1 and I3 also reads r1 as a source operand.
→ Hazard: RAW on r1 (I3 depends on the original r1 value produced by I1, even
though I2 does not write r1).
Thus, the sequence has two independent RAW hazards:
I1 → I2: Hazard on r1.
I2 → I3: Hazard on r2.
(And additionally, I3’s read of r1 depends on I1.)
Part (ii): No Forwarding Case – Hazards and NOP Insertion
Key Issue: Without forwarding, a consuming instruction must wait until the producer writes
its result to the register file in the WB stage. That is, the ID stage (which reads the operands)
must be delayed until after the WB of the instruction that produces the needed data.
Timing Requirements
I1: • IF: Cycle 1 • ID: Cycle 2 • EX: Cycle 3 • MEM: Cycle 4 • WB: Cycle 5
→ The new value of r1 is written in cycle 5.
I2 Dependency (on r1): I2 must read r1 in ID after cycle 5. → Its ID stage must be
scheduled at cycle 6 or later.
I2 → I3 Dependency (on r2): Suppose we reschedule I2 so that its IF starts in cycle
5: • I2: IF=Cycle 5, ID=Cycle 6, EX=Cycle 7, MEM=Cycle 8, WB=Cycle 9 → The
new value of r2 will be available in cycle 9. I3 then must have its ID stage no earlier
than cycle 10.
Inserting NOPs
Without stalling, the processor would normally fetch instructions one per cycle. However, to
avoid the hazards we must delay I2 and I3 as follows:
Between I1 and I2: In an unstalled pipeline, I2 would be fetched in cycle 2. To have
I2’s ID stage occur in cycle 6, we delay its IF to cycle 5. Thus, insert 3 NOPs after
I1.
Between I2 and I3: With I2’s IF in cycle 5, I2’s WB is in cycle 9. To ensure I3’s ID is
after cycle 9, delay I3’s IF to cycle 9. Thus, insert 3 NOPs after I2.
Pipeline Timeline Without Forwarding
A simplified timeline is shown below:
With No Forwarding (NOPs inserted)
------------------------------------------------------------------------
Cycle: 1 2 3 4 5 6 7 8 9 10 ...
------------------------------------------------------------------------
I1: IF -> ID -> EX -> MEM -> WB
NOP: NOP -> NOP -> NOP (3 NOPs inserted)
I2: IF -> ID -> EX -> MEM -> WB
NOP: NOP -> NOP -> NOP (3 NOPs inserted)
I3: IF -> ID -> EX -> MEM -> WB
------------------------------------------------------------------------
I1: Completes WB in cycle 5 so that I2’s ID in cycle 6 reads updated r1.
I2: Completes WB in cycle 9 so that I3’s ID in cycle 10 reads updated r2 (and r1
from I1).
Total NOPs inserted: 3 between I1 and I2, and 3 between I2 and I3.
Part (iii): Full Forwarding Case – Hazards and Resolution
When full forwarding is available, the ALU result computed in the EX stage can be directly
forwarded to a subsequent instruction’s EX stage. Thus, the register file update (WB) delay is
masked by bypassing.
How Forwarding Resolves the Hazards
I1 → I2 (on r1): I1 computes r1 in its EX stage (cycle 3) and the result is available
for forwarding. I2’s EX stage occurs in cycle 4 (if fetched in the normal order). With
forwarding, I2 receives the value immediately from I1’s EX or MEM stage. → No
stall is required.
I2 → I3 (on r2): Similarly, I2 computes r2 in its EX stage (cycle 4 if I2 is fetched
immediately after I1) and can forward that result to I3’s EX stage (cycle 5). → No
stall is required.
I1 → I3 (on r1): I3’s use of r1 is also handled by forwarding from I1’s EX stage if
needed. → No stall is required.
Ideal Pipeline Schedule with Full Forwarding
A typical schedule (assuming back-to-back fetch) is:
With Full Forwarding (No NOPs needed)
------------------------------------------------------------------------
Cycle: 1 2 3 4 5 6 7
------------------------------------------------------------------------
I1: IF -> ID -> EX -> MEM -> WB
I2: IF -> ID -> EX -> MEM -> WB
I3: IF -> ID -> EX -> MEM -> WB
------------------------------------------------------------------------
In this ideal schedule:
I2’s EX stage (cycle 4) gets the value of r1 forwarded from I1’s EX/MEM output.
I3’s EX stage (cycle 5) gets the value of r2 forwarded from I2’s EX/MEM output (and
r1 from I1, if necessary).
Thus, with full forwarding, no NOPs (stalls) are required.
Final Answers Summary
1. Dependences and Types: – I1 → I2: RAW hazard on r1. – I2 → I3: RAW
hazard on r2. – I1 → I3: RAW hazard on r1.
2. No Forwarding: – To ensure that the ID stage for an instruction (which reads
operands) occurs only after the previous instruction’s WB stage, you must insert
stalls. • Insert 3 NOPs between I1 and I2 (delaying I2 so that its ID is in cycle
6, after I1’s WB in cycle 5). • Insert 3 NOPs between I2 and I3 (delaying I3 so
that its ID is after I2’s WB in cycle 9). – Total: 6 NOPs are needed.
3. Full Forwarding: – With a full forwarding mechanism, results computed in the EX
stage are immediately forwarded to the following instruction’s EX stage. – No
NOPs are required since all RAW hazards (I1→I2, I2→I3, and I1→I3) are resolved
by bypassing the data directly.
3. Consider the following code segment in C: A=b+e; c=b+f; Here (15) |BTL 5|
Evaluating is the generated MIPS code for this segment, assuming all variables
are in memory and are addressable as off sets from $t0:
1w $t1, 0($t0)
1w $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
1w $t4, 8($t0)
add $t5, $t1,St4
sw St5, 16($t0)
Find the hazards in the preceding code segment and reorder the instructions to
avoid any pipeline stalls.
Answer :
The Given Code Segment
The original MIPS code (with a slight interpretation of variable names) is:
lw $t1, 0($t0) # (I1) Load b
lw $t2, 4($t0) # (I2) Load e
add $t3, $t1, $t2 # (I3) Compute A = b + e
sw $t3, 12($t0) # (I4) Store result A
lw $t4, 8($t0) # (I5) Load f
add $t5, $t1, $t4 # (I6) Compute c = b + f
sw $t5, 16($t0) # (I7) Store result c
Variables interpretation: • b is at offset 0, e at offset 4, f at offset 8; • Result A (b + e)
is stored at offset 12; and • Result c (b + f) is stored at offset 16.
Part (i): Identify Data Dependences and Their Types
Examine each instruction for Read-After-Write (RAW) hazards:
1. I1 → I3: – I1 (lw $t1, 0($t0)) loads b into $t1. – I3 (add $t3, $t1, $t2) reads $t1. →
RAW hazard on $t1.
2. I2 → I3: – I2 (lw $t2, 4($t0)) loads e into $t2. – I3 uses $t2 as the second operand. →
RAW hazard on $t2.
3. I1 → I6: – I1 provides $t1 (b) and I6 (add $t5, $t1, $t4) uses $t1 as the first operand.
→ RAW hazard on $t1 (again).
4. I5 → I6: – I5 (lw $t4, 8($t0)) loads f into $t4. – I6 needs $t4 as its second operand.
→ RAW hazard on $t4.
In summary, dependencies exist between: • I1 and I3 (and I6) on $t1; • I2 and I3 on
$t2; and • I5 and I6 on $t4.
Part (ii): Reordering to Avoid Pipeline Stalls (No Forwarding Assumed)
In a simple 5‐stage non‐forwarding pipeline, a loaded value is not available for an
instruction using it until the WB stage. For example, a load instruction’s result (issued
in IF at cycle 1) writes its register in WB at cycle 5. If a subsequent instruction’s ID
occurs earlier than cycle 6, it reads the stale value.
In the original order, the add instructions (I3 and I6) occur too soon after the
corresponding loads (I1, I2, and I5).
A proven reordering technique is to “group” independent instructions together so that
the value–producing loads have enough time to complete before their values are
needed. Notice that both add instructions depend on $t1, and one add depends on $t2
while the other on $t4. Since these loads come from different memory locations, we
can perform all three loads first, then execute the arithmetic operations, and finally
perform the stores.
Reordered Code:
lw $t1, 0($t0) # Load b
lw $t2, 4($t0) # Load e
lw $t4, 8($t0) # Load f
add $t3, $t1, $t2 # Compute A = b + e
add $t5, $t1, $t4 # Compute c = b + f
sw $t3, 12($t0) # Store A
sw $t5, 16($t0) # Store c
Why This Works: – All loads are performed consecutively. By the time we reach the
add instructions, the values for $t1, $t2, and $t4 are already available (or at least are
scheduled far enough apart to avoid hazards in a no‐forwarding design). – No add
instruction immediately follows its corresponding load, so the pipeline has time to
write back the loaded values before they are used. – The stores occur after the
computations, naturally.
If one were to schedule the original instructions without reordering, NOPs (stall
cycles) would be needed. For instance, between lw $t1 and add $t3, at least two
cycles of delay might be necessary. Reordering eliminates the need for these stalls.
Part (iii): Diagram Illustrating the Reordered Sequence in the Pipeline
Below is an idealized pipeline timeline for the reordered code (assuming one
instruction is fetched per cycle):
Pipeline Stages: IF → ID → EX → MEM → WB
Reordered Code:
I1: lw $t1, 0($t0)
I2: lw $t2, 4($t0)
I3: lw $t4, 8($t0)
I4: add $t3, $t1, $t2
I5: add $t5, $t1, $t4
I6: sw $t3, 12($t0)
I7: sw $t5, 16($t0)
Answer :
Loop Instructions
1. lw r1, 0(r1) - Load word into r1 from memory
2. and r1, r1, r2 - Perform bitwise AND operation on r1 and r2
3. lw r1, 0(r1) - Load word into r1 from memory
4. lw r1, 0(r1) - Load word into r1 from memory
5. beq r1, r0, loop - Branch to loop if r1 equals r0
Pipeline Execution Diagram
For the third iteration:
C Instr Instr Instr Instr Instr
y uctio uctio uctio uctio uctio
cl n1 n2 n3 n4 n5
e
1 IF
2 ID IF
3 EX ID IF
4 MEM EX ID IF
5 WB MEM EX ID IF
6 WB MEM EX ID
7 WB MEM EX
8 WB MEM
9 WB
In the above table:
IF = Instruction Fetch
ID = Instruction Decode
EX = Execute
MEM = Memory Access
WB = Write Back
Instructions in the Pipeline During These Cycles
During cycle 1, only the first instruction (lw r1, 0(r1)) is in the IF stage.
During cycle 2, the first instruction is in the ID stage, and the second instruction (and
r1, r1, r2) is in the IF stage.
During cycle 3, the first instruction moves to the EX stage, the second instruction is in
the ID stage, and the third instruction (lw r1, 0(r1)) is in the IF stage.
And so on...
4. Plan the pipelining in MIPS architecture and generate the exceptions handled in
MIPS.
Answer :
MIPS Pipeline Overview
MIPS architecture uses a classic 5-stage pipeline:
1. IF (Instruction Fetch): Fetch the instruction from memory.
2. ID (Instruction Decode): Decode the instruction and read the registers.
3. EX (Execution): Perform the operation or calculate an address.
4. MEM (Memory Access): Access memory operand.
5. WB (Write Back): Write the result back into a register.
Planning the Pipelining
Each instruction moves through the stages one at a time, with one instruction entering
the pipeline at each clock cycle. Here's a simple representation for three instructions:
memory locations.
For example, if you have an array of numbers and you access each
types of memory based on how fast they are and how much data they can
hold.
Offset (to locate a byte in a block) = 4 bits (since each block is 16 bytes)
Index (to select a block) = 10 bits (since there are 1024 blocks)
Storage
Lower (less dense) Higher (more dense)
Capacity
Simple Summary
memory).
DRAM is slower but cheaper and used for main memory (RAM) in
computers.
If the data is not there (cache miss), it must be fetched from RAM.
Example:
Key Point:
A high miss penalty slows down performance, so caches help reduce it.
Rotational Latency
Rotational latency is the waiting time for a hard disk’s spinning platter to
rotate and bring the required data under the read/write head.
How It Works?
2. The read/write head stays in place but waits for the correct sector to rotate
under it.
4.
Key Point:
Faster spinning disks have lower rotational latency, improving data access
speed.
How It Works?
3. If two blocks map to the same cache line, the old block is replaced when
of lines in Cache)
Key Point:
✔ Fast and simple, but if two frequently used blocks map to the same line,
1. Hit Ratio
EAT = (Hit Ratio × Cache Access Time) + (Miss Ratio × Miss Penalty)
Example Calculation
Cache time = 10 ns
Fragmentation happens when memory is not used efficiently. There are two
types:
too big for what’s needed, wasting space inside each block
Write-back: Data is only written to the cache, and then later, when the
When the processor needs an instruction but it’s not in the cache:
Miss Detected: The system realizes the instruction isn’t in the cache.
memory.
Store in Cache: The instruction is saved in the cache for future use.
Hit Rate: The percentage of memory accesses that result in a cache hit (i.e.,
Miss Rate: The percentage of memory accesses that result in a cache miss
Block placement in cache memory refers to how blocks of data are mapped
line.
The Dirty/Modified bit tells us if the data in the cache has been changed
If the bit is set to 1, it means the data in the cache has been modified, but
the change hasn't been written back to the main memory yet.
If the bit is set to 0, the data in the cache is the same as in the main
short. It's also used to increase the throughput of data transfer between a
The Translation Lookaside Buffer (TLB) is a small, fast memory unit used to
1. Run More Programs: You can run more apps at the same time.
2. Keeps Programs Safe: Each program has its own memory, so they don’t
3. Better Use of Memory: Virtual memory makes sure RAM is used wisely.
6. Faster: Only the parts of programs needed are loaded, making things
quicker.
7. Cheap: Virtual memory lets you do more with less physical memory.
loading only the parts of the program needed at any moment. It helps save
Paging: Divides memory into fixed-size pages and maps them to physical
based on the logical structure of a program, such as code, data, and stack.
It allows for more flexible memory usage but can lead to fragmentation
Memory Access.
DMA allows devices to directly read/write to memory without involving the
3. The DMA controller moves the data between the device and memory.
4. When done, the DMA controller tells the CPU, and the CPU takes control
back
PART-B
It helps the computer work faster by dividing a job into smaller parts and
Types of Parallelism:
Data Parallelism
o What it is: The same task is done on different pieces of data at the
same time.
o What it is: Different tasks are done at the same time. Each task is
independent.
fetched or decoded.
4. Thread-Level Parallelism
5. Bit-Level Parallelism
o What it is: Operations are done on multiple bits at the same time.
just 1 bit.
6. Pipeline Parallelism
o What it is: A task is broken into stages, and each stage is worked
decodes it, and another executes it, all at the same time.
parallelism.
Characteristics of ILP:
ILP allows the CPU to run multiple instructions at the same time,
2. Pipelining
3. Out-of-Order Execution
Instructions that don’t depend on each other can be executed out of order
to save time.
4. Superscalar Architecture
Modern CPUs can handle more than one instruction at a time using
5. Faster Processing
instructions at once.
Limitations of ILP:
1. Data Dependency
2. Branching
3. Hardware Limits
4. Resource Conflicts
5. Complex Scheduling
The system has to figure out which instructions can be run together,
6. Limited Improvement
As you try to run more instructions in parallel, the benefits may start
drive or SSD) as an extension of RAM. This means that a computer can run
normally allow.
In simple terms, virtual memory enables the system to swap data between
RAM and storage when needed, providing the illusion of more memory
2. Improved Multitasking
3. Memory Isolation
It ensures that each process in the system operates in its own separate
used more efficiently, only loading parts of programs that are currently
5. Error Handling
6. Cost-Effective
7. Security
Virtual memory isolates processes from each other, which prevents one
process from accessing or damaging another’s memory. This improves
program accesses memory, the system needs to translate the virtual address
translation is usually done through a page table. The TLB stores recent address
memory.
2. TLB Check: The TLB is checked to see if the translation for that virtual
3. Page Table Check: If the translation is not in the TLB (a "TLB miss"), the
1. Faster Address Translation: The TLB reduces the time spent in translating
3. Efficient Use of Virtual Memory: TLB helps manage large memory spaces
Conclusion:
The TLB is a key component in speeding up memory access by caching
in architecture design.
o Relevance: Used in cache memory due to its speed. It's crucial for
consumes more power than other memory types. It’s used in small
3. Flash Memory:
is off.
o Relevance: Used for storage in devices like SSDs and USB drives.
o Relevance: PCM could replace both DRAM and Flash due to its
resistance in a material.
store data.
systems for both fast memory and long-term data storage, reducing
7. 3D XPoint:
Capacity:
increasing capacity often comes with trade-offs in cost, speed, and power
consumption.
Speed (Latency):
Definition: The time it takes to access data from memory, typically
measured in nanoseconds.
(e.g., cache vs. main memory) use different memory types to balance
Volatility:
Definition: Whether the memory retains data when the system is powered
off.
Relevance: Volatile memory (like DRAM) loses data when power is off
Cost:
Definition: The price per unit of memory (e.g., cost per GB).
Power Consumption:
Definition: The amount of power the memory consumes during
operation.
laptops. Technologies like MRAM and ReRAM are designed for low
Access Time:
access data.
Relevance: Faster access times (e.g., with SRAM) are important for tasks
time memory like DRAM is used for larger storage at lower costs.
4. Apply how Internal Communication Methodologies is useful in
components (like the CPU, memory, and I/O devices) exchange data within the
What it is: A bus is a shared path used for transferring data between
computer. The width and speed of the bus (e.g., 32-bit or 64-bit) affect
memory, from fast but small (like cache) to slower but large (like hard
drives).
ensures that frequently used data is accessed faster, reducing delays and
improving performance.
Chip (NoC)
Network-on-Chip (NoC) is a
efficiency.
5. Communication Protocols
What it is:
Communication
data is exchanged
together.
What it is: Scalability means a system can grow in size and capability
communication systems are essential. Low latency ensures that tasks, like
extend battery life while keeping the device fast and responsive.
some parts fail. Error handling ensures any issues are detected and
corrected.
applications.
detection ensures that data remains correct, even when failures occur.
5. i).Demonstrate the DMA controller. Discuss how it improves the
Performance
allows the CPU to focus on other tasks while data is being transferred.
How DMA Works:
1. The CPU tells the DMA controller where to move data (from an I/O
2. The DMA controller handles the data transfer on its own, without
3. Once the transfer is done, the DMA controller tells the CPU through an
interrupt.
o Without DMA, the CPU would need to manage every data transfer,
o With DMA, the CPU doesn't have to worry about moving the data
and can focus on other tasks, making the system more efficient.
o DMA transfers data directly between the I/O device and memory,
video streaming.
3. Better Use of System Resources:
CPU’s time.
o This makes the whole system run more smoothly and faster.
4. Lower Latency:
system.
o Since the CPU isn't busy handling the data transfer, it can work on
system performance.
Example:
Imagine you're copying a large file from a hard drive to memory. Without
DMA, the CPU would be involved in moving each bit of data, making
everything slower. With DMA, the DMA controller moves the data directly,
allowing the CPU to continue working on other tasks (like opening a web
reduces the CPU workload, and makes the system more efficient. This is
especially important for tasks that require transferring large amounts of data,
between the memory and peripheral devices (such as hard drives, printers,
network interfaces) without involving the CPU. This process allows data to
move quickly and efficiently, freeing up the CPU to perform other tasks. Here's
1. Initialization by CPU:
a peripheral device).
memory).
and memory.
o The CPU is bypassed during this data transfer, meaning the CPU
processing time.
4. Interrupt Notification:
finished.
o The CPU can now process the data that has been transferred or
For example, when you download a file from the internet, the data comes from
the network card (a peripheral device) and is stored in memory. The DMA
controller will handle transferring the data from the network card directly to
Reduces CPU workload: The CPU can focus on other tasks while DMA
Conclusion:
CPU and making data transfers faster and more efficient. By handling data
6. Point out the need for cache memory. Explain the following three
ii).Associative. iii).Set
associative.
Cache memory is a small, very fast memory that sits between the CPU and the
main memory (RAM). The main reason for using cache memory is to speed up
data access. Main memory (RAM) is slower compared to the CPU. When the
CPU needs data, it has to fetch it from RAM, which can take a lot of time.
However, many times, the CPU repeatedly needs the same data from memory.
Cache memory stores this frequently accessed data and instructions, allowing
o To reduce the time the CPU spends waiting for data from memory.
There are three common ways to map data from the main memory to cache
memory:
1. Direct Mapping
2. Associative Mapping
3. Set-Associative Mapping
1. Direct Mapping:
What it is: In direct mapping, each block of memory is mapped to exactly one
specific line in the cache. This means that there is a fixed location for each
Cache
2. Associative Mapping
cache line. The cache checks all its lines to find the required data when
3. Set-Associative Mapping
Associative Mapping. The cache is divided into sets, and each set has
multiple cache lines. Each memory block is assigned to a specific set, but
within that set, the block can be stored in any of the cache lines.
Ta
Cache Set Number = Main Memory block number % Number of sets in cache
7.Evaluate the features of Bus Arbitration-Masters and Slaves.
I/O devices, etc.) share the same communication bus. Since only one device can
use the bus at any time, arbitration ensures that each device gets access in a
In a bus system:
Masters are devices that request the bus to send or receive data.
Slaves are devices that respond to the master's requests but do not
Masters: Features
Controls the Bus: When the master has control over the bus, it can read
communicate with (e.g., memory, I/O devices), what data to send, and the
Example:
In a computer system with a CPU and a disk controller, the CPU is a master
because it controls the data flow and requests access to memory. Similarly, the
disk controller may also be a master when it needs to read from or write to a
disk.
Slaves: Features
A slave is a device that responds to the master's requests but does not initiate
Passive Role: The slave is passive; it does not request the bus. It only
Fixed Tasks: Slaves usually have specific, predefined tasks (e.g., storing
Example:
In the same system, the RAM is a slave. The master (CPU) requests data from
the RAM, and the RAM only responds when asked by the CPU.
access. It is responsible for deciding which master gets access to the bus
round-robin schemes.
o Features: The arbiter decides who gets control of the bus based on
centralized arbitration.
Conclusion
Bus arbitration is essential in systems with multiple devices sharing the same
bus. It ensures that one device gets access to the bus at a time, preventing
Masters are devices that control the bus, initiate data transfers, and
Architecture
Bus Structure
1. Data Bus: Carries the actual data being transferred between devices.
3. Control Bus: Manages the flow of data by sending control signals, such
The bus protocol defines the rules for communication between the devices on
the bus. It ensures that the devices can transfer data reliably and in the correct
sequence.
Timing: Defines when the devices send and receive data, using signals
Bus control is responsible for managing the access to the bus, ensuring that only
one device uses it at a time. This is usually managed by the bus controller.
Architecture
Bus Structure
The bus structure refers to the physical and logical organization of the bus
system, which connects multiple devices (such as CPUs, memory, I/O devices)
1. Address Lines:
Unidirectional.
main memory
Example:
2. Data Lines:
Used to carry the binary data between the CPU, memory and IO.
Bidirectional.
Based on the width of a data bus we can determine the word length of a
CPU.
Example:
3. Control Lines:
a CPU clock.
Bus Protocol
The bus protocol defines the rules for how devices communicate over the bus:
device.
Arbitration: If multiple devices want to use the bus, this decides who
Bus Control
Bus Controller: A component that manages when each device gets to use
the bus.
Control Signals: These are signals that control operations, like whether
the data should be read or written, and timing signals that synchronize the
transfer.
In short:
Bus Structure is the physical setup for transferring data.
Bus Control manages the timing and who gets to use the bus
be read.
2. RAM (Random Access Memory): Data can be both read from and
written to. It is volatile, meaning data is lost when the power is turned off.
o Types:
need refreshing.
periodic refreshing.
o Types: L1, L2, and L3 cache, depending on where they are located
Cache Memory:
Faster Access: Cache memory stores frequently used data, which allows
Improved Performance: Reduces the time the CPU spends waiting for
Efficiency: By holding data closer to the CPU, cache reduces delays and
Virtual Memory:
better security.
dynamically
data is sent and received at regular intervals. The clock provides a timing
reference that keeps all devices connected to the bus operating in sync.
Key Features of a Synchronous Bus:
with a common clock signal that is generated by the sending device and
used by both the sending and receiving devices. This ensures that both
devices are in sync and ready to receive or transmit data at the same time.
either the parallel or serial mode of data transfer. In parallel data transfer,
multiple bits of data are transferred simultaneously, while in serial data
the data is transferred correctly. This can involve the use of signals such
Data Rate: The data transfer rate in synchronous data transfer is typically
limited by the clock frequency and the number of bits that can be
impedance of the devices to ensure that data is not lost due to reflections.
The design procedure is easy. The master does not wait for any
acknowledgement signal from the slave, though the master waits for a
The slave does not generate an acknowledge signal, though it obeys the
timing rules as per the protocol set by the master or system designer.
Disadvantages of Synchronous Data Transfer
If the slave operates at a slow speed, the master will be idle for some time
be sent or received. Data is transferred when both the sender and receiver
are ready, making this bus more flexible but potentially slower compared
to a synchronous bus.
Handshaking Method
which can make it hard to know when to send or read data. Strobe control
1. Strobe Signal:
o The transmitting device sends a strobe signal along with the data.
o The receiving device waits for the strobe signal to know when the
strobe.
o If the strobe signal is sent after the data, it’s called a trailing
strobe.
o It helps the receiving device know when to read the data, even if
In asynchronous data transfer, devices don’t have a shared clock, so they need
a way to make sure both are ready to send and receive data. Handshaking is
used to do this.
How Handshaking Works:
1. RTS (Request-to-Send):
2. CTS (Clear-to-Send):
o The receiving device responds with a CTS signal, saying it’s ready
3. Data Transfer:
4. ACK (Acknowledgment):
transferred.
Flow Control: It also helps control how much data is sent, preventing the
Cache is a small, high-speed memory located between the CPU and main
retrieval by reducing the time it takes for the CPU to access data.
Advantages of Cache:
the CPU needs data, it first checks the cache, which significantly
RAM.
2. Improved Performance:
system performance.
3. Reduced Latency:
o Cache reduces the latency (delay) between the CPU and memory.
6. Energy Efficiency:
o Because the cache is faster, the CPU spends less time accessing the
Cache operation refers to how cache memory works to speed up data retrieval by
storing frequently used data. Cache memory is a small, fast memory located
close to the CPU, and it helps avoid slow access to the main memory (RAM).
Cache operations use two main principles to make the system faster: temporal
1. Temporal Locality:
What it is: If a CPU accesses certain data, it is likely to need that same
Cache operation: The data that was just accessed is stored in the cache
so that the CPU can quickly retrieve it again without needing to fetch it
Example: If you are working on a document and keep accessing the same
sentence, the cache will store that sentence, so next time you need it, it's
2. Spatial Locality:
stored in the cache, but also nearby data locations, anticipating future
needs.
Example: If you're reading a list of items, once the cache stores the
current item, it may also store the next few items in the list to speed up
Cache Performance
Cache performance is measured by how often the CPU finds data in the cache
(cache hit) vs. how often it has to go to the slower main memory (cache miss).
is faster.
Cache Miss: When the data is not found in the cache, it's called a miss,
and the data has to be fetched from the main memory, which is slower.
Hit Ratio:
The hit ratio is a percentage of how often the CPU finds data in the cache. A
Hits + Misses
Miss Ratio:
The miss ratio is the opposite, showing how often the data was not in the cache.
Hits + Misses
Simultaneous Access: If the CPU can access both cache and memory at
necessary diagrams.
fiber to send data one bit at a time. Serial buses can be used to connect devices
to a computer
communicate with.
The device responds with a data packet that contains the requested information.
RS-232: A serial bus that defines standards for serial binary communication.
RS-422 and RS-485: Use differential signaling to allow longer distances and
higher speeds.
data is transmitted between them without any intermediate devices. This type of
bus architecture is often used for simpler, faster, and more reliable
communication.
communication.
Example: USB (Universal Serial Bus)
The USB is another form of serial bus that is commonly used for connecting
to-many connection).
Characteristics of USB:
system.
Data Transfer Modes: USB supports several transfer types: control, bulk,
In Multipoint Serial Bus Architecture, multiple devices (nodes) are connected over
a single bus, allowing communication between any two or more devices on the
network. The bus typically involves a shared data line, meaning all connected
Cost-effective: Uses fewer wires and can easily expand by adding more
devices.
a single shared bus, while SPI allows multiple slave devices but requires
Characteristics of SPI:
Magnetic disks store data using a rotating disk coated with a magnetic material.
Common examples include hard drives (HDDs) and older floppy disks.
Structure: They have circular platters that spin, and a read/write head
Data Access: Data is accessed randomly, meaning you can directly get
Storage Capacity: Modern hard drives can hold a lot of data, from
over time.
Advantages:-
Disadvantages:-
These are less expensive than RAM but more expensive than magnetic
tape memories.
2.Magnetic Tape
Magnetic tape stores data on a long, thin tape coated with magnetic
Storage Capacity: Magnetic tapes can store large amounts of data, often
Cost: Magnetic tapes are cheaper per gigabyte than many other storage
Advantages :
5. It is a reusable memory.
Disadvantages :
randomly or directly.
2. It requires caring to store, i.e., vulnerable humidity, dust free, and suitable
environment.
3. It stored data cannot be easily updated or modified, i.e., difficult to make
updates on data.
3.Optical Disks
Structure: The disk has tiny pits and flat areas that the laser reads to find
the data.
Key Features:
Data Storage: Optical disks store data as tiny pits and lands on their
surface. A laser is used to read these patterns, which represent binary data
Types:
o DVDs hold about 4.7 GB (used for movies and larger data).
movies).
Data Access: They are usually slower than magnetic disks and data is
accessed sequentially.
Durability: They are resistant to magnetic interference but can be
An Optical Disk is a storage medium that uses laser technology to read and
write data. It is a flat, circular disk made from materials like polycarbonate, with
a reflective surface. Optical disks are commonly used for storing and sharing
data, as they have a longer lifespan and higher capacity than older technologies
Applications:
Data Storage & Backup: Used for storing and securing data, often for
games.
data.
files.
Advantages:
High Storage Capacity: Much more than older technologies like floppy
disks.
Disadvantages:
temperatures.
Limited Rewrite Ability: Some disks can only be written once (e.g., CD-
R, DVD-R).
data.
14.Discuss the following in detail
Input devices are hardware devices that allow users to interact with a computer
and provide data or instructions for processing. They serve as the interface
between the user and the computer, allowing users to input text, commands,
audio, video, and more. Here are the key input devices:
1. Keyboard:
o The keyboard is one of the most widely used input devices. It is
commands.
2. Mouse:
o The mouse is a pointing device that moves the cursor on the screen,
buttons).
o Types: Wired and wireless mice, optical mice (use light to detect
applications).
3. Scanner:
o A scanner is used to digitize physical documents, images, and other
visuals into digital format, which can then be stored and processed
by the computer.
areas).
scanning.
4. Microphone:
the computer).
5. Touchscreen:
o A touchscreen is an input device that allows users to interact
devices).
6. Joystick/Gamepad:
7. Digital Camera/Webcam:
o Digital cameras and webcams are used to capture images and
or streaming.
images.
creation.
Input devices are essential for user interaction with a computer. They
them versatile for different applications, from everyday tasks like typing
audio.
Output devices are hardware components that display, play, or convey the
processed data from a computer to the user in a form that can be perceived, such
as visual, audio, or tactile feedback. They are crucial for users to receive results
card and converts it into a visible image using pixels on the screen.
2. Printer:
o Types: Inkjet printers (good for photos), laser printers (fast and
efficient for text), dot matrix printers (less common, used for
impact printing).
o Uses: Printing documents, photos, and graphics for personal or
professional use.
3. Speakers/Headphones:
wireless headphones.
gaming, etc.
4. Projector:
theaters.
5. Plotter:
6. Haptic Devices:
sensations.
virtual objects.
and robotics.
Conclusion on Output Devices:
bits of data simultaneously across the bus. It allows for high-speed data
Merits:
2. Low Latency:
making parallel buses efficient for tasks where quick access to data
is essential.
4. Simpler Design:
serial buses and bridge-based systems, since multiple data lines are
used in parallel.
applications.
Demerits:
3. Scalability Limitations:
data paths.
4. Cost:
5. Electromagnetic Interference:
distances.
as bus controllers) that connect different bus segments and manage the
data flow between them. This architecture typically splits the system into
Merits:
devices).
bus.
3. Scalability:
4. Traffic Segmentation:
5. Fault Isolation:
help isolate the fault, preventing it from affecting the entire system.
Demerits:
1. Increased Complexity:
bottlenecks.
2. Potential Bottlenecks:
system performance.
3. Latency:
which can slow down the overall data transfer rate, especially if the
4. Cost:
Serial Bus Architecture uses a single or few data lines to transmit data
one bit at a time. In contrast to parallel buses, serial buses are designed
Merits:
o Serial buses require fewer physical lines, reducing the space and
2. Long-Distance Communication:
3. Reduced Complexity:
o Fewer lines make serial buses simpler to design and implement
complex routing.
4. Lower Cost:
achieve very high speeds, making them suitable for a wide range of
Demerits:
this gap, some legacy systems may still experience slower speeds.
2. Higher Latency:
3. Limited Bandwidth:
o Despite advancements in technology, serial buses may still offer
throughput is required.
devices.
2. For a direct mapped cache design with a 32 bit address, the following
bits of the address are used to access the cache. Tag : 31-10 Index: 9-5
Offset: 4-0
iii).Assess what is the ratio between total bits required for such a cache
The Offset tells us how many bits are used to find the data inside a block.
The Index tells us how many lines or entries the cache has. Here, the
Index is 5 bits.
bits:
1. Total bits required for cache (each entry includes tag, valid bit, and
data):
27922+1+256=279 bits
892832×279=8928 bits
2. Data storage bits: Each cache entry stores 32 bytes of data, so:
819232×32×8=8192 bits
3. Ratio:
1.09Ratio=81928928≈1.09
Final Answers:
1. Cache block size: 32 bytes
3. Ratio: 1.09
Memory Modules
system with more memory. Here's how we can do that in an easy way:
What is it?
A memory bank is like a "storage unit" where multiple RAM chips are
Divide the total memory required into smaller parts (small RAM
modules).
Group these small modules into different banks (like multiple storage
shelves).
address.
256MB each. These can be grouped into 4 memory banks (each with 4
modules).
Advantages:
Faster memory access because data can be retrieved from multiple banks
at once.
2. Memory Interleaving
What is it?
module.
splits the total memory into 4 blocks, and the memory system accesses
them in parallel.
Advantages:
different modules.
once.
What is it?
A memory controller makes sure data is stored in the correct module and
modules, cascading allows you to access them one after the other when
Advantages:
Disadvantages:
Slower for random access because data isn't stored in one place.
Simply install these ready-to-use modules into the available memory slots
on the motherboard.
needs.
Advantages:
Conclusion
In simple terms, to build large RAMs from small RAMs, you can:
1. Group them into memory banks for better organization and access.
3. Chain the RAMs together to extend memory when one module is full.
and expand.
4.Summarize the virtual memory organization followed in digital
computers.
from RAM to disk storage. This process enables programs to access more
an easy-to-understand way.
Virtual memory allows a computer to use more memory than the physical
physical memory (RAM) and the storage disk. This ensures programs can
space.
This virtual address space is much larger than the physical RAM
available.
The program accesses memory using virtual addresses, which are then
(typically 4KB).
blocks called page frames, where each page from virtual memory is
mapped.
a. Page Table
A page table keeps track of the mapping between virtual memory pages
management unit (MMU) looks up the virtual address in the page table
b. Address Translation
is stored.
occurs.
The operating system then retrieves the required page from the hard drive
If RAM is full, the operating system may swap out less-used pages to the
divided into page frames. When a program needs more memory than
available in RAM, the operating system moves pages between RAM and
disk storage.
2. The MMU checks the page table and sees the page isn’t in RAM.
3. A page fault occurs, and the operating system fetches the required page
Memory Isolation and Protection: Each program runs in its own virtual
request memory.
3. Page Fault (if needed): If the page is not in RAM, the operating system
handles a page fault and loads the required page from disk storage into
RAM.
4. Memory Swapping: The OS can swap pages between RAM and disk as
more efficiently by managing which pages to load into RAM and which
its own isolated memory space, preventing one program from affecting
another.
Slower Performance: If too many pages are swapped in and out of RAM
accessing data from the hard disk is slower than accessing RAM.
Disk Space Usage: Virtual memory requires disk space (swap space) for
storing data that doesn’t fit into RAM. Using too much disk space can fill
Conclusion:
available. By dividing memory into pages, using page tables for address
translation, and swapping data between RAM and disk storage, virtual
managed.
22VLT402-COMPUTER ARCHITECTURE AND ORGANIZATION
ANSWER KEY
UNIT V- ADVANCED COMPUTER ARCHITECTURE
PART-A
A multicore microprocessor is a single CPU chip with multiple processing cores, allowing it
to execute multiple tasks simultaneously. This improves performance, efficiency, and
multitasking in modern computing.
8. State the overall speedup if a webserver is to be enhanced with a new CPU which is
10 times faster on computation than an old CPU . The original CPU spent 40% of its
time processing and 60% of its time waiting for I/O
data. data.
Processor
Synchronized execution. Independent execution.
Coordination
SMT (Simultaneous
Feature Hardware Multithreading
Multithreading)
Resource Use Shares CPU resources dynamically. Allocates resources in time slots.
13. Integrate the ideas of multistage network and cross bar network.
Memory Access Same for all processors. Varies based on memory location.
Fine-Grained Multithreading:
Switches threads after every instruction cycle to reduce CPU stalls.
Hides execution delays (e.g., memory latency) by continuously interleaving multiple
threads.
Efficient for high-latency operations but may reduce single-thread performance.
Used in GPU architectures and high-throughput processors.
Type Description
Simultaneous
Executes multiple threads simultaneously in a single core.
(SMT)
CPU Utilization Keeps execution units busy May have idle cycles between switches
GPUs, high-throughput
Example Traditional multi-threaded CPUs
processors
20. Classify shared memory multiprocessor based on the memory access latency.
NUMA (Non-Uniform Memory access time varies Large-scale servers, AMD EPYC,
Memory Access) based on location. Intel Xeon.
PART-B
Parallelism:
Parallelism is the ability of a system to perform multiple operations or tasks simultaneously,
rather than sequentially, to enhance computational efficiency and performance. It is widely
used in modern computing architectures to maximize processing power and reduce execution
time. Parallelism is essential in high-performance computing, real-time processing, and large-
scale data analysis.
Parallelism is classified into different types based on the level at which tasks are executed
concurrently. The main types include Bit-level parallelism, Instruction-level parallelism,
Data-level parallelism, Task-level parallelism, and Thread-level parallelism.
Types of Parallelism:
1. Bit-Level Parallelism (BLP):
o Focuses on increasing the processor’s ability to process multiple bits in a
single clock cycle.
o By widening the data path, processors can perform operations on larger bit-
sized operands (e.g., 32-bit instead of 16-bit).
o Example: A 32-bit processor can process a 32-bit number in one cycle,
whereas a 16-bit processor would require two cycles.
2. Instruction-Level Parallelism (ILP):
o Involves executing multiple instructions simultaneously within a single
processor.
o Achieved using techniques like pipelining, superscalar execution, out-of-order
execution, and speculative execution.
o Example: Modern CPUs fetch, decode, and execute multiple instructions in
parallel, reducing overall execution time.
3. Data-Level Parallelism (DLP):
o Executes the same operation on multiple data points at the same time.
o Commonly used in vector processing and SIMD (Single Instruction, Multiple
Data) architectures.
o Example: Graphics processing units (GPUs) use DLP to process large sets of
pixel data concurrently.
4. Task-Level Parallelism (TLP):
o Different tasks or functions run in parallel, each performing a unique job
independently.
o Achieved in multi-core and distributed computing environments where
different cores or machines execute separate tasks.
o Example: A web browser rendering a webpage while simultaneously
downloading files in the background.
5. Thread-Level Parallelism (ThLP):
o Multiple threads of the same process execute concurrently, improving
application performance.
o Used in multi-threaded applications where different parts of a program run in
parallel.
o Example: In a multi-core processor, separate threads handle different aspects
of a game, such as physics simulation and AI processing.
Limitations of ILP:
Despite its benefits, ILP has some challenges that limit its efficiency.
1. Data Hazards: Some instructions depend on the results of previous ones, causing
delays.
2. Control Hazards: Conditional statements (like if-else) can disrupt parallel execution.
3. Structural Hazards: Limited processing units and memory bandwidth can slow
execution.
4. Diminishing Returns: Adding more ILP techniques does not always lead to big
improvements.
5. Complex Hardware Design: ILP requires advanced processors, making them harder
to design and expensive.
6. Code Dependency Issues: Some programs are naturally difficult to parallelize.
7. Memory Latency: If memory access is slow, it can limit ILP efficiency.
8. High Power Consumption: More ILP features require more energy, increasing heat
and power use.
9. Compiler Challenges: Writing optimized code to take full advantage of ILP is
difficult.
Conclusion:
ILP helps processors run faster by executing multiple instructions at the same time. However,
it faces challenges like instruction dependencies, hardware limitations, and power
consumption. Despite these issues, ILP is widely used in modern processors to improve
performance.
2. i).Give the software and hardware techniques to achieve Instruction level parallelism.
Instruction Level Parallelism (ILP) refers to the ability of a processor to execute multiple
instructions simultaneously to improve performance. (ILP) can be achieved using various
software and hardware techniques. These techniques help improve CPU performance by
executing multiple instructions simultaneously.
Example:
Before Scheduling (Stalls Present)
LOAD R1, A ; Load value from memory (takes time)
ADD R2, R1, R3 ; Must wait for LOAD to finish
STORE R2, B ; Store result in memory
Problem: ADD depends on LOAD, causing a delay.
3.Software Pipelining:
Example:
Before Software Pipelining (Sequential Execution)
LOOP:
LOAD R1, 0(R2) ; Load
ADD R3, R1, R4 ; Compute
STORE R3, 0(R5) ; Store
ADDI R2, R2, 4
ADDI R5, R5, 4
BNE R2, R6, LOOP
Each iteration completes Load → Compute → Store before the next starts, causing
idle time.
After Software Pipelining (Optimized Overlapping Execution)
LOOP:
LOAD R1, 0(R2) ; Load (Next Iteration)
ADD R3, R7, R4 ; Compute (Current Iteration)
STORE R3, 0(R5) ; Store (Previous Iteration)
ADDI R2, R2, 4
ADDI R5, R5, 4
BNE R2, R6, LOOP
4. . Register Renaming
Eliminates false dependencies by renaming registers.
Example:
Branch Prediction :The CPU guesses the outcome of a branch (if-else, loops) to avoid stalls.
Mispredictions cause performance penalties.
Example:
if (x > 0)
y = x * 2;
else
y = x - 2;
If-Conversion: Replaces branches with conditional instructions to eliminate branch penalties.
Example:
y = (x > 0) ? x : -x; // No branching, executes faster
1.Superscalar Execution:
Superscalar execution is a CPU architecture technique that allows multiple instructions to be
processed simultaneously, increasing performance by using multiple execution units within
the CPU.
Example: If a CPU has two arithmetic units, it can execute two arithmetic instructions at the
same time, effectively doubling the speed for those instructions.
1. ADD R1, R2, R3 (add contents of R2 and R3, store in R1)
2. SUB R4, R5, R6 (subtract R6 from R5, store in R4)
In a superscalar CPU with two arithmetic units, both instructions can be executed
simultaneously, boosting performance.
2. Out-of-Order Execution:
Example:
1. LOAD R1, 100(R2) (load data from memory address 100 + R2 into R1)
2. ADD R3, R1, R4 (add the contents of R1 and R4, store the result in R3)
3. MUL R5, R6, R7 (multiply the contents of R6 and R7, store the result in R5)
The CPU can execute MUL R5, R6, R7 before ADD R3, R1, R4 if the resources for
multiplication are available, improving overall performance.
3.Register Renaming:Register renaming is a CPU technique that eliminates false data
dependencies by dynamically mapping logical registers to physical registers, allowing parallel
execution of instructions.
Example:
4. Branch Prediction:
Branch prediction is a CPU technique that guesses the outcome of conditional branches (like
`if` statements) to continue executing instructions without waiting for the branch to be
resolved.
Example:
5.Speculative Execution:
Speculative execution is a CPU technique that executes instructions before it's certain
they're needed, based on predictions.
Example:
2. CPU predicts the outcome (true or false) and executes instructions based on that
prediction.
These techniques, when combined, help maximize the CPU's ability to execute multiple
instructions in parallel, improving overall performance.
ii).Summarize the facts or challenges faced by parallel processing i] enhancing computer
architecture.
Hardware multithreading:
(ii) In a multithreaded processor, instructions from multiple threads are interleaved, enabling
better CPU utilization.
Key Observations:
Coarse-Grained Multithreading:
It is a multithreading technique where the processor switches to another thread only when
the current thread encounters a long stall (such as a cache miss or memory access delay).
Unlike fine-grained multithreading, which switches threads every cycle, CGM allows a thread
to execute continuously until it faces a significant delay.
Key Observations:
(i) Different colored blocks represent different threads (Red, Green, Yellow).
(ii) Each thread runs continuously until it hits a stall (represented by white blocks).
(iii) Once a stall occurs, the processor switches to the next available thread instead of
waiting.
(iv)This reduces idle time in the processor but may result in some overhead when switching
between threads.
It improves CPU efficiency by reducing idle cycles due to long stalls but does not switch
threads as frequently as fine-grained multithreading, which switches every cycle.
Simultaneous Multithreading (SMT):
Key Observations:
(ii) The label "Skip C" and "Skip A" indicate that certain execution slots are skipped due to
resource unavailability or dependency issues.
Keeps the processor busy by executing another thread when one thread stalls (e.g.,
waiting for memory access).
Maximizes the use of execution units, reducing idle cycles.
2️. Improved Performance in Multitasking:
Enables multiple threads to run efficiently on the same core, enhancing multitasking
capabilities.
Reduces execution time for parallel workloads.
3️. Better System Responsiveness
Multiple Registers: Each thread requires its own set of registers to store intermediate
results. The processor must support multiple sets of registers or a register renaming
mechanism to switch between threads quickly.
Thread Contexts: A processor running multiple threads must manage the context of
each thread, which includes its instruction pointers, program counters, and other
state information.
Execution Units: These are the components of the processor that perform
computations. In multithreading, multiple execution units may be used in parallel to
execute different threads concurrently.
Instruction Pipeline: To execute multiple threads concurrently, processors need to
have deep pipelines, allowing them to fetch, decode, and execute multiple
instructions from multiple threads at the same time
Applications:
4. Apply your knowledge on graphics processing units and explain how it helps computer
to improve processor performance.
(i) A Graphics Processing Unit (GPU) is a specialized electronic circuit designed to accelerate
the creation of images, animations, and video.
(ii) Initially designed for rendering graphics, GPUs are now used for general-purpose
computing (GPGPU), AI, and scientific simulations.
(iii) Modern GPUs are essential in gaming, AI, data science, machine learning, and high-
performance computing (HPC).
Key Components:
Cores:
GPUs consist of numerous smaller processing units called cores. These cores work
together to execute thousands of threads simultaneously, making GPUs highly
efficient for tasks involving large-scale computations.
Memory:
GPUs have dedicated memory called Video RAM (VRAM), optimized for high-speed
access and data transfer. VRAM allows the GPU to store and quickly access the data
needed for rendering and computations.
Graphics Pipeline:
The graphics pipeline is a series of stages that transform 3D models and scenes into
2D images displayed on the screen. Key stages include vertex processing, geometry
processing, rasterization, and fragment processing.
Shader Units:
Shader units are programmable processors within the GPU that perform shading
calculations to determine the color, lighting, and texture of each pixel. Shaders
include vertex shaders, geometry shaders, and fragment shaders.
1. Parallel Processing
Thousands of Cores: GPUs have many small cores optimized for handling multiple
tasks simultaneously.
SIMD Architecture: Executes the same operation on multiple data points at once,
improving efficiency.
High Throughput: Processes large datasets faster than CPUs, ideal for graphics and
computations.
2. Offloading Work from CPU
Graphics Rendering: Takes over rendering, freeing CPU for logic and system tasks.
AI & Machine Learning: Accelerates deep learning while CPU manages data flow.
Scientific Computing: Handles simulations, physics calculations, and big data tasks.
3. Specialized Hardware for Performance
GDDR & HBM Memory: Faster than system RAM, reducing memory bottlenecks.
Memory Parallelism: Accesses thousands of memory locations simultaneously.
Efficient Data Handling: Keeps textures, models, and computation data readily
available.
5. Gaming & Real-Time Rendering
Dedicated Video Encoders: NVENC (NVIDIA) and VCE (AMD) speed up video
rendering.
Real-Time Streaming: Reduces CPU usage, improving stream quality.
Efficient Video Editing: Accelerates rendering in software like Adobe Premiere.
8. Improves Overall System Performance
Challenges of GPUs:
1. Power Consumption:
- High power usage can lead to increased energy costs and heat generation.
2. Resource Contention:
- Multiple tasks competing for GPU resources can lead to performance bottlenecks.
3. Complex Programming:
- Writing software to fully utilize GPU capabilities requires specialized knowledge and skills.
4. Cost:
- High-performance GPUs can be expensive, making them less accessible for some users.
5. Compatibility Issues:
- Some software applications may not be optimized to take full advantage of GPU
capabilities.
Definition:
SIMD (Single Instruction, Multiple Data) is a parallel processing technique in which a single
instruction operates on multiple data elements simultaneously. This approach enhances
computational efficiency, particularly in tasks that involve repetitive calculations on large
datasets.
B = [5, 6, 7, 8]
2+6=8
3 + 7 = 10
4 + 8 = 12
With SIMD, multiple processing units (PU1, PU2, PU3, PU4) execute the same
instruction on multiple data elements simultaneously:
PU1 1 5 6
PU2 2 6 8
PU3 3 7 10
PU4 4 8 12
All four additions happen at the same time instead of one by one.
Applications:
Image & Video Processing – Speeds up filtering, compression, and transformations
in photo editing, video encoding, and streaming.
Scientific Simulations – Used in weather modeling, fluid dynamics, and molecular
simulations for fast computations.
Multimedia Applications – Enhances audio processing, speech recognition, and 3D
rendering by parallelizing data operations.
Cryptography – Boosts encryption/decryption speeds for secure communication and
data protection.
Machine Learning – Accelerates deep learning by optimizing matrix and vector
computations.
Faster, more efficient, and optimized for large-scale computations the SIMD is used
Definition:
Each processor has its own instruction set, meaning they perform different
operations on the same data.
For example:
o CPU 1 might perform error checking.
o CPU 2 might apply a mathematical transformation.
o CPU 3 might filter noise from the data.
o CPU 4 might perform encryption.
3. Parallel Processing
All processors execute their unique instructions simultaneously on the same data.
Unlike SIMD (Single Instruction, Multiple Data), where all CPUs execute the same
operation, each processor in MISD does something different.
4. Output Stage
Examples:
Comparison:
Performance Highly efficient for vector Specialized use cases, not common in
Impact processing general-purpose computing
Data Level Parallelism in SIMD and MISD serves different computational needs. SIMD is
widely used in modern computing for accelerating parallel tasks in graphics, AI, and scientific
simulations, while MISD is limited to specialized domains like fault-tolerant systems and
signal processing. Understanding these architectures helps optimize performance for specific
applications.
6. i).Point out how will you classify shared memory multi-processor based on memory
access latency.
Definition:
In a Uniform Memory Access (UMA) system, all processors share a single, centralized
memory, and each processor experiences equal latency when accessing any memory
location. This is commonly found in Symmetric Multiprocessing (SMP) systems.
Step-by-Step Working of UMA:
Processors are connected to the memory and I/O devices via an Interconnect (Bus,
Crossbar, or Multistage Network).
This interconnect manages communication between processors and memory.
It ensures that data is consistently updated across all processors.
4. Fetching & Executing Instructions
Since multiple processors share the same memory, they must coordinate to avoid
conflicts.
Techniques like locks, semaphores, and cache coherence protocols ensure that all
processors see the same updated data.
6. Handling Input/Output (I/O) Operations
Suppose your computer has a quad-core processor (4 cores) and 8GB of RAM. You open
multiple application
Each core (processor) fetches and stores data in the same 8GB RAM at equal access speed
because it follows Uniform Memory Access (UMA).
Definition: (NUMA) systems use a distributed memory model where memory is divided
among multiple nodes, and access time varies depending on whether a processor is
accessing local or remote memory.
Key Components:
Working:
Examples:
Types of NUMA:
(A) Non-caching NUMA (B) Cache-Coherent NUMA
1. Each CPU has its own local memory – Reduces contention for shared memory.
2. No cache coherence mechanism – CPUs directly access memory without caching
remote data.
3. MMU manages memory access – Requests to local memory are fast, while remote
memory access is slower via the system bus.
4. System bus connects multiple nodes – Enables inter-node communication but
increases latency for remote memory access.
5. Efficient for tasks with localized memory access – Not ideal for workloads requiring
frequent remote memory access.
Each CPU mainly uses its own memory, and if it needs data from another CPU's memory, it
takes longer to fetch it because there is no shared caching system.
Each node has its own CPU and memory, connected via a local bus.
CPUs access local memory quickly for faster performance.
If data is not in local memory, the CPU retrieves it from another node via the
interconnection network.
Directory-based system maintains cache coherence, ensuring all nodes have updated
data.
Improves performance by balancing local memory speed with shared memory
access.
CC-NUMA ensures CPUs can efficiently share memory while keeping data consistent across
multiple caches. It balances fast local memory access with global shared memory
coordination.
Definition:
The classification is based on how memory access latency varies depending on the system’s
architecture, and choosing the right system depends on the workload and memory access
patterns.
The "Skip A" label suggests that the system skips a particular step (likely due to a
delay or stall) and moves forward to execute other instructions.
Left Side: Sequential execution where delays (red blocks) cause stalls.
Right Side: Fine-grained multithreading allows skipping over stalled operations,
keeping the processor active.
Different colors represent different instructions or threads.
This concept is used in processors like GPUs and some CPUs to improve
efficiency and minimize idle cycles.
Simultaneous Multithreading:
Simultaneous Multithreading (SMT) is a CPU execution technique that allows multiple
threads to run in parallel within a single processor core. Unlike traditional multithreading,
which switches between threads base on stalls or availability, SMT enables different threads
to share execution resources at the same time.
each representing a different thread executing in parallel. The key observations from the
image include:
1. Parallel Thread Execution: Multiple threads (represented by different colors) are
running simultaneously within a single core.
2. Skipping Stalled Threads: When certain threads encounter stalls (e.g., waiting for
data from memory), the processor skips them (labeled as "Skip A" and "Skip C") and
continues executing instructions from other available threads.
3. Efficient CPU Utilization: Instead of allowing the processor to remain idle during
stalls, SMT ensures that execution units remain active by processing instructions from
other threads.
SMT Works:
Advantages of SMT
1. Increased CPU Utilization: By allowing multiple threads to share execution units,
SMT ensures that the CPU is always performing useful work.
2. Better Performance for Multithreaded Workloads: Applications designed for
multithreading, such as databases, web servers, and gaming engines, benefit
significantly from SMT.
3. Improved Responsiveness: Even if one thread is stalled, other threads can continue
execution, leading to better system responsiveness.
4. Energy Efficiency: SMT improves performance without requiring additional cores,
leading to better power efficiency compared to adding more physical cores.
Disadvantages of SMT
1. Resource Contention: Since multiple threads share CPU resources, contention can
occur, leading to performance degradation if too many threads compete for the same
resources.
2. Security Concerns: SMT can introduce security vulnerabilities, such as side-channel
attacks (e.g., Spectre and Meltdown), where one thread might infer data from another
thread running on the same core.
3. Not Always Beneficial: In workloads that are not optimized for multithreading, SMT
may not provide significant performance gains and can sometimes introduce
overhead.
these challenges, SMT remains a fundamental technology in improving CPU efficiency and
performance in various computing applications.
Multicore processors:
A multicore processor is a single computing component (a CPU) that has multiple
independent processing units called cores. Each core can execute instructions independently,
allowing for parallel processing,which improves performance, efficiency, and multitasking
capabilities
Advantages:
Enhanced computing power.
Energy efficiency.
Ideal for high-performance tasks like AI, gaming, and data processing.
Disadvantages:
More complex programming required.
Higher manufacturing costs.
Some applications may not fully utilize all cores.
Multithreading:
Multithreading is a programming technique that allows multiple threads (smallest units of a
process) to execute independently within a single process. It enables concurrent execution of
multiple tasks, improving performance, responsiveness, and resource utilization.
Types:
1. Pre-emptive Multithreading:
Explanation
Pre-emptive multithreading is controlled by the operating system (OS). The OS decides when
to switch between threads based on priority, time slices, or resource availability. Threads do
not have control over execution switching, ensuring fair CPU distribution.
Working
1. The OS assigns time slices (quantum) to each thread.
2. When a thread's time expires or it enters a waiting state, the OS preempts it and
switches to another thread.
3. This process continues, ensuring multitasking and system stability.
Example
Modern operating systems (Windows, Linux) use preemptive multithreading for
process scheduling.
Web browsers handle multiple tabs and background processes efficiently using this
technique.
2. Cooperative Multithreading
Explanation
Cooperative multithreading relies on threads voluntarily yielding control to allow other
threads to execute. If a thread does not yield control, it can monopolize CPU time, potentially
causing system slowdowns.
Working
1. A thread executes its task until it voluntarily releases the CPU.
2. Another thread is scheduled to run once the previous thread yields control.
3. If a thread does not yield, it can block other threads, leading to inefficiencies.
Example
Early macOS versions used cooperative multithreading before adopting preemptive
scheduling.
Single-threaded applications where tasks are manually scheduled by developers.
3. Concurrent Multithreading
Explanation
Concurrent multithreading allows multiple threads to execute independently within a single
process. However, threads may not execute simultaneously but rather take turns running in an
interleaved fashion.
Working
1. Multiple threads exist within a process and share resources.
2. The system switches between threads when one becomes idle or blocked.
3. This ensures efficiency without requiring multiple CPU cores.
Example
Java's multithreading model (using Thread and Runnable interfaces) allows
concurrent execution.
Music players run UI and playback threads concurrently.
4. Parallel Multithreading
Explanation
Parallel multithreading involves executing multiple threads simultaneously on different CPU
cores. It is used to fully utilize modern multicore processors, significantly improving
performance.
Working
1. Threads are assigned to different CPU cores.
2. Multiple threads execute at the same time without waiting.
3. This approach is ideal for high-performance computing (HPC) and real-time
applications.
Example
Multicore processors handle AI computations using parallel execution.
Gaming engines run physics, graphics, and AI logic in parallel for smooth gameplay.
Advantages of Multithreading
1. Efficient CPU Utilization
o Allows multiple threads to run simultaneously, keeping the CPU busy and
reducing idle time.
2. Improved Responsiveness
o Ensures smooth execution of applications, especially in GUIs and real-time
systems.
o Example: A web browser remains responsive while loading a webpage in the
background.
3. Faster Execution
o Tasks get executed concurrently, improving performance and reducing
execution time.
o Example: A gaming application running AI, physics, and graphics as separate
threads.
4. Better Resource Sharing
o Threads share process memory and resources, reducing overhead compared to
multiple processes.
5. Parallel Processing Capability
o On multicore processors, threads can run truly in parallel, boosting speed for
complex computations.
Disadvantages of Multithreading
1. Complex Debugging and Synchronization Issues
o Managing multiple threads can lead to race conditions, deadlocks, and
resource conflicts.
o Requires proper synchronization (mutexes, semaphores) to prevent data
inconsistency.
2. Increased Resource Consumption
o Each thread requires CPU time and memory, which can lead to overhead if too
many threads are created.
3. Context Switching Overhead
o The CPU spends time switching between threads, which can slow down
performance in certain cases.
4. Security Risks
o Threads share the same memory space, so a bug in one thread can affect
others, leading to potential vulnerabilities.
Applications of Multithreading
1. Operating Systems
o Used for multitasking (running multiple applications at once).
o Example: Windows, Linux thread scheduling.
2. Web Browsers
o Handles multiple tabs, downloads, and rendering in parallel.
o Example: Google Chrome using separate threads for each tab.
3. Gaming and Graphics Processing
o Separate threads for rendering, physics calculations, and AI.
o Example: First-person shooter games with real-time effects.
4. Multimedia Applications
o Allows simultaneous audio/video playback and background processing.
o Example: Video players, music streaming services.
5. High-Performance Computing (HPC)
o Parallel processing for big data analysis, AI, and machine learning.
o Example: Scientific simulations, weather prediction models.
6. Networking Applications
o Servers handle multiple client requests simultaneously using multithreading.
o Example: Web servers like Apache, Nginx, and cloud computing platforms.
Advantages of Multithreading
1. Efficient CPU Utilization
o Allows multiple threads to run simultaneously, keeping the CPU busy and
reducing idle time.
2. Improved Responsiveness
o Ensures smooth execution of applications, especially in GUIs and real-time
systems.
o Example: A web browser remains responsive while loading a webpage in the
background.
6. Faster Execution
o Tasks get executed concurrently, improving performance and reducing
execution time.
o Example: A gaming application running AI, physics, and graphics as separate
threads.
7. Better Resource Sharing
o Threads share process memory and resources, reducing overhead compared to
multiple processes.
8. Parallel Processing Capability
o On multicore processors, threads can run truly in parallel, boosting speed for
complex computations.
Definition
MIMD architecture consists of multiple processors executing different instructions on
different data sets simultaneously. This is the most general form of parallel computing used in
multi-core and distributed systems.
Working
Multiple processors fetch different instructions from the instruction pool.
Each processor operates on its own data from the data pool.
Processors run independently and in parallel.
Execution continues concurrently, making this model highly scalable.
Why is MIMD used?
Best suited for multi-core processors and parallel computing.
Used in high-performance computing, cloud computing, and distributed systems.
Allows multiple tasks to be executed simultaneously, improving efficiency and speed.
Examples
Multi-core processors (Intel Core i7, AMD Ryzen), where each core can execute
different threads.
Supercomputers like IBM Blue Gene and Cray XC40.
Distributed computing systems, such as Hadoop clusters.
Advantages
Highly scalable and efficient.
Supports complex and diverse computations.
Maximizes system performance.
Disadvantages
Complex programming required.
High power consumption.
Expensive hardware implementation.
SISD is the foundation of classical computing, with a single control unit and
processing unit handling operations sequentially.
It remains relevant in basic computing, embedded systems, and low-cost applications.
However, with modern performance demands, more advanced architectures like
SIMD (Single Instruction, Multiple Data) and MIMD (Multiple Instruction, Multiple
Data) are widely used.
simultaneous Multithreading:
Simultaneous Multithreading (SMT) is a technique used in modern processors to improve
CPU performance and efficiency. It allows multiple threads to execute simultaneously on a
single physical core by sharing resources more effectively. This results in better utilization of
CPU resources, improved parallelism, and enhanced throughput.
consists of a structured grid, where:
Each small square represents an instruction.
Different colors indicate different threads executing simultaneously.
The "Skip" annotations highlight scenarios where certain instructions from a thread
cannot execute at a given cycle, often due to dependencies or resource constraints.
Working of SMT
SMT enhances CPU performance by enabling multiple threads to share execution resources
within a single core. The key working principles of SMT include:
1. Thread-Level Parallelism (TLP):
o The processor executes multiple threads simultaneously by sharing execution
units, registers, and caches.
o It dynamically allocates resources to different threads based on availability.
2. Instruction Dispatch and Scheduling:
o The CPU fetches instructions from different threads in parallel.
o A thread scheduler determines which instructions are executed based on
available execution units and dependencies.
o If an instruction in one thread is stalled due to memory latency or a
dependency, another thread’s instruction can be executed instead, reducing
idle time.
3. Pipeline Execution:
o SMT allows multiple threads to share the CPU pipeline, ensuring that no
pipeline stage remains idle.
o When a thread experiences a stall, other active threads can utilize the pipeline,
improving overall efficiency.
4. Register and Cache Sharing:
o Threads share resources like registers and caches to optimize execution.
o The CPU ensures fair allocation of cache memory among threads to prevent
one thread from monopolizing resources.
5. Handling Resource Contention:
o If multiple threads compete for the same execution unit, the processor
schedules instructions efficiently to avoid bottlenecks.
o Prioritization mechanisms ensure critical instructions receive processing
priority.
Advantages of SMT
1. Improved CPU Utilization: SMT enables efficient use of processor execution units
by interleaving multiple threads.
2. Higher Throughput: More instructions per cycle can be completed compared to
single-threaded execution.
3. Reduced Execution Stalls: If one thread encounters a stall (e.g., memory access
latency), other threads can continue executing.
4. Power Efficiency: While SMT increases power consumption slightly, the
performance gains per watt are usually beneficial.
5. Better Performance in Multi-Tasking: Applications that use multiple threads (e.g.,
web servers, video rendering, and databases) benefit significantly.
Challenges of SMT
1. Resource Contention: Multiple threads share execution resources, potentially leading
to performance degradation if too many threads compete for the same resources.
2. Security Concerns: Side-channel attacks like Spectre and Meltdown exploit shared
resources in SMT architectures.
3. Performance Variability: Not all workloads benefit equally from SMT. Some single-
threaded applications may not see significant improvements.
4. Increased Power Consumption: While efficiency improves, SMT does require more
power than single-thread execution.
By leveraging SMT, modern processors achieve higher throughput and better responsiveness,
making it a crucial technology for multi-threaded workloads. Understanding its execution, as
depicted in the image, helps in optimizing applications for better performance.
Each of these approaches differs in how they manage thread execution and resource
allocation. The provided diagram illustrates these methods using functional unit (FU)
utilization across processor cycles.
1. Superscalar Execution
Definition:
Superscalar execution is a technique where a single thread issues multiple instructions per
cycle, optimizing CPU performance by exploiting instruction-level parallelism (ILP).
Explanation:
Superscalar processors have multiple execution units (FUs) allowing simultaneous
execution of independent instructions from the same thread.
The efficiency depends on the ability to find independent instructions that can be
executed in parallel.
Pipeline stalls due to data hazards or control dependencies can limit performance
gains.
Image illustrates:
Represents a single-threaded execution model where multiple instructions from the
same thread are issued per cycle.
The orange blocks show active execution units (FUs) being used, while the white
spaces indicate idle units due to stalls or dependencies.
Inefficiencies arise due to stalls, limiting parallel execution.
Working:
1. Instructions from a single thread are fetched and decoded.
2. The processor identifies independent instructions.
3. These instructions are executed simultaneously across multiple execution units.
4. If dependencies exist, execution units may remain idle, causing stalls.
5. The next set of instructions is fetched and processed in the next cycle.
Advantages:
Increased performance due to parallel execution.
Efficient for single-threaded applications with high ILP.
Low complexity compared to multithreading approaches.
Disadvantages:
Limited by the availability of independent instructions.
Poor utilization during stalls or dependencies.
Not suitable for multi-threaded workloads.
Definition:
Coarse-Grained Multithreading (CGMT) switches between threads only when a long-latency
stall occurs, such as a cache miss.
Explanation:
Unlike superscalar execution, CGMT introduces multithreading by switching threads
when the current thread encounters a delay.
This prevents execution units from remaining idle for long periods.
However, CGMT does not utilize resources efficiently when threads do not encounter
long stalls.
Image illustrates:
Uses multiple threads but switches between them only when a long-latency stall
occurs (e.g., cache miss).
Different colors (blue, green, yellow, orange) represent separate threads.
A thread executes until it encounters a stall, after which the processor switches to a
different thread.
Working:
1. A single thread executes until a long-latency stall occurs.
2. The processor switches execution to another thread.
3. The new thread continues execution until it encounters a stall.
4. The processor cycles through available threads in a sequential manner.
5. Once the stalled thread is ready, execution resumes.
Advantages:
Reduces idle time due to long-latency stalls.
Simple hardware implementation compared to other multithreading techniques.
Effective for workloads with occasional stalls.
Disadvantages:
Poor efficiency when thread switching is infrequent.
Performance suffers if all threads experience stalls at the same time.
Delays occur when switching threads, causing execution gaps.
Modern processors often combine Superscalar Execution, SMT, and CMP (Chip
Multiprocessing) to achieve the best balance between efficiency and performance.
13. Illustrate the following in detail (i) Clusters (ii) Wharehouse scale computers
(i)Clusters:
A cluster is a collection of interconnected computers (nodes) that work together as a single
system to enhance performance, scalability, and reliability. These systems are widely used in
high-performance computing (HPC), data centers, cloud computing, and parallel processing.
The computers in a cluster share resources and coordinate tasks to achieve higher efficiency
compared to standalone machines.
Working of Clusters:
Clusters function by distributing tasks among multiple computers (nodes) that work in
parallel. These nodes communicate with each other using high-speed network connections.
The process generally involves:
1. Task Assignment: A task is divided into smaller subtasks and assigned to different
nodes.
2. Processing: Each node processes its assigned task independently.
3. Communication: Nodes share intermediate results with each other or the master node.
4. Aggregation: The final result is compiled from all nodes and sent to the user.
Types of Clusters:
Clusters can be categorized based on their purpose and architecture:
a) High-Availability (HA) Clusters
Designed to ensure continuous service availability.
If one node fails, another node takes over (failover mechanism).
Used in banking systems, e-commerce, and enterprise applications.
b) Load Balancing Clusters
Distributes workload among nodes to optimize performance.
Ensures no single node is overwhelmed.
Commonly used in web servers and cloud computing.
c) High-Performance Computing (HPC) Clusters
Used for computationally intensive tasks such as scientific simulations.
Parallel processing helps speed up execution.
Examples include supercomputers and research laboratories.
d) Storage Clusters
Provides redundant and scalable data storage.
Used in big data applications and enterprise storage solutions.
Components of a Cluster System
Clusters typically consist of the following components:
Nodes: Individual computers that perform computations.
Master Node (Root Node): Assigns tasks and aggregates results.
Slave Nodes: Execute assigned tasks and send back results.
Networking: High-speed interconnects (Ethernet, InfiniBand) for communication.
Storage System: Shared or distributed storage for data handling.
Advantages of Clusters
1. Improved Performance: Parallel processing allows faster execution of tasks.
2. Scalability: Additional nodes can be added to meet growing demand.
3. Fault Tolerance: Redundant nodes prevent system failure.
4. Cost-Effectiveness: Uses commodity hardware instead of expensive mainframes.
5. Load Balancing: Distributes tasks evenly to prevent overloading any node.
Disadvantages of Clusters
1. Complex Configuration: Requires specialized knowledge to set up and maintain.
2. High Network Dependency: Performance depends on network speed and reliability.
3. Synchronization Issues: Coordinating tasks between nodes can be challenging.
4. Power Consumption: Running multiple nodes increases energy costs.
5. Security Risks: Interconnected systems are vulnerable to cyber threats.
Applications of Clusters
Clusters are used in various domains, including:
Scientific Research: Weather forecasting, genetic sequencing, simulations.
Finance: High-frequency trading, risk analysis.
Big Data Analytics: Processing large-scale data in cloud environments.
Gaming: Multiplayer online games requiring real-time synchronization.
Enterprise IT: Running database servers, email systems, and web hosting.
Clusters play a crucial role in modern computing by providing enhanced processing power,
scalability, and reliability. Whether used for scientific research, cloud computing, or high-
availability systems, clusters are an essential technology driving innovation in various
industries.
Advantages of WSCs
1. Scalability: Can handle vast amounts of data and scale easily as demand increases.
2. Cost Efficiency: Optimized resource management reduces operational costs.
3. High Performance: Parallel processing enables quick execution of complex tasks.
4. Reliability and Fault Tolerance: Built-in redundancy ensures continued operation
even in case of hardware failures.
5. Energy Efficiency: Advanced cooling and power management systems reduce energy
consumption.
6. Support for AI and Big Data: WSCs are essential for AI model training, big data
analytics, and large-scale simulations.
Disadvantages of WSCs
1. High Initial Investment: Setting up a WSC requires significant financial investment.
2. Complex Management: Requires specialized skills to manage networking, storage,
and compute resources.
3. Security Concerns: Storing and processing vast amounts of data make WSCs a target
for cyber threats.
4. Latency Issues: Large-scale distributed systems may experience latency in data
transfer and processing.
5. Environmental Impact: High energy consumption can contribute to environmental
concerns if not managed efficiently.
Applications of WSCs
1. Cloud Computing Services: Used by Google Cloud, AWS, and Microsoft Azure for
hosting web applications.
2. Big Data Processing: Supports large-scale analytics using frameworks like Hadoop
and Spark.
3. Artificial Intelligence and Machine Learning: Enables deep learning model training
with GPUs and TPUs.
4. Social Media Platforms: Facebook, Twitter, and LinkedIn use WSCs to manage user
data and services.
5. E-Commerce: Amazon and other online retailers use WSCs for recommendation
engines and transaction processing.
6. Scientific Research: Used for climate modeling, genome sequencing, and simulations
in physics and chemistry.
This system enables efficient large-scale computing by combining multiple servers into a
single, powerful computational unit.
14. Discuss the multiprocessor network topologies in detail.
Bus Topology:
A network topology where all devices are connected to a single central
communication cable (bus).All processors share a common communication bus.
Simple and cost-effective but suffers from bus contention.Best suited for small-scale
multiprocessor systems.
Star Topology:
where all devices are connected to a central hub or switch that manages data
transmission. It offers efficient communication, easy troubleshooting, and high
performance by reducing data collisions.
This topology is scalable, allowing new devices to be added without disruption.
However, it is more expensive due to additional cabling and hub costs, and the entire
network depends on the hub—if it fails, the network goes down.
Star topology is commonly used in home and office networks due to its reliability and
efficiency.
Ring Topology:
Ring Topology connects each node to exactly two other nodes, forming a closed loop.
Data flows in a unidirectional or bidirectional manner.
Reduces data collisions and ensures efficient transmission.
A single node failure can disrupt the network unless fault tolerance is implemented.
Commonly used in telecommunications and local area networks (LANs).
Mesh Topology:
Mesh Topology in which each of the nodes of the network is connected to each of the other
nodes in the network with a point-to-point link – this makes it possible for data to be
simultaneously transmitted from any single node to all of the other nodes.
Hybrid Topology:
Tree Topology:
Tree Topology follows a hierarchical structure with a root node at the top.
Each node connects to a fixed number of lower-level nodes (branching factor).
Uses point-to-point links to connect different levels of nodes.
The top-level (root) node has no parent and serves as the main connection point.
Provides easy scalability by adding more branches.
Commonly used in large networks and organizational structures.
If the root node fails, the entire network may be affected.
Applications of Multiprocessor Network Topologies
Multiprocessor network topologies play a crucial role in various high-performance
computing environments. Their applications include:
1. Supercomputing – Used in large-scale scientific simulations, weather forecasting,
climate modeling, and molecular dynamics simulations.
2. Cloud Computing & Data Centers – Supports distributed computing for cloud
services like AWS, Google Cloud, and Microsoft Azure.
3. Artificial Intelligence & Machine Learning – Enables efficient training and
deployment of deep learning models using parallel processing.
4. Big Data Analytics – Helps in processing large datasets for business intelligence,
fraud detection, and real-time analytics.
5. Telecommunications – Used in network infrastructure for routing and data packet
switching.
6. Internet of Things (IoT) – Supports edge computing and real-time data processing in
IoT environments.
PART-C
1.Evaluate the below C code using MIMD and SIMD machine as efficient as possible:
For(i=0;i<2000;i++)
For(j=0;j<3000;j++)
Array[i][j]=array[j][i]+200;
Explanation:
The #pragma omp parallel for directive splits the outer loop (i) across multiple
processors.
Each processor computes a portion of the rows (i) independently.
No data dependencies exist between rows, so this approach is efficient.
Explanation:
The #pragma omp simd directive vectorizes the inner loop (j).
SIMD instructions process multiple j values in parallel, adding 200 to each element.
Optimized Code for Both MIMD and SIMD:To combine the benefits of MIMD and SIMD,
we can parallelize the outer loop (i) using MIMD and vectorize the inner loop (j) using
SIMD:
Performance:
MIMD: Parallelizes the outer loop across multiple processors.
SIMD: Vectorizes the inner loop for each processor.
Combined Speedup: Significant improvement over sequential execution.
2. Write down a list of your daily activities that you typically do on a weekday. For
instance get out of bed, take a shower, get dressed, eat breakfast, brush your teeth, dry
your hair etc (minimum ten activities). Which of these activities can be done in form of
parallelism. For each activity discuss if they are working in parallel, but if not, why they
are not. Estimate how much shorter time it will take to complete all the activities if it is
done in parallel.
Can be
Estimated
Activity Done in Reason
Time Saved
Parallel?
Can do while
Brush teeth ✅ Yes checking 2 min
phone/emails
Can listen to
Get dressed ✅ Yes news/music while 1 min
dressing
Can do while
Pack bag ✅ Yes waiting for food to 2 min
cook
Can do while
Wear shoes ✅ Yes 1 min
listening to a podcast
Can read
Commute to emails/listen to an
✅ Yes 5 min
work/school audiobook while
traveling
Time Estimation Without Parallelism (Sequential Execution):
In a sequential approach, every activity is performed one after another. Assuming an
approximate time for each task:
Wake up 2 min
In a parallel approach, multiple tasks are performed at the same time wherever possible.
Tasks that can overlap include:
3.Consider the following portions of two different programs running at the same time
on four processors in a symmetric multicore processor (SMP). Assume that before this
code is run, both x and y are 0? Core l: x=2;
Core 2: y=2;
Core 3: w=x+ y +1;
Core 4: z=x + y;
(i) What if all the possible resulting values of w, ? For each possible outcomes, explain
how we might arrive at those values.
(ii) Develop the execution more deterministic so that only one set of values is possible?
Let’s analyse the given program running on four cores in a symmetric multicore processor
(SMP). The initial values of x and y are 0. The code is as follows:
Core 1: x = 2;
Core 2: y = 2;
Core 3: w = x + y + 1;
Core 4: z = x + y;
We need to determine the possible values of w and z based on the order of execution of these
instructions. Since the cores are running concurrently, the order of execution is not fixed, and
different interleaving can lead to different results.
The values of w and z depend on the order in which the cores execute their instructions. Let’s
analyse the possible scenarios:
To ensure that only one set of values is possible, we need to enforce a specific order of
execution. This can be achieved using synchronization mechanisms such
as barriers or locks. Here’s how we can make the execution deterministic:
How It Works:
1. Core 1 sets x = 2 and waits at the barrier.
2. Core 2 sets y = 2 and waits at the barrier.
3. Once both Core 1 and Core 2 reach the barrier, Core 3 and Core 4 are allowed to
proceed.
4. Core 3 calculates w = 2 + 2 + 1 = 5.
5. Core 4 calculates z = 2 + 2 = 4.
Result: Always w = 5, z = 4.
Without synchronization, the values of w and z can vary depending on the order of
execution.
By introducing barriers, we can enforce a deterministic execution order, ensuring that
w = 5 and z = 4 are the only possible outcomes.
4.Summarize the merits and demerits of clusters and warehouse scales computer.
Clusters and warehouse-scale computers (WSCs) are two essential architectures in modern
computing. Clusters are a collection of interconnected computers that work together to
function as a single system, while WSCs are large-scale data centers designed for cloud
computing, big data processing, and large-scale web services. Each has its own advantages
and disadvantages, making them suitable for different applications.
Merits and Demerits of Clusters
Merits of Clusters:
1. High Performance: Multiple nodes work together, providing better computational
power and processing speed.
2. Scalability: Additional nodes can be added to enhance performance as demand
increases.
3. Cost-Effective: Clusters use commodity hardware, making them more affordable
compared to supercomputers.
4. Fault Tolerance: If one node fails, the workload can be redistributed among other
nodes, ensuring system reliability.
5. Parallel Processing: Enables efficient task execution by distributing workloads
across multiple nodes.
6. Customizability: Can be tailored to specific needs such as high-performance
computing (HPC) or data storage solutions.
Demerits of Clusters:
1. Complex Setup and Management: Requires specialized knowledge to configure and
maintain.
2. Network Dependency: Performance may be limited by network latency and
communication overhead.
3. Power and Cooling Requirements: Large clusters generate significant heat and
consume high power.
4. Software Compatibility Issues: Some applications may not be optimized for cluster
environments.
5. Synchronization Challenges: Tasks running in parallel need to be well-coordinated
to avoid inefficiencies.
Both clusters and warehouse-scale computers have their own strengths and weaknesses.
Clusters offer a cost-effective and scalable solution for high-performance computing, whereas
WSCs provide massive processing power and storage capabilities suited for large-scale
applications. The choice between the two depends on factors like computational needs,
budget, scalability requirements, and energy efficiency considerations. Future advancements
in distributed computing and cloud technologies will continue to enhance their capabilities
and efficiency.