0% found this document useful (0 votes)
13 views64 pages

COA For Midterm

The document outlines the course structure for Computer Organization and Architecture (COA CSE/IT-222) taught by Dr. Rahul Chaurasia, including attendance policies, evaluation criteria, and key industries related to the subject. It covers fundamental concepts, internal hardware structures, instruction set architectures, and performance measures, emphasizing the importance of computer architecture in various fields such as embedded systems, AI, and cloud computing. The course aims to equip students with both theoretical knowledge and practical skills in computer architecture and organization.

Uploaded by

sarthakgupta.pw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views64 pages

COA For Midterm

The document outlines the course structure for Computer Organization and Architecture (COA CSE/IT-222) taught by Dr. Rahul Chaurasia, including attendance policies, evaluation criteria, and key industries related to the subject. It covers fundamental concepts, internal hardware structures, instruction set architectures, and performance measures, emphasizing the importance of computer architecture in various fields such as embedded systems, AI, and cloud computing. The course aims to equip students with both theoretical knowledge and practical skills in computer architecture and organization.

Uploaded by

sarthakgupta.pw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Computer Organization and Architecture

(COA CSE/IT-222)

By- Dr. Rahul Chaurasia


Assistant Professor
Deptt. of CSE, IIIT Bhopal
Class Instructions:
• Adherence to the institute's attendance rules is mandatory.
• Students with less than 75% attendance will be detained without any
exceptions.
• No mass bunk will be considered and ABSENT will be awarded to the
entire class.
• Quizzes will be conducted in the class after each topic.
• Industries- Semiconductor,
• GATE weightage- [8-14 Marks]
• Evaluation will be based on-
Theory: Practical:
Mini Test- 10 Marks Lab assignments/file- 50 Marks (5 Marks/assignment)
Midterm- 20 Marks Lab Attendance- 10 Marks (1 Marks/Lab)
Attendance- 10 Marks Final written + Implementation + Viva- 40 Marks
End term- 60 Marks
Text/Reference Books:
1. Computer Architecture and Organization J.P. Hayes
2. Computer Organization and Architecture (Designing for Performance) William Stallings
Utility:
 It builds the foundation for understanding how computer systems work. This knowledge is essential
for various industries and research domains, including the following:
 Hardware Design and Development
• Industries: Intel, AMD, NVIDIA, ARM, Qualcomm, Broadcom.
• Research Domains: High-performance computing (HPC), custom processor design, and
microarchitecture optimization.
• Applications: Designing efficient processors, GPUs, ASICs, and FPGAs.
 Embedded Systems and IoT
• Industries: Texas Instruments, NXP Semiconductors, STMicroelectronics, Bosch.
• Research Domains: Embedded hardware design, energy-efficient computing, and sensor-based
systems.
• Applications: Developing IoT devices, robotics, and real-time systems.
 Hardware Security
• Industries: Cybersecurity companies, defense organizations, and research labs.
• Research Domains: Secure processor design, cryptographic hardware, and tamper-resistant
architectures.
• Applications: IP protection, secure boot mechanisms, and hardware-based cryptographic systems.
 Software Development and Systems Programming
• Industries: Microsoft, Google, Red Hat, VMware.
• Research Domains: Operating systems, compilers, and virtual machine design.
• Applications: Writing performance-optimized software that leverages hardware features.
 Artificial Intelligence and Machine Learning (Hardware Accelerators)
• Industries: NVIDIA, Google (TPUs), IBM (Watson), and various startups focusing on AI hardware.
• Research Domains: AI accelerators, neuromorphic computing, and parallel processing.
• Applications: Designing hardware for AI/ML workloads, like GPUs, TPUs, and edge AI devices.
 Cloud Computing and Data Centers
• Industries: AWS, Google Cloud, Microsoft Azure.
• Research Domains: Server architecture, virtualization, and power-efficient design.
• Applications: Building scalable and energy-efficient data centers.
 Multimedia Systems and Consumer Electronics
• Industries: Samsung, Sony, LG, Apple.
• Research Domains: Multimedia processing units, display technologies, and codecs.
• Applications: Optimizing hardware for audio/video processing and gaming devices.
 Quantum Computing
• Industries: IBM, Google Quantum AI, D-Wave.
• Research Domains: Quantum processor architectures, hybrid quantum-classical systems.
• Applications: Exploring new computational paradigms beyond classical architectures.
 Automotive and Aerospace Systems
• Industries: Tesla, Boeing, SpaceX, Bosch.
• Research Domains: Autonomous vehicles, avionics systems, and resilient computing.
• Applications: Designing onboard computers and real-time embedded systems.
 Research in Academic and Government Labs
• Institutions: IITs, NITs, IISc, DRDO, ISRO, and international labs like MIT, Stanford.
• Research Domains: Processor innovation, fault-tolerant systems, and bio-inspired computing.
• Applications: Advancing the state-of-the-art in computer architecture.
Course Outcomes:
• To understand the fundamental concepts of computer organization and architecture, including how a computer
functions and executes instructions.
• To explore the internal structure and operations of computer hardware components such as processors,
memory, and input/output devices.
• To study the design and working of key components like ALU, control unit, and memory hierarchy (caches, main
memory, virtual memory).
• To analyze different instruction set architectures (ISA) and their impact on system performance.
• To examine the concepts of data representation, addressing modes, and instruction formats.
• To introduce pipeline processing and its role in improving instruction throughput.
• To understand I/O organization and the mechanisms used to interface devices with processors.

Leaning Objectives:
--By the end of this course, students should be able to:
Knowledge-Based Objectives:
• Explain the architecture of modern computer systems, including the role of each major component.
• Analyze the memory hierarchy and the role of caches, paging, and virtual memory.
• Describe different types of instruction set architectures and evaluate their trade-offs.
• Understand the principles of pipelining and how it improves system performance.
Skill-Based Objectives:
• Design and simulate simple digital circuits related to computer architecture.
• Write low-level assembly language programs to illustrate machine-level operations.
• Calculate performance metrics such as CPI (cycles per instruction) and assess how architectural features
influence performance.
Application-Based Objectives:
• Apply the principles of computer organization to analyze the behavior of real-world systems.
• Compare and contrast different architectural paradigms such as RISC, CISC.
• Demonstrate an understanding of how hardware and software interact to execute a program.
Computer Architecture
• Deals with the conceptual design of the system.
--Focusing on what the system should do.
--It addresses how to meet system goals, such as performance, power
efficiency, and compatibility.
• Instructions (CPU architecture paradigms)---CISC // Intel, Pentium
---RISC // Power PC
 CISC- Suited for desktops and  RISC- making it ideal for embedded
servers where instruction-level systems, multimedia, and real-time
performance is critical. applications and automotive systems.

• Addressing Mode- How operand is given in the instruction.


Immediate addressing Takes operand as instruction.
Direct Addressing  Address is used
Indirect Addressing Reference is used
• Data Format  How to interpret the binary string.
Reduced Instruction Set Computing Complex Instruction Set Computing
(RISC) (CISC)
CPI=1 (Cycle/Instruction) leading to Variable CPI, as some instructions may take
predictable and efficient performance multiple cycles.
Fixed in size (instruction), e.g. 32bits, which Instruction size is variable e.g., 1 to 15 bytes in
simplify decoding and pipelining in hardware x86
• The length of an instruction depends on its
components (like opcode, operands, and
addressing mode).
• Helps optimize memory usage but can make
instruction decoding more complex.
Limited addressing modes (to simplify the More addressing modes to support complex
hardware and improve execution speed). instructions.
A.M α Flexibility More flexible
(Less flexible compared to CISC but optimized
for high performance with simple instructions)

More general-purpose registers, so efficient Fewer general-purpose registers, with reliance


parameter passing and reducing memory on memory for parameter passing
access. This approach minimizes bottlenecks
caused by memory latency.

ARM, MIPS, SPARC, RISC-V processors Intel x86, AMD, and IBM processors

Scalable Processor Architecture developed by Sun Microsystems in the mid-1980s.


Computer Organization
Deals with the logical design/implementation of the system. It concern with
physical devices and their interconnection with respect to improve the
performance (Work/time).
• It involves the physical realization and interconnection of components.

GOAL: Reduce the time; increased clock frequency improved


performance.

Locality of reference: 10% of instructions out of 100% consuming 90% of


time and rest 90% of instructions consuming 10% of time.
So, to reduce this 90% time that is 9msec, we use Cache then it takes less
time.

So, In a nutshell-
 Architecture defines "what" the system should do at a conceptual level.
 Organization focuses on "how" to build the system to achieve that conceptual
design, efficiently in physical terms.
Performance Factors
• Representation  faster Access
• Storing and Accessing Cache- Reducing Avg. access time
• Processing Pipelining- Reducing Avg. processing time.
• Communication  Direct Memory Access (DMA)

Memory

CPU DMA

Device

DMA is a communicating device


Architectural Classification
 Von-Neumann Architecture-  Harward Architecture-
--First memory-based architecture • Storage of program and data is still required
before processing.
• Supports stored program concept. • Separate Memories: Instructions and data
• Single pathway to access data and instruction are stored in separate memory units, with
• Only instructions needs to be changed for dedicated buses for each.
modification. V.M • Parallel Access: The CPU can fetch
instructions and access data simultaneously,
Memory unit O.S improving performance.
(RAM)

DMA

Input Output
device Input/Output processor device

 laptop, desktop, or smartphone likely uses a modified Harvard architecture.


Flynn’s Classification
• It is a method used to categorize computer architectures based on the number of
instruction streams and data streams they can handle simultaneously.
• Instruction Stream: The sequence of instructions executed by a CPU.
• Data Stream: The set of data that the CPU processes.

1) SISD: One CPU, one task, one dataset. //Serial processing


2) SIMD: One CPU controlling multiple processing units, operating on multiple datasets in
parallel. //Array or Vector processing, Exp. Playlist
3) MISD: Multiple CPUs processing the same dataset with different instructions (rare).
Exp. Pipeline Architectures, sensor data analysis
4) MIMD: Multiple CPUs performing independent tasks on separate datasets.
//Serial Multi-processing or parallel processing.

Utility:
• In modern computing-
MIMD being dominant in general-purpose multi-core processors (distributed systems) and
SIMD being extensively used in GPUs for data-intensive tasks (vector calculations).
Primary Components of Architecture

Control Unit Processing Unit Memory Unit

1 1 1 SISD

1 N Modules SIMD

N M Modules MIMD
Memory Organization CPU

ith Level

(i+1)th
Random access Level Block
memory

Random access Block

Semi random Disk memory


( ) Sectors
access

Tape memory
Serial Access Record
( )

• Access time- ‘T ’
• Size- ‘S’
• Cost – ‘C ’
• Access frequency- ‘f ’
• Level- ‘I ’ i.e. CM is subset of MM
Performance Measures
• The performance of memory hierarchy is given by HIT ratio (Availability of
referred information at referred place is termed as HIT).
• HIT ratio < < <
•H α .
More HIT Less avg. access time T

• HαS T2
• Issue with Memory Hierarchy?
Data inconsistency Soln: Proper updation
T1

 Mathematical expression for Memory Hierarchy: S


Let, in a two-level memory hierarchy,
, , , are the specification of level 1 and
, , , are the specification of level 2, then
+
Performance Measures Larger memory

• = + { }

• = +

• For a three-level cache:


= +
If,
=1, =
=0, =1 = +
=0, =0 = + +
Que.1: In a two-level cache memory system avg. access time without level-1 is
150nsec, with level-1 avg. access time is 30nsec. The level-1 access time is
20nsec. What is the HIT ratio of level-1.
=30nsec
Soln: = + = 20nsec
30 = + .150 =?
= 1 (by default)
=12/13 =92.3%
Cont.
Que.1.1: In this system if maximum tolerated avg. access time is 40nsec., then
min ?
Soln: if =
=
40= 20 +(1- ).150
= 11/13 =.84% // if HIT ratio is less than .84, then >40ns

Que1.2. Which of the following HIT ratio satisfies if HIT ratio is atmost .84.
a) .6 b) .78 c) .82 d).94
If Tavg. is not going to high then HIT ratio will never decrease
than that

Que1.3. In the above problem, if the HIT ratio is made to 100%, what will be the
value of
a) 0 b) 100 c) 150 d) 120
// does not influence , instead it influence Avg. time.
Que.2: Consider, in a multilevel memory hierarchy, information is
distributed in L1 cache (instruction), L2 cache (data), L3 unified
cache (data+instruction) and main memory. The hit ratio of
respective memories are .8, .85, .9 and 1. The respective access
times are 10 nsec, 10 nsec, 50 nsec and 500 nsec among the
total references 60% of them refer the data. If the referred word is
not present in L1&L2, it must be brought from level L3. If it is not
present in L3, the word should be obtained from main memory to
L3 and from L3 to respective cache memory.
-what will be the average instruction time?
-average data time ?and
-average access time?
Instruction Unified

L1 BUS

CPU L3 MM

L2
Data
= +(1- ).[( + ). +(1- ). ( )
=.8X10+.2 [60X.9+(1-.9)X560]
= 30nsec

= +(1- ).[ ). +(1- ). ( )]


=.85.10+(.15).[.9.(60)+(.1).(560)]
=8.5+16.5=25nsec.
= 25nsec

Fdata =60%, Fins =40%

. = X .+ X
=30X40/100+ 25X60/100
= 27nsec.
 Throughput of the above system: .
= = words/sec
/

≈37million words/sec.
Que.2: Consider, a three-level memory system which contains
Level-1 cache, level-2 cache, and main memory. They are
connected as per the following arrangement:
Block size =4B Block size =16B
20ns 200ns
2ns

4B
4 words
(4B)
L1 L2 MM

1. If L1 has Miss and L2 has HIT, a block is transferred from L2


to L1. What is the block transfer time.
a) 2ns b) 20ns c) 80ns d)88ns
Soln: 4*(20+2)=88ns //Here, Not asking Access time
2. If there is Miss at Level 1 as well as Level2 . The block must be
transferred from L3 to L2 and then the associated block must be
moved from L2 to L1. What will be the total block transfer time?
a) 222ns b) 888ns c) 902ns d) 968ns
20ns 200ns
Soln: 2ns

CPU MM
4 words 4B
(4B)
L1 L2 MM
88 + (200+20)x4

= 88+ (200+20)x4 = 968ns // Block size is same but no. of


blocks may differ due to the size of Memory.
3. If both BUSES are acting parallelly, then
a) 222ns b) 888ns c) 902ns d) 968ns
Soln: Overlapped
220ns 1
220 + 22
L3 L2 220 + 22
220ns 2 220 + 22
L3 L2 220 + 22
=902ns
22ns
L2 L1
// If the BUS transfers from L3 to L2 and L2 to L1, are overlapped
then effective block transfer time is
880 +22 = 902ns
The last one only, first three will
pass already in 880ns
Registers
• Registers are often referred to as the CPU’s working memory.
• Registers are high-speed storage areas within the CPU, but have the least storage
capacity.
• Registers are not referenced by their address, but are directly accessed and
manipulated by the CPU during instruction execution.
• Registers store data, instructions, addresses and intermediate results of processing.
• The data and instructions that require processing must be brought in the registers
of CPU before they can be processed.

Some of the important registers in CPU are as follows—


1. Accumulator (ACC) stores the result of arithmetic and logic operations.
2. Instruction Register (IR) contains the current instruction most recently fetched.
3. Program Counter (PC) contains the address of next instruction to be processed.
4. Memory Address Register (MAR) contains the address of next location in the
memory to be accessed.
5. Memory Buffer Register (MBR) temporarily stores data from memory or the
data to be sent to memory.
6. Data Register (DR) stores the operands and any other data.
BUS Architecture
 A bus is a set of wires used for interconnection, where each wire can carry
one bit of data. (A bus width is defined by the number of wires in the bus).
 CPU sends data, instructions and information to the components inside the
computer as well as to the peripherals and devices attached to it.
 Bus is a set of electronic signal pathways that allows information and
signals to travel between components inside or outside of a computer.
1. The data to be transferred is carried
by the data bus. The size of data bus
defines the size of the processor. A
processor can be 8, 16, 32 or 64-bit
processor.
2. The address of I/O device or memory
is carried by the address bus. The width
of address bus determines the
maximum number of memory locations
the computer can address.
3. The command to access the memory
or the I/O device is carried by the control
Interaction between CPU and memory bus. specifies whether data is to be
read or written to the memory.
• Currently, Pentium Pro, II, III, IV have 36–bit address bus that can address 236 bytes
or 64 GB of memory.
Memory Mapping:
Direct mapping:
M/N blocks are competing for same position in cache.
 K block of main memory placed in (KmodN) cache block.
Requires less TAG bits& one TAG controller. Block of MM
Physical Address Cache Cache 0
P=16 Words
TAG (0 …15)
4 block of
TAG Cache Word (0,4,8,12,…28) 0 001
MM
info. block offset offset (1,5,9,13,…29) 1 000 1
log M/N log N log P
(2,6,10,14,…30) 2 000 2 K
Block
(3,7,11,15,…31) 3 001 7 31
Conflict problem Let N= 4 Let M= 32
Block Block

TAG Controller Size = N x log M/N = Total no. of TAG bits

Exp. MM word ‘8’ is a HIT or MISS: 8mod16= 0th block of MM 0th cache block

000 00 1000 Is a MISS


Exp. MM word ‘17’ is a HIT or MISS: 17 mod16= 1st block of MM 1st cache block

000 01 0001 Is a HIT


Cont.
Hardware design

Block of
Cache
000 01 0000 Cache block TAG TAG Cache
Word selection 0 001 4
TAG Cache
block offset Select lines 1 000 1
offset CB T CB T CB T 000
CB T MUX CB T MUX CB T MUX 2 2
CB T 4:1 CB T 4:1 CB T 4:1
3 001 7
CB T CB T CB T

A B Let N= 4
TAG comparator Block

HIT/MISS

No. of MUX required = TAG bits = log M/N, to select TAG info., each of size N:1 (cache size)
HIT Latency = MIUX delay + comparator delay (=EX-NOR delay + AND delay)
0
Slowest 4
 Limitations of Direct Mapping: 1 1
In initially empty cache, if referred MM blocks are: [16, 0, 16, 0, 4, 0, 4] then HIT ratio =0/7
2 2
If all the referred words belongs to same cache block.
Solution- Associate mapping/Set-associate mapping 3 7
Quiz:-
Que. Kth MM block is referred with direct mapping, then which MM words
move into cache.
a) K*P to P
b) K*P to K*P +P
c) K*P to P-1
d) K*P to K*P+P-1

Que. How many 2x1 Mux are required to construct MX1 Mux.

Ans: M-1 // M is power of 2

Que: How many Nx1 Mux are required to construct MX1 Mux :

Ans: {M-1}/{N-1} // provided that M is a power of N


Associate mapping:
Any block of MM can be placed anywhere in cache. So, No conflict occurs. So, no need of conflict resolving time. , it is
fastest.
It requires searching of all cache blocks before telling HIT/MISS.
 No. of TAG comparators required =N, Costly
 More TAG bits
Physical Address
TAG Cache
Word 4 block of
Word TAG info. 10000
TAG info. offset MM
offset 10100 1
log M log P
2
5 5
7
 No conflict problem

 Associate mapping requires more TAG bits.


 HIT latency = Comparator delay+ OR delay

(=EX-NOR delay + AND delay) A B A B A B A B

Comparator Comparator Comparator Comparator


1 2 3 4
 In initially empty cache, if referred MM blocks are:
[16, 0, 16, 0, 4, 0, 4] then HIT ratio =4/7
0 16
OR
1 0

2 HIT/MISS
4
3
Set-Associate Mapping:-
 K block of main memory placed in (KmodS). //’S’ is total no. of sets in CACHE.
Within the set it can be placed anywhere.
 Cache is divided into logical sets in two-way set associate, each set is allocated with two cache blocks
 With K way set association the total no. of sets in the cache is S =N/K
 If K=1, It becomes Direct mapping
 If K=N, it becomes associate mapping
 M/S MM blocks are competing for each set in cache.
Physical Address

TAG
Set offset Word offset
info.
log M/S log S log P

Associate Direct
(S=1) (S=N)
log M log M/N
 It requires searching of all the blocks within the SET before telling HIT/MISS.
 For K-way set association, the no. of TAG comparators =K
 In all cases , TAG Controller Size = N x TAG bits.
 Conflict Problem- Reduced but not eliminated
TAG comparator DM AM K-way SAM
1 N K

 Conflicts:
Direct Mapping ------More
Associate Mapping------Not possible
Set-associate -------Less
So, there can be
Conflict Problem  Direct Mapping, Set-associate mapping
Compulsory Problem All mapping
Capacity Problem All mapping
Compulsory Problem: compulsory miss references are resulting due to finite block
size.
0 15 word 16th word MISS
Block
0
Capacity: due to finite capacity of cache.

1023
1024th word MISS

• So, compulsory miss references can be minimized by increasing block size.


• Capacity miss references can be minimized by minimizing block size.
• So, When there will be none of these problems? When, CM=MM
Quiz:-
Que. Consider K-way set associate cache with ‘V’ sets. Each block is
containing K-words, if the MM is referred sequentially, what will be the words
covered if Kth block is referred.

a) (KmodV)*K to (KmodV)*K+K-1
b) (K*V) to (K*V)+K-1
c) (VmodK)*K to (VmodK)*K+K-1
d) (K*V) to (K*V)+V-1

Que. Consider the MM containing 2M blocks. Cache memory containing 2C


blocks. If it is used in 2-way set associate manner, the position of Kth MM block
in cache will be?

a) (KmodC) set
b) (Kmod2C) set
c) (CmodK) set
d) (2^CmodK) set
Quiz:-
Que. Consider 64MB main memory and 16KB cache divided into 128 bytes/block.
The word size is one byte-
1) What is the size of physical address
2) How many blocks are present in MM and Cache
3) Compute no. of TAG bits, TAG comparators for the following mappings:
a) Direct mapping
b) Associate mapping
c) 8-way Set-associate mapping

TAG Cache Word TAG


Set offset Word offset
info. block offset offset info.
log M/N log N log P log M/S log S log P

𝑀𝑎𝑝𝑝𝑖𝑛𝑔 𝑇𝐴𝐺 𝑏𝑖𝑡𝑠 𝑇𝐴𝐺 𝑐𝑜𝑚𝑝𝑎𝑟𝑎𝑡𝑜𝑟

Word DM 12 1 (12 bits)


TAG info.
offset
log M AM 19 N =128 (19 bits)
log P
SAM 15 8 (15 bits)

Note:
a)Higher the set Association, More the TAG bits
b) Higher the set Association, Less the TAG bits
Quiz:-
Que. Among the 1000 references in 64 words cache, the following observations
are made.
100 references resulting Miss due to conflict problem, 200 references resulting
Miss due to compulsory problem, and 100 due to capacity problem. What will be
the HIT ratio if the cache is used with Direct mapping, and Associate mapping.
a) .6, .7
b) .6, .6
c) .7, .7
CACHE Memory:-
• Cache is the smallest and fastest component in the memory hierarchy.
• It is used to bridge the speed mismatch between the fastest Processor to
the slowest memory component at a reasonable cost.
• The cache maintains locality of reference (frequently used portion of the
program).
• The cache and main memory are divided into equal sized blocks (number of
blocks in cache is less, no. of blocks in main memory is more).
• Hence more than one main-memory block competing for same cache
position. TAG bits are used for each cache block to know which is that main
memory block in cache.
• TAG controller is a memory which maintains the TAG information of all
cache blocks (cache directory).

• Address mapping will decide, which block of main memory has to be placed
where in the cache.
CACHE Updation Techniques:-
• Cache coherence Problem:
• It results when the contents of the cache and the associated
contents in MM differ with each other.
• Solution:
• The cache coherence problem is resolved with write through updation,
write back updation.

• In the write through updation, the main memory and cache memory are
simultaneously updated.
• The time for updation in Write through:
• Tupdation =Max (Tc, Tmm) =Tmm

• write through updation gives the better performance for less number of
updation, Not suitable when a variable is frequently updated and
intermediate values are not required.
• In the Write back updation, MM is updated only when Dirty block is
selected for replacement.
• while write back gives better performance for more number of updation.
• In order to know that whether the block brought into cache is modified or not, for
each block one extra bit is allocated (Dirty bit) 0 not modified, 1-modified/Dirty

Dirty block
MM
Cache B1
0 B1 B2
copy back
1 B2 B3
2 B3 B4
3 B4 -

new block -

B7

• If clean block is selected for replacement


• Tupdation =(Tnew + Tc) // for clean page //Tc for current updation

• If dirty block is selected for replacement


• Tupdation =(TBold+ TBnew + Tc) // for Dirty page
Quiz:-
Que. Consider a cache memory with 80% Hit for Read operation and 90% HIT for
write operation.
The cache access time is 5 times faster than main memory access time, which is
20ms.
If Write through updation is used then if there is a Miss for read or write, a 2-word
block (1 block =1 word) must be moved from main memory to cache memory. If
there are 20% write operations,
what is the Avg. Read access time,
Avg. write access time.
What is the Avg. access time when both read and write are taken into consideration.
𝑻
Soln: HR = 80%, HW = 90%, TC = 20ms, TC = 𝑴𝑴
𝟓

Write through FW =20%, FR =80%,

Write allocate TB = 2X TM =200ms //to reduce future miss penalty

TU = Max (TC, TM) =100ms

T avg.R = HR X TC + (1- HR). (TBNEW + TC) =0.8X20=0.2 (220)ms= 60ms


T avg.W = HW X TU + (1- HW ). (TBNEW + TU) = .9X100+(1-.9).(200+100)= 120ms
T avg. = FR X T avg.R + FW X T avg.W = 80/100X 60+ 20/100X120 = 72ms.
Block Replacement Techniques:-
• The BRT are selecting such a block for replacement, so that number of
future miss penalties are reduced.

• Optimal replacement technique gives the best performance by choosing


the block which is not required in future or which is not needed for longest
time.
• The implementation complexity makes it impractical and is used as the
reference for other replacement techniques
• In the FIFO, the block that has spent longest time in cache is selected for
replacement. Here the assumption is most of the references of that block
might have exhausted.
• The data structure Queue is used for implementation
• The LRU protects most recently used block from being replaced.
• Each block is assigned with USE Bit, whenever it is referred, its USE Bit is
made to 1 and other block USE Bit is made to 0.
• In the event of replacement only that Block whose USE Bit is 0, is selected.
• The Data structure QUEUE can be extended for LRU with only exception is
queue is to be reorganized in the event of HIT.
FIFO
HIT removed HIT: If it is already present in cache.
Removed: If cache is full and you want to place
another block then existing one has to be
removed from cache.

L to R

• FIFO: N=4, 1, 7, 18, 7, 13, 20, 1, 17, 7, 13, 25, 13, 22, 19, 10

HIT STATUS of CACHE LAST replaced BLOCK


2 25,22,19,10 13
• LRU: N=4, 1, 7, 18, 7, 13, 20, 1, 17, 7, 13, 25, 13, 22, 19, 10
Old used
HIT STATUS of CACHE LAST replaced BLOCK
2 13,22,19,10 25
• Direct Mapping: 1, 7, 18, 7, 13, 20, 1, 17, 7, 13, 25, 13, 22, 19, 10
1, 3, 2, 3, 1, 0, 1, 1, 3, 1, 1, 1, 2, 3, 2,
HIT STATUS of CACHE LAST replaced BLOCK
2 20, 13,10,19 22
0, 1, 2, 3
Block Replacement Techniques:-
Que: Consider 4-way Set-Associate Cache which is initially empty containing
16blocks. The MM has 256 blocks. LRU replacement policy is used if the
following MM blocks are referred. 4-way SAC

• 0, 255, 1, 4, 3, 8, 133, 159, 216, 129, 63, 8, 48, 32, 73, 92, 155
SET: 0, 3, 1, 0, 3, 0, 1, 3, 0, 1, 3, 0, 0, 0, 1, 0, 3
• N=16, S = 16/4 =4

• Then which MM block will not present in cache?


a) 216 b)3 c) 129 d) 8

• How many MM blocks will present in the cache at the end of references?
a) 12 b)11 c) 14 d) 16
Que: Which blocks are replaced? Or How many blocks are replaced =4
Que: which blocks will present at last.
Que: Last replaced
Que: Number of Hits
Que: Hit Ratio
Que: In the above problem, if associated Mapping with LRU is used, what will
be the no. of replacement.
a) 0 b)1 c) 2 d) 3

Que: Hit Ratio ?


=1/17
Que: If it is the direct Mapped Cache, what will be the total no. of replacements?
a) 7 b)8 c) 9 d) 6
N=16, Kmod16
Que: At the end of reference, what will be the total no. of MM block present in cache
with Direct Mapping.
a) 10 b) 9 c)11 d) 12
Bcz, no. of six reminders are not present here.
Temporal and Spatial Locality:-
• Cache memory performance depends significantly on temporal locality
and spatial locality. The choice of block size influences these two types of
localities differently.
• Temporal locality: refers to the tendency of recently accessed data to be
accessed again soon.
• Exp: Consider a loop accessing the same variable repeatedly.
for (int i = 0; i < 1000; i++)
{ sum += A[10]; // Accessing the same memory location repeatedly }

• Ensures frequently accessed data stays in the cache and reduces


unnecessary data fetch.
• Smaller block size (No. of blocks must be more in cache) is good for
temporal locality.
Temporal

Spatial
Temporal and Spatial Locality:-
• Spatial locality: refers to the tendency of memory locations near recently
accessed locations to be accessed soon.
• Exp:
# A loop accessing an array sequentially:
for (int i = 0; i < 1000; i++)
{ sum += A[i]; // Accessing consecutive memory locations }
//When accessing A[i], fetching a whole block that includes A[i+1], A[i+2],
... ensures that subsequent accesses hit in the cache.

• If block size is more, So, number of words can be accommodated, so less


Miss. So, it is good for spatial locality.
Memory Interleaving:-
• Memory Interleaving is a technique used to enhance memory access
performance by distributing consecutive memory addresses across multiple
memory banks.
• It is widely used in high-performance computing, multiprocessor
systems, and modern architectures to overcome memory bottlenecks.
• Memory is divided into modules and data is distributed across modules.
• Physical address (PA) is divided into two-parts:
Word Module
selection selection

• Types: A) Lower order memory interleaving


B) Higher order Memory interleaving
• Uses the least significant bits (LSB) of the address to determine the memory
bank.

Cont. •

Consecutive memory addresses are placed in different memory banks.
Improves parallel access to memory and reduces memory access latency.
𝑴𝑺𝑩 𝟔 − 𝒃𝒊𝒕 L𝑺𝑩 • Best suited for programs with high spatial locality, such as array
accesses and loop iterations.
Word Module
𝟒 𝟐

𝟎 1 2 3

4 5 6 7

---- -- ---- -- ---- -- ---- --

60 61 62 63
𝟓𝟎𝒏𝒔
𝑡𝑖𝑚𝑒 𝐵𝐴𝑁𝐾
𝑊𝑜𝑟𝑑 11

10
01
2x4
00 decoder
(2ns)
00
Lower order Memory interleaving: 01
0000 00  W0 M0 2ns 50ns 10
0000 01  W0 M1 50ns 11
0000 10  W0 M2 Overlapping 50ns
0000 11  W0 M3
𝐻𝑖𝑔ℎ𝑒𝑟 𝑂𝑟𝑑𝑒𝑟 𝑏𝑖𝑡𝑠
50ns 50+8 = 58ns
𝐿𝑜𝑤𝑒𝑟 𝑂𝑟𝑑𝑒𝑟 𝑏𝑖𝑡 𝑡𝑜 (while interleaving)
𝑡𝑜 𝑠𝑒𝑙𝑒𝑐𝑡 𝑤𝑜𝑟𝑑 𝑠𝑒𝑙𝑒𝑐𝑡 𝑚𝑜𝑑𝑢𝑙𝑒
Cont:
• In general time to access 4 words= 4x50= 200ns

Higher order memory interleaving:-


• Uses the most significant bits (MSB) of the address to
determine the memory bank.
• Consecutive memory addresses may end up in the same
bank.
• This can lead to memory bank conflicts and higher latency.
• Better suited for random-access workloads where
Module Word
consecutive accesses are not common.
𝟐 𝟒

𝟎 16 32 48

1 17 33 49

---- -- ---- -- ---- -- ---- --

15 31 47 63

00 0000  M0W0
• Time for accessing 4 words: 2ns+ 4x50 =202ns. 00 0001  M0W1
00 0010  M0W2
00 0011  M0W3
𝐻𝑖𝑔ℎ𝑒𝑟 𝑂𝑟𝑑𝑒𝑟 𝑏𝑖𝑡𝑠 𝐿𝑜𝑤𝑒𝑟 𝑂𝑟𝑑𝑒𝑟 𝑏𝑖𝑡𝑠
𝑡𝑜 𝑠𝑒𝑙𝑒𝑐𝑡 𝑚𝑜𝑑𝑢𝑙𝑒 𝑡𝑜 𝑠𝑒𝑙𝑒𝑐𝑡 𝑤𝑜𝑟𝑑
Memory Interleaving:-
• In order to obtain optimal locality, memory is divided into Banks and each
Bank contains multiple modules
• Bank supports Spatial Locality
• Modules within the bank supports Temporal Locality
• Que: A processor is having a cache memory with a 64 bytes block. The main
memory is divided into K-banks. Each Bank can accommodate ‘C’ bytes. If C is 2
and K is 24. Further, Consecutive C-byte chunks are mapped on consecutive banks
with wrap around. All the k- Banks can be accessed in parallel. It may require two
accesses. If the amount of data exceeds K-Banks. The Bank decoding takes K/2 ns.
The latency of Bank Access is 80ns. What is the total latency to access the initial
block from main memory to cache memory.
Parallel access

0 1 2 3 4 5 46 47
CM
48 49 50 51 52 53 48 49
64B 64B

𝐵𝑎𝑛𝑘 ′0′ 𝐵𝑎𝑛𝑘 ′1′ 𝐵𝑎𝑛𝑘 ′2′ 𝐵𝑎𝑛𝑘 ′23′


• 1st iteration : 48 Bytes=24 banks are accessed K/2 =12ns
=80ns 92ns
• 2nd iteration: 16Bytes=8 banks are accessed 8/2ns=4
=
=80
Instructions, Addressing Modes
Opcode Opr. ref. #Ref. Data transfer
• Reg-Reg
Based on functional classification Conditional • Reg-M/r
Machine A&L unit Iterative instruction • M/r - Reg.
control instruction instruction (Branching) • M/r-M/r
• 4-address Instruction: ADD A1, A2, A3, A4
• M[A1]  M[A2]+ M[A3]
• A4 ? Instruction sequencing //Self Sequencing
• Adv: Simplicity, and any no. of registers can be used
• Disadv: Lengthy, More memory access requires, Slowest

• 3-address Instruction: ADD A1, A2, A3


Uses the program counter and lesser instruction size
M[A1]  M[A2]+ M[A3]
Adv: Less instruction length than 4-address instruction, faster
Disadv: Requires Pc for address sequencing

• 2-address Instruction: ADD A1, A2


• M[A1]  M[A1]+ M[A2]
2 address instruction is formed by using one of the reference for denoting the
source operand as well as destination for result. It reduces size, but
overwriting is the problem.
Instructions
• 1-address Instruction: ADD A1
• Acc Acc+ M[A1]
• It is used to resolve over-writing problem by adding one more processor
register Accumulator. Accumulator content is overwritten, memory remains
same.
Adv: smaller in length
Disadv: costly due to Accumulator

• 0-address Instruction/Stack address instruction: ADD


• Uses internal stack for implementation. It result more complexity and expose
to overflow, underflow problems.
Que: Consider a Hypothetical system which support both, one-address and two-
address instructions. The 16-bit fixed length instruction is stored in 128-word
memory. If the system is using 2 two-address instruction, what will be the number
of one-address instruction supported by the system.
A) 512 B) 384 C) 256 D) 128 E) 0
ddress instruction : 0, 1, 2, 3, 4
-----16 bits---- -----16 bits----
1-Address 2-Address Address Address
Opcode Address Opcode
Instruction Instruction (1) (2)
(9) (7) (2) (7) (7)
To accommodate a two-address instruction in the system, it is required to forgo
2^7 one- address instructions (7 opcode bits are used for address). Since there
are 2 two-address instructions, 256 one-instructions have been sacrificed
among 512 possible one-address instructions.
Hence number of available one-address instructions in the instruction set = 512-
256 =256

// P  One address instruction


//Q Two address instruction
512 - Q *2^7 =P

Total possible one- Invalid one- Valid one-Address


Address Instruction Address Instruction Instruction
Addressing Modes
Opcode Opr. ref.
Specifies, How operand reference is given in the instruction.
• Immediate
Non-computable • Direct
• Indirect

• Base
• Index Instruction
Computable
• Based Index
• Relative Addressing
Addressing Definition Usage Limitation -----16 bits----
Mode (AM)
Immediate Opcode Operand
Immediate Operand is Constants, Restricted
the part of Fast operand (8) (8)
instruction response size -----16 bits----
Direct EA is Static Limited Direct Opcode Ref. of operand
present in variable address (Effective address)
the range (8) (8)
instruction
-----16 bits----
Indirect Reference Pointers, SLOW Ref. of
of EA is Provides Indirect Opcode
Effective address
given here security (8) (8)
Computable
• In computable addressing mode, Effective address is computed by using
the Fix processor register contents and displacement of the instruction.

Opcode Displacement

• Base Addressing: This mode is useful for relocatable programs.

• Index Addressing: This mode is useful for accessing array elements.

• Based Index: Hybrid

• Relative Addressing: This mode with PC is useful for Intra-segment branching

PC= PC ± d

• Auto Index Addressing Mode: are useful for implementing the STACK.

• Auto Increment Addressing Mode: This is useful in fetching elements from an


array. The amount of increment depends on the size of data item accessed.
Cont.
A = 100
PC =2000 50 75
Base =100 75 5
Index = 200
150 10
250 50
350 15
2000 ADD| 50

Addressing EA Operand Result


Mode (AA+X)
No EA required, as
Immediate - 50 150 the operand is part
Direct 50 75 175 of the instruction.
Indirect 75 5 105
Based (100+50) 10 110
Index (200+50) 50 150
Based Index (100+200+50) 15 115
Instruction Pipeline
• No. of Stages a Pipeline supports efficiently-
4-Stage 5-Stage 6-Stage
1-Stage 2-Stage
Pipeline Pipeline Pipeline
Pipeline Pipeline
Start
Start Start
Start Start IF
I1: F
I1: IF
I1: F D I2: ID
D ID
I2:
I1: F D E S OF
I3:
I3: OF
I2: E S E
Instruction E
I4: S
I4: E
Stop Stop Memory
I5: S
Stop Write Back
Stop to Reg.

Stop
Speedup=2 Speedup=4 Speedup=5 Speedup=6
Cont.

• Reason: Management activity taking more


time.
Speedup 𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉𝒐𝒖𝒕 𝑷𝒊𝒑𝒆𝒍𝒊𝒏𝒆
• S = ,(S<1)
𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉 𝒑𝒊𝒑𝒆𝒍𝒊𝒏𝒆

• Not Feasible
• P-IV having 6 stage pipelining

KOPT =6
Instruction Pipeline
• Conventional Processing: F D E S
Hardware1 Hardware2 Hardware3 Hardware4
• For 100 Instructions, time =100*4 clocks = 400 clocks
• IP allows efficient usage of resources
S I1
• Instruction Pipeline: E I1 I2
D I1 I2 I3
F I1 I2 I3 I4  All Hardware used
• In K-stage instruction pipeline the time required to implement n- instructions:
Tn = (K+n-1).clocks
• The performance of pipeline is given by speed up factor
𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉𝒐𝒖𝒕 𝑷𝒊𝒑𝒆𝒍𝒊𝒏𝒆 𝑲∗𝒏 𝒄𝒍𝒐𝒄𝒌𝒔
• S = = ,(S>1)
𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉 𝒑𝒊𝒑𝒆𝒍𝒊𝒏𝒆 𝑲 𝒏 𝟏 𝒄𝒍𝒐𝒄𝒌𝒔
• Only 100 clocks required for 97 Instructions. CPIAVG. = 1.03 Clocks
• Speedup = 4/1.03 = 3.88
• Only 1000 clocks required for 997 Instructions. CPIAVG. = 1.003 Clocks
• Speedup = 4/1.003 = 3.98
• More the Instruction (n>>K), the More the Speedup, the More the
Efficiency, the More the Sideal Case. Means S K
• It reduces Avg. Instruction Time. It is expected to provide CPIAVG.=1
TclockP = max(Stage Delay)+ Overhead
Sideal = K (= No. of Stages)
Cont.
 Classification of I.P.:
1) Linear Pipelining: [Feed Forward Only]
--The processing of data is done in linear and sequential manner. The input is
supplied to the first block/stage and we get the output from the last block/stage.
a)Synchronous
Buffer

S1 S2 S3
OUTPUT
INPUT (F+D) (E) (S)

Clock
• Synchronization is achieved using Buffer Register between Stages.
• Performance of Pipeline is influenced by un-even stage delays.

• As dependencies among instructions reduce the performance of a pipeline, The


effective Speedup →( )
Seffective=
( ∗ )

READY? READY?
b)Asynchronous S1 S2 S3
ACK ACK
Transfer Data if READY?
Cont.
2. Non-Linear Pipelining: [Feed Forward as well as Feed Back]

• In a Non-linear pipeline, a Stage may be used again and again based on


the need.
• Reservation Table is used to calculate the Latency for Overlapping
instructions in non-linear pipeline (no use of Buffer).

Que1: If for first instruction Time with Pipeline is T1 and time without Pipeline
is T2. Then, 2ns 2ns 2ns 2ns

a) T1< T2 S1 S2 S3
b) T1> T2 (10ns) (5ns) (3ns)
c) T1= T2
d) T1≤ T2

Without Pipeline: 10+5+3 = 18ns


With Pipeline: n+K-1 clocks = 1+k-1 clocks= k clocks = (K=3)X (10+2)
= 36ns
Cont.
Que.1: A program P1 of 100 instruction is implemented with 5-stage instruction
pipeline. Each Stage is having uniform delay of 10 ns. The program P2 of 2000
instructions is also implemented using the same pipeline.
a) Calculate the time required for both programs,
b) Compute the speedup factor for each case
c) what will be the efficiencies extracted by the programs from the pipeline

Soln: P1 P2
100 instr. 2000 instr.
K=5 K=5
Delay =10ns

a) T100 = (5+100-1)x10ns T2000 = (5+2000-1)x10ns


= 1.04µs = 20.04µs

( ) ( )
b) SP1 = = =4.807 SP2 = = =4.99

c) SP1 = 4.807 SP2 =4.99


If SP5, Then 100%
If SP1, Then 100/5% = (4.99X100)/5%
SO, for SP=4.807, Then (4.807x100)/5 %
Efficiency =96.14% Efficiency = 99.8%
d) Sideal =5
Cont.
Que.3: Consider 4-stage instruction pipeline where stages are implemented
using Combinational circuit. Between every two stages and after the last
stage there exist interstage buffer registers. The delays of the stages and
buffer registers are given below:
1ns 1ns 1ns 1ns

S1 S2 S3 S4
(8ns) (10ns) (11ns) (10ns)

under ideal condition with steady (stable) state what is the minimum speedup
gained with pipeline compared to non-pipeline system.
Soln:

𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉𝒐𝒖𝒕 𝑷𝒊𝒑𝒆𝒍𝒊𝒏𝒆 ∑


S = 𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉 𝒑𝒊𝒑𝒆𝒍𝒊𝒏𝒆
=( )
= =3.25
Cont.
If Different instructions spending different number of clocks then check for

--Availability // When released by previous executing instruction.


--Eligibility // When its previously assigned task is done.
• as the efficiency of pipeline gets reduced due to dependency that is if one
is eligible but the stage is not available then that instruction will have Stall
cycles.
• Which in turn lead reduction in efficiency as dependencies require the
reorganization of instruction pipeline.

Que: Consider, a hypothetical pipeline in which different instructions spent


different number of clocks. The following table use the number of clocks
required for this non-linear pipeline. For each stage for each instruction, How
many clocks are required to implement the following 4 instructions.
Clock
Instruction

𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉𝒐𝒖𝒕 𝑷𝒊𝒑𝒆𝒍𝒊𝒏𝒆


S1 S2 S3 S4 Speedup = 𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉 𝒑𝒊𝒑𝒆𝒍𝒊𝒏𝒆
I1 2 1 3 2 I1
𝟐𝟖 𝒄𝒍𝒐𝒄𝒌𝒔
=
𝟏𝟒 𝒄𝒍𝒐𝒄𝒌𝒔
I2 1 1 2 3 I1, I2
=2
I3 3 2 2 2 I1, I2, I3

I4 1 1 1 1 I1, I2, I3 , I4
Dependencies
I1 : ADD R1 R2 R3
I2 : ADD R4 R1 R5

1) Data Dependencies:
2) Control Dependencies
3) Structural Dependencies

1) Data Dependencies: If one instruction in the pipeline is waiting for the result of another
instruction, which is not yet computed.
Source: Arithmetic Instruction are the main source of data dependency.
Problem: Data Inconsistency.
R1 R2 + R3
R4 R1 + R5
1 2 3 4

F D E Store
ADD R2, R3 R2+R3 R1 5

F D E Store
ADD R1, R5 R1+R5 R4
Cont.
Solutions:
a) Introduction to STALL cycles: [Wait till things are over]

1 2 3 4

F D E Store
ADD R2, R3 R2+R3 R1 5

F D E Store
ADD R1, R5 R1+R5 R4 6 7

F
ADD ### ### D
R1, R5
E
R1+R5
Store
R4
STALL
• If clocks (increases) then performance (decreases), Then-

b) Instruction Rescheduling: To minimize STALL penalty


Here Independent Instructions are placed between dependent instructions which are
equal to number of STALL cycles.
I1

Ii Independent
Ij instructions

I2
Cont.
c) Operand Forwarding:
• Operand forwarding is capable of dealing with most of the data dependencies
here the value of operand is given to stage before it is Store using interstage
buffer registers.
• It reduces the stall penalty (If it cannot eliminate).

I1 : ADD R1 R2 R3
R1 R2 + R3
I2 : ADD R4 R1 R5
R4 R1 + R5
1 2 3 4

F D E Store
I1 5
ADD R2, R3 R2+R3 R1 R4 R2 + R3 + R5
F D E Store
I2 ADD R5 (R2+R3)+ R4
R5

I1: LD R1, R2 (100); R1M[R2+100]


I2: ADD R3, R1, R4; R3 R1 + R4
1 2 3 4 5
IF ID E MEM WB
LD R2 R2+100 M[R2+100] R1 6 7

IF STALL ID E MEM WB
ADD R4 X+R4 - R3
Que.: Consider five stage instruction Pipeline with instruction fetch, instruction decode,
operand fetch, perform operation and write operand stages. All the stages except perform
Cont.
operation (PO) consume 1-clock/instruction/stage. The perform operation requires 6 clocks
for Division, 3 Clocks for Multiplication and 1 clock each for Addition and Subtraction.
How many minimum clocks are required to implement the following four instructions with
operand forwarding.
I0: MUL R2 R0 R1
I1: DIV R5 R3 R4
I2: ADD R2 R5 R2
I3: SUB R5 R2 R6
1 2 3 4 5 6 7

IF ID OF PO WO
I0 MUL R0,R1 R0*R1=X R2 8 9 10 11 12 13

IF ID STALL OF PO WO
I1:
DIV R3,R4 R3/R4 =Y R5 14

IF ID STALL OF PO WO
I2:
ADD R2 Y+R2 R2 15

IF ID STALL OF PO WO
I3: SUB R6 (Y+R R5
2)-R6

You might also like