COA For Midterm
COA For Midterm
(COA CSE/IT-222)
Leaning Objectives:
--By the end of this course, students should be able to:
Knowledge-Based Objectives:
• Explain the architecture of modern computer systems, including the role of each major component.
• Analyze the memory hierarchy and the role of caches, paging, and virtual memory.
• Describe different types of instruction set architectures and evaluate their trade-offs.
• Understand the principles of pipelining and how it improves system performance.
Skill-Based Objectives:
• Design and simulate simple digital circuits related to computer architecture.
• Write low-level assembly language programs to illustrate machine-level operations.
• Calculate performance metrics such as CPI (cycles per instruction) and assess how architectural features
influence performance.
Application-Based Objectives:
• Apply the principles of computer organization to analyze the behavior of real-world systems.
• Compare and contrast different architectural paradigms such as RISC, CISC.
• Demonstrate an understanding of how hardware and software interact to execute a program.
Computer Architecture
• Deals with the conceptual design of the system.
--Focusing on what the system should do.
--It addresses how to meet system goals, such as performance, power
efficiency, and compatibility.
• Instructions (CPU architecture paradigms)---CISC // Intel, Pentium
---RISC // Power PC
CISC- Suited for desktops and RISC- making it ideal for embedded
servers where instruction-level systems, multimedia, and real-time
performance is critical. applications and automotive systems.
ARM, MIPS, SPARC, RISC-V processors Intel x86, AMD, and IBM processors
So, In a nutshell-
Architecture defines "what" the system should do at a conceptual level.
Organization focuses on "how" to build the system to achieve that conceptual
design, efficiently in physical terms.
Performance Factors
• Representation faster Access
• Storing and Accessing Cache- Reducing Avg. access time
• Processing Pipelining- Reducing Avg. processing time.
• Communication Direct Memory Access (DMA)
Memory
CPU DMA
Device
DMA
Input Output
device Input/Output processor device
Utility:
• In modern computing-
MIMD being dominant in general-purpose multi-core processors (distributed systems) and
SIMD being extensively used in GPUs for data-intensive tasks (vector calculations).
Primary Components of Architecture
1 1 1 SISD
1 N Modules SIMD
N M Modules MIMD
Memory Organization CPU
ith Level
(i+1)th
Random access Level Block
memory
Tape memory
Serial Access Record
( )
• Access time- ‘T ’
• Size- ‘S’
• Cost – ‘C ’
• Access frequency- ‘f ’
• Level- ‘I ’ i.e. CM is subset of MM
Performance Measures
• The performance of memory hierarchy is given by HIT ratio (Availability of
referred information at referred place is termed as HIT).
• HIT ratio < < <
•H α .
More HIT Less avg. access time T
• HαS T2
• Issue with Memory Hierarchy?
Data inconsistency Soln: Proper updation
T1
• = + { }
• = +
Que1.2. Which of the following HIT ratio satisfies if HIT ratio is atmost .84.
a) .6 b) .78 c) .82 d).94
If Tavg. is not going to high then HIT ratio will never decrease
than that
Que1.3. In the above problem, if the HIT ratio is made to 100%, what will be the
value of
a) 0 b) 100 c) 150 d) 120
// does not influence , instead it influence Avg. time.
Que.2: Consider, in a multilevel memory hierarchy, information is
distributed in L1 cache (instruction), L2 cache (data), L3 unified
cache (data+instruction) and main memory. The hit ratio of
respective memories are .8, .85, .9 and 1. The respective access
times are 10 nsec, 10 nsec, 50 nsec and 500 nsec among the
total references 60% of them refer the data. If the referred word is
not present in L1&L2, it must be brought from level L3. If it is not
present in L3, the word should be obtained from main memory to
L3 and from L3 to respective cache memory.
-what will be the average instruction time?
-average data time ?and
-average access time?
Instruction Unified
L1 BUS
CPU L3 MM
L2
Data
= +(1- ).[( + ). +(1- ). ( )
=.8X10+.2 [60X.9+(1-.9)X560]
= 30nsec
. = X .+ X
=30X40/100+ 25X60/100
= 27nsec.
Throughput of the above system: .
= = words/sec
/
≈37million words/sec.
Que.2: Consider, a three-level memory system which contains
Level-1 cache, level-2 cache, and main memory. They are
connected as per the following arrangement:
Block size =4B Block size =16B
20ns 200ns
2ns
4B
4 words
(4B)
L1 L2 MM
CPU MM
4 words 4B
(4B)
L1 L2 MM
88 + (200+20)x4
Exp. MM word ‘8’ is a HIT or MISS: 8mod16= 0th block of MM 0th cache block
Block of
Cache
000 01 0000 Cache block TAG TAG Cache
Word selection 0 001 4
TAG Cache
block offset Select lines 1 000 1
offset CB T CB T CB T 000
CB T MUX CB T MUX CB T MUX 2 2
CB T 4:1 CB T 4:1 CB T 4:1
3 001 7
CB T CB T CB T
A B Let N= 4
TAG comparator Block
HIT/MISS
No. of MUX required = TAG bits = log M/N, to select TAG info., each of size N:1 (cache size)
HIT Latency = MIUX delay + comparator delay (=EX-NOR delay + AND delay)
0
Slowest 4
Limitations of Direct Mapping: 1 1
In initially empty cache, if referred MM blocks are: [16, 0, 16, 0, 4, 0, 4] then HIT ratio =0/7
2 2
If all the referred words belongs to same cache block.
Solution- Associate mapping/Set-associate mapping 3 7
Quiz:-
Que. Kth MM block is referred with direct mapping, then which MM words
move into cache.
a) K*P to P
b) K*P to K*P +P
c) K*P to P-1
d) K*P to K*P+P-1
Que. How many 2x1 Mux are required to construct MX1 Mux.
Que: How many Nx1 Mux are required to construct MX1 Mux :
2 HIT/MISS
4
3
Set-Associate Mapping:-
K block of main memory placed in (KmodS). //’S’ is total no. of sets in CACHE.
Within the set it can be placed anywhere.
Cache is divided into logical sets in two-way set associate, each set is allocated with two cache blocks
With K way set association the total no. of sets in the cache is S =N/K
If K=1, It becomes Direct mapping
If K=N, it becomes associate mapping
M/S MM blocks are competing for each set in cache.
Physical Address
TAG
Set offset Word offset
info.
log M/S log S log P
Associate Direct
(S=1) (S=N)
log M log M/N
It requires searching of all the blocks within the SET before telling HIT/MISS.
For K-way set association, the no. of TAG comparators =K
In all cases , TAG Controller Size = N x TAG bits.
Conflict Problem- Reduced but not eliminated
TAG comparator DM AM K-way SAM
1 N K
Conflicts:
Direct Mapping ------More
Associate Mapping------Not possible
Set-associate -------Less
So, there can be
Conflict Problem Direct Mapping, Set-associate mapping
Compulsory Problem All mapping
Capacity Problem All mapping
Compulsory Problem: compulsory miss references are resulting due to finite block
size.
0 15 word 16th word MISS
Block
0
Capacity: due to finite capacity of cache.
1023
1024th word MISS
a) (KmodV)*K to (KmodV)*K+K-1
b) (K*V) to (K*V)+K-1
c) (VmodK)*K to (VmodK)*K+K-1
d) (K*V) to (K*V)+V-1
a) (KmodC) set
b) (Kmod2C) set
c) (CmodK) set
d) (2^CmodK) set
Quiz:-
Que. Consider 64MB main memory and 16KB cache divided into 128 bytes/block.
The word size is one byte-
1) What is the size of physical address
2) How many blocks are present in MM and Cache
3) Compute no. of TAG bits, TAG comparators for the following mappings:
a) Direct mapping
b) Associate mapping
c) 8-way Set-associate mapping
Note:
a)Higher the set Association, More the TAG bits
b) Higher the set Association, Less the TAG bits
Quiz:-
Que. Among the 1000 references in 64 words cache, the following observations
are made.
100 references resulting Miss due to conflict problem, 200 references resulting
Miss due to compulsory problem, and 100 due to capacity problem. What will be
the HIT ratio if the cache is used with Direct mapping, and Associate mapping.
a) .6, .7
b) .6, .6
c) .7, .7
CACHE Memory:-
• Cache is the smallest and fastest component in the memory hierarchy.
• It is used to bridge the speed mismatch between the fastest Processor to
the slowest memory component at a reasonable cost.
• The cache maintains locality of reference (frequently used portion of the
program).
• The cache and main memory are divided into equal sized blocks (number of
blocks in cache is less, no. of blocks in main memory is more).
• Hence more than one main-memory block competing for same cache
position. TAG bits are used for each cache block to know which is that main
memory block in cache.
• TAG controller is a memory which maintains the TAG information of all
cache blocks (cache directory).
• Address mapping will decide, which block of main memory has to be placed
where in the cache.
CACHE Updation Techniques:-
• Cache coherence Problem:
• It results when the contents of the cache and the associated
contents in MM differ with each other.
• Solution:
• The cache coherence problem is resolved with write through updation,
write back updation.
• In the write through updation, the main memory and cache memory are
simultaneously updated.
• The time for updation in Write through:
• Tupdation =Max (Tc, Tmm) =Tmm
• write through updation gives the better performance for less number of
updation, Not suitable when a variable is frequently updated and
intermediate values are not required.
• In the Write back updation, MM is updated only when Dirty block is
selected for replacement.
• while write back gives better performance for more number of updation.
• In order to know that whether the block brought into cache is modified or not, for
each block one extra bit is allocated (Dirty bit) 0 not modified, 1-modified/Dirty
Dirty block
MM
Cache B1
0 B1 B2
copy back
1 B2 B3
2 B3 B4
3 B4 -
new block -
B7
L to R
• FIFO: N=4, 1, 7, 18, 7, 13, 20, 1, 17, 7, 13, 25, 13, 22, 19, 10
• 0, 255, 1, 4, 3, 8, 133, 159, 216, 129, 63, 8, 48, 32, 73, 92, 155
SET: 0, 3, 1, 0, 3, 0, 1, 3, 0, 1, 3, 0, 0, 0, 1, 0, 3
• N=16, S = 16/4 =4
• How many MM blocks will present in the cache at the end of references?
a) 12 b)11 c) 14 d) 16
Que: Which blocks are replaced? Or How many blocks are replaced =4
Que: which blocks will present at last.
Que: Last replaced
Que: Number of Hits
Que: Hit Ratio
Que: In the above problem, if associated Mapping with LRU is used, what will
be the no. of replacement.
a) 0 b)1 c) 2 d) 3
Spatial
Temporal and Spatial Locality:-
• Spatial locality: refers to the tendency of memory locations near recently
accessed locations to be accessed soon.
• Exp:
# A loop accessing an array sequentially:
for (int i = 0; i < 1000; i++)
{ sum += A[i]; // Accessing consecutive memory locations }
//When accessing A[i], fetching a whole block that includes A[i+1], A[i+2],
... ensures that subsequent accesses hit in the cache.
Cont. •
•
Consecutive memory addresses are placed in different memory banks.
Improves parallel access to memory and reduces memory access latency.
𝑴𝑺𝑩 𝟔 − 𝒃𝒊𝒕 L𝑺𝑩 • Best suited for programs with high spatial locality, such as array
accesses and loop iterations.
Word Module
𝟒 𝟐
𝟎 1 2 3
4 5 6 7
60 61 62 63
𝟓𝟎𝒏𝒔
𝑡𝑖𝑚𝑒 𝐵𝐴𝑁𝐾
𝑊𝑜𝑟𝑑 11
10
01
2x4
00 decoder
(2ns)
00
Lower order Memory interleaving: 01
0000 00 W0 M0 2ns 50ns 10
0000 01 W0 M1 50ns 11
0000 10 W0 M2 Overlapping 50ns
0000 11 W0 M3
𝐻𝑖𝑔ℎ𝑒𝑟 𝑂𝑟𝑑𝑒𝑟 𝑏𝑖𝑡𝑠
50ns 50+8 = 58ns
𝐿𝑜𝑤𝑒𝑟 𝑂𝑟𝑑𝑒𝑟 𝑏𝑖𝑡 𝑡𝑜 (while interleaving)
𝑡𝑜 𝑠𝑒𝑙𝑒𝑐𝑡 𝑤𝑜𝑟𝑑 𝑠𝑒𝑙𝑒𝑐𝑡 𝑚𝑜𝑑𝑢𝑙𝑒
Cont:
• In general time to access 4 words= 4x50= 200ns
𝟎 16 32 48
1 17 33 49
15 31 47 63
00 0000 M0W0
• Time for accessing 4 words: 2ns+ 4x50 =202ns. 00 0001 M0W1
00 0010 M0W2
00 0011 M0W3
𝐻𝑖𝑔ℎ𝑒𝑟 𝑂𝑟𝑑𝑒𝑟 𝑏𝑖𝑡𝑠 𝐿𝑜𝑤𝑒𝑟 𝑂𝑟𝑑𝑒𝑟 𝑏𝑖𝑡𝑠
𝑡𝑜 𝑠𝑒𝑙𝑒𝑐𝑡 𝑚𝑜𝑑𝑢𝑙𝑒 𝑡𝑜 𝑠𝑒𝑙𝑒𝑐𝑡 𝑤𝑜𝑟𝑑
Memory Interleaving:-
• In order to obtain optimal locality, memory is divided into Banks and each
Bank contains multiple modules
• Bank supports Spatial Locality
• Modules within the bank supports Temporal Locality
• Que: A processor is having a cache memory with a 64 bytes block. The main
memory is divided into K-banks. Each Bank can accommodate ‘C’ bytes. If C is 2
and K is 24. Further, Consecutive C-byte chunks are mapped on consecutive banks
with wrap around. All the k- Banks can be accessed in parallel. It may require two
accesses. If the amount of data exceeds K-Banks. The Bank decoding takes K/2 ns.
The latency of Bank Access is 80ns. What is the total latency to access the initial
block from main memory to cache memory.
Parallel access
0 1 2 3 4 5 46 47
CM
48 49 50 51 52 53 48 49
64B 64B
• Base
• Index Instruction
Computable
• Based Index
• Relative Addressing
Addressing Definition Usage Limitation -----16 bits----
Mode (AM)
Immediate Opcode Operand
Immediate Operand is Constants, Restricted
the part of Fast operand (8) (8)
instruction response size -----16 bits----
Direct EA is Static Limited Direct Opcode Ref. of operand
present in variable address (Effective address)
the range (8) (8)
instruction
-----16 bits----
Indirect Reference Pointers, SLOW Ref. of
of EA is Provides Indirect Opcode
Effective address
given here security (8) (8)
Computable
• In computable addressing mode, Effective address is computed by using
the Fix processor register contents and displacement of the instruction.
Opcode Displacement
PC= PC ± d
• Auto Index Addressing Mode: are useful for implementing the STACK.
Stop
Speedup=2 Speedup=4 Speedup=5 Speedup=6
Cont.
• Not Feasible
• P-IV having 6 stage pipelining
KOPT =6
Instruction Pipeline
• Conventional Processing: F D E S
Hardware1 Hardware2 Hardware3 Hardware4
• For 100 Instructions, time =100*4 clocks = 400 clocks
• IP allows efficient usage of resources
S I1
• Instruction Pipeline: E I1 I2
D I1 I2 I3
F I1 I2 I3 I4 All Hardware used
• In K-stage instruction pipeline the time required to implement n- instructions:
Tn = (K+n-1).clocks
• The performance of pipeline is given by speed up factor
𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉𝒐𝒖𝒕 𝑷𝒊𝒑𝒆𝒍𝒊𝒏𝒆 𝑲∗𝒏 𝒄𝒍𝒐𝒄𝒌𝒔
• S = = ,(S>1)
𝑻𝒊𝒎𝒆 𝒘𝒊𝒕𝒉 𝒑𝒊𝒑𝒆𝒍𝒊𝒏𝒆 𝑲 𝒏 𝟏 𝒄𝒍𝒐𝒄𝒌𝒔
• Only 100 clocks required for 97 Instructions. CPIAVG. = 1.03 Clocks
• Speedup = 4/1.03 = 3.88
• Only 1000 clocks required for 997 Instructions. CPIAVG. = 1.003 Clocks
• Speedup = 4/1.003 = 3.98
• More the Instruction (n>>K), the More the Speedup, the More the
Efficiency, the More the Sideal Case. Means S K
• It reduces Avg. Instruction Time. It is expected to provide CPIAVG.=1
TclockP = max(Stage Delay)+ Overhead
Sideal = K (= No. of Stages)
Cont.
Classification of I.P.:
1) Linear Pipelining: [Feed Forward Only]
--The processing of data is done in linear and sequential manner. The input is
supplied to the first block/stage and we get the output from the last block/stage.
a)Synchronous
Buffer
S1 S2 S3
OUTPUT
INPUT (F+D) (E) (S)
Clock
• Synchronization is achieved using Buffer Register between Stages.
• Performance of Pipeline is influenced by un-even stage delays.
READY? READY?
b)Asynchronous S1 S2 S3
ACK ACK
Transfer Data if READY?
Cont.
2. Non-Linear Pipelining: [Feed Forward as well as Feed Back]
Que1: If for first instruction Time with Pipeline is T1 and time without Pipeline
is T2. Then, 2ns 2ns 2ns 2ns
a) T1< T2 S1 S2 S3
b) T1> T2 (10ns) (5ns) (3ns)
c) T1= T2
d) T1≤ T2
Soln: P1 P2
100 instr. 2000 instr.
K=5 K=5
Delay =10ns
( ) ( )
b) SP1 = = =4.807 SP2 = = =4.99
S1 S2 S3 S4
(8ns) (10ns) (11ns) (10ns)
under ideal condition with steady (stable) state what is the minimum speedup
gained with pipeline compared to non-pipeline system.
Soln:
I4 1 1 1 1 I1, I2, I3 , I4
Dependencies
I1 : ADD R1 R2 R3
I2 : ADD R4 R1 R5
1) Data Dependencies:
2) Control Dependencies
3) Structural Dependencies
1) Data Dependencies: If one instruction in the pipeline is waiting for the result of another
instruction, which is not yet computed.
Source: Arithmetic Instruction are the main source of data dependency.
Problem: Data Inconsistency.
R1 R2 + R3
R4 R1 + R5
1 2 3 4
F D E Store
ADD R2, R3 R2+R3 R1 5
F D E Store
ADD R1, R5 R1+R5 R4
Cont.
Solutions:
a) Introduction to STALL cycles: [Wait till things are over]
1 2 3 4
F D E Store
ADD R2, R3 R2+R3 R1 5
F D E Store
ADD R1, R5 R1+R5 R4 6 7
F
ADD ### ### D
R1, R5
E
R1+R5
Store
R4
STALL
• If clocks (increases) then performance (decreases), Then-
Ii Independent
Ij instructions
I2
Cont.
c) Operand Forwarding:
• Operand forwarding is capable of dealing with most of the data dependencies
here the value of operand is given to stage before it is Store using interstage
buffer registers.
• It reduces the stall penalty (If it cannot eliminate).
I1 : ADD R1 R2 R3
R1 R2 + R3
I2 : ADD R4 R1 R5
R4 R1 + R5
1 2 3 4
F D E Store
I1 5
ADD R2, R3 R2+R3 R1 R4 R2 + R3 + R5
F D E Store
I2 ADD R5 (R2+R3)+ R4
R5
IF STALL ID E MEM WB
ADD R4 X+R4 - R3
Que.: Consider five stage instruction Pipeline with instruction fetch, instruction decode,
operand fetch, perform operation and write operand stages. All the stages except perform
Cont.
operation (PO) consume 1-clock/instruction/stage. The perform operation requires 6 clocks
for Division, 3 Clocks for Multiplication and 1 clock each for Addition and Subtraction.
How many minimum clocks are required to implement the following four instructions with
operand forwarding.
I0: MUL R2 R0 R1
I1: DIV R5 R3 R4
I2: ADD R2 R5 R2
I3: SUB R5 R2 R6
1 2 3 4 5 6 7
IF ID OF PO WO
I0 MUL R0,R1 R0*R1=X R2 8 9 10 11 12 13
IF ID STALL OF PO WO
I1:
DIV R3,R4 R3/R4 =Y R5 14
IF ID STALL OF PO WO
I2:
ADD R2 Y+R2 R2 15
IF ID STALL OF PO WO
I3: SUB R6 (Y+R R5
2)-R6