0% found this document useful (0 votes)
3 views31 pages

ACA - All Unit

The document discusses various computer architecture concepts including MIPS, IPC, CPI, Amdahl's Law, instruction sets (RISC and CISC), instruction pipelining, and pipeline hazards. It also covers advanced techniques such as operand forwarding, branch prediction, dynamic scheduling, and multithreading, along with their advantages and disadvantages. Additionally, it introduces VLIW architecture, superscalar execution, and super pipelining, emphasizing their roles in enhancing processing efficiency.

Uploaded by

venombcmc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views31 pages

ACA - All Unit

The document discusses various computer architecture concepts including MIPS, IPC, CPI, Amdahl's Law, instruction sets (RISC and CISC), instruction pipelining, and pipeline hazards. It also covers advanced techniques such as operand forwarding, branch prediction, dynamic scheduling, and multithreading, along with their advantages and disadvantages. Additionally, it introduces VLIW architecture, superscalar execution, and super pipelining, emphasizing their roles in enhancing processing efficiency.

Uploaded by

venombcmc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

5.

MIPS:

 Million instructions per second (MIPS) is an approximate measure of a computer’s raw


processing power.
 MIPS figures can be misleading because measurement techniques often differ, and
different computers may require different sets of instructions to perform the same
activity.

 It handles when the amount of work is large.


6. IPC:

 Interprocess communication is the mechanism provided by the operating system that


allows processes to communicate with each other.
 This communication could involve a process letting another process know that some
event has occurred or the transferring of data from one process to another.

7. CPI:

 Cycles per instruction (aka clock cycles per instruction, clocks per instruction, or CPI) is
one aspect of a processor's performance: the average number of clock
cycles per instruction for a program or program fragment.
 It is the multiplicative inverse of instructions per cycle.

8. Amdahl’s Law:

 It is named after computer scientist Gene Amdahl.


 It is also known as Amdahl’s argument.
 It is a formula that gives the theoretical speedup in latency of the execution of a task
at a fixed workload that can be expected of a system whose resources are improved.
 In other words, it is a formula used to find the maximum improvement possible by just
improving a particular part of a system.
 Amdahl’s law uses two factors to find speedup from some enhancement:

i. Fraction enhanced

ii. Speedup enhanced

9. Instruction Set:

 An instruction is a set of codes that the computer processor can understand.


 The code is usually in 1s and 0s, or machine language.
 It contains instructions or tasks that control the movement of bits and bytes within the
processor.
 The instruction set provides commands to the processor, to tell it what it needs to do.
Types: Generally, there are two types of instruction set used in computers.

i). RISC(Reduced Instruction set Computer):


 Relatively few instructions.
 Relatively few addressing modes.
 Memory access limited to load and store instructions.
 All operations done within the register of the CPU.
 Single-cycle instruction execution.
 Fixed length, easily decoded instruction format.
 Hardwired rather than micro programmed control.

ii). CISC(Complex Instruction Set Computer):

 A large number of instructions typically from 100 to 250 instructions.


 Some instructions that perform specialized tasks and are used infrequently.
 A large variety of addressing modes- typically from 5 to 20 different modes.
 Variable length instruction formats.
 Instructions that manipulate operands in memory.

10. RISC vs CISC:

RISC CISC

Focus on software Focus on hardware

Uses both hardwired and microprogrammed


Uses only Hardwired control unit control unit

Transistors are used for storing complex


Transistors are used for more registers Instructions

Fixed sized instructions Variable sized instructions

Can perform only Register to Register Can perform REG to REG or REG to MEM or
Arithmetic operations MEM to MEM

Requires more number of registers Requires less number of registers

Code size is large Code size is small

An instruction executed in a single clock cycle Instruction takes more than one clock cycle
RISC CISC

An instruction fit in one word Instructions are larger than the size of one
word.

11. Instruction Pipelining :

 An instruction pipeline receives sequential instructions from memory while prior


instructions are implemented in other portions.
 Pipeline processing can be seen in both the data and instruction streams.

 Pipeline processing can happen not only in the data stream but also in the instruction
stream. To perform tasks such as fetching, decoding and execution of instructions, most
digital computers with complicated instructions would require an instruction pipeline.
 In general, each and every instruction must be processed by the computer in the
following order:
1. Fetching the instruction from memory
2. Decoding the obtained instruction
3. Calculating the effective address
4. Fetching the operands from the given memory
5. Execution of the instruction
6. Storing the result in a proper place
 A four-segment instruction pipeline is illustrated in the block diagram given above. The
instructional cycle is divided into four parts:

Segment 1
The implementation of the instruction fetch segment can be done using the FIFO or first-in,
first-out buffer.

Segment 2
In the second segment, the memory instruction is decoded, and the effective address is then
determined in a separate arithmetic circuit.

Segment 3
In the third segment, some operands would be fetched from memory.

Segment 4
The instructions would finally be executed in the very last segment of a pipeline organisation.
12. RISC 5 stages pipeline : In the early days of computer hardware, Reduced Instruction Set
Computer Central Processing Units (RISC CPUs) was designed to execute one instruction per
cycle, five stages in total. Those stages are, Fetch, Decode, Execute, Memory, and Write. The
simplicity of operations performed allows every instruction to be completed in one processor
cycle.

Fetch:In the Fetch stage, instruction is being fetched from the memory.

Decode:During the Decode stage, we decode the instruction and fetch the source operands

Execute:During the execute stage, the computer performs the operation specified by the
instruction

Memory:If there is any data that needs to be accessed, it is done in the memory stage

Write:If we need to store the result in the destination location, it is done during the writeback
stage.

13. Pipeline Hazards : Pipeline hazards are conditions that can occur in a pipelined machine
that impede the execution of a subsequent instruction in a particular cycle for a variety of
reasons.

Types:

i). Structural Hazards:

 Hardware resource conflicts among the instructions in the pipeline cause structural
hazards.
 Memory, a GPR Register, or an ALU might all be used as resources here.
 When more than one instruction in the pipe requires access to the very same resource
in the same clock cycle, a resource conflict is said to arise.

ii). Data Hazards:

 Data hazards in pipelining emerge when the execution of one instruction is dependent
on the results of another instruction that is still being processed in the pipeline.
 The order of the READ or WRITE operations on the register is used to classify data
threats into three groups.
iii). Control Hazards:

 Branch hazards are caused by branch instructions and are known as control hazards in
computer architecture.
 The flow of program/instruction execution is controlled by branch instructions.
 Remember that conditional statements are used in higher-level languages for iterative
loops and condition testing (correlate with while, for, and if case statements). These are
converted into one of the BRANCH instruction variations.
 As a result, when the decision to execute one instruction is reliant on the result of
another instruction, such as a conditional branch, which examines the condition’s
consequent value, a conditional hazard develops.

14. Operand Forwarding :

 To minimize data dependency stalls in the pipeline, operand forwarding is used.


 In operand forwarding, we use the interface registers present between the stages to
hold intermediate output so that dependent instruction can access new value from
the interface register directly.

15. Branch Prediction Techniques :

 Branch prediction is a technique used to speed execution of instructions on processors


that use pipelining.
 Branch prediction breaks instructions down into predicates, similar to predicate logic.
 A CPU using branch prediction only executes statements if a predicate is true.
 Branch prediction is implemented in CPU logic with a branch predictor.
 Since unnecessary code is not executed, the processor can work much more efficiently.

Types of Branch Prediction Techniques:

i). Static Branch Prediction Technique : In case of Static branch prediction technique
underlying hardware assumes that either the branch is not taken always or the branch is
taken always.

ii). Dynamic Branch Prediction Technique : In Dynamic branch prediction technique


prediction by underlying hardware is not fixed, rather it changes dynamically. This technique
has high accuracy than static technique.

16. Pipeline Scheduling :

 Pipeline scheduling refers to the act of automating parts or all of a data pipeline’s
components at fixed times, dates or intervals.
 Pipeline scheduling is not to be confused with data streaming which involves a constant,
real-time feed of data from one or more sources that passes through the processes
specified in the pipeline.
 Data Pipelines makes pipeline scheduling easy.

17. Loop Unrolling :

 Loop unrolling is a technique used to increase the number of instructions executed


between executions of the loop branch logic.
 This occurs by manually adding the necessary code for the loop to occur multiple times
within the loop body and then updating the conditions and counters accordingly.
 This reduces the number of times the loop branch logic is executed.
 Loop unrolling is a well-known loop transformation.

Advantages:
 Increases program efficiency.
 Reduces loop overhead.
 If statements in loop are not dependent on each other, they can be executed in parallel.

Disadvantages:

 Increased program code size, which can be undesirable.


 Possible increased usage of register in a single iteration to store temporary variables which
may reduce performance.
 Apart from very small and simple codes, unrolled loops that contain branches are even
slower than recursions.

18. Dynamic Scheduling :

 Dynamic Scheduling is a technique in which the hardware rearranges the instruction


execution to reduce the stalls, while maintaining data flow and exception behavior.
 The dynamic scheduler maintains three data structures – the reservation station, a
register result data structure that keeps of the instruction that will modify a register and
an instruction status data structure.
 The three steps in a dynamic scheduler are- Issue, Execute and Write Result.

Advantages:
 It handles cases when dependences are unknown at compile time
 It simplifies the compiler
 It allows code compiled for one pipeline to run efficiently on a different pipeline

Hardware speculation, a technique with significant performance advantages, builds
on dynamic scheduling.
19. Hardware based Speculation :
 Hardware-based speculation follows the predicted flow of data values to choose when to
execute instructions.
 This method of executing programs is essentially a data-flow execution: operations execute
as soon as their operands are available.
 Hardware-based speculation combines three key ideas:
 Dynamic branch prediction to choose which instructions to execute,
 Speculation to allow the execution of instructions before the control dependences are
resolved and
 Dynamic scheduling to deal with the scheduling of different combinations of basic blocks.

Advantages:
 Legacy code
 No ”fix-up” code is required
 Maintains precise exceptions, even with speculation.
 Hardware speculation is better because dynamic branch prediction can be
better than static, especially in integer programs.

20. Tomasulo’s Approach :


 Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic
scheduling of instructions that allows out-of-order execution and enables more efficient
use of multiple execution units.
 The major innovations of Tomasulo’s algorithm include register renaming in
hardware, reservation stations for all execution units, and a common data bus (CDB) on
which computed values broadcast to all reservation stations that may need them.
 The three stages listed below are the stages through which each instruction passes from
the time it is issued to the time its execution is complete.
 Stage 1: Issue
In the issue stage, instructions are issued for execution if all operands and reservation stations
are ready or else they are stalled. Registers are renamed in this step, eliminating WAR and
WAW hazards.
 Stage 2: Execute
In the execute stage, the instruction operations are carried out. Instructions are delayed in this
step until all of their operands are available, eliminating RAW hazards. Program correctness is
maintained through effective address calculation to prevent hazards through memory.
 Stage 3: Write result
In the write Result stage, ALU operations results are written back to registers and store
operations are written back to memory.
21. VLIW(Very Long Instruction Word) :
 The processors in this architecture have multiple functional units, fetch from the
Instruction cache that have the Very Long Instruction Word.
 Multiple independent operations are grouped together in a single VLIW Instruction. They
are initialized in the same clock cycle.
 Each operation is assigned an independent functional unit.
 All the functional units share a common register file.
 Instruction words are typically of the length 64-1024 bits depending on the number of
execution unit and the code length required to control each unit.
 Instruction scheduling and parallel dispatch of the word is done statically by the compiler.
 The compiler checks for dependencies before scheduling parallel execution of the
instructions.

Advantages :
 Reduces hardware complexity.
 Reduces power consumption because of reduction of hardware complexity.
 Since compiler takes care of data dependency check, decoding, instruction issues, it
becomes a lot simpler.
 Increases potential clock rate.
 Functional units are positioned corresponding to the instruction pocket by compiler.
Disadvantages :
 Complex compilers are required which are hard to design.
 Increased program code size.
 Larger memory bandwidth and register-file bandwidth.
 Unscheduled events, for example a cache miss could lead to a stall which will stall the
entire processor.
 In case of un-filled opcodes in a VLIW, there is waste of memory space and instruction
bandwidth.
22. Multithreading :
 Multithreading is a function of the CPU that permits multiple threads to run
independently while sharing the same process resources.
 A thread is a conscience sequence of instructions that may run in the same parent
process as other threads.
 Multithreading allows many parts of a program to run simultaneously.
 These parts are referred to as threads, and they are lightweight processes that are
available within the process.
 As a result, multithreading increases CPU utilization through multitasking. In
multithreading, a computer may execute and process multiple tasks simultaneously.
 Multithreading needs a detailed understanding of these two terms: process and thread.
A process is a running program, and a process can also be subdivided into independent
units called threads.
Advantages
a. Responsive
b. Resource sharing
c. Economy
d. Scalability
e. Better communication
f. Utilization of multiprocessor architecture
g. Minimized system resource usage
Disadvantages
a. It needs more careful synchronization.
b. It can consume a large space of stocks of blocked threads.
c. It needs support for thread or process.
d. If a parent process has several threads for proper process functioning, the child
processes should also be multithreaded because they may be required.
e. It imposes context switching overhead.

23. Types of Multithreading :


i). Fined grained :-
 In fine grained multithreading, the threads are executed in a round-robin fashion in
consecutive cycles.
 A multithreading mechanism in which switching among threads happens despite the
cache miss caused by the thread instruction.
 Requires more threads to keep the CPU busy.
 It is more efficient than coarse grain multithreading.

ii). Coarse grained :-

 In coarse grained multithreading, a thread issues instructions until thread


issuing stops.
 The process is also called stalling. When a stall occurs, the next thread starts
issuing instructions. At this point, a cycle is lost due to this thread switching.
 A multithreading mechanism in which the switch only happens when the
thread in execution causes a stall, thus wasting a clock cycle.
 It is less efficient.
 Requires fewer threads to keep the CPU busy.

24. Superscalar :

 It executes multiple independent instructions in parallel.


 Applicable to both RISC & CISC, but usually in RISC.
 In superscalar multiple independent instruction pipelines are used.
 A superscalar processor typically fetches multiple instructions at a ti,e and then
attempts to find nearby instructions that are independent of one another and can be
therefore be executed in parallel.

25. Super pipelining :

 It is the breaking of stages in an attempt to shorten the clock period and thus enhancing
the instruction throughput by keeping more and more instructions in flight at a time.
 It performs only one pipeline stage per clock cycle.
 The more pipe stages there are, the faster the pipeline is because each stage is then
shorter.
 Ideally, a pipeline with five stages should be five times faster than a non-pipelined
processor.

26. Hyper Threading :

 Hyper-threading is Intel's proprietary simultaneous multithreading (SMT)


implementation used to improve parallelization of computations (doing multiple tasks at
once) performed on x86 microprocessors.
 It was introduced on Xeon server processors in February 2002 and
on Pentium 4 desktop processors in November 2002.
 Hyper-Threading Technology is a form of simultaneous multithreading technology
introduced by Intel, while the concept behind the technology has been patented by Sun
Microsystems.
 Architecturally, a processor with Hyper-Threading Technology consists of two logical
processors per core, each of which has its own processor architectural state.
 Each logical processor can be individually halted, interrupted or directed to execute a
specified thread, independently from the other logical processor sharing the same
physical core.
 Hyper-threading works by duplicating certain sections of the processor—those that
store the architectural state—but not duplicating the main execution resources. This
allows a hyper-threading processor to appear as the usual "physical" processor and an
extra "logical" processor to the host operating system, allowing the operating system to
schedule two threads or processes simultaneously and appropriately.

27. Vector Architecture :

 Vector architecture includes instruction set extensions to an ISA to support vector operations,
which are deeply pipelined.
 Vector operations are on vector registers, which are xed-length bank of registers. Data is
transferred between a vector register and the memory system.
 Each vector operation takes vector registers or a vector register and a scalar value as input.
 Vector architecture can only be effective on applications that have significant datalevel
parallelism (DLP). Vector processing advantages greatly reduces the dynamic instruction
bandwidth. Generally execution time is reduced due to

(1) Eliminating loop overhead

(2) Stalls only occurring on the first- vector element rather than on each vector element,

(3) Performing vector operations in parallel.


28. GPU :

 GPU stands for Graphics Processing Unit.


 GPUs are also known as video cards or graphics cards.
 In order to display pictures, videos, and 2D or 3D animations, each device uses a GPU.
 A GPU performs fast calculations of arithmetic and frees up the CPU to do different
things.
 Originally, GPUs were designed to accelerate 3D graphics rendering.
 It enables graphics programmers with shadowing techniques and advanced lighting to
create more exciting visual effects and more realistic scenes.
 GPUs are generally used to drive high-quality gaming experiences, creating life-like
super-slick rendering and graphic design.
 However, there are also many business applications, which depend on strong graphics
chips.
 Today, the GPU is more programmable than ever before, giving them the potential to
speed up a wide variety of applications that go way beyond conventional graphics
rendering.
29. CUDA Programming :

 CUDA stands for Compute Unified Device Architecture.


 It is an extension of C/C++ programming. CUDA is a programming language that uses
the Graphical Processing Unit (GPU).
 It is a parallel computing platform and an API (Application Programming Interface)
model.

Why do we need CUDA?

 GPUs are designed to perform high-speed parallel computations to display graphics such
as games.
 Use available CUDA resources. More than 100 million GPUs are already deployed.
 It provides 30-100x speed-up over other microprocessors for some applications.
 GPUs have very small Arithmetic Logic Units (ALUs) compared to the somewhat larger
CPUs. This allows for many parallel calculations, such as calculating the color for each pixel
on the screen, etc.

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.


 Each kernel consists of blocks, which are independent groups of ALUs.
 Each block contains threads, which are levels of computation.
 The threads in each block typically work together to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to the GPU is often the most typical part of
the computation.
 For each thread, local memory is the fastest, followed by shared memory, global, static,
and texture memory the slowest.

30. Memory Hierarchy :


 Memory Hierarchy, in Computer System Design, is an enhancement that helps in
organising the memory so that it can actually minimise the access time.
 The development of the Memory Hierarchy occurred on a behaviour of a program
known as locality of references.

Memory Hierarchy Design


This Hierarchy Design of Memory is divided into two main types. They are:

i. External or Secondary Memory


It consists of Magnetic Tape, Optical Disk, Magnetic Disk, i.e. it includes peripheral
storage devices that are accessible by the system’s processor via I/O Module.
ii. Internal Memory or Primary Memory
It consists of CPU registers, Cache Memory, and Main Memory. It is accessible directly
by the processor.

Design of Memory Hierarchy


In computers, the memory hierarchy primarily includes the following:

1. Registers
The register is usually an SRAM or static RAM in the computer processor that is used to hold the
data word that is typically 64 bits or 128 bits. A majority of the processors make use of a status
word register and an accumulator. The accumulator is primarily used to store the data in the
form of mathematical operations, and the status word register is primarily used for decision
making.

2. Cache Memory
The cache basically holds a chunk of information that is used frequently from the main
memory. We can also find cache memory in the processor. In case the processor has a single-
core, it will rarely have multiple cache levels. The present multi-core processors would have
three 2-levels for every individual core, and one of the levels is shared.

3. Main Memory
In a computer, the main memory is nothing but the CPU’s memory unit that communicates
directly. It’s the primary storage unit of a computer system. The main memory is very fast and a
very large memory that is used for storing the information throughout the computer’s
operations. This type of memory is made up of ROM as well as RAM.

4. Magnetic Disks
In a computer, the magnetic disks are circular plates that’s fabricated with plastic or metal with
a magnetised material. Two faces of a disk are frequently used, and many disks can be stacked
on a single spindle by read/write heads that are obtainable on every plane. The disks in a
computer jointly turn at high speed.

5. Magnetic Tape
Magnetic tape refers to a normal magnetic recording designed with a slender magnetizable
overlay that covers an extended, thin strip of plastic film. It is used mainly to back up huge
chunks of data. When a computer needs to access a strip, it will first mount it to access the
information. Once the information is allowed, it will then be unmounted. The actual access time
of a computer memory would be slower within a magnetic strip, and it will take a few minutes
for us to access a strip.
31. Locality of Reference :
 Locality of reference refers to a phenomenon in which a computer program tends to
access same set of memory locations for a particular time period.
 In other words, Locality of Reference refers to the tendency of the computer program
to access instructions whose addresses are near one another.
 The property of locality of reference is mainly shown by loops and subroutine calls in a
program.

Cache Operation:
It is based on the principle of locality of reference. There are two ways with which data or
instruction is fetched from main memory and get stored in cache memory. These two ways
are the following:
1. Temporal Locality –
Temporal locality means current data or instruction that is being fetched may be needed
soon. So we should store that data or instruction in the cache memory so that we can
avoid again searching in main memory for the same data.
2. Spatial Locality –
Spatial locality means instruction or data near to the current memory location that is being
fetched, may be needed soon in the near future. This is slightly different from the temporal
locality. Here we are talking about nearly located memory locations while in temporal locality
we were talking about the actual memory location that was being fetched.

32. Cache Memory :


 Cache Memory is a special very high-speed memory.
 It is used to speed up and synchronize with high-speed CPU.
 Cache memory is costlier than main memory or disk memory but more economical
than CPU registers.
 Cache memory is an extremely fast memory type that acts as a buffer between RAM
and the CPU.
 It holds frequently requested data and instructions so that they are immediately
available to the CPU when needed.
 Cache memory is used to reduce the average time to access data from the Main
memory.
 The cache is a smaller and faster memory that stores copies of the data from
frequently used main memory locations.
 There are various different independent caches in a CPU, which store instructions and
data

Cache Performance: When the processor needs to read or write a location in main memory, it
first checks for a corresponding entry in the cache.
 If the processor finds that the memory location is in the cache, a cache hit has occurred
and data is read from the cache.
 If the processor does not find the memory location in the cache, a cache miss has
occurred. For a cache miss, the cache allocates a new entry and copies in data from main
memory, then the request is fulfilled from the contents of the cache.
The performance of cache memory is frequently measured in terms of a quantity called Hit
ratio.
Hit ratio = hit / (hit + miss) = no. of hits/total accesses
We can improve Cache performance using higher cache block size, and higher associativity,
reduce miss rate, reduce miss penalty, and reduce the time to hit in the cache.
Cache Mapping: There are three different types of mapping used for the purpose of cache
memory:-
A. Direct Mapping
The simplest technique, known as direct mapping, maps each block of main memory into only
one possible cache line. or In Direct mapping, assign each memory block to a specific line in
the cache. If a line is previously taken up by a memory block when a new block needs to be
loaded, the old block is trashed. An address space is split into two parts index field and a tag
field. The cache is used to store the tag field whereas the rest is stored in the main memory.
B. Associative Mapping
In this type of mapping, the associative memory is used to store content and addresses of the
memory word. Any block can go into any line of the cache. This means that the word id bits
are used to identify which word in the block is needed, but the tag becomes all of the
remaining bits. This enables the placement of any word at any place in the cache memory. It
is considered to be the fastest and the most flexible mapping form. In associative mapping
the index bits are zero.
C. Set-associative Mapping
This form of mapping is an enhanced form of direct mapping where the drawbacks of direct
mapping are removed. Set associative addresses the problem of possible thrashing in the
direct mapping method. Set-associative mapping allows that each word that is present in the
cache can have two or more words in the main memory for the same index address. Set
associative cache mapping combines the best of direct and associative cache mapping
techniques. In set associative mapping the index bits are given by the set offset bits.
33. Write Strategy :
a. Write through :- In write-through, data is simultaneously updated to cache and
memory. This process is simpler and more reliable. This is used when there are no
frequent writes to the cache(The number of write operations is less). It helps in data
recovery (In case of a power outage or system failure). A data write will experience
latency (delay) as we have to write to two locations (both Memory and Cache). It
Solves the inconsistency problem.
b. Write back :- The data is updated only in the cache and updated into the memory at a
later time. Data is updated in the memory only when the cache line is ready to be
replaced (cache line replacement is done using Belady’s Anomaly, Least Recently Used
Algorithm, FIFO, LIFO, and others depending on the application). Write Back is also
known as Write Deferred.
34. Cache Misses : A cache miss is an event in which a system or application makes a request
to retrieve data from a cache, but that specific data is not currently in cache memory. Cache
Miss occurs when data is not available in the Cache Memory. When the CPU detects a miss, it
processes the miss by fetching requested data from main memory.
Types of Cache misses :
These are various types of cache misses as follows below.
1. Compulsory Miss –
It is also known as cold start misses or first references misses. These misses occur when
the first access to a block happens. Block must be brought into the cache.
2. Capacity Miss –
These misses occur when the program working set is much larger than the cache
capacity. Since Cache cannot contain all blocks needed for program execution, so
cache discards these blocks.

3. Conflict Miss –
It is also known as collision misses or interference misses. These misses occur when
several blocks are mapped to the same set or block frame. These misses occur in the
set associative or direct mapped block placement strategies.

4. Coherence Miss –
It is also known as Invalidation. These misses occur when other external processors,
i.e., I/O updates memory.
35. Cache Optimization :
 The cache is a part of the hierarchy present next to the CPU.
 It is used in storing the frequently used data and instructions. It is generally very
costly i.e., the larger the cache memory, the higher the cost. Hence, it is used in
smaller capacities to minimize costs.
 To make up for its less capacity, it must be ensured that it is used to its full
potential.
 Optimization of cache performance ensures that it is utilized in a very efficient
manner to its full potential.
Cache Optimization Technique :-
1. Larger block size: If the block size is increased, spatial locality can be exploited
in an efficient way which results in a reduction of miss rates. But it may result in an
increase in miss penalties. The size can’t be extended beyond a certain point since
it affects negatively the point of increasing miss rate. Because larger block size
implies a lesser number of blocks which results in increased conflict misses.
2. Larger cache size: Increasing the cache size results in a decrease of capacity
misses, thereby decreasing the miss rate. But, they increase the hit time and
power consumption.

3. Higher associativity: Higher associativity results in a decrease in conflict misses.


Thereby, it helps in reducing the miss rate.

36. Methods to reduce Miss Penalty :

1. Multi-Level Caches: If there is only one level of cache, then we need to decide between
keeping the cache size small in order to reduce the hit time or making it larger so that the
miss rate can be reduced. Both of them can be achieved simultaneously by introducing cache
at the next levels.
Suppose, if a two-level cache is considered:
 The first level cache is smaller in size and has faster clock cycles comparable to that of the
CPU.
 Second-level cache is larger than the first-level cache but has faster clock cycles compared
to that of main memory. This large size helps in avoiding much access going to the main
memory. Thereby, it also helps in reducing the miss penalty.
2. Critical word first and Early Restart: Generally, the processor requires one word of the
block at a time. So, there is no need of waiting until the full block is loaded before sending the
requested word. This is achieved using:
 The critical word first: It is also called a requested word first. In this method, the exact
word required is requested from the memory and as soon as it arrives, it is sent to the
processor. In this way, two things are achieved, the processor continues execution, and the
other words in the block are read at the same time.
 Early Restart: In this method, the words are fetched in the normal order. When the
requested word arrives, it is immediately sent to the processor which continues execution
with the requested word.

37. Advanced Cache Optimization :

 Way Prediction to Reduce Hit Time : In way prediction, extra bits are kept in the cache to
predict the way, or block within the set of the next cache access. This prediction means
the multiplexor is set early to select the desired block, and only a single tag comparison is
performed that clock cycle in parallel with reading the cache data. A miss results in
checking the other blocks for matches in the next clock cycle.

 Pipelined Cache Access to Increase Cache Bandwidth : The critical timing path in a cache
hit is the three-step process of addressing the tag memory using the index portion of the
address, comparing the read tag value to the address, and setting the multiplexor to
choose the correct data item if the cache is set associative. This optimization is simply to
pipeline cache access so that the effective latency of a first-level cache hit can be
multiple clock cycles, giving fast clock cycle time and high bandwidth but slow hits.
 Nonblocking Caches to Increase Cache Bandwidth : The processor needs to stall on a
data cache miss for pipelined computers that allow out-of-order execution. A
nonblocking cache or lockup-free cache escalates the potential benefits by allowing the
data cache to continue to supply cache hits during a miss. This “hit under miss”
optimization reduces the effective miss penalty by being helpful during a miss instead of
ignoring the requests of the processor.

 Multi-banked Caches to Increase Cache Bandwidth : Rather than treat the cache as a
single monolithic block, we can divide it into
independent banks(as done in DRAM)that can support simultaneous accesses. To spread
the accesses across all the banks, a mapping of addresses to banks that works well is to
spread the addresses of the block sequentially across the banks.

 Critical Word First and Early Restart to Reduce Miss Penalty : This technique is based on
the observation that the processor normally needs just one word of the block at a time.
Critical word first: Request the missed word first from memory and send it to the
processor as soon as it arrives; let the processor continue execution while filling the rest
of the words in the block.
Early restart: Fetch the words in normal order, but as soon as the requested
word of the block arrives send it to the processor and let the processor continue
execution.

38. Compiler Optimization : The compiler can easily reorganize the code, without affecting
the correctness of the program. The compiler can profile code, identify conflicting sequences
and do the reorganization accordingly. Reordering the instructions reduced misses by 50% for a
2-KB direct-mapped instruction cache with 4-byte blocks, and by 75% in an 8-KB cache. Another
code optimization aims for better efficiency from long cache blocks. Aligning basic blocks so
that the entry point is at the beginning of a cache block decreases the chance of a cache miss
for sequential code. This improves both spatial and temporal locality of reference.

39. Write Buffer Merging : This is an optimization used to improve the efficiency of write
buffers. Normally, if the write buffer is empty, the data and the full address will be written in
the buffer. The CPU continues working, while the buffer prepares to write the word to the
memory. Now, if the buffer contains other modified blocks, the addresses can be checked to
see if the address of this new data matches the address of a valid write buffer entry. If so, the
new data can be combined with the already available entry, called write merging.
40. NoC :
 A network on a chip or network-on-chip is a network-based communications
subsystem on an integrated circuit ("microchip"), most typically between modules in
a system on a chip (SoC).
 The modules on the IC are typically semiconductor IP cores schematizing various
functions of the computer system, and are designed to be modular in the sense
of network science.
 The network on chip is a router-based packet switching network between SoC modules.
 NoC technology applies the theory and methods of computer networking to on-
chip communication and brings notable improvements over
conventional bus and crossbar communication architectures.
 Networks-on-chip come in many network topologies, many of which are still
experimental as of 2018.
 A common NoC used in contemporary personal computers is a graphics processing
unit (GPU) — commonly used in computer graphics, video
gaming and accelerating artificial intelligence.

41. Topology :
 The topology is the first fundamental aspect of NoC design, and it has a profound effect
on the overall network cost and performance.
 The topology determines the physical layout and connections between nodes and
channels.
 Also, the message traverse hops and each hop’s channel length depend on the topology.
 The topology significantly influences the latency and power consumption.
 Since the topology determines the number of alternative paths between nodes, it
affects the network traffic distribution, and hence the network bandwidth and
performance achieved.

42. Routing :
 Data routing networks are used for inter PE data exchange.
 Data routing networks can be static or dynamic.
 In a multicomputer network, data routing is achieved by messages among multiple computer nodes.
 Routing network reduces the time required for data exchange and thus system performance is
enhanced.
 Commonly used data routing functions are shifting, rotation, permutations, broadcast, multicast,
personalized communication, shuffle, etc.
 Routing is the process of selecting a path for traffic in a network or between or across
multiple networks.
 Broadly, routing is performed in many types of networks, including circuit-switched
networks, such as the public switched telephone network (PSTN), and computer
networks, such as the Internet.

43. Flow Control :


 The control flow is the order in which the computer executes statements in a script.
 Flow control is design issue at Data Link Layer.
 It is a technique that generally observes the proper flow of data from sender to
receiver.
 It is very essential because it is possible for sender to transmit data or information
at very fast rate and hence receiver can receive this information and process it.
 This can happen only if receiver has very high load of traffic as compared to
sender, or if receiver has power of processing less as compared to sender.
 Flow control is basically a technique that gives permission to two of stations that
are working and processing at different speeds to just communicate with one
another.

Approaches to Flow Control : Flow Control is classified into two categories:


 Feedback – based Flow Control : In this control technique, sender simply transmits data or
information or frame to receiver, then receiver transmits data back to sender and also
allows sender to transmit more amount of data or tell sender about how receiver is
processing or doing. This simply means that sender transmits data or frames after it has
received acknowledgements from user.
 Rate – based Flow Control : In this control technique, usually when sender sends or
transfer data at faster speed to receiver and receiver is not being able to receive data at
the speed, then mechanism known as built-in mechanism in protocol will just limit or
restricts overall rate at which data or information is being transferred or transmitted by
sender without any feedback or acknowledgement from receiver.
44. Input and Output Strategies : The instructions and data that have to be calculated
should be entered into the computer by the various medium. The results should be provided to
the user by a medium. The Input/output structure of the computer supports a method to
communicate with the external world and prepare the operating frameworks with the data it
requires to handle the I/O activity efficiently.

Input-Output Configuration
The figure displays an illustration of the input and output device. The input device is the
keyboard and the output device is the printer. The terminals are the keyboard and printer.
They send and receive the data consecutively.
The data is alphanumeric and 8-bits in size. The input supported by the keyboard is transfer to
the input register INPR. The information is saved in the OUTPR (output register) in the serial
order for the printer. The OUTR saves the serial data for the printer.

The I/O registers communicate serially with interfaces (keyboard, printer) and parallel with AC.
The sender interface receives data from the keyboard and transfers it to INPR.
The receiver interfaces access the data and address it to the printer.
The INPR holds the 8- bit alphanumeric input data.
FGI defines a 1-bit input flag which is a flip-flop. When the input device receives any new
information, the flip flop is set to 1. It is cleared to 0 when information is received through the
output device.
The output device sets the FGO to 1 after receiving, decoding, and printing the
information. FGO in the 0 modes denotes that the device is printing information.
45. Crossbar Switches :
 Crossbar Switch system contains of a number of crosspoints that are kept at
intersections among memory module and processor buses paths.
 In each crosspoint, the small square represents a switch which obtains the path from a
processor to a memory module.
 Each switch point has control logic to set up the transfer path among a memory and
processor.
 It calculates the address which is placed in the bus to obtain whether its specific
module is being addressed.
 In addition, it eliminates multiple requests for access to the same memory module on
a predetermined priority basis.

You might also like