0% found this document useful (0 votes)

3 views31 pages

ACA - All Unit

The document discusses various computer architecture concepts including MIPS, IPC, CPI, Amdahl's Law, instruction sets (RISC and CISC), instruction pipelining, and pipeline hazards. It also covers advanced techniques such as operand forwarding, branch prediction, dynamic scheduling, and multithreading, along with their advantages and disadvantages. Additionally, it introduces VLIW architecture, superscalar execution, and super pipelining, emphasizing their roles in enhancing processing efficiency.

Uploaded by

venombcmc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views31 pages

ACA - All Unit

Uploaded by

venombcmc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

5.

MIPS:

 Million instructions per second (MIPS) is an approximate measure of a computer’s raw

processing power.
 MIPS figures can be misleading because measurement techniques often differ, and
different computers may require different sets of instructions to perform the same
activity.

 It handles when the amount of work is large.



6. IPC:

 Interprocess communication is the mechanism provided by the operating system that

allows processes to communicate with each other.
 This communication could involve a process letting another process know that some
event has occurred or the transferring of data from one process to another.

7. CPI:

 Cycles per instruction (aka clock cycles per instruction, clocks per instruction, or CPI) is
one aspect of a processor's performance: the average number of clock
cycles per instruction for a program or program fragment.
 It is the multiplicative inverse of instructions per cycle.

8. Amdahl’s Law:

 It is named after computer scientist Gene Amdahl.

 It is also known as Amdahl’s argument.
 It is a formula that gives the theoretical speedup in latency of the execution of a task
at a fixed workload that can be expected of a system whose resources are improved.
 In other words, it is a formula used to find the maximum improvement possible by just
improving a particular part of a system.
 Amdahl’s law uses two factors to find speedup from some enhancement:

i. Fraction enhanced

ii. Speedup enhanced

9. Instruction Set:

 An instruction is a set of codes that the computer processor can understand.

 The code is usually in 1s and 0s, or machine language.
 It contains instructions or tasks that control the movement of bits and bytes within the
processor.
 The instruction set provides commands to the processor, to tell it what it needs to do.
Types: Generally, there are two types of instruction set used in computers.

i). RISC(Reduced Instruction set Computer):

 Relatively few instructions.
 Relatively few addressing modes.
 Memory access limited to load and store instructions.
 All operations done within the register of the CPU.
 Single-cycle instruction execution.
 Fixed length, easily decoded instruction format.
 Hardwired rather than micro programmed control.

ii). CISC(Complex Instruction Set Computer):

 A large number of instructions typically from 100 to 250 instructions.

 Some instructions that perform specialized tasks and are used infrequently.
 A large variety of addressing modes- typically from 5 to 20 different modes.
 Variable length instruction formats.
 Instructions that manipulate operands in memory.

10. RISC vs CISC:

RISC CISC

Focus on software Focus on hardware

Uses both hardwired and microprogrammed

Uses only Hardwired control unit control unit

Transistors are used for storing complex

Transistors are used for more registers Instructions

Fixed sized instructions Variable sized instructions

Can perform only Register to Register Can perform REG to REG or REG to MEM or
Arithmetic operations MEM to MEM

Requires more number of registers Requires less number of registers

Code size is large Code size is small

An instruction executed in a single clock cycle Instruction takes more than one clock cycle
RISC CISC

An instruction fit in one word Instructions are larger than the size of one
word.

11. Instruction Pipelining :

 An instruction pipeline receives sequential instructions from memory while prior

instructions are implemented in other portions.
 Pipeline processing can be seen in both the data and instruction streams.

 Pipeline processing can happen not only in the data stream but also in the instruction
stream. To perform tasks such as fetching, decoding and execution of instructions, most
digital computers with complicated instructions would require an instruction pipeline.
 In general, each and every instruction must be processed by the computer in the
following order:
1. Fetching the instruction from memory
2. Decoding the obtained instruction
3. Calculating the effective address
4. Fetching the operands from the given memory
5. Execution of the instruction
6. Storing the result in a proper place
 A four-segment instruction pipeline is illustrated in the block diagram given above. The
instructional cycle is divided into four parts:

Segment 1
The implementation of the instruction fetch segment can be done using the FIFO or first-in,
first-out buffer.

Segment 2
In the second segment, the memory instruction is decoded, and the effective address is then
determined in a separate arithmetic circuit.

Segment 3
In the third segment, some operands would be fetched from memory.

Segment 4
The instructions would finally be executed in the very last segment of a pipeline organisation.
12. RISC 5 stages pipeline : In the early days of computer hardware, Reduced Instruction Set
Computer Central Processing Units (RISC CPUs) was designed to execute one instruction per
cycle, five stages in total. Those stages are, Fetch, Decode, Execute, Memory, and Write. The
simplicity of operations performed allows every instruction to be completed in one processor
cycle.

Fetch:In the Fetch stage, instruction is being fetched from the memory.

Decode:During the Decode stage, we decode the instruction and fetch the source operands

Execute:During the execute stage, the computer performs the operation specified by the
instruction

Memory:If there is any data that needs to be accessed, it is done in the memory stage

Write:If we need to store the result in the destination location, it is done during the writeback
stage.

13. Pipeline Hazards : Pipeline hazards are conditions that can occur in a pipelined machine
that impede the execution of a subsequent instruction in a particular cycle for a variety of
reasons.

Types:

i). Structural Hazards:

 Hardware resource conflicts among the instructions in the pipeline cause structural
hazards.
 Memory, a GPR Register, or an ALU might all be used as resources here.
 When more than one instruction in the pipe requires access to the very same resource
in the same clock cycle, a resource conflict is said to arise.

ii). Data Hazards:

 Data hazards in pipelining emerge when the execution of one instruction is dependent
on the results of another instruction that is still being processed in the pipeline.
 The order of the READ or WRITE operations on the register is used to classify data
threats into three groups.
iii). Control Hazards:

 Branch hazards are caused by branch instructions and are known as control hazards in
computer architecture.
 The flow of program/instruction execution is controlled by branch instructions.
 Remember that conditional statements are used in higher-level languages for iterative
loops and condition testing (correlate with while, for, and if case statements). These are
converted into one of the BRANCH instruction variations.
 As a result, when the decision to execute one instruction is reliant on the result of
another instruction, such as a conditional branch, which examines the condition’s
consequent value, a conditional hazard develops.

14. Operand Forwarding :

 To minimize data dependency stalls in the pipeline, operand forwarding is used.

 In operand forwarding, we use the interface registers present between the stages to
hold intermediate output so that dependent instruction can access new value from
the interface register directly.

15. Branch Prediction Techniques :

 Branch prediction is a technique used to speed execution of instructions on processors

that use pipelining.
 Branch prediction breaks instructions down into predicates, similar to predicate logic.
 A CPU using branch prediction only executes statements if a predicate is true.
 Branch prediction is implemented in CPU logic with a branch predictor.
 Since unnecessary code is not executed, the processor can work much more efficiently.

Types of Branch Prediction Techniques:

i). Static Branch Prediction Technique : In case of Static branch prediction technique
underlying hardware assumes that either the branch is not taken always or the branch is
taken always.

ii). Dynamic Branch Prediction Technique : In Dynamic branch prediction technique

prediction by underlying hardware is not fixed, rather it changes dynamically. This technique
has high accuracy than static technique.

16. Pipeline Scheduling :

 Pipeline scheduling refers to the act of automating parts or all of a data pipeline’s
components at fixed times, dates or intervals.
 Pipeline scheduling is not to be confused with data streaming which involves a constant,
real-time feed of data from one or more sources that passes through the processes
specified in the pipeline.
 Data Pipelines makes pipeline scheduling easy.

17. Loop Unrolling :

 Loop unrolling is a technique used to increase the number of instructions executed

between executions of the loop branch logic.
 This occurs by manually adding the necessary code for the loop to occur multiple times
within the loop body and then updating the conditions and counters accordingly.
 This reduces the number of times the loop branch logic is executed.
 Loop unrolling is a well-known loop transformation.

Advantages:
 Increases program efficiency.
 Reduces loop overhead.
 If statements in loop are not dependent on each other, they can be executed in parallel.

Disadvantages:

 Increased program code size, which can be undesirable.

 Possible increased usage of register in a single iteration to store temporary variables which
may reduce performance.
 Apart from very small and simple codes, unrolled loops that contain branches are even
slower than recursions.

18. Dynamic Scheduling :

 Dynamic Scheduling is a technique in which the hardware rearranges the instruction

execution to reduce the stalls, while maintaining data flow and exception behavior.
 The dynamic scheduler maintains three data structures – the reservation station, a
register result data structure that keeps of the instruction that will modify a register and
an instruction status data structure.
 The three steps in a dynamic scheduler are- Issue, Execute and Write Result.

Advantages:
 It handles cases when dependences are unknown at compile time
 It simplifies the compiler
 It allows code compiled for one pipeline to run efficiently on a different pipeline

Hardware speculation, a technique with significant performance advantages, builds
on dynamic scheduling.
19. Hardware based Speculation :
 Hardware-based speculation follows the predicted flow of data values to choose when to
execute instructions.
 This method of executing programs is essentially a data-flow execution: operations execute
as soon as their operands are available.
 Hardware-based speculation combines three key ideas:
 Dynamic branch prediction to choose which instructions to execute,
 Speculation to allow the execution of instructions before the control dependences are
resolved and
 Dynamic scheduling to deal with the scheduling of different combinations of basic blocks.

Advantages:
 Legacy code
 No ”fix-up” code is required
 Maintains precise exceptions, even with speculation.
 Hardware speculation is better because dynamic branch prediction can be
better than static, especially in integer programs.

20. Tomasulo’s Approach :

 Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic
scheduling of instructions that allows out-of-order execution and enables more efficient
use of multiple execution units.
 The major innovations of Tomasulo’s algorithm include register renaming in
hardware, reservation stations for all execution units, and a common data bus (CDB) on
which computed values broadcast to all reservation stations that may need them.
 The three stages listed below are the stages through which each instruction passes from
the time it is issued to the time its execution is complete.
 Stage 1: Issue
In the issue stage, instructions are issued for execution if all operands and reservation stations
are ready or else they are stalled. Registers are renamed in this step, eliminating WAR and
WAW hazards.
 Stage 2: Execute
In the execute stage, the instruction operations are carried out. Instructions are delayed in this
step until all of their operands are available, eliminating RAW hazards. Program correctness is
maintained through effective address calculation to prevent hazards through memory.
 Stage 3: Write result
In the write Result stage, ALU operations results are written back to registers and store
operations are written back to memory.
21. VLIW(Very Long Instruction Word) :
 The processors in this architecture have multiple functional units, fetch from the
Instruction cache that have the Very Long Instruction Word.
 Multiple independent operations are grouped together in a single VLIW Instruction. They
are initialized in the same clock cycle.
 Each operation is assigned an independent functional unit.
 All the functional units share a common register file.
 Instruction words are typically of the length 64-1024 bits depending on the number of
execution unit and the code length required to control each unit.
 Instruction scheduling and parallel dispatch of the word is done statically by the compiler.
 The compiler checks for dependencies before scheduling parallel execution of the
instructions.

Advantages :
 Reduces hardware complexity.
 Reduces power consumption because of reduction of hardware complexity.
 Since compiler takes care of data dependency check, decoding, instruction issues, it
becomes a lot simpler.
 Increases potential clock rate.
 Functional units are positioned corresponding to the instruction pocket by compiler.
Disadvantages :
 Complex compilers are required which are hard to design.
 Increased program code size.
 Larger memory bandwidth and register-file bandwidth.
 Unscheduled events, for example a cache miss could lead to a stall which will stall the
entire processor.
 In case of un-filled opcodes in a VLIW, there is waste of memory space and instruction
bandwidth.
22. Multithreading :
 Multithreading is a function of the CPU that permits multiple threads to run
independently while sharing the same process resources.
 A thread is a conscience sequence of instructions that may run in the same parent
process as other threads.
 Multithreading allows many parts of a program to run simultaneously.
 These parts are referred to as threads, and they are lightweight processes that are
available within the process.
 As a result, multithreading increases CPU utilization through multitasking. In
multithreading, a computer may execute and process multiple tasks simultaneously.
 Multithreading needs a detailed understanding of these two terms: process and thread.
A process is a running program, and a process can also be subdivided into independent
units called threads.
Advantages
a. Responsive
b. Resource sharing
c. Economy
d. Scalability
e. Better communication
f. Utilization of multiprocessor architecture
g. Minimized system resource usage
Disadvantages
a. It needs more careful synchronization.
b. It can consume a large space of stocks of blocked threads.
c. It needs support for thread or process.
d. If a parent process has several threads for proper process functioning, the child
processes should also be multithreaded because they may be required.
e. It imposes context switching overhead.

23. Types of Multithreading :

i). Fined grained :-
 In fine grained multithreading, the threads are executed in a round-robin fashion in
consecutive cycles.
 A multithreading mechanism in which switching among threads happens despite the
cache miss caused by the thread instruction.
 Requires more threads to keep the CPU busy.
 It is more efficient than coarse grain multithreading.

ii). Coarse grained :-

 In coarse grained multithreading, a thread issues instructions until thread

issuing stops.
 The process is also called stalling. When a stall occurs, the next thread starts
issuing instructions. At this point, a cycle is lost due to this thread switching.
 A multithreading mechanism in which the switch only happens when the
thread in execution causes a stall, thus wasting a clock cycle.
 It is less efficient.
 Requires fewer threads to keep the CPU busy.

24. Superscalar :

 It executes multiple independent instructions in parallel.

 Applicable to both RISC & CISC, but usually in RISC.
 In superscalar multiple independent instruction pipelines are used.
 A superscalar processor typically fetches multiple instructions at a ti,e and then
attempts to find nearby instructions that are independent of one another and can be
therefore be executed in parallel.

25. Super pipelining :

 It is the breaking of stages in an attempt to shorten the clock period and thus enhancing
the instruction throughput by keeping more and more instructions in flight at a time.
 It performs only one pipeline stage per clock cycle.
 The more pipe stages there are, the faster the pipeline is because each stage is then
shorter.
 Ideally, a pipeline with five stages should be five times faster than a non-pipelined
processor.

26. Hyper Threading :

 Hyper-threading is Intel's proprietary simultaneous multithreading (SMT)

implementation used to improve parallelization of computations (doing multiple tasks at
once) performed on x86 microprocessors.
 It was introduced on Xeon server processors in February 2002 and
on Pentium 4 desktop processors in November 2002.
 Hyper-Threading Technology is a form of simultaneous multithreading technology
introduced by Intel, while the concept behind the technology has been patented by Sun
Microsystems.
 Architecturally, a processor with Hyper-Threading Technology consists of two logical
processors per core, each of which has its own processor architectural state.
 Each logical processor can be individually halted, interrupted or directed to execute a
specified thread, independently from the other logical processor sharing the same
physical core.
 Hyper-threading works by duplicating certain sections of the processor—those that
store the architectural state—but not duplicating the main execution resources. This
allows a hyper-threading processor to appear as the usual "physical" processor and an
extra "logical" processor to the host operating system, allowing the operating system to
schedule two threads or processes simultaneously and appropriately.

27. Vector Architecture :

 Vector architecture includes instruction set extensions to an ISA to support vector operations,
which are deeply pipelined.
 Vector operations are on vector registers, which are xed-length bank of registers. Data is
transferred between a vector register and the memory system.
 Each vector operation takes vector registers or a vector register and a scalar value as input.
 Vector architecture can only be effective on applications that have significant datalevel
parallelism (DLP). Vector processing advantages greatly reduces the dynamic instruction
bandwidth. Generally execution time is reduced due to

(1) Eliminating loop overhead

(2) Stalls only occurring on the first- vector element rather than on each vector element,

(3) Performing vector operations in parallel.

28. GPU :

 GPU stands for Graphics Processing Unit.

 GPUs are also known as video cards or graphics cards.
 In order to display pictures, videos, and 2D or 3D animations, each device uses a GPU.
 A GPU performs fast calculations of arithmetic and frees up the CPU to do different
things.
 Originally, GPUs were designed to accelerate 3D graphics rendering.
 It enables graphics programmers with shadowing techniques and advanced lighting to
create more exciting visual effects and more realistic scenes.
 GPUs are generally used to drive high-quality gaming experiences, creating life-like
super-slick rendering and graphic design.
 However, there are also many business applications, which depend on strong graphics
chips.
 Today, the GPU is more programmable than ever before, giving them the potential to
speed up a wide variety of applications that go way beyond conventional graphics
rendering.
29. CUDA Programming :

 CUDA stands for Compute Unified Device Architecture.

 It is an extension of C/C++ programming. CUDA is a programming language that uses
the Graphical Processing Unit (GPU).
 It is a parallel computing platform and an API (Application Programming Interface)
model.

Why do we need CUDA?

 GPUs are designed to perform high-speed parallel computations to display graphics such
as games.
 Use available CUDA resources. More than 100 million GPUs are already deployed.
 It provides 30-100x speed-up over other microprocessors for some applications.
 GPUs have very small Arithmetic Logic Units (ALUs) compared to the somewhat larger
CPUs. This allows for many parallel calculations, such as calculating the color for each pixel
on the screen, etc.

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.

 Each kernel consists of blocks, which are independent groups of ALUs.
 Each block contains threads, which are levels of computation.
 The threads in each block typically work together to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to the GPU is often the most typical part of
the computation.
 For each thread, local memory is the fastest, followed by shared memory, global, static,
and texture memory the slowest.

30. Memory Hierarchy :

 Memory Hierarchy, in Computer System Design, is an enhancement that helps in
organising the memory so that it can actually minimise the access time.
 The development of the Memory Hierarchy occurred on a behaviour of a program
known as locality of references.

Memory Hierarchy Design

This Hierarchy Design of Memory is divided into two main types. They are:

i. External or Secondary Memory

It consists of Magnetic Tape, Optical Disk, Magnetic Disk, i.e. it includes peripheral
storage devices that are accessible by the system’s processor via I/O Module.
ii. Internal Memory or Primary Memory
It consists of CPU registers, Cache Memory, and Main Memory. It is accessible directly
by the processor.

Design of Memory Hierarchy

In computers, the memory hierarchy primarily includes the following:

1. Registers
The register is usually an SRAM or static RAM in the computer processor that is used to hold the
data word that is typically 64 bits or 128 bits. A majority of the processors make use of a status
word register and an accumulator. The accumulator is primarily used to store the data in the
form of mathematical operations, and the status word register is primarily used for decision
making.

2. Cache Memory
The cache basically holds a chunk of information that is used frequently from the main
memory. We can also find cache memory in the processor. In case the processor has a single-
core, it will rarely have multiple cache levels. The present multi-core processors would have
three 2-levels for every individual core, and one of the levels is shared.

3. Main Memory
In a computer, the main memory is nothing but the CPU’s memory unit that communicates
directly. It’s the primary storage unit of a computer system. The main memory is very fast and a
very large memory that is used for storing the information throughout the computer’s
operations. This type of memory is made up of ROM as well as RAM.

4. Magnetic Disks
In a computer, the magnetic disks are circular plates that’s fabricated with plastic or metal with
a magnetised material. Two faces of a disk are frequently used, and many disks can be stacked
on a single spindle by read/write heads that are obtainable on every plane. The disks in a
computer jointly turn at high speed.

5. Magnetic Tape
Magnetic tape refers to a normal magnetic recording designed with a slender magnetizable
overlay that covers an extended, thin strip of plastic film. It is used mainly to back up huge
chunks of data. When a computer needs to access a strip, it will first mount it to access the
information. Once the information is allowed, it will then be unmounted. The actual access time
of a computer memory would be slower within a magnetic strip, and it will take a few minutes
for us to access a strip.
31. Locality of Reference :
 Locality of reference refers to a phenomenon in which a computer program tends to
access same set of memory locations for a particular time period.
 In other words, Locality of Reference refers to the tendency of the computer program
to access instructions whose addresses are near one another.
 The property of locality of reference is mainly shown by loops and subroutine calls in a
program.

Cache Operation:
It is based on the principle of locality of reference. There are two ways with which data or
instruction is fetched from main memory and get stored in cache memory. These two ways
are the following:
1. Temporal Locality –
Temporal locality means current data or instruction that is being fetched may be needed
soon. So we should store that data or instruction in the cache memory so that we can
avoid again searching in main memory for the same data.
2. Spatial Locality –
Spatial locality means instruction or data near to the current memory location that is being
fetched, may be needed soon in the near future. This is slightly different from the temporal
locality. Here we are talking about nearly located memory locations while in temporal locality
we were talking about the actual memory location that was being fetched.

32. Cache Memory :

 Cache Memory is a special very high-speed memory.
 It is used to speed up and synchronize with high-speed CPU.
 Cache memory is costlier than main memory or disk memory but more economical
than CPU registers.
 Cache memory is an extremely fast memory type that acts as a buffer between RAM
and the CPU.
 It holds frequently requested data and instructions so that they are immediately
available to the CPU when needed.
 Cache memory is used to reduce the average time to access data from the Main
memory.
 The cache is a smaller and faster memory that stores copies of the data from
frequently used main memory locations.
 There are various different independent caches in a CPU, which store instructions and
data

Cache Performance: When the processor needs to read or write a location in main memory, it
first checks for a corresponding entry in the cache.
 If the processor finds that the memory location is in the cache, a cache hit has occurred
and data is read from the cache.
 If the processor does not find the memory location in the cache, a cache miss has
occurred. For a cache miss, the cache allocates a new entry and copies in data from main
memory, then the request is fulfilled from the contents of the cache.
The performance of cache memory is frequently measured in terms of a quantity called Hit
ratio.
Hit ratio = hit / (hit + miss) = no. of hits/total accesses
We can improve Cache performance using higher cache block size, and higher associativity,
reduce miss rate, reduce miss penalty, and reduce the time to hit in the cache.
Cache Mapping: There are three different types of mapping used for the purpose of cache
memory:-
A. Direct Mapping
The simplest technique, known as direct mapping, maps each block of main memory into only
one possible cache line. or In Direct mapping, assign each memory block to a specific line in
the cache. If a line is previously taken up by a memory block when a new block needs to be
loaded, the old block is trashed. An address space is split into two parts index field and a tag
field. The cache is used to store the tag field whereas the rest is stored in the main memory.
B. Associative Mapping
In this type of mapping, the associative memory is used to store content and addresses of the
memory word. Any block can go into any line of the cache. This means that the word id bits
are used to identify which word in the block is needed, but the tag becomes all of the
remaining bits. This enables the placement of any word at any place in the cache memory. It
is considered to be the fastest and the most flexible mapping form. In associative mapping
the index bits are zero.
C. Set-associative Mapping
This form of mapping is an enhanced form of direct mapping where the drawbacks of direct
mapping are removed. Set associative addresses the problem of possible thrashing in the
direct mapping method. Set-associative mapping allows that each word that is present in the
cache can have two or more words in the main memory for the same index address. Set
associative cache mapping combines the best of direct and associative cache mapping
techniques. In set associative mapping the index bits are given by the set offset bits.
33. Write Strategy :
a. Write through :- In write-through, data is simultaneously updated to cache and
memory. This process is simpler and more reliable. This is used when there are no
frequent writes to the cache(The number of write operations is less). It helps in data
recovery (In case of a power outage or system failure). A data write will experience
latency (delay) as we have to write to two locations (both Memory and Cache). It
Solves the inconsistency problem.
b. Write back :- The data is updated only in the cache and updated into the memory at a
later time. Data is updated in the memory only when the cache line is ready to be
replaced (cache line replacement is done using Belady’s Anomaly, Least Recently Used
Algorithm, FIFO, LIFO, and others depending on the application). Write Back is also
known as Write Deferred.
34. Cache Misses : A cache miss is an event in which a system or application makes a request
to retrieve data from a cache, but that specific data is not currently in cache memory. Cache
Miss occurs when data is not available in the Cache Memory. When the CPU detects a miss, it
processes the miss by fetching requested data from main memory.
Types of Cache misses :
These are various types of cache misses as follows below.
1. Compulsory Miss –
It is also known as cold start misses or first references misses. These misses occur when
the first access to a block happens. Block must be brought into the cache.
2. Capacity Miss –
These misses occur when the program working set is much larger than the cache
capacity. Since Cache cannot contain all blocks needed for program execution, so
cache discards these blocks.

3. Conflict Miss –
It is also known as collision misses or interference misses. These misses occur when
several blocks are mapped to the same set or block frame. These misses occur in the
set associative or direct mapped block placement strategies.

4. Coherence Miss –
It is also known as Invalidation. These misses occur when other external processors,
i.e., I/O updates memory.
35. Cache Optimization :
 The cache is a part of the hierarchy present next to the CPU.
 It is used in storing the frequently used data and instructions. It is generally very
costly i.e., the larger the cache memory, the higher the cost. Hence, it is used in
smaller capacities to minimize costs.
 To make up for its less capacity, it must be ensured that it is used to its full
potential.
 Optimization of cache performance ensures that it is utilized in a very efficient
manner to its full potential.
Cache Optimization Technique :-
1. Larger block size: If the block size is increased, spatial locality can be exploited
in an efficient way which results in a reduction of miss rates. But it may result in an
increase in miss penalties. The size can’t be extended beyond a certain point since
it affects negatively the point of increasing miss rate. Because larger block size
implies a lesser number of blocks which results in increased conflict misses.
2. Larger cache size: Increasing the cache size results in a decrease of capacity
misses, thereby decreasing the miss rate. But, they increase the hit time and
power consumption.

3. Higher associativity: Higher associativity results in a decrease in conflict misses.

Thereby, it helps in reducing the miss rate.

36. Methods to reduce Miss Penalty :

1. Multi-Level Caches: If there is only one level of cache, then we need to decide between
keeping the cache size small in order to reduce the hit time or making it larger so that the
miss rate can be reduced. Both of them can be achieved simultaneously by introducing cache
at the next levels.
Suppose, if a two-level cache is considered:
 The first level cache is smaller in size and has faster clock cycles comparable to that of the
CPU.
 Second-level cache is larger than the first-level cache but has faster clock cycles compared
to that of main memory. This large size helps in avoiding much access going to the main
memory. Thereby, it also helps in reducing the miss penalty.
2. Critical word first and Early Restart: Generally, the processor requires one word of the
block at a time. So, there is no need of waiting until the full block is loaded before sending the
requested word. This is achieved using:
 The critical word first: It is also called a requested word first. In this method, the exact
word required is requested from the memory and as soon as it arrives, it is sent to the
processor. In this way, two things are achieved, the processor continues execution, and the
other words in the block are read at the same time.
 Early Restart: In this method, the words are fetched in the normal order. When the
requested word arrives, it is immediately sent to the processor which continues execution
with the requested word.

37. Advanced Cache Optimization :

 Way Prediction to Reduce Hit Time : In way prediction, extra bits are kept in the cache to
predict the way, or block within the set of the next cache access. This prediction means
the multiplexor is set early to select the desired block, and only a single tag comparison is
performed that clock cycle in parallel with reading the cache data. A miss results in
checking the other blocks for matches in the next clock cycle.

 Pipelined Cache Access to Increase Cache Bandwidth : The critical timing path in a cache
hit is the three-step process of addressing the tag memory using the index portion of the
address, comparing the read tag value to the address, and setting the multiplexor to
choose the correct data item if the cache is set associative. This optimization is simply to
pipeline cache access so that the effective latency of a first-level cache hit can be
multiple clock cycles, giving fast clock cycle time and high bandwidth but slow hits.
 Nonblocking Caches to Increase Cache Bandwidth : The processor needs to stall on a
data cache miss for pipelined computers that allow out-of-order execution. A
nonblocking cache or lockup-free cache escalates the potential benefits by allowing the
data cache to continue to supply cache hits during a miss. This “hit under miss”
optimization reduces the effective miss penalty by being helpful during a miss instead of
ignoring the requests of the processor.

 Multi-banked Caches to Increase Cache Bandwidth : Rather than treat the cache as a
single monolithic block, we can divide it into
independent banks(as done in DRAM)that can support simultaneous accesses. To spread
the accesses across all the banks, a mapping of addresses to banks that works well is to
spread the addresses of the block sequentially across the banks.

 Critical Word First and Early Restart to Reduce Miss Penalty : This technique is based on
the observation that the processor normally needs just one word of the block at a time.
Critical word first: Request the missed word first from memory and send it to the
processor as soon as it arrives; let the processor continue execution while filling the rest
of the words in the block.
Early restart: Fetch the words in normal order, but as soon as the requested
word of the block arrives send it to the processor and let the processor continue
execution.

38. Compiler Optimization : The compiler can easily reorganize the code, without affecting
the correctness of the program. The compiler can profile code, identify conflicting sequences
and do the reorganization accordingly. Reordering the instructions reduced misses by 50% for a
2-KB direct-mapped instruction cache with 4-byte blocks, and by 75% in an 8-KB cache. Another
code optimization aims for better efficiency from long cache blocks. Aligning basic blocks so
that the entry point is at the beginning of a cache block decreases the chance of a cache miss
for sequential code. This improves both spatial and temporal locality of reference.

39. Write Buffer Merging : This is an optimization used to improve the efficiency of write
buffers. Normally, if the write buffer is empty, the data and the full address will be written in
the buffer. The CPU continues working, while the buffer prepares to write the word to the
memory. Now, if the buffer contains other modified blocks, the addresses can be checked to
see if the address of this new data matches the address of a valid write buffer entry. If so, the
new data can be combined with the already available entry, called write merging.
40. NoC :
 A network on a chip or network-on-chip is a network-based communications
subsystem on an integrated circuit ("microchip"), most typically between modules in
a system on a chip (SoC).
 The modules on the IC are typically semiconductor IP cores schematizing various
functions of the computer system, and are designed to be modular in the sense
of network science.
 The network on chip is a router-based packet switching network between SoC modules.
 NoC technology applies the theory and methods of computer networking to on-
chip communication and brings notable improvements over
conventional bus and crossbar communication architectures.
 Networks-on-chip come in many network topologies, many of which are still
experimental as of 2018.
 A common NoC used in contemporary personal computers is a graphics processing
unit (GPU) — commonly used in computer graphics, video
gaming and accelerating artificial intelligence.

41. Topology :
 The topology is the first fundamental aspect of NoC design, and it has a profound effect
on the overall network cost and performance.
 The topology determines the physical layout and connections between nodes and
channels.
 Also, the message traverse hops and each hop’s channel length depend on the topology.
 The topology significantly influences the latency and power consumption.
 Since the topology determines the number of alternative paths between nodes, it
affects the network traffic distribution, and hence the network bandwidth and
performance achieved.

42. Routing :
 Data routing networks are used for inter PE data exchange.
 Data routing networks can be static or dynamic.
 In a multicomputer network, data routing is achieved by messages among multiple computer nodes.
 Routing network reduces the time required for data exchange and thus system performance is
enhanced.
 Commonly used data routing functions are shifting, rotation, permutations, broadcast, multicast,
personalized communication, shuffle, etc.
 Routing is the process of selecting a path for traffic in a network or between or across
multiple networks.
 Broadly, routing is performed in many types of networks, including circuit-switched
networks, such as the public switched telephone network (PSTN), and computer
networks, such as the Internet.

43. Flow Control :

 The control flow is the order in which the computer executes statements in a script.
 Flow control is design issue at Data Link Layer.
 It is a technique that generally observes the proper flow of data from sender to
receiver.
 It is very essential because it is possible for sender to transmit data or information
at very fast rate and hence receiver can receive this information and process it.
 This can happen only if receiver has very high load of traffic as compared to
sender, or if receiver has power of processing less as compared to sender.
 Flow control is basically a technique that gives permission to two of stations that
are working and processing at different speeds to just communicate with one
another.

Approaches to Flow Control : Flow Control is classified into two categories:

 Feedback – based Flow Control : In this control technique, sender simply transmits data or
information or frame to receiver, then receiver transmits data back to sender and also
allows sender to transmit more amount of data or tell sender about how receiver is
processing or doing. This simply means that sender transmits data or frames after it has
received acknowledgements from user.
 Rate – based Flow Control : In this control technique, usually when sender sends or
transfer data at faster speed to receiver and receiver is not being able to receive data at
the speed, then mechanism known as built-in mechanism in protocol will just limit or
restricts overall rate at which data or information is being transferred or transmitted by
sender without any feedback or acknowledgement from receiver.
44. Input and Output Strategies : The instructions and data that have to be calculated
should be entered into the computer by the various medium. The results should be provided to
the user by a medium. The Input/output structure of the computer supports a method to
communicate with the external world and prepare the operating frameworks with the data it
requires to handle the I/O activity efficiently.

Input-Output Configuration
The figure displays an illustration of the input and output device. The input device is the
keyboard and the output device is the printer. The terminals are the keyboard and printer.
They send and receive the data consecutively.
The data is alphanumeric and 8-bits in size. The input supported by the keyboard is transfer to
the input register INPR. The information is saved in the OUTPR (output register) in the serial
order for the printer. The OUTR saves the serial data for the printer.

The I/O registers communicate serially with interfaces (keyboard, printer) and parallel with AC.
The sender interface receives data from the keyboard and transfers it to INPR.
The receiver interfaces access the data and address it to the printer.
The INPR holds the 8- bit alphanumeric input data.
FGI defines a 1-bit input flag which is a flip-flop. When the input device receives any new
information, the flip flop is set to 1. It is cleared to 0 when information is received through the
output device.
The output device sets the FGO to 1 after receiving, decoding, and printing the
information. FGO in the 0 modes denotes that the device is printing information.
45. Crossbar Switches :
 Crossbar Switch system contains of a number of crosspoints that are kept at
intersections among memory module and processor buses paths.
 In each crosspoint, the small square represents a switch which obtains the path from a
processor to a memory module.
 Each switch point has control logic to set up the transfer path among a memory and
processor.
 It calculates the address which is placed in the bus to obtain whether its specific
module is being addressed.
 In addition, it eliminates multiple requests for access to the same memory module on
a predetermined priority basis.

Chapter1 - Basic Structure of Computers
0% (1)
Chapter1 - Basic Structure of Computers
119 pages
Digital Electronics & Computer Fundamentals Theory.
50% (2)
Digital Electronics & Computer Fundamentals Theory.
137 pages
Chapter1 - Basic Structure of Computers
100% (1)
Chapter1 - Basic Structure of Computers
119 pages
QUESTION BANK UNIT 5 - Computer Organization and Architecture
No ratings yet
QUESTION BANK UNIT 5 - Computer Organization and Architecture
9 pages
MEGA65-Book Draft PDF
No ratings yet
MEGA65-Book Draft PDF
822 pages
Engineers Guide To PCI Express Solutions
100% (1)
Engineers Guide To PCI Express Solutions
36 pages
1) Define MIPS. CPI and MFLOPS.: Q.1 Attempt Any FOUR
No ratings yet
1) Define MIPS. CPI and MFLOPS.: Q.1 Attempt Any FOUR
10 pages
Aca Notes
No ratings yet
Aca Notes
23 pages
UART/USART, SPI, I C Features and Applications
100% (2)
UART/USART, SPI, I C Features and Applications
5 pages
01 Introduction To Digital Control System PDF
No ratings yet
01 Introduction To Digital Control System PDF
8 pages
ACA Question Bank
No ratings yet
ACA Question Bank
19 pages
Design A Pulse Mode Circuit Having Two Input
No ratings yet
Design A Pulse Mode Circuit Having Two Input
4 pages
VivA QuestionS MP
No ratings yet
VivA QuestionS MP
9 pages
Si1000 Datasheet
No ratings yet
Si1000 Datasheet
372 pages
P11Mca1 & P8Mca1 - Advanced Computer Architecture: Unit V Processors and Memory Hierarchy
No ratings yet
P11Mca1 & P8Mca1 - Advanced Computer Architecture: Unit V Processors and Memory Hierarchy
45 pages
Automata Theory Lec-02
No ratings yet
Automata Theory Lec-02
31 pages
Seed Sowing Robot
No ratings yet
Seed Sowing Robot
83 pages
Group 6 Cpu Design Presentation
No ratings yet
Group 6 Cpu Design Presentation
50 pages
ADVD-CMOS Process
No ratings yet
ADVD-CMOS Process
99 pages
CPU Structure and Functions
No ratings yet
CPU Structure and Functions
39 pages
How To Program Arduino FPGA
No ratings yet
How To Program Arduino FPGA
20 pages
Unit2 Aca
No ratings yet
Unit2 Aca
118 pages
Motherboard GA B75 D3V v1.1
No ratings yet
Motherboard GA B75 D3V v1.1
88 pages
Reduced Instruction Set Computing (RISC) : Li-Chuan Fang
No ratings yet
Reduced Instruction Set Computing (RISC) : Li-Chuan Fang
42 pages
Contact Session 8 - With Annotation-1
No ratings yet
Contact Session 8 - With Annotation-1
47 pages
Chapter 14 - Processor Structure and Function
No ratings yet
Chapter 14 - Processor Structure and Function
74 pages
Introduction To Free-RTOS: Deepak D'Souza
No ratings yet
Introduction To Free-RTOS: Deepak D'Souza
34 pages
D201GLY Mother Board Intel
No ratings yet
D201GLY Mother Board Intel
76 pages
Contact Session 8
No ratings yet
Contact Session 8
63 pages
Real Time System Lect10 A
No ratings yet
Real Time System Lect10 A
25 pages
15 Hardware
No ratings yet
15 Hardware
27 pages
Adobe Scan 27 Jan 2023
No ratings yet
Adobe Scan 27 Jan 2023
25 pages
Processor and Computer Achitecture
No ratings yet
Processor and Computer Achitecture
26 pages
nT-iBR Series-Manual-En-V1.0-0917
No ratings yet
nT-iBR Series-Manual-En-V1.0-0917
56 pages
EE4242: VLSI Circuits
No ratings yet
EE4242: VLSI Circuits
2 pages
Processor Organization & Instruction Cycle
No ratings yet
Processor Organization & Instruction Cycle
31 pages
Unit 5
No ratings yet
Unit 5
23 pages
CH7-Parallel and Pipelined Processing
No ratings yet
CH7-Parallel and Pipelined Processing
23 pages
Unit 5
No ratings yet
Unit 5
23 pages
STW120CT Computer Architecture and Networks: (Instruction Pipelining)
No ratings yet
STW120CT Computer Architecture and Networks: (Instruction Pipelining)
24 pages
Shift Registers
No ratings yet
Shift Registers
35 pages
Computer Archi
No ratings yet
Computer Archi
58 pages
COA DR MVN 5 UNIT - Latest PDF
No ratings yet
COA DR MVN 5 UNIT - Latest PDF
24 pages
Coa Co4
No ratings yet
Coa Co4
28 pages
Unit 5
No ratings yet
Unit 5
36 pages
Deming Chen: Chapter 38, Design Automation For Microelectronics, Springer Handbook of Automation
No ratings yet
Deming Chen: Chapter 38, Design Automation For Microelectronics, Springer Handbook of Automation
15 pages
Unit1 1.7 Instr Cycle
No ratings yet
Unit1 1.7 Instr Cycle
35 pages
Hardware
No ratings yet
Hardware
34 pages
Microprocessor: 4-Bit Chips
No ratings yet
Microprocessor: 4-Bit Chips
14 pages
The Improvement of The Personal Computer
No ratings yet
The Improvement of The Personal Computer
74 pages
CSO Previous Year Question Paper (2019-15)
No ratings yet
CSO Previous Year Question Paper (2019-15)
10 pages
New Doc 01-31-2024 16.43
No ratings yet
New Doc 01-31-2024 16.43
19 pages
Be Computer Engineering Semester 4 2018 December Computer Organization and Architecture Cbcgs
No ratings yet
Be Computer Engineering Semester 4 2018 December Computer Organization and Architecture Cbcgs
18 pages
L03 Pipelining
No ratings yet
L03 Pipelining
45 pages
Advanced Computer Architecture (ACA) Assignment
No ratings yet
Advanced Computer Architecture (ACA) Assignment
16 pages
Ambo University Waliso Campus: Dep:-Information Technology Group 8 It2 Year
No ratings yet
Ambo University Waliso Campus: Dep:-Information Technology Group 8 It2 Year
10 pages
CA Lecture 12
No ratings yet
CA Lecture 12
48 pages
Coa Unit 5
No ratings yet
Coa Unit 5
20 pages
Aca Univ 2 Mark and 16 Mark
No ratings yet
Aca Univ 2 Mark and 16 Mark
20 pages
L04 Pipelining
No ratings yet
L04 Pipelining
38 pages
MPMC Module 5
No ratings yet
MPMC Module 5
25 pages
COA CH 6
No ratings yet
COA CH 6
14 pages
15.1 Hardware Virtual Machines 2024
No ratings yet
15.1 Hardware Virtual Machines 2024
10 pages
CO Pipelining PDF Notes
No ratings yet
CO Pipelining PDF Notes
10 pages
Ambo University Waliso Campus: Dep:-Information Technology
No ratings yet
Ambo University Waliso Campus: Dep:-Information Technology
9 pages
Labsheet DJM30073 Edited Jun 2021
No ratings yet
Labsheet DJM30073 Edited Jun 2021
9 pages
CA Final PDF
No ratings yet
CA Final PDF
13 pages
15.1 Processors & Paralell Processing (MT-L)
No ratings yet
15.1 Processors & Paralell Processing (MT-L)
12 pages
8-Channel Source Driver With Over-Current Protection: Absolute Maximum Ratings
No ratings yet
8-Channel Source Driver With Over-Current Protection: Absolute Maximum Ratings
9 pages
Computer Architecture 1
No ratings yet
Computer Architecture 1
37 pages
Cie-III Verilog HDL QP
No ratings yet
Cie-III Verilog HDL QP
2 pages
Computer Storage Devices
No ratings yet
Computer Storage Devices
11 pages
Department Programme Course Code Course Name Semester Credit Values Contact Hours Pre-Requisite (S) Vission & Mission: Vission
No ratings yet
Department Programme Course Code Course Name Semester Credit Values Contact Hours Pre-Requisite (S) Vission & Mission: Vission
4 pages
DDCO Notes-162-171
No ratings yet
DDCO Notes-162-171
10 pages
Computer Hardware Lecturer - 4
No ratings yet
Computer Hardware Lecturer - 4
9 pages
An Instruction Set
No ratings yet
An Instruction Set
3 pages
Coa Based On Willam Stalling
No ratings yet
Coa Based On Willam Stalling
9 pages
CA Classes-106-110
No ratings yet
CA Classes-106-110
5 pages
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
No ratings yet
Reduced Instruction Set Computer (Risc) Complex Instruction Set Computer (Cisc)
7 pages
CISC Vs RISC Pipelining
No ratings yet
CISC Vs RISC Pipelining
5 pages
Answer Key DIGITAL AND PRINCIPLE ORGANIZATION
No ratings yet
Answer Key DIGITAL AND PRINCIPLE ORGANIZATION
5 pages
Design and Characterization of A CMOS 8-Bit Microprocessor Data Path
No ratings yet
Design and Characterization of A CMOS 8-Bit Microprocessor Data Path
6 pages
hw1 Sol
No ratings yet
hw1 Sol
4 pages
Apple Hardware Test
No ratings yet
Apple Hardware Test
3 pages
COA-2 Marks - Unit 3
No ratings yet
COA-2 Marks - Unit 3
3 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
Learning Linux Binary Analysis: Learning Linux Binary Analysis
From Everand
Learning Linux Binary Analysis: Learning Linux Binary Analysis
Ryan "elfmaster" O'Neill
4/5 (1)
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Code Beneath the Surface: Mastering Assembly Programming
From Everand
Code Beneath the Surface: Mastering Assembly Programming
Kameron Hussain
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

ACA - All Unit

Uploaded by

ACA - All Unit

Uploaded by

5.

 Million instructions per second (MIPS) is an approximate measure of a computer’s raw

 It handles when the amount of work is large.

 Interprocess communication is the mechanism provided by the operating system that

 It is named after computer scientist Gene Amdahl.

ii. Speedup enhanced

 An instruction is a set of codes that the computer processor can understand.

i). RISC(Reduced Instruction set Computer):

ii). CISC(Complex Instruction Set Computer):

 A large number of instructions typically from 100 to 250 instructions.

10. RISC vs CISC:

Focus on software Focus on hardware

Uses both hardwired and microprogrammed

Transistors are used for storing complex

Fixed sized instructions Variable sized instructions

Requires more number of registers Requires less number of registers

Code size is large Code size is small

11. Instruction Pipelining :

 An instruction pipeline receives sequential instructions from memory while prior

i). Structural Hazards:

ii). Data Hazards:

14. Operand Forwarding :

 To minimize data dependency stalls in the pipeline, operand forwarding is used.

15. Branch Prediction Techniques :

 Branch prediction is a technique used to speed execution of instructions on processors

Types of Branch Prediction Techniques:

ii). Dynamic Branch Prediction Technique : In Dynamic branch prediction technique

16. Pipeline Scheduling :

17. Loop Unrolling :

 Loop unrolling is a technique used to increase the number of instructions executed

 Increased program code size, which can be undesirable.

18. Dynamic Scheduling :

 Dynamic Scheduling is a technique in which the hardware rearranges the instruction

20. Tomasulo’s Approach :

23. Types of Multithreading :

ii). Coarse grained :-

 In coarse grained multithreading, a thread issues instructions until thread

 It executes multiple independent instructions in parallel.

25. Super pipelining :

26. Hyper Threading :

 Hyper-threading is Intel's proprietary simultaneous multithreading (SMT)

27. Vector Architecture :

(1) Eliminating loop overhead

(3) Performing vector operations in parallel.

 GPU stands for Graphics Processing Unit.

 CUDA stands for Compute Unified Device Architecture.

Why do we need CUDA?

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.

30. Memory Hierarchy :

Memory Hierarchy Design

i. External or Secondary Memory

Design of Memory Hierarchy

32. Cache Memory :

3. Higher associativity: Higher associativity results in a decrease in conflict misses.

36. Methods to reduce Miss Penalty :

37. Advanced Cache Optimization :

43. Flow Control :

Approaches to Flow Control : Flow Control is classified into two categories:

You might also like