ACA - All Unit
ACA - All Unit
MIPS:
6. IPC:
7. CPI:
Cycles per instruction (aka clock cycles per instruction, clocks per instruction, or CPI) is
one aspect of a processor's performance: the average number of clock
cycles per instruction for a program or program fragment.
It is the multiplicative inverse of instructions per cycle.
8. Amdahl’s Law:
i. Fraction enhanced
9. Instruction Set:
RISC CISC
Can perform only Register to Register Can perform REG to REG or REG to MEM or
Arithmetic operations MEM to MEM
An instruction executed in a single clock cycle Instruction takes more than one clock cycle
RISC CISC
An instruction fit in one word Instructions are larger than the size of one
word.
Pipeline processing can happen not only in the data stream but also in the instruction
stream. To perform tasks such as fetching, decoding and execution of instructions, most
digital computers with complicated instructions would require an instruction pipeline.
In general, each and every instruction must be processed by the computer in the
following order:
1. Fetching the instruction from memory
2. Decoding the obtained instruction
3. Calculating the effective address
4. Fetching the operands from the given memory
5. Execution of the instruction
6. Storing the result in a proper place
A four-segment instruction pipeline is illustrated in the block diagram given above. The
instructional cycle is divided into four parts:
Segment 1
The implementation of the instruction fetch segment can be done using the FIFO or first-in,
first-out buffer.
Segment 2
In the second segment, the memory instruction is decoded, and the effective address is then
determined in a separate arithmetic circuit.
Segment 3
In the third segment, some operands would be fetched from memory.
Segment 4
The instructions would finally be executed in the very last segment of a pipeline organisation.
12. RISC 5 stages pipeline : In the early days of computer hardware, Reduced Instruction Set
Computer Central Processing Units (RISC CPUs) was designed to execute one instruction per
cycle, five stages in total. Those stages are, Fetch, Decode, Execute, Memory, and Write. The
simplicity of operations performed allows every instruction to be completed in one processor
cycle.
Fetch:In the Fetch stage, instruction is being fetched from the memory.
Decode:During the Decode stage, we decode the instruction and fetch the source operands
Execute:During the execute stage, the computer performs the operation specified by the
instruction
Memory:If there is any data that needs to be accessed, it is done in the memory stage
Write:If we need to store the result in the destination location, it is done during the writeback
stage.
13. Pipeline Hazards : Pipeline hazards are conditions that can occur in a pipelined machine
that impede the execution of a subsequent instruction in a particular cycle for a variety of
reasons.
Types:
Hardware resource conflicts among the instructions in the pipeline cause structural
hazards.
Memory, a GPR Register, or an ALU might all be used as resources here.
When more than one instruction in the pipe requires access to the very same resource
in the same clock cycle, a resource conflict is said to arise.
Data hazards in pipelining emerge when the execution of one instruction is dependent
on the results of another instruction that is still being processed in the pipeline.
The order of the READ or WRITE operations on the register is used to classify data
threats into three groups.
iii). Control Hazards:
Branch hazards are caused by branch instructions and are known as control hazards in
computer architecture.
The flow of program/instruction execution is controlled by branch instructions.
Remember that conditional statements are used in higher-level languages for iterative
loops and condition testing (correlate with while, for, and if case statements). These are
converted into one of the BRANCH instruction variations.
As a result, when the decision to execute one instruction is reliant on the result of
another instruction, such as a conditional branch, which examines the condition’s
consequent value, a conditional hazard develops.
i). Static Branch Prediction Technique : In case of Static branch prediction technique
underlying hardware assumes that either the branch is not taken always or the branch is
taken always.
Pipeline scheduling refers to the act of automating parts or all of a data pipeline’s
components at fixed times, dates or intervals.
Pipeline scheduling is not to be confused with data streaming which involves a constant,
real-time feed of data from one or more sources that passes through the processes
specified in the pipeline.
Data Pipelines makes pipeline scheduling easy.
Advantages:
Increases program efficiency.
Reduces loop overhead.
If statements in loop are not dependent on each other, they can be executed in parallel.
Disadvantages:
Advantages:
It handles cases when dependences are unknown at compile time
It simplifies the compiler
It allows code compiled for one pipeline to run efficiently on a different pipeline
Hardware speculation, a technique with significant performance advantages, builds
on dynamic scheduling.
19. Hardware based Speculation :
Hardware-based speculation follows the predicted flow of data values to choose when to
execute instructions.
This method of executing programs is essentially a data-flow execution: operations execute
as soon as their operands are available.
Hardware-based speculation combines three key ideas:
Dynamic branch prediction to choose which instructions to execute,
Speculation to allow the execution of instructions before the control dependences are
resolved and
Dynamic scheduling to deal with the scheduling of different combinations of basic blocks.
Advantages:
Legacy code
No ”fix-up” code is required
Maintains precise exceptions, even with speculation.
Hardware speculation is better because dynamic branch prediction can be
better than static, especially in integer programs.
Advantages :
Reduces hardware complexity.
Reduces power consumption because of reduction of hardware complexity.
Since compiler takes care of data dependency check, decoding, instruction issues, it
becomes a lot simpler.
Increases potential clock rate.
Functional units are positioned corresponding to the instruction pocket by compiler.
Disadvantages :
Complex compilers are required which are hard to design.
Increased program code size.
Larger memory bandwidth and register-file bandwidth.
Unscheduled events, for example a cache miss could lead to a stall which will stall the
entire processor.
In case of un-filled opcodes in a VLIW, there is waste of memory space and instruction
bandwidth.
22. Multithreading :
Multithreading is a function of the CPU that permits multiple threads to run
independently while sharing the same process resources.
A thread is a conscience sequence of instructions that may run in the same parent
process as other threads.
Multithreading allows many parts of a program to run simultaneously.
These parts are referred to as threads, and they are lightweight processes that are
available within the process.
As a result, multithreading increases CPU utilization through multitasking. In
multithreading, a computer may execute and process multiple tasks simultaneously.
Multithreading needs a detailed understanding of these two terms: process and thread.
A process is a running program, and a process can also be subdivided into independent
units called threads.
Advantages
a. Responsive
b. Resource sharing
c. Economy
d. Scalability
e. Better communication
f. Utilization of multiprocessor architecture
g. Minimized system resource usage
Disadvantages
a. It needs more careful synchronization.
b. It can consume a large space of stocks of blocked threads.
c. It needs support for thread or process.
d. If a parent process has several threads for proper process functioning, the child
processes should also be multithreaded because they may be required.
e. It imposes context switching overhead.
24. Superscalar :
It is the breaking of stages in an attempt to shorten the clock period and thus enhancing
the instruction throughput by keeping more and more instructions in flight at a time.
It performs only one pipeline stage per clock cycle.
The more pipe stages there are, the faster the pipeline is because each stage is then
shorter.
Ideally, a pipeline with five stages should be five times faster than a non-pipelined
processor.
Vector architecture includes instruction set extensions to an ISA to support vector operations,
which are deeply pipelined.
Vector operations are on vector registers, which are xed-length bank of registers. Data is
transferred between a vector register and the memory system.
Each vector operation takes vector registers or a vector register and a scalar value as input.
Vector architecture can only be effective on applications that have significant datalevel
parallelism (DLP). Vector processing advantages greatly reduces the dynamic instruction
bandwidth. Generally execution time is reduced due to
(2) Stalls only occurring on the first- vector element rather than on each vector element,
GPUs are designed to perform high-speed parallel computations to display graphics such
as games.
Use available CUDA resources. More than 100 million GPUs are already deployed.
It provides 30-100x speed-up over other microprocessors for some applications.
GPUs have very small Arithmetic Logic Units (ALUs) compared to the somewhat larger
CPUs. This allows for many parallel calculations, such as calculating the color for each pixel
on the screen, etc.
1. Registers
The register is usually an SRAM or static RAM in the computer processor that is used to hold the
data word that is typically 64 bits or 128 bits. A majority of the processors make use of a status
word register and an accumulator. The accumulator is primarily used to store the data in the
form of mathematical operations, and the status word register is primarily used for decision
making.
2. Cache Memory
The cache basically holds a chunk of information that is used frequently from the main
memory. We can also find cache memory in the processor. In case the processor has a single-
core, it will rarely have multiple cache levels. The present multi-core processors would have
three 2-levels for every individual core, and one of the levels is shared.
3. Main Memory
In a computer, the main memory is nothing but the CPU’s memory unit that communicates
directly. It’s the primary storage unit of a computer system. The main memory is very fast and a
very large memory that is used for storing the information throughout the computer’s
operations. This type of memory is made up of ROM as well as RAM.
4. Magnetic Disks
In a computer, the magnetic disks are circular plates that’s fabricated with plastic or metal with
a magnetised material. Two faces of a disk are frequently used, and many disks can be stacked
on a single spindle by read/write heads that are obtainable on every plane. The disks in a
computer jointly turn at high speed.
5. Magnetic Tape
Magnetic tape refers to a normal magnetic recording designed with a slender magnetizable
overlay that covers an extended, thin strip of plastic film. It is used mainly to back up huge
chunks of data. When a computer needs to access a strip, it will first mount it to access the
information. Once the information is allowed, it will then be unmounted. The actual access time
of a computer memory would be slower within a magnetic strip, and it will take a few minutes
for us to access a strip.
31. Locality of Reference :
Locality of reference refers to a phenomenon in which a computer program tends to
access same set of memory locations for a particular time period.
In other words, Locality of Reference refers to the tendency of the computer program
to access instructions whose addresses are near one another.
The property of locality of reference is mainly shown by loops and subroutine calls in a
program.
Cache Operation:
It is based on the principle of locality of reference. There are two ways with which data or
instruction is fetched from main memory and get stored in cache memory. These two ways
are the following:
1. Temporal Locality –
Temporal locality means current data or instruction that is being fetched may be needed
soon. So we should store that data or instruction in the cache memory so that we can
avoid again searching in main memory for the same data.
2. Spatial Locality –
Spatial locality means instruction or data near to the current memory location that is being
fetched, may be needed soon in the near future. This is slightly different from the temporal
locality. Here we are talking about nearly located memory locations while in temporal locality
we were talking about the actual memory location that was being fetched.
Cache Performance: When the processor needs to read or write a location in main memory, it
first checks for a corresponding entry in the cache.
If the processor finds that the memory location is in the cache, a cache hit has occurred
and data is read from the cache.
If the processor does not find the memory location in the cache, a cache miss has
occurred. For a cache miss, the cache allocates a new entry and copies in data from main
memory, then the request is fulfilled from the contents of the cache.
The performance of cache memory is frequently measured in terms of a quantity called Hit
ratio.
Hit ratio = hit / (hit + miss) = no. of hits/total accesses
We can improve Cache performance using higher cache block size, and higher associativity,
reduce miss rate, reduce miss penalty, and reduce the time to hit in the cache.
Cache Mapping: There are three different types of mapping used for the purpose of cache
memory:-
A. Direct Mapping
The simplest technique, known as direct mapping, maps each block of main memory into only
one possible cache line. or In Direct mapping, assign each memory block to a specific line in
the cache. If a line is previously taken up by a memory block when a new block needs to be
loaded, the old block is trashed. An address space is split into two parts index field and a tag
field. The cache is used to store the tag field whereas the rest is stored in the main memory.
B. Associative Mapping
In this type of mapping, the associative memory is used to store content and addresses of the
memory word. Any block can go into any line of the cache. This means that the word id bits
are used to identify which word in the block is needed, but the tag becomes all of the
remaining bits. This enables the placement of any word at any place in the cache memory. It
is considered to be the fastest and the most flexible mapping form. In associative mapping
the index bits are zero.
C. Set-associative Mapping
This form of mapping is an enhanced form of direct mapping where the drawbacks of direct
mapping are removed. Set associative addresses the problem of possible thrashing in the
direct mapping method. Set-associative mapping allows that each word that is present in the
cache can have two or more words in the main memory for the same index address. Set
associative cache mapping combines the best of direct and associative cache mapping
techniques. In set associative mapping the index bits are given by the set offset bits.
33. Write Strategy :
a. Write through :- In write-through, data is simultaneously updated to cache and
memory. This process is simpler and more reliable. This is used when there are no
frequent writes to the cache(The number of write operations is less). It helps in data
recovery (In case of a power outage or system failure). A data write will experience
latency (delay) as we have to write to two locations (both Memory and Cache). It
Solves the inconsistency problem.
b. Write back :- The data is updated only in the cache and updated into the memory at a
later time. Data is updated in the memory only when the cache line is ready to be
replaced (cache line replacement is done using Belady’s Anomaly, Least Recently Used
Algorithm, FIFO, LIFO, and others depending on the application). Write Back is also
known as Write Deferred.
34. Cache Misses : A cache miss is an event in which a system or application makes a request
to retrieve data from a cache, but that specific data is not currently in cache memory. Cache
Miss occurs when data is not available in the Cache Memory. When the CPU detects a miss, it
processes the miss by fetching requested data from main memory.
Types of Cache misses :
These are various types of cache misses as follows below.
1. Compulsory Miss –
It is also known as cold start misses or first references misses. These misses occur when
the first access to a block happens. Block must be brought into the cache.
2. Capacity Miss –
These misses occur when the program working set is much larger than the cache
capacity. Since Cache cannot contain all blocks needed for program execution, so
cache discards these blocks.
3. Conflict Miss –
It is also known as collision misses or interference misses. These misses occur when
several blocks are mapped to the same set or block frame. These misses occur in the
set associative or direct mapped block placement strategies.
4. Coherence Miss –
It is also known as Invalidation. These misses occur when other external processors,
i.e., I/O updates memory.
35. Cache Optimization :
The cache is a part of the hierarchy present next to the CPU.
It is used in storing the frequently used data and instructions. It is generally very
costly i.e., the larger the cache memory, the higher the cost. Hence, it is used in
smaller capacities to minimize costs.
To make up for its less capacity, it must be ensured that it is used to its full
potential.
Optimization of cache performance ensures that it is utilized in a very efficient
manner to its full potential.
Cache Optimization Technique :-
1. Larger block size: If the block size is increased, spatial locality can be exploited
in an efficient way which results in a reduction of miss rates. But it may result in an
increase in miss penalties. The size can’t be extended beyond a certain point since
it affects negatively the point of increasing miss rate. Because larger block size
implies a lesser number of blocks which results in increased conflict misses.
2. Larger cache size: Increasing the cache size results in a decrease of capacity
misses, thereby decreasing the miss rate. But, they increase the hit time and
power consumption.
1. Multi-Level Caches: If there is only one level of cache, then we need to decide between
keeping the cache size small in order to reduce the hit time or making it larger so that the
miss rate can be reduced. Both of them can be achieved simultaneously by introducing cache
at the next levels.
Suppose, if a two-level cache is considered:
The first level cache is smaller in size and has faster clock cycles comparable to that of the
CPU.
Second-level cache is larger than the first-level cache but has faster clock cycles compared
to that of main memory. This large size helps in avoiding much access going to the main
memory. Thereby, it also helps in reducing the miss penalty.
2. Critical word first and Early Restart: Generally, the processor requires one word of the
block at a time. So, there is no need of waiting until the full block is loaded before sending the
requested word. This is achieved using:
The critical word first: It is also called a requested word first. In this method, the exact
word required is requested from the memory and as soon as it arrives, it is sent to the
processor. In this way, two things are achieved, the processor continues execution, and the
other words in the block are read at the same time.
Early Restart: In this method, the words are fetched in the normal order. When the
requested word arrives, it is immediately sent to the processor which continues execution
with the requested word.
Way Prediction to Reduce Hit Time : In way prediction, extra bits are kept in the cache to
predict the way, or block within the set of the next cache access. This prediction means
the multiplexor is set early to select the desired block, and only a single tag comparison is
performed that clock cycle in parallel with reading the cache data. A miss results in
checking the other blocks for matches in the next clock cycle.
Pipelined Cache Access to Increase Cache Bandwidth : The critical timing path in a cache
hit is the three-step process of addressing the tag memory using the index portion of the
address, comparing the read tag value to the address, and setting the multiplexor to
choose the correct data item if the cache is set associative. This optimization is simply to
pipeline cache access so that the effective latency of a first-level cache hit can be
multiple clock cycles, giving fast clock cycle time and high bandwidth but slow hits.
Nonblocking Caches to Increase Cache Bandwidth : The processor needs to stall on a
data cache miss for pipelined computers that allow out-of-order execution. A
nonblocking cache or lockup-free cache escalates the potential benefits by allowing the
data cache to continue to supply cache hits during a miss. This “hit under miss”
optimization reduces the effective miss penalty by being helpful during a miss instead of
ignoring the requests of the processor.
Multi-banked Caches to Increase Cache Bandwidth : Rather than treat the cache as a
single monolithic block, we can divide it into
independent banks(as done in DRAM)that can support simultaneous accesses. To spread
the accesses across all the banks, a mapping of addresses to banks that works well is to
spread the addresses of the block sequentially across the banks.
Critical Word First and Early Restart to Reduce Miss Penalty : This technique is based on
the observation that the processor normally needs just one word of the block at a time.
Critical word first: Request the missed word first from memory and send it to the
processor as soon as it arrives; let the processor continue execution while filling the rest
of the words in the block.
Early restart: Fetch the words in normal order, but as soon as the requested
word of the block arrives send it to the processor and let the processor continue
execution.
38. Compiler Optimization : The compiler can easily reorganize the code, without affecting
the correctness of the program. The compiler can profile code, identify conflicting sequences
and do the reorganization accordingly. Reordering the instructions reduced misses by 50% for a
2-KB direct-mapped instruction cache with 4-byte blocks, and by 75% in an 8-KB cache. Another
code optimization aims for better efficiency from long cache blocks. Aligning basic blocks so
that the entry point is at the beginning of a cache block decreases the chance of a cache miss
for sequential code. This improves both spatial and temporal locality of reference.
39. Write Buffer Merging : This is an optimization used to improve the efficiency of write
buffers. Normally, if the write buffer is empty, the data and the full address will be written in
the buffer. The CPU continues working, while the buffer prepares to write the word to the
memory. Now, if the buffer contains other modified blocks, the addresses can be checked to
see if the address of this new data matches the address of a valid write buffer entry. If so, the
new data can be combined with the already available entry, called write merging.
40. NoC :
A network on a chip or network-on-chip is a network-based communications
subsystem on an integrated circuit ("microchip"), most typically between modules in
a system on a chip (SoC).
The modules on the IC are typically semiconductor IP cores schematizing various
functions of the computer system, and are designed to be modular in the sense
of network science.
The network on chip is a router-based packet switching network between SoC modules.
NoC technology applies the theory and methods of computer networking to on-
chip communication and brings notable improvements over
conventional bus and crossbar communication architectures.
Networks-on-chip come in many network topologies, many of which are still
experimental as of 2018.
A common NoC used in contemporary personal computers is a graphics processing
unit (GPU) — commonly used in computer graphics, video
gaming and accelerating artificial intelligence.
41. Topology :
The topology is the first fundamental aspect of NoC design, and it has a profound effect
on the overall network cost and performance.
The topology determines the physical layout and connections between nodes and
channels.
Also, the message traverse hops and each hop’s channel length depend on the topology.
The topology significantly influences the latency and power consumption.
Since the topology determines the number of alternative paths between nodes, it
affects the network traffic distribution, and hence the network bandwidth and
performance achieved.
42. Routing :
Data routing networks are used for inter PE data exchange.
Data routing networks can be static or dynamic.
In a multicomputer network, data routing is achieved by messages among multiple computer nodes.
Routing network reduces the time required for data exchange and thus system performance is
enhanced.
Commonly used data routing functions are shifting, rotation, permutations, broadcast, multicast,
personalized communication, shuffle, etc.
Routing is the process of selecting a path for traffic in a network or between or across
multiple networks.
Broadly, routing is performed in many types of networks, including circuit-switched
networks, such as the public switched telephone network (PSTN), and computer
networks, such as the Internet.
Input-Output Configuration
The figure displays an illustration of the input and output device. The input device is the
keyboard and the output device is the printer. The terminals are the keyboard and printer.
They send and receive the data consecutively.
The data is alphanumeric and 8-bits in size. The input supported by the keyboard is transfer to
the input register INPR. The information is saved in the OUTPR (output register) in the serial
order for the printer. The OUTR saves the serial data for the printer.
The I/O registers communicate serially with interfaces (keyboard, printer) and parallel with AC.
The sender interface receives data from the keyboard and transfers it to INPR.
The receiver interfaces access the data and address it to the printer.
The INPR holds the 8- bit alphanumeric input data.
FGI defines a 1-bit input flag which is a flip-flop. When the input device receives any new
information, the flip flop is set to 1. It is cleared to 0 when information is received through the
output device.
The output device sets the FGO to 1 after receiving, decoding, and printing the
information. FGO in the 0 modes denotes that the device is printing information.
45. Crossbar Switches :
Crossbar Switch system contains of a number of crosspoints that are kept at
intersections among memory module and processor buses paths.
In each crosspoint, the small square represents a switch which obtains the path from a
processor to a memory module.
Each switch point has control logic to set up the transfer path among a memory and
processor.
It calculates the address which is placed in the bus to obtain whether its specific
module is being addressed.
In addition, it eliminates multiple requests for access to the same memory module on
a predetermined priority basis.