CH 1 Intro To Parallel Architecture
CH 1 Intro To Parallel Architecture
In today's rapidly evolving digital landscape, the need for parallel architecture has become
increasingly crucial. Parallel architecture refers to the use of multiple processors working
together simultaneously to perform computational tasks. This approach has gained immense
significance due to the growing demands for enhanced performance, improved efficiency, and
the ability to process large volumes of data swiftly. In this note, we will explore the need for
parallel architecture, providing insights through bullet points, case studies, and examples.
1. Performance Enhancement:
Case Study: The Large Hadron Collider (LHC) at CERN, one of the most significant
scientific experiments, utilizes a massive parallel computing cluster to analyse vast
amounts of data generated from particle collisions. This parallel processing enables the
LHC to make ground breaking discoveries in particle physics.
*Example: Modern gaming consoles, like the Xbox Series X and PlayStation 5, employ
multi-core processors to deliver superior graphics and real-time gameplay.
• Data Analytics: With the exponential growth of data, organizations need to process
and analyze vast datasets swiftly. Parallel architecture is indispensable for distributed
data analytics, allowing organizations to extract valuable insights from their data.
Case Study: Google's BigQuery, a data warehouse that can scan and analyse terabytes
of data within seconds, utilizes parallel processing to handle massive datasets and
deliver real-time queries.
*Example: The NVIDIA Tesla V100, a GPU designed for AI and deep learning, employs
thousands of cores to parallelize tasks, significantly reducing the time required to train
complex machine learning models.
*Example: Cloud computing platforms like Amazon Web Services (AWS) and Microsoft
Azure offer scalable, parallel computing instances that can be adjusted to meet specific
resource demands.
*Case Study: The Human Genome Project used a parallel computing approach to map
the entire human genome. This monumental scientific achievement was made possible
through the parallel processing of DNA sequences.
• Climate Modelling: Climate scientists use parallel architecture to run complex climate
models, simulating various climate scenarios and understanding the impact of climate
change.
*Case Study: The European Centre for Medium-Range Weather Forecasts (ECMWF)
employs a supercomputer with parallel architecture to run global weather forecasts and
climate simulations.
*Example: The Advanced Encryption Standard (AES), widely used for data encryption,
can be accelerated through parallel processing, making it suitable for securing data
transmissions in real-time.
*Example: Intel's Core i9 processors and AMD's Ryzen processors are consumer-grade
CPUs with multiple cores, making parallel processing capabilities readily available for
personal computers.
• Specialized Hardware: The advent of specialized hardware, like GPUs and TPUs, has
enabled parallel processing for specific tasks, such as graphics rendering, AI, and deep
learning.
*Example: NVIDIA's GeForce RTX 30 series GPUs are designed for parallel
processing, making them popular choices for gaming, content creation, and AI
applications.
In conclusion, the need for parallel architecture has grown exponentially as our digital world
becomes more data-driven and reliant on high-performance computing. The examples and case
studies provided underscore the critical role of parallel architecture in various domains, from
scientific research to real-time applications, data analytics, and security. With the continuous
evolution of hardware and technology, parallel architecture is poised to remain a fundamental
component of the computing landscape, enabling us to meet the demands of the future
effectively.
❖ Application Trends
• The need for making applications work faster is something we see in every part of
computing. As computer technology gets better, we can do more things with our
applications. But this also means these applications become more demanding and need
even better technology. So, it's like a never-ending cycle where we keep improving
computer technology to make applications run faster.
• This cycle of improvement is what pushes us to make microprocessors perform better
and better. Microprocessors are like the brains of computers, and we keep making them
more powerful. This also puts extra pressure on parallel architecture because it's used
for the most demanding applications that need a lot of computing power.
• To give you an idea of how much better things are getting, if a regular computer gets
50% faster every year, a computer with a hundred processors working together is like
having the power that regular computers will have in ten years. And if you have a
thousand processors, it's like having the power of computers almost twenty years in the
future.
• Because different applications need different levels of performance, computer
companies make different types of computers. Some are not very powerful and are used
by most people, while others are super powerful and are used for the most demanding
applications. This creates what we call a "platform pyramid," where most people use
the less powerful computers, and a smaller group uses the super powerful ones. The
pressure to make computers even more powerful is greatest at the top of the pyramid,
where the most demanding applications are.
• Before microprocessors came into the picture, we used fancy technologies and different
ways of organizing computers to make them faster. But nowadays, the best way to make
computers way faster than what we have now is by using multiple processors. The
applications that need the most power are written to run on multiple processors at the
same time. So, the need for better performance is highest for parallel architectures and
the applications that use them.
• Both architects and application developers want to know how using parallelism makes
applications run faster. We can measure this improvement in performance by something
called "speedup on processors."
When you have one specific problem to solve, the machine's performance on that
problem is just the opposite of how long it takes to solve it. So, in a special situation,
we can say the following:
In the business world, high-end computers now often use parallel technology. They
might not need as much parallel power as in scientific work, but they use it a lot.
Multi-processor systems have been the top choice for business computing since the
1960s. In this area, how fast and powerful a computer is directly affects how big of
a business it can support. We check this using test, like those for online transaction
processing (OLTP) sponsored by the Transaction Processing Performance Council
(TPC). These tests measure how many transactions a computer can handle in a
minute.
❖ Technology Trends
• The need for parallelism to achieve better performance becomes clearer when we
consider technological advancements. Relying solely on faster single processors may
not suffice, making parallel architectures more attractive. Furthermore, the challenges
in parallel computer architecture resemble those in traditional computers, such as
resource allocation and data handling.
• The big technological change is that computer parts are getting smaller and faster. This
means we can fit more of them in the same space. Also, the area where we can put these
parts is getting larger. So, the speed of computers goes up as the parts get smaller, and
we can add more parts because of the bigger space. In the end, using lots of parts at the
same time (parallelism) will likely make computers faster than just making each part
run faster.
• This idea is confirmed when we look at commercial microprocessors. If you see Figure
1-5, you'll notice that the clock speed (how fast it works) of important microprocessors
goes up by about 30% every year, and the number of transistors (tiny on-off switches)
goes up by about 40% each year. So, if we look at how powerful a chip is (how many
transistors it uses every second), the number of transistors has made a chip about ten
times more powerful than increasing the clock speed over the past two decades. This
means that microprocessors are doing better on standard tests at a much higher rate.
• As technology advances, more parts of a computer can fit onto a single chip, including
memory and support for connecting devices (I/O). Modern high-end microprocessors
for computers and GPUs have surpassed billions of transistors, with some flagship
CPUs and GPUs featuring more than 20 billion transistors.
• The processors need data from memory faster. This is achieved through parallelism by
sending more data at once. Designs across computers, from PCs to servers, are adapting
to this requirement by using wider memory paths and better organization. Advanced
DRAM designs transfer numerous bits in parallel within the chip, then move them
quickly through a narrower path. These designs also retain recent data in fast on-chip
buffers, similar to processor caches, to speed up future data access. Utilizing parallelism
and data locality is essential for advancing memory technology.
❖ Architectural Trends
• Technology progress shapes what's possible, while computer architecture turns that
potential into actual performance and capability. Having more transistors (tiny
switches) can boost performance in two ways: parallelism and locality. Parallelism
means doing multiple things at once, reducing the time it takes to complete tasks. But
it needs resources to support all those simultaneous activities. Locality involves keeping
data close to the processor, which speeds things up, but this also requires resources.
The best performance usually comes from a balance between using parallelism and
maintaining locality.
• The early days of microprocessors benefited from an easy form of parallelism: bit-level
parallelism in every operation. The sharp change in microprocessor growth in following
Figure shows that the widespread use of 32-bit operations, along with the use of caches,
made a big difference.
• During the mid-80s to mid-90s, the focus was on making instructions in computers
work faster. They figured out how to do the basic steps of processing instructions (like
understanding what an instruction means, doing math, and finding data) in a single step.
Thanks to caches (a type of high-speed memory), they could also quickly get the
instructions and data they needed most of the time. The RISC approach showed that,
with careful planning, they could organize the steps of instruction processing so that
they could do an instruction almost every cycle, on average.
• In the mid-80s, microprocessor-based computers used separate chips for different tasks.
As technology improved, these tasks were combined into a single chip for better
communication. This single chip handled math, memory, decisions, and floating-point
operations. They also started working on multiple instructions simultaneously, known
as "superscalar execution," which used the chip's resources more effectively. This
approach involved fetching and processing more instructions at once, making
computers faster and more efficient.
• To boost a processor's speed through instruction-level parallelism, it needs a steady
flow of instructions and data. To meet this demand, larger on-chip caches were added,
using more transistors. But having both the processor and cache on the same chip
allowed for quicker data access. Still, as more instructions were processed, delays
caused by control transfers and cache misses became more significant.
• Pipeline system is like the modern day assembly line setup in factories. For example,
in a car manufacturing industry, huge assembly lines are setup and at each point, there
are robotic arms to perform a certain task, and then the car moves on ahead to the next
arm.
• Types of pipelining:
It is divided into 2 categories:
1. Arithmetic Pipeline
2. Instruction Pipeline
1. Arithmetic Pipeline
Arithmetic pipelines are usually found in most of the computers. They are used for
floating point operations, multiplication of fixed-point numbers etc. For example: The
input to the Floating-Point Adder pipeline is:
X = A*2^a
Y = B*2^b
Here A and B are mantissas (significant digit of floating-point numbers), while a and b
are exponents.
The floating point addition and subtraction is done in 4 parts:
a. Compare the exponents.
b. Align the mantissas.
c. Add or subtract mantissas
d. Produce the result.
Registers are used for storing the intermediate results between the above operations.
2. Instruction Pipeline
An instruction pipeline reads instruction from the memory while previous instructions
are being executed in other segments of the pipeline. Thus, we can execute multiple
instructions simultaneously. The pipeline will be more efficient if the instruction cycle
is divided into segments of equal duration.
Advantages of Pipelining
1. The cycle time of the processor is reduced.
2. It increases the throughput of the system
3. It makes the system reliable.
Disadvantages of Pipelining
1. The design of pipelined processor is complex and costly to manufacture.
2. The instruction latency is more.
10 | P r e p a r e d b y : P r i y a n k a M o r e
PCA: Chapter 1: Introduction to parallel architecture
11 | P r e p a r e d b y : P r i y a n k a M o r e
PCA: Chapter 1: Introduction to parallel architecture
I. Levels of processing
II. Pipeline configuration
III. Types of instruction and data
I. Based on Levels of processing (Handler’s Classification)
According to the level of processing Handler proposed three classification schemes:
a. Arithmetic pipeline
b. Processor Pipeline
c. Instruction Pipeline
a. Arithmetic Pipeline
An arithmetic pipeline generally breaks an arithmetic operation into multiple
arithmetic steps that can be executed one by one in segments in Arithmetic Logic
Unit. In arithmetic pipeline, the ALU of a computer is segmented for pipeline
operations in various data formats.
For example,
- 4-stage pipeline in Star-100
- 8-stage pipeline in TI-ASC
- 14-stage pipeline in Cray-1
12 | P r e p a r e d b y : P r i y a n k a M o r e
PCA: Chapter 1: Introduction to parallel architecture
c. Instruction Pipeline
In instruction pipeline, the execution of a stream of instructions can be pipelined by
overlapping the execution of the current instruction with the fetch, decode and
operand fetch of subsequent instructions. This technique is also known as look
ahead. Example: Almost all high-performance computers nowadays are equipped
with instruction pipeline processor.
13 | P r e p a r e d b y : P r i y a n k a M o r e
PCA: Chapter 1: Introduction to parallel architecture
Unifunction Pipeline:
A pipeline with fixed and dedicated function is called a Unifunction pipeline. For
example, Floating point adder
The Cray-1 has 12 unifunctional pipeline units for various scalar, vector, fixed point
and floating-point operations.
Multifunction Pipeline:
A pipeline that performs different functions either at different times or at the same
time, by interconnecting different subsets of stages in the pipeline is called a
Multifunction pipeline. For example: TI-ASC has multifunction pipeline
processors.
Static Pipeline:
• Static pipelines are preferred when instructions of same type are to be executed
continuously.
Dynamic Pipeline:
14 | P r e p a r e d b y : P r i y a n k a M o r e
PCA: Chapter 1: Introduction to parallel architecture
small DO loop are often prefetched into the instruction buffer. The required
scalar operands for repeated scalar instructions are moved into a data cache in
order to continuously supply the pipeline with operands.
Example: IBM-360
b. Vector Pipelines:
This type of pipeline processes vector instruction over vector operands.
Computers having vector instructions are often called vector processors. The
design of a vector pipeline is expanded from that of a scalar pipeline.
Example: STAR-100, Cray-1
• The fetch stage (F) fetches instructions from a cache memory, ideally one per cycle.
• The decode stage (D) reveals the instruction function to be performed and identifies the
resources needed. Resources include general-purpose registers buses, and functional
units.
• The issue stage (I) reserves resources. The operands are also read from registers during
the issue stage.
• The instructions are executed in one or several execute stages (E). Three execute stages
are shown in Fig.
• The last writeback stage (W) is used to write results into the registers. Memory load or
store operations are treated as part of execution.
• Figure (b) illustrates the issue of instructions following the original program order. The
shaded boxes correspond to idle cycles when instruction issues are blocked due to
resource latency or conflicts or due to data dependencies.
15 | P r e p a r e d b y : P r i y a n k a M o r e
PCA: Chapter 1: Introduction to parallel architecture
• The total time required is 17 Clock cycles. This time is measured beginning at cycle 4
when the first instruction starts execution until cycle 20 when the last instruction starts
execution.
• Figure (c) shows an improved timing after the instruction issuing order is changed to
eliminate unnecessary delays due to dependence.
• The idea is to issue all four load operations in the beginning.
• Both the add and multiply instructions are blocked fewer cycles due to this data
prefetching. The reordering should not change the end results.
• The time required is being reduced to 11 cycles, measured from cycle 4 to cycle I4.
1. Clock Period
The CPU of digital computer is driven by a clock with a constant cycle time (in nano
seconds).
Clock Rate: The inverse of the cycle time is the clock rate.
f = 1/T in megahertz.
16 | P r e p a r e d b y : P r i y a n k a M o r e
PCA: Chapter 1: Introduction to parallel architecture
Example
- 4-stage pipeline
- sub operation in each stage; tp = 20nS
- 100 tasks to be executed
- 1 task in non-pipelined system; 20*4 = 80nS
- Pipelined System
(k + n - 1)*tp = (4+99) * 20 = 2060nS
- Non-Pipelined System
n*k*tp = 100*80 = 8000nS
- Speedup
Sk=8000/2060 = 3.88
- 4-Stage Pipeline is basically identical to the system with 4 identical function units
17 | P r e p a r e d b y : P r i y a n k a M o r e
PCA: Chapter 1: Introduction to parallel architecture
3. Efficiency:
The efficiency of a pipeline can be measured as the ratio of busy time span to the total
time span including the idle time. Let c be the clock period of the pipeline, the efficiency
E can be denoted as:
E = (n. m. c) / m. [m. c + (n-1).c] = n / [(m + (n-1)]
As n → ∞, E becomes 1.
4. Throughput:
Throughput of a pipeline can be defined as the number of results that have been
achieved per unit time. It can be denoted as:
T = (n/ [m + (n-1)]) / c = E / c
Throughput denotes the computing power of the pipeline. Maximum speed up,
efficiency and throughput are the ideal cases.
18 | P r e p a r e d b y : P r i y a n k a M o r e