0% found this document useful (0 votes)
56 views61 pages

Unit IV Parallelism

The document discusses parallelism in computer organization and architecture, covering concepts such as parallel processing challenges, types of parallel computing, and the advantages of parallel computing. It also explores various architectures including SIMD, MIMD, and vector processors, as well as the roles of multi-core processors and shared memory multiprocessors. Additionally, it highlights the applications of parallel computing across various fields such as scientific computing, artificial intelligence, and real-time systems.

Uploaded by

aparna.gopakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views61 pages

Unit IV Parallelism

The document discusses parallelism in computer organization and architecture, covering concepts such as parallel processing challenges, types of parallel computing, and the advantages of parallel computing. It also explores various architectures including SIMD, MIMD, and vector processors, as well as the roles of multi-core processors and shared memory multiprocessors. Additionally, it highlights the applications of parallel computing across various fields such as scientific computing, artificial intelligence, and real-time systems.

Uploaded by

aparna.gopakumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

CHRIST

Deemed to be University

CS435P – COMPUTER ORGANIZATION AND


ARCHITECTURE

Unit IV - PARALLELISM

Dr. Shamanth N.

Assistant Professor, Department of CSE, CHRIST (Deemed to be University), Bangalore

[email protected]

Excellence and Service


CHRIST
Deemed to be University

Unit IV - PARALLELISM

● Parallel processing challenges – Flynn’s classification – SISD, MIMD,


SIMD, SPMD, and Vector Architectures - Hardware multithreading –
Multi-core processors and other Shared Memory Multiprocessors -
Introduction to Graphics Processing Units, Clusters, Warehouse Scale
Computers and other Message-Passing Multiprocessors.

Excellence and Service


CHRIST
Deemed to be University

What is Parallelism?
● Doing Things Simultaneously
○ Same thing or different things
○ Solving one larger problem
● Serial Computing
○ Problem is broken into stream of instructions that are executed
sequentially one after another on a single processor.
○ One instruction executes at a time.
● Parallel Computing
○ Problem divided into parts that can be solved concurrently.
○ Each part further broken into stream of instructions
○ Instructions from different parts executes simultaneously.

Excellence and Service


CHRIST
Deemed to be University

Serial computation

● Traditionally in serial computation, used only a single computer


having a single Central Processing Unit (CPU).
● In the serial computation, a large problem is broken into smaller parts
but these sub part are executed one by one.
● Only a single instruction may execute at a time. So it takes lot of time
for solving a large problem.

Excellence and Service


CHRIST
Deemed to be University

Serial computation Cont……..

Problem

CPU

N N-1 ……. 2 1

Instructions

Excellence and Service


CHRIST
Deemed to be University

Parallel Computing

CPU

CPU

Problem
CPU

CPU

Sub problems Instructions

Excellence and Service


CHRIST
Deemed to be University

Different forms of parallel computing

● Bit level
● Instruction level
● Data parallelism
● Task parallelism

Excellence and Service


CHRIST
Deemed to be University

Advantages of Parallel Computing

● Solve large problem easily.


● Save money and time.
● Data are transmitted fast.
● Provide concurrency.
● Communicate in the proper way.
● Good performance.
● Choose best hardware and software primitives.

Excellence and Service


CHRIST
Deemed to be University

Use of Parallel Computing

● Atmosphere, Earth, Environment


● Physics - applied, nuclear, particle, condensed matter, high pressure,
fusion, photonics
● Bioscience, Biotechnology, Genetics
● Chemistry, Molecular Sciences
● Geology, Seismology
● Mechanical Engineering - from prosthetics to spacecraft
● Electrical Engineering, Circuit Design, Microelectronics
● Computer Science, Mathematics

Excellence and Service


CHRIST
Deemed to be University

Use of Parallel Computing

● Scientific Computing.
○ Numerically Intensive Simulations
● Database Operations and Information Systems
○ Web based services, Web search engines, Online transaction
processing.
○ Client and inventory database management, Data mining, MIS
○ Geographic information systems, Seismic data Processing
● Artificial intelligence, Machine Learning, Deep Learning
● Real time systems and Control Applications
○ Hardware and Robotic Control, Speech processing, Pattern
Recognition.

Excellence and Service


CHRIST
Deemed to be University

Parallel Computer Architectural Model

Parallel architectural model is classified into two categories as below.


○ Shared memory
○ Distributed memory

Excellence and Service


CHRIST
Deemed to be University

Flynn’s & Feng’s Classification Taxonomy

SISD SIMD
Single Instruction, Single Instruction,
Single Data Multiple Data
MISD MIMD
Multiple Instruction, Multiple Instruction,
Single Data Multiple Data
Excellence and Service
CHRIST
SISD: Deemed to be University

ADD R0,R1 R0=5,R1=6


SIMD:
ADD R0,R1
R0=4,R1=5
R0=6, R1=8
MISD:
ADD R0,R1
SUB R0,R1
R0=5,R1=4
MIMD:
ADD R0,R1
SUB R0,R1
R0=5,R1=4 R0=6, R1=8
Excellence and Service
CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

SISD (Contd..)

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Excellence and Service


CHRIST
Deemed to be University

Vector Processors

Excellence and Service


CHRIST
Deemed to be University

Vector Processors

● Highly pipelined function units


● Stream data from/to vector registers (with multiple elements in a vector
register) to units
○ Data collected from memory into registers
○ Results stored from registers to memory
● Example: Vector extension to MIPS
○ 32 × 64-element registers (64-bit elements)
○ Vector instructions
■ lv, sv: load/store to /from vector registers
■ addv.d: add vectors of double
■ addvs.d: add scalar to each element of vector of double
● Significantly reduces instruction-fetch bandwidth

Excellence and Service


CHRIST
Deemed to be University

Vector Processors

● In computing, a vector processor or array processor is a central


processing unit (CPU) that implements an instruction
set containing instructions that operate on one-dimensional arrays of
data called vectors,
● Compared to the scalar processors, whose instructions operate on
single data items.
● Vector processors can greatly improve performance on certain
workloads, notably numerical simulation and similar tasks.
● Vector machines appeared in the early 1970s and
dominated supercomputer design through the 1970s into the 1990s,
notably the various Cray platforms.
● The rapid fall in the price-to-performance ratio of
conventional microprocessor designs led to the vector supercomputer's
demise in the later 1990s.

Excellence and Service


CHRIST
Deemed to be University

Vector Processors

● An older and, as we shall see, more elegant interpretation of SIMD is


called a vector architecture, which has been closely identified with
computers designed by Seymour Cray starting in the 1970s.
● It is also a great match to problems with lots of data-level parallelism.
● Rather than having 64 ALUs perform 64 additions simultaneously, like
the old array processors, the vector architectures pipelined the ALU to
get good performance at lower cost.

Excellence and Service


CHRIST
Deemed to be University

Vector Processor

Excellence and Service


CHRIST
Deemed to be University

Vector versus Scalar

Vector instructions have several important properties compared to


conventional instruction set architectures, which are called scalar
architectures in this context:
● A single vector instruction is equivalent to executing an entire loop.
The instruction fetch and decode bandwidth needed is dramatically
reduced.
● Hardware does not have to check for data hazards within a vector
instruction.
● Vector architectures and compilers have a reputation of making it much
easier than when using MIMD multiprocessors to write efficient
applications when they contain data-level parallelism.

Excellence and Service


CHRIST
Deemed to be University

Vector versus Scalar

● Hardware need only check for data hazards between two vector
instructions once per vector operand
● The cost of the latency to main memory is seen only once for the entire
vector, rather than once for each word of the vector.
● Control hazards that would normally arise from the loop branch are
non-existent.
● The savings in instruction bandwidth and hazard checking plus the
efficient use of memory bandwidth give vector architectures
advantages in power and energy versus scalar architectures.

Excellence and Service


CHRIST
Deemed to be University

Vector Processor Classification

● Memory to memory architecture

● Register to register architecture

Excellence and Service


CHRIST
Deemed to be University

Vector Processor Classification

Memory to memory architecture


• In memory to memory architecture, source operands, intermediate and
final results are retrieved (read) directly from the main memory.
• For memory to memory vector instructions, the information of the base
address, the offset, the increment, and the vector length must be
specified in order to enable streams of data transfers between the main
memory and pipelines.
• The processors like TI-ASC, CDC STAR-100, and Cyber-205 have
vector instructions in memory to memory formats.
• The main points about memory to memory architecture are:
• There is no limitation of size
• Speed is comparatively slow in this architecture

Excellence and Service


CHRIST
Deemed to be University

Vector Processor Classification

Register to register architecture


• In register to register architecture, operands and results are retrieved
indirectly from the main memory through the use of large number of
vector registers or scalar registers.
• The processors like Cray-1 and the Fujitsu VP-200 use vector
instructions in register to register formats.
• The main points about register to register architecture are:
• Register to register architecture has limited size.
• Speed is very high as compared to the memory to memory
architecture.
• The hardware cost is high in this architecture.

Excellence and Service


Multicores,Multiprocessors,
and Clusters
• MULTIPROCESSOR A computer system with at least two processors. This is
in contrast to a uniprocessor, which has one.
• If a single processor fails in a multiprocessor with n processors, these
system would continue to provide service with n – 1 processors. Hence,
multiprocessors can also improve availability

• JOB-LEVEL PARALLELISM or process-level parallelism Utilizing multiple


processors by running independent programs simultaneously.

• PARALLEL PROCESSING PROGRAM A single program that runs on multiple


processors simultaneously.

• CLUSTER A set of computers connected over a local area network (LAN)


that functions as a single large multiprocessor.

• MULTICORE MICROPROCESSOR A microprocessor containing multiple


processors ("cores") in a single integrated circuit.
CHRIST
Deemed to be University

Processors

Excellence and Service


Shared Memory Multiprocessors
• SHARED MEMORY MULTIPROCESSOR (SMP) - In a shared-memory
multiprocessor, all processors have access to the same memory.
• Tasks running in different processors can access shared variables in the
memory using the same addresses via load and store instructions.
• An interconnection network enables any processor to access any module that is a
part of the shared memory.
UMA & NUMA Processor
• Single address space multiprocessors come in two styles.

• UNIFORM MEMORY ACCESS (UMA)/Symmetric Multiprocessor


(SMP) A multiprocessor in which accesses to main memory take
about the same amount of time no matter which processor
requests the access and no matter which word is asked.

• NON UNIFORM MEMORY ACCESS (NUMA) A type of single


address space multiprocessor in which some memory accesses are
much faster than others depending on which processor asks for
which word, typically because main memory is divided and
attached to diff erent microprocessors
• SYNCHRONIZATION The process of
coordinating the behavior of two or more
processes, which may be running on different
processors.
• LOCK A synchronization device that allows
access to data to only one processor at a time.
Clusters and Other Message-Passing
Multiprocessors
• MESSAGE PASSING Communicating between multiple
processors by explicitly sending and receiving
information.
• SEND MESSAGE ROUTINE A routine used by a
processor in machines with private memories to pass
to another processor.
• RECEIVE MESSAGE ROUTINE A routine used by a
processor in machines with private memories to
accept a message from another processor.
• CLUSTERS Collections of computers connected via I/O
over standard network switches to form a
message-passing multiprocessor.
SISD, MIMD, SIMD, SPMD, and Vector
• SISD or Single Instruction stream, Single Data stream.A
uniprocessor.
• MIMD or Multiple Instruction streams, Multiple Data
streams.A multiprocessor.
• SPMD Single Program, Multiple Data streams. The
conventional MIMD programming model, where a single
program runs across all processors
• SIMD or Single Instruction stream, Multiple Data streams. A
multiprocessor. The same instruction is applied to many
data streams, as in a vector processor or array processor.
• Data-level parallelism Parallelism achieved by operating on
independent data.
Hardware Multithreading
• HARDWARE MULTITHREADING In computer
architecture, multithreading is the ability of a central processing unit (CPU)
(or a single core in a multi-core processor) to provide multiple threads of
execution concurrently, supported by the operating system.
• Increasing utilization of a processor by switching to another thread when one
thread is stalled.

• FINE-GRAINED MULTITHREADING A version of hardware


multithreading that suggests switching between threads after every
instruction.(round robin after every clock cycle)
• Cycle i + 1: an instruction from thread B is issued.
• Cycle i + 2: an instruction from thread C is issued.

Advantages: Instructions from other threads can be executed when one thread stalls. This
approach is known as interleaving and it improves throughput.
Disadvantages: it slows down the execution of the individual threads, since a thread that is
ready to execute without stalls will be delayed by instructions from other threads.
Hardware Multithreading (contd..)
• COARSE-GRAINED MULTITHREADING A version of hardware multithreading
that suggests switching between threads only after significant events, such as a cache
miss.
• Cycle i: instruction j from thread A is issued.
• Cycle i + 1: instruction j + 1 from thread A is issued.
• Cycle i + 2: instruction j + 2 from thread A is issued, which is a load instruction that
misses in all caches.
• Cycle i + 3: thread scheduler invoked, switches to thread B.
• Cycle i + 4: instruction k from thread B is issued.
• Cycle i + 5: instruction k + 1 from thread B is issued.

Advantages: relieves the need to have thread switching be extremely fast and is much less likely to
slow down the execution of an individual thread, since instructions from other threads will only
be issued when a thread encounters a costly stall.

Disadvantages: The new thread that begins executing after the stall must fill the pipeline before
instructions will be able to complete. This is called start-up overhead.
Hardware Multithreading (contd..)
• SIMULTANEOUS MULTITHREADING (SMT) A version of
multithreading that lowers the cost of multithreading
by utilizing the resources needed for multiple issue,
dynamically schedule microarchitecture.
• Cycle i: instructions j and j + 1 from thread A and
instruction k from thread B are simultaneously issued.
• Cycle i + 1: instruction j + 2 from thread A,
instruction k + 1 from thread B, and instruction m from
thread C are all simultaneously issued.
Hardware Multithreading (contd..)
Graphics Processing Unit

● A major driving force for improving graphics processing was the


computer game industry, both on PCs and in dedicated game
consoles such as the Sony PlayStation.

● GPU consists of hundreds of parallel floating-point units, which


makes high-performance computing more accessible.

● The interest in GPU computing blossomed when this potential was


combined with a programming language that made GPUs easier to
program.
CHRIST
Deemed to be University

CPU vs GPU
CPU
● At the heart of any and every computer in existence is a central processing
unit or CPU. The CPU handles the core processing tasks in a
computer—the literal computation that drives every single action in a
computer system.
Standard components in CPU
● Core(s): The central architecture of the CPU is the “core,” where all
computation and logic happens. A core typically functions through what is
called the “instruction cycle,” where instructions are pulled from memory
(fetch), decoded into processing language (decode), and executed through
the logical gates of the core (execute). Initially, all CPUs were single-core,
but with the proliferation of multi-core CPUs, we’ve seen an increase in
processing power.

Excellence and Service


CHRIST
Deemed to be University

CPU vs GPU
Standard components in CPU …
● Cache: Cache is super-fast memory built either within the CPU or in
CPU-specific motherboards to facilitate quick access to data the CPU is
currently using. Since CPUs work so fast to complete millions of
calculations per second, they require ultra-fast (and expensive) memory to
do it—memory that is much faster than hard drive storage or even the
fastest RAM.
● Memory Management Unit (MMU): The MMU controls data movement
between the CPU and RAM during the instruction cycle.
● CPU Clock and Control Unit: Every CPU works on synchronizing
processing tasks through a clock. The CPU clock determines the
frequency at which the CPU can generate electrical pulses, its primary
way of processing and transmitting data, and how rapidly the CPU can
work. So, the higher the CPU clock rate, the faster it will run and quicker
processor-intensive tasks can be completed.

Excellence and Service


CHRIST
Deemed to be University

CPU vs GPU
GPU
● Cache: Cache is super-fast memory built either within the CPU or in
CPU-specific motherboards to facilitate quick access to data the CPU is
currently using. Since CPUs work so fast to complete millions of
calculations per second, they require ultra-fast (and expensive) memory to
do it—memory that is much faster than hard drive storage or even the
fastest RAM.
● Memory Management Unit (MMU): The MMU controls data movement
between the CPU and RAM during the instruction cycle.
● CPU Clock and Control Unit: Every CPU works on synchronizing
processing tasks through a clock. The CPU clock determines the
frequency at which the CPU can generate electrical pulses, its primary
way of processing and transmitting data, and how rapidly the CPU can
work. So, the higher the CPU clock rate, the faster it will run and quicker
processor-intensive tasks can be completed.

Excellence and Service


CHRIST
Deemed to be University

CPU vs GPU
GPU
● Graphical processing, is generally considered one of the more complex
processing tasks for the CPU. Solving that complexity has led to
technology with applications far beyond graphics.
● The challenge in processing graphics is that graphics call on complex
mathematics to render, and those complex mathematics must compute in
parallel to work correctly. For example, a graphically intense video game
might contain hundreds or thousands of polygons on the screen at any
given time, each with its individual movement, color, lighting, and so on.
CPUs aren’t made to handle that kind of workload. That’s where graphical
processing units (GPUs) come into play.

Excellence and Service


CHRIST
Deemed to be University

CPU vs GPU
GPU
● GPUs are similar in function to CPU: they contain cores, memory, and
other components. Instead of emphasizing context switching to manage
multiple tasks, GPU acceleration emphasizes parallel data processing
through a large number of cores.

Excellence and Service


CHRIST
Deemed to be University

CPU vs GPU
Advantages of a CPU
● Flexibility: CPUs are flexible and resilient and can handle a variety of
tasks outside of graphics processing. Because of their serial processing
capabilities, the CPU can multitask across multiple activities in your
computer. Because of this, a strong CPU can provide more speed for
typical computer use than a GPU.
● Contextual Power: In specific situations, the CPU will outperform the
GPU. For example, the CPU is significantly faster when handling several
different types of system operations (random access memory, mid-range
computational operations, managing an operating system, I/O operations).

Excellence and Service


CHRIST
Deemed to be University

CPU vs GPU
Advantages of a CPU
● Precision: CPUs can work on mid-range mathematical equations with a
higher level of precision. CPUs can handle the computational depth and
complexity more readily, becoming increasingly crucial for specific
applications.
● Access to Memory: CPUs usually contain significant local cache memory,
which means they can handle a larger set of linear instructions and, hence,
more complex system and computational operations.
● Cost and Availability: CPUs are more readily available, more widely
manufactured, and cost-effective for consumer and enterprise use.
Additionally, hardware manufacturers still create thousands of
motherboard designs to house a wide range of CPUs.

Excellence and Service


CHRIST
Deemed to be University

CPU vs GPU
Disadvantages of a CPU
● Parallel Processing: CPUs cannot handle parallel processing like a GPU,
so large tasks that require thousands or millions of identical operations
will choke a CPU’s capacity to process data.
● Slow Evolution: In line with Moore’s Law, developing more powerful
CPUs will eventually slow, which means less improvement year after
year. The expansion of multi-core CPUs has mitigated this somewhat.
● Compatibility: Not every system or software is compatible with every
processor. For example, applications written for x86 Intel Processors will
not run on ARM processors. This is less of a problem as more computer
manufacturers use standard processor sets (see Apple’s move to Intel
processors), but it still presents issues between PCs and mobile devices.

Excellence and Service


CHRIST
Deemed to be University

CPU vs GPU
Advantages of a GPU
● High Data Throughput: a GPU consist of hundreds of cores performing
the same operation on multiple data items in parallel. Because of that, a
GPU can push vast volumes of processed data through a workload,
speeding up specific tasks beyond what a CPU can handle.
● Massive Parallel Computing: Whereas CPUs excel in more complex
computations, GPUs excel in extensive calculations with numerous
similar operations, such as computing matrices or modeling complex
systems.

Excellence and Service


CHRIST
Deemed to be University

CPU vs GPU
Disadvantages of a GPU

● Multitasking: GPUs aren’t built for multitasking, so they don’t have much
impact in areas like general-purpose computing.
● Cost: While the price of GPUs has fallen somewhat over the years, they
are still significantly more expensive than CPUs. This cost rises more
when talking about a GPU built for specific tasks like mining or analytics.
● Power and Complexity: While a GPU can handle large amounts of parallel
computing and data throughput, they struggle when the processing
requirements become more chaotic. Branching logic paths, sequential
operations, and other approaches to computing impede the effectiveness
of a GPU.

Excellence and Service


Why Compute Unified Device Architecture (CUDA)
and GPU Computing?
● programmability of both the hardware and the programming language is NVIDIA's
CUDA (Compute Unified Device Architecture), which enables the programmer to
write C programs to execute on GPUs
● GPU computing Using a GPU for computing via a parallel programming language
and API.
● GPGPU Using a GPU for general-purpose computation via a traditional graphics
API and graphics pipeline.
● A CUDA program is a unified C/CTT program for a heterogeneous CPU and GPU
system. It executes on the CPU and dispatches parallel work to the GPU.
● Work consists of a data transfer from main memory and a thread dispatch.
● The CUDA compiler allocates registers to each thread, under the constraint that the
registers per thread times threads per thread block does not exceed the 8192
registers per multiprocessor.
● CUDA A scalable parallel programming model and language based on C/C-I-+. It is
a parallel programming platform for GPUs and multicore CPUs.
NVIDIA GPU Architecture
NVIDIA GPU Memory Structure
CHRIST
Deemed to be University

Questions

● Which type of machine are suitable for executing for loop?


● SIMD works best when dealing with arrays in for loops.

● SIMD is at its weakest in case or switch statements, where each execution unit
must perform a different operation on its data, depending on what data it has.

Excellence and Service


Asynchronous Activity

● Write a notes on Warehouse Scale Computers

You might also like