0% found this document useful (0 votes)
23 views

Parallelism - Multiprocessing, Multithreading & Pipelining

Computer hardware
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Parallelism - Multiprocessing, Multithreading & Pipelining

Computer hardware
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

COSC 403:

COMPUTER ARCHITECTURE

Image Source: http://www.thedailycrate.com/wp-content/uploads/2014/08/o-COMPUTER-SCIENCE-facebook.jpg - Retrieved Online on January 11, 2016


Parallelism:
Multiprocessing,
Multithreading &
Pipelining

MODULE SIX
PARALLELISM
• Parallelism / Parallel Processing
essentially refers to techniques that are
implored to enhance the performance of
modern computer systems. The fundamental
goal in parallelism is to increase the amount
of work that can be done (performance) by a
computer processor (system) per cycle (or
unit time).
• It basically has to do with sets of
instructions that do not depend on each
other to be executed; more or less like how
PARALLELISM
• The computing industry over the years has
witnessed the implementation of different
techniques have aimed at exploiting and
enhancing the capability for parallel
processing. Some of these techniques
include: Multiprocessing (and
Multicore), Multithreading, Pipelining,
Superscalar, Out-of-order Execution, Cluster
& Grid Processing Systems, etc.
PARALLELISM (contd.)
• Fundamentally, Parallelism could be implemented
at both the Hardware and the Software levels.
• Parallelism in Hardware:
 Parallelism in a Uniprocessor
 Pipelining
 Superscalar
 Very Long Instruction Words (VLIW), etc…
PARALLELISM (contd.)
 Single Instruction Multiple Data (SIMD) Stream
Instructions, Vector Processors, Graphics
Processing Units (GPUs), etc.
 Parallelism in Multiprocessors
 Shared-memory Multiprocessors
 Distributed-memory Multiprocessors
 Chip Multiprocessors (a.k.a. Multicores)
 Multi-computers (a.k.a. Cluster Systems)
PARALLELISM (contd.)
• Parallelism in Software:
 Bit-level Parallelism - 1970 to ~1985
 Instruction Level Parallelism – 1985 – mid ‘90s
 Task-level Parallelism – mainstream for modern general
purpose computing
 Data Parallelism
 Transaction Level Parallelism
Processor Architectures that are designed to take advantage
of the various benefits of parallelism in both hardware &
software can be organized into one of the following known /
existing architectures:
RELATIONSHIP BETWEEN A
TASK, INSTRUCTION, PROCESS
& THREAD
• A Task is a job that is to be done by the Computer
or a goal to be accomplished
• Instructions are a set of directives or commands
given to the Computer towards achieving a specific
task
• Recall that a Program (or, Software) is a set of
Instructions given to the computer to perform a
specific task
• Therefore, we can say, Instruction = Program
• A Process is a Program in execution
PARALLELISM (contd.)
Single Instruction, Multiple
Single Instruction, Single Data
Data Stream (SIMD)
Stream (SISD) Architecture
Architecture
• Possesses a single • Features a single machine instruction

processor • Possesses capabilities to control


simultaneous execution
• Features a single machine • Multiple (data) processing elements are
instruction stream featured
• Each (data) processing element has its
• Data is stored in a single own associated (data) memory
memory • Same instruction is executed on a
different set of data by all resident
• It is a standard processors
uniprocessor • Featured in vector, GPU and array
implementation processor implementations
PARALLELISM (contd.)
Multiple Instruction, Single Multiple Instruction, Multiple
Data Stream (MISD) Data Stream (MIMD)
Architecture Architecture
• It features a single sequence • There is a set of processors
of data that is transmitted to that simultaneously execute
a set of processors different instruction sequences
using different sets of data
• Each processor executes
different instruction sequence • This is a standard
from the same data stream multiprocessor implementation

• This is still a hypothetical • SMPs, clusters and NUMA


paradigm, as it is yet to systems are actual
feature in an actual implementations of this
implementation approach to parallelism
PARALLELISM (contd.)

Taxonomy
of Parallel
Processor
Architectur
es
PARALLELISM (contd.)
Single Instruction, Multiple
Single Instruction, Single Data
Data Stream (SIMD)
Stream (SISD) Architecture
Architecture
PARALLELISM (contd.)
Multiple Instruction, Single Multiple Instruction, Multiple
Data Stream (MISD) Data Stream (MIMD)
Architecture Architecture
PARALLELISM (contd.)
The focus, however of this particular
module for this course would be on
Parallelism as it relates specifically to
the concepts of Multiprocessing (and
Multicore), Multithreading, &
Pipelining. All these are basically
methods / techniques for achieving
Instruction Level Parallelism from the
broadest sense of it…
PARALLELISM:
PIPELINING
WHAT IS PIPELINING?
• Pipelining is not a technique that is native to
processors and computer architecture; it is a
general-purpose efficiency technique that is used in:
Production & Assembly lines, Bucket brigades, Fast
food restaurants, etc.
• Pipelining is used in other CS disciplines:
• Networking
• Server software architecture
• It is a very useful technique for increasing
throughput in the presence of long latency times
WHAT IS PIPELINING? (contd.)
• A technique used in advanced
microprocessors where the microprocessor
begins executing a second instruction before
the first has been completed.
• In modern processors with pipelining, the
computer architecture allows the next
instructions to be fetched while the
processor is performing arithmetic
operations, holding them in a buffer close to
the processor until each instruction
operation can performed.
PIPELINING PRINCIPLE
• The pipeline is divided into segments
and each segment can execute its
operation concurrently with the other
segments. Once a segment completes
an operation, it passes the result to the
next segment in the pipeline and
fetches the next operations from the
preceding segment.
PIPELINING IN REAL LIFE (contd.)
• Imagine that three janitors (A, B & C) have to clean up
a flat of three bedrooms that each have to be swept,
washed and mopped successively. Janitor A has the
broom, Janitor B has the automatic floor washer, and
Janitor C has the mop and cleaning rags…
• Imagine that Janitors B & C have to wait first for
Janitor A to finish sweeping all three rooms before
Janitor B would proceed to wash the floors of all three
rooms while Janitor C waits and Janitor A is idle…
• And then when Janitor B is done washing, Janitor C
would begin cleaning all three rooms while Janitors A
& B would become idle…
PIPELINING IN REAL LIFE (contd.)
TASKS ROOM 1 ROOM 2 ROOM 3
In the first Janitor A is
cycle sweeping;
Janitors B & C are
idle.
In the second Janitor A is
cycle sweeping;
Janitors B & C are
idle.
In the third Janitor A is
cycle sweeping;
Janitors B & C are
idle.
In the fourth Janitor B is
cycle washing;
Janitors A & C are
idle.
PIPELINING IN REAL LIFE (contd.)
TASKS ROOM 1 ROOM 2 ROOM 3
In the seventh Janitor C is
cycle cleaning;
Janitors A & B are
idle.
In the eighth Janitor C is
cycle cleaning;
Janitors A & B are
idle.
In the ninth Janitor C is
cycle cleaning;
Janitors A & B are
idle.
In the tenth Task
cycle Complete
d
PIPELINING IN REAL LIFE (contd.)
In other words:
• If each of the three janitors would take 1 hour to
complete each duty for each of the rooms, the total
time that would be taken to complete the task
would be 1 x 9 = 9 hours…
• This is excluding the idle times of each of the
Janitors while they had to wait for the entire
preceding duty to be completed by the other
janitor(s)
The result is both a waste of time and an inefficient
use of resources… This is a non-pipelined
PIPELINING IN REAL LIFE (contd.)
• Now, imagine that the same three janitors (A, B & C) have
to clean up a flat of three bedrooms that each have to be
swept, washed and mopped successively. Janitor A has the
broom, Janitor B has the automatic floor washer, and
Janitor C has the mop and cleaning rags…
• Imagine that Janitor A finishes sweeping room 1 and
proceeds to room 2, while Janitor B begins washing room
1…
• Then Janitor A proceeds to sweep room 3, while Janitor B
proceeds to wash room 2 and Janitor C proceeds to clean
room 1…
PIPELINING IN REAL LIFE (contd.)
TASKS ROOM 1 ROOM 2 ROOM 3
In the first Janitor A is
cycle sweeping;
Janitors B & C are
idle.
In the second Janitor B is Janitor A is
cycle washing; sweeping.
Janitor C is idle
In the third Janitor C is Janitor B is Janitor A is
cycle cleaning. washing. sweeping.
In the fourth Janitor C is Janitor B is
cycle cleaning. washing; Janitor A
is idle.
In the fifth Janitor C is
cycle cleaning;
Janitors A & B are
PIPELINING IN REAL LIFE (contd.)
In other words:
• If each of the three janitors would take 1 hour to complete
each duty for each of the rooms, the total time that would
be taken to complete the task would be 1 x 6 = 6 hours…
• Even though this is also excluding the idle times of each of
the Janitors while they had to wait for the entire preceding
duty to be completed by the other janitor(s), but also 3
hours of efficiency have been saved…
The result is a reduced time to clean the entire flat of three
rooms, and an efficient use of resources due to a reduction in
idle time… This is a pipelined scenario…
PIPELINING FACTS
• Pipelining doesn’t improve the latency of single load, rather it
improves the throughput of entire workload
• The pipeline rate is often limited by the slowest pipeline stage
• In Pipelining, multiple tasks operate simultaneously using
different resources at any given time.
• The potential for speedup (is directly proportional to) increases
as the number of pipeline stages increases
• Unbalanced lengths and time frames of the pipeline stages
could potentially reduce speedup
• Cycle times decrease as the clock rate increase
PROCESSOR PIPELINING
• A pipelined processor can be defined as a
processor that consists of a sequence of processing
circuits called segments and a stream of operands
(data) is passed through the pipeline in such a way
that in each segment partial processing of the data
stream is performed and the final output is
received when the stream has passed through the
entire segments of the pipeline.
• An operation that can be decomposed into a
sequence of well-defined sub tasks could be
implemented using the pipelining concept.
PROCESSOR PIPELINING (contd.)
In modern Microprocessors, pipelines can be characterized
based on whether they are:
1) Hardware or software implemented – i.e. pipelining can
be implemented in either software or hardware.
2) Large or Small Scale – i.e. stations in a pipeline can
range from simplistic to powerful, and a pipeline can
range in length from short to long.
3) Synchronous or asynchronous flow – A synchronous
pipeline operates like an assembly line: at a given time,
each station is processing some amount of information.
An asynchronous pipeline, allow a station to forward
information at any time / remain idle.
PROCESSOR PIPELINING (contd.)
4) Buffered or unbuffered flow – One stage of pipeline sends
data directly to another one or a buffer is place between each
pairs of stages often to cater for delays.
5) Finite Chunks or Continuous Bit Streams – The digital
information that passes though a pipeline can consist of a
sequence or small data items or an arbitrarily long bit stream.
6) Automatic Data Feed Or Manual Data Feed – Some
implementations of pipelines use a separate mechanism to
move information, and other implementations require each
stage to participate in moving information.
7) Uni-function or Multifunction – This depends on whether or
not different functions could be performed at different times
through the pipeline segments
PROCESSOR PIPELINING (contd.)
• Executing an instruction in a typical
Microprocessor basically includes the
following basic stages:
PROCESSOR PIPELINING (contd.)
• Implementing this is a typical pipeline
structure would look like this:
PROCESSOR PIPELINING (contd.)
Clock cycle
1 2 3 4 5 6 7 8 9
lw $t0, IF ID EX MEM WB
4($sp)
sub $v0, $a0, IF ID EX ME WB
$a1 M
and $t1, $t2, IF ID EX ME WB
$t3 M
or $s0, $s1, IF ID EX MEM WB
$s2
add $sp, $sp, IF ID EX MEM WB
-4
PROCESSOR PIPELINING (contd.)
• The pipeline diagram above shows the execution of a
series of instructions.
• The instruction sequence is shown vertically, from top to bottom.
• Clock cycles are shown horizontally, from left to right.
• Each instruction is divided into its component stages. (We show five
stages for every instruction, which will make the control unit
easier.)

• This clearly indicates the overlapping of instructions. For


example, there are three instructions active in the third
cycle above.
• The “lw” instruction is in its Execute stage.
• Simultaneously, the “sub” is in its Instruction Decode stage.
• Also, the “and” instruction is just being fetched.
PROCESSOR PIPELINING (contd.)

Clock cycle
1 2 3 4 5 6 7 8 9
lw $t0, IF ID EX ME WB
4($sp) M
sub $v0, IF ID EX ME WB
$a0, $a1 M
and $t1, IF ID EX ME WB
$t2, $t3 M
or $s0, IF ID EX ME WB
$s1, $s2 filling full emptying
M
add $sp, IF ID EX ME WB
$sp, -4 M
PROCESSOR PIPELINING (contd.)
• The pipeline depth is the number of stages—in this
case, five.
• In the first four cycles here, the pipeline is filling,
since there are unused functional units.
• In cycle 5, the pipeline is full. Five instructions are
being executed simultaneously, so all hardware
units are in use.
• In cycles 6-9, the pipeline is emptying.
CALCULATING THROUGHPUT
• To determine the improvement in throughput of a pipelined against a non-pipelined

processor, the following formulae are used:

Where:

n = the number of processes that would be required to complete the task

t = the total time required to complete all processes in the pipeline

max {t} = highest time required to complete a process in the pipeline

T = the time period required to complete a particular task


IN OTHER WORDS…
• Pipelining attempts to maximize instruction
throughput by overlapping the execution of
multiple instructions.
• Pipelining offers amazing speedup.
• In the best case, one instruction finishes on every cycle,
and the speedup is equal to the pipeline depth.
• The pipelined datapath is much similar to the
single-cycle one, but it also features added
pipeline registers
• Each stage needs is own functional units
PIPELINING vs SUPERSCALAR
PIPELINING SUPERSCALAR
PROCESSOR PIPELINING (contd.)
• The ideal pipeline is often described as one in
which every instruction progresses smoothly
down the stages of the pipeline without any lags
(delays) or stalls.
• However, it is often difficult to achieve such a
pipeline in a real world processor implementation
• Several things could cause a pipeline to stall /
wait. These are known as hazards… And there are
basically three types:
PIPELINING HAZARDS
 Procedural dependencies => Control hazards
This occurs as a result of dependencies in conditional and
unconditional branches, calls/returns, which typically happen
when the location of an instruction depends on previous
instruction that is in another location
 Resource conflicts => Structural hazards
Occurs as a result of the need to use the same resource in
different stages of the pipeline; typically when two
instructions need to access the same resource.
PIPELINING HAZARDS
 Data dependencies => Data hazards
This typically occurs when an instruction is
supposed to use the result of a previous instruction
which result is not yet available for use at the time.
As an example, a data hazard occurs exactly when
an instruction tries to read a register in its ID stage
that an earlier instruction intends to write in its WB
stage. Data Hazards are typically of three types:
 RAW (read after write) [Dependence]
 WAR (write after read) [Anti-Dependence]
 WAW (write after write) [Output Dependence]
EXAMPLE OF A DATA HAZARD
Select R2 and R3 for ADD R2 and R3 STORE SUM IN
ALU Operations R1

ADD R1, R2, R3 IF ID EX M WB

SUB R4, R1, R5 IF ID EX M WB

Select R1 and R5 for


ALU Operations
MITIGATING PIPELINING
HAZARDS
• Pipelining Hazards are commonly mitigated
through the use of:
 Stalling: This involves halting the flow of
instructions until the required result is ready to be
used. However, note that stalling wastes processor
time by doing nothing while waiting for the result.
This is inefficient…
 The insertion of “nops” (no operation) into the
pipeline stream, typically just to create delays…
MITIGATING PIPELINING
HAZARDS (contd.)
 Forwarding of results early enough, so that
missing / required data items are made available
early enough through the aid of some internal
resources. This sometimes helps to avoid stalls…
 Register Renaming: This technique is used in
solving false data dependences that arise from the
reuse of architectural registers by successive
instructions that do not necessarily have any real
data dependences between them; this is achieved
by renaming their register operands
PIPELINING ADVANTAGES &
DISADVANTAGES
Advantages:
• More efficient use of processor
• Quicker time of execution of large number of
instructions
Disadvantages:
• Pipelining involves adding hardware to the chip
• Inability to continuously run the pipeline at full
speed because of pipeline hazards which disrupt
the smooth execution of the pipeline.
PARALLELISM:
MULTIPROCESSING
MULTIPROCESSING
• Multiprocessing is a feature of modern
computer systems in which two or more
CPUs typically share full access to a
common RAM, and are able to process /
execute instructions at the same time.
• Most modern microarchitectures implement
bus-based multiprocessors, where the
processors communicate with each other
and the memory via a bus line…
MULTIPROCESSING (contd.)

Bus-based multiprocessors
MULTIPROCESSING (contd.)
A multi-processing Operating System can
run several processes at the same time
Each process has its own address/memory space
The OS's scheduler decides which and when each
process is executed
Only one process is actually executing on the
processing cores (s) at any given time.
However, the system appears to be running
several programs simultaneously
MULTIPROCESSOR
ARCHITECTURES
Message-Passing Shared-Memory
Architectures Architectures
• Separate address space for • Single address space
each processor. shared by all processors.
• Processors communicate • Processors communicate by
via message passing. memory read/write.
• SMP or NUMA.
• Cache coherence is
important issue.
MULTIPROCESSOR ARCHITECTURES
(contd.)
Message-Passing Shared-Memory
Architectures Architectures
SHARED-MEMORY ARCHITECTURE:
SMP and NUMA
• SMP = Symmetric Multiprocessor
• All memory are equally close to all processors.
• Typical interconnection network between processors is a
shared bus.
• It is easy to program, but difficult to scale (i.e. add more
processors) (8 – 32 processors allowed).
• Also referred to as UMA (Uniform Memory Access)
• NUMA = Non-Uniform Memory Access
• Each memory is closer to some processors than others.
• a.k.a. “Distributed Shared Memory”.
• Typically interconnection between processors is a grid or
hypercube.
MULTIPROCESSOR
ARCHITECTURES
• Different possible methods exist using
which the modern commodity
Operating System could harness these
multiprocessor architectures for its
optimal performance… These include:
One-to-One Mapping of OS to CPU,
Master-Slave CPU Designations, and
Symmetric & Asymmetric
Multiprocessing Implementations, etc.
… You could read these up for
PARALLELISM:
MULTITHREADING
MULTITHREADING
• Multithreading is basically a software feature of
modern commodity operating systems that is used
to maximize and take advantage of the parallelism
capabilities in the microprocessor hardware by
breaking up tasks into threads.
• Threads are lightweight processes that the
processor can easily switch between with minimal
switching.
• Threads are important because they help to
enhance parallel processing; increase response of
the machine to the user; utilize the idle time of the
CPU; and prioritize user tasks based on a priority
MULTITHREADING (contd.)
• For Example, a simple web server typically
listens for a request and then serves it… If
the web server does not feature a
multithreaded capability, the requests
awaiting processing would be in a queue,
thereby increasing the response time for
requests; also the server might hang /
become deadlocked when a bad request is
encountered.
• However, with a multithreaded environment,
the web server is able to serve multiple
MULTITHREADING (contd.)
• Synchronization is a feature of
multithreaded environments (such as the
Java Runtime Environment) that help to
prevent data corruption… It allows for only
one thread to perform an operation on a
particular data object at a time… If multiple
threads require an access to a particular
object, synchronization helps in maintaining
consistency…
WHY MULTITHREADING?
In a single threaded application, one
thread of execution must complete
the entire tasks
 Ifan application has several tasks to
perform, those tasks will be performed
when the thread can get to them.
 A single task which requires a lot of
processing can make the entire
application appear to be "sluggish" or
unresponsive.
WHY MULTITHREADING? (contd.)
In a multithreaded application, each
task can be performed by a separate
thread
 Ifone thread is executing a long process, it
does not make the entire application wait for
it to finish.
If a multithreaded application is being
executed on a system that has multiple
processors, the OS may execute
separate threads simultaneously on
HOW MULTITHREADING WORKS?
 Each thread is given its own "context"
 A thread's context includes virtual registers
and its own calling stack
 The "scheduler" decides which thread
executes at any given time
 The VM may use its own scheduler
 Since many OSes now directly support
multithreading, the VM may use the
system's scheduler for scheduling threads
HOW MULTITHREADING WORKS?
(contd.)
 The scheduler maintains a list of ready
threads (the run queue) and a list of threads
waiting for input (the wait queue)
 Each thread has a priority. The scheduler
typically schedules between the highest
priority threads in the run queue
 Note: the programmer cannot make assumptions
about how threads are going to be scheduled.
Typically, threads will be executed differently on
different platforms.
SUMMARY
• Parallelism
• Pipelining, Principle, Facts & Hazards…
• Multiprocessing & Multiprocessing
Architectures…
• Multithreading, Reasons for
Multithreading & How it works…
BIBLIOGRAPHY
1. Hennessy, & Patterson (2007). Computer
Architecture: A Quantitative Approach (Fourth
Edition). San Francisco. Elsevier.
2. Stallings (2010). Computer Organization and
Architecture (Eighth Edition). New Jersey.
Prentice-Hall.
3. Harris, & Harris (2012). Digital Design and
Computer Architecture (Second Edition). San
Francisco. Elsevier.
N
I O
ST
E
U
Q ?
S
Image Source: http://iamforkids.org/wp-content/uploads/2013/11/j04278101.jpg - Retrieved Online on January 11, 2016
L E
DU
M O
O F
ND
E

You might also like