Parallelism - Multiprocessing, Multithreading & Pipelining
Parallelism - Multiprocessing, Multithreading & Pipelining
COMPUTER ARCHITECTURE
MODULE SIX
PARALLELISM
• Parallelism / Parallel Processing
essentially refers to techniques that are
implored to enhance the performance of
modern computer systems. The fundamental
goal in parallelism is to increase the amount
of work that can be done (performance) by a
computer processor (system) per cycle (or
unit time).
• It basically has to do with sets of
instructions that do not depend on each
other to be executed; more or less like how
PARALLELISM
• The computing industry over the years has
witnessed the implementation of different
techniques have aimed at exploiting and
enhancing the capability for parallel
processing. Some of these techniques
include: Multiprocessing (and
Multicore), Multithreading, Pipelining,
Superscalar, Out-of-order Execution, Cluster
& Grid Processing Systems, etc.
PARALLELISM (contd.)
• Fundamentally, Parallelism could be implemented
at both the Hardware and the Software levels.
• Parallelism in Hardware:
Parallelism in a Uniprocessor
Pipelining
Superscalar
Very Long Instruction Words (VLIW), etc…
PARALLELISM (contd.)
Single Instruction Multiple Data (SIMD) Stream
Instructions, Vector Processors, Graphics
Processing Units (GPUs), etc.
Parallelism in Multiprocessors
Shared-memory Multiprocessors
Distributed-memory Multiprocessors
Chip Multiprocessors (a.k.a. Multicores)
Multi-computers (a.k.a. Cluster Systems)
PARALLELISM (contd.)
• Parallelism in Software:
Bit-level Parallelism - 1970 to ~1985
Instruction Level Parallelism – 1985 – mid ‘90s
Task-level Parallelism – mainstream for modern general
purpose computing
Data Parallelism
Transaction Level Parallelism
Processor Architectures that are designed to take advantage
of the various benefits of parallelism in both hardware &
software can be organized into one of the following known /
existing architectures:
RELATIONSHIP BETWEEN A
TASK, INSTRUCTION, PROCESS
& THREAD
• A Task is a job that is to be done by the Computer
or a goal to be accomplished
• Instructions are a set of directives or commands
given to the Computer towards achieving a specific
task
• Recall that a Program (or, Software) is a set of
Instructions given to the computer to perform a
specific task
• Therefore, we can say, Instruction = Program
• A Process is a Program in execution
PARALLELISM (contd.)
Single Instruction, Multiple
Single Instruction, Single Data
Data Stream (SIMD)
Stream (SISD) Architecture
Architecture
• Possesses a single • Features a single machine instruction
Taxonomy
of Parallel
Processor
Architectur
es
PARALLELISM (contd.)
Single Instruction, Multiple
Single Instruction, Single Data
Data Stream (SIMD)
Stream (SISD) Architecture
Architecture
PARALLELISM (contd.)
Multiple Instruction, Single Multiple Instruction, Multiple
Data Stream (MISD) Data Stream (MIMD)
Architecture Architecture
PARALLELISM (contd.)
The focus, however of this particular
module for this course would be on
Parallelism as it relates specifically to
the concepts of Multiprocessing (and
Multicore), Multithreading, &
Pipelining. All these are basically
methods / techniques for achieving
Instruction Level Parallelism from the
broadest sense of it…
PARALLELISM:
PIPELINING
WHAT IS PIPELINING?
• Pipelining is not a technique that is native to
processors and computer architecture; it is a
general-purpose efficiency technique that is used in:
Production & Assembly lines, Bucket brigades, Fast
food restaurants, etc.
• Pipelining is used in other CS disciplines:
• Networking
• Server software architecture
• It is a very useful technique for increasing
throughput in the presence of long latency times
WHAT IS PIPELINING? (contd.)
• A technique used in advanced
microprocessors where the microprocessor
begins executing a second instruction before
the first has been completed.
• In modern processors with pipelining, the
computer architecture allows the next
instructions to be fetched while the
processor is performing arithmetic
operations, holding them in a buffer close to
the processor until each instruction
operation can performed.
PIPELINING PRINCIPLE
• The pipeline is divided into segments
and each segment can execute its
operation concurrently with the other
segments. Once a segment completes
an operation, it passes the result to the
next segment in the pipeline and
fetches the next operations from the
preceding segment.
PIPELINING IN REAL LIFE (contd.)
• Imagine that three janitors (A, B & C) have to clean up
a flat of three bedrooms that each have to be swept,
washed and mopped successively. Janitor A has the
broom, Janitor B has the automatic floor washer, and
Janitor C has the mop and cleaning rags…
• Imagine that Janitors B & C have to wait first for
Janitor A to finish sweeping all three rooms before
Janitor B would proceed to wash the floors of all three
rooms while Janitor C waits and Janitor A is idle…
• And then when Janitor B is done washing, Janitor C
would begin cleaning all three rooms while Janitors A
& B would become idle…
PIPELINING IN REAL LIFE (contd.)
TASKS ROOM 1 ROOM 2 ROOM 3
In the first Janitor A is
cycle sweeping;
Janitors B & C are
idle.
In the second Janitor A is
cycle sweeping;
Janitors B & C are
idle.
In the third Janitor A is
cycle sweeping;
Janitors B & C are
idle.
In the fourth Janitor B is
cycle washing;
Janitors A & C are
idle.
PIPELINING IN REAL LIFE (contd.)
TASKS ROOM 1 ROOM 2 ROOM 3
In the seventh Janitor C is
cycle cleaning;
Janitors A & B are
idle.
In the eighth Janitor C is
cycle cleaning;
Janitors A & B are
idle.
In the ninth Janitor C is
cycle cleaning;
Janitors A & B are
idle.
In the tenth Task
cycle Complete
d
PIPELINING IN REAL LIFE (contd.)
In other words:
• If each of the three janitors would take 1 hour to
complete each duty for each of the rooms, the total
time that would be taken to complete the task
would be 1 x 9 = 9 hours…
• This is excluding the idle times of each of the
Janitors while they had to wait for the entire
preceding duty to be completed by the other
janitor(s)
The result is both a waste of time and an inefficient
use of resources… This is a non-pipelined
PIPELINING IN REAL LIFE (contd.)
• Now, imagine that the same three janitors (A, B & C) have
to clean up a flat of three bedrooms that each have to be
swept, washed and mopped successively. Janitor A has the
broom, Janitor B has the automatic floor washer, and
Janitor C has the mop and cleaning rags…
• Imagine that Janitor A finishes sweeping room 1 and
proceeds to room 2, while Janitor B begins washing room
1…
• Then Janitor A proceeds to sweep room 3, while Janitor B
proceeds to wash room 2 and Janitor C proceeds to clean
room 1…
PIPELINING IN REAL LIFE (contd.)
TASKS ROOM 1 ROOM 2 ROOM 3
In the first Janitor A is
cycle sweeping;
Janitors B & C are
idle.
In the second Janitor B is Janitor A is
cycle washing; sweeping.
Janitor C is idle
In the third Janitor C is Janitor B is Janitor A is
cycle cleaning. washing. sweeping.
In the fourth Janitor C is Janitor B is
cycle cleaning. washing; Janitor A
is idle.
In the fifth Janitor C is
cycle cleaning;
Janitors A & B are
PIPELINING IN REAL LIFE (contd.)
In other words:
• If each of the three janitors would take 1 hour to complete
each duty for each of the rooms, the total time that would
be taken to complete the task would be 1 x 6 = 6 hours…
• Even though this is also excluding the idle times of each of
the Janitors while they had to wait for the entire preceding
duty to be completed by the other janitor(s), but also 3
hours of efficiency have been saved…
The result is a reduced time to clean the entire flat of three
rooms, and an efficient use of resources due to a reduction in
idle time… This is a pipelined scenario…
PIPELINING FACTS
• Pipelining doesn’t improve the latency of single load, rather it
improves the throughput of entire workload
• The pipeline rate is often limited by the slowest pipeline stage
• In Pipelining, multiple tasks operate simultaneously using
different resources at any given time.
• The potential for speedup (is directly proportional to) increases
as the number of pipeline stages increases
• Unbalanced lengths and time frames of the pipeline stages
could potentially reduce speedup
• Cycle times decrease as the clock rate increase
PROCESSOR PIPELINING
• A pipelined processor can be defined as a
processor that consists of a sequence of processing
circuits called segments and a stream of operands
(data) is passed through the pipeline in such a way
that in each segment partial processing of the data
stream is performed and the final output is
received when the stream has passed through the
entire segments of the pipeline.
• An operation that can be decomposed into a
sequence of well-defined sub tasks could be
implemented using the pipelining concept.
PROCESSOR PIPELINING (contd.)
In modern Microprocessors, pipelines can be characterized
based on whether they are:
1) Hardware or software implemented – i.e. pipelining can
be implemented in either software or hardware.
2) Large or Small Scale – i.e. stations in a pipeline can
range from simplistic to powerful, and a pipeline can
range in length from short to long.
3) Synchronous or asynchronous flow – A synchronous
pipeline operates like an assembly line: at a given time,
each station is processing some amount of information.
An asynchronous pipeline, allow a station to forward
information at any time / remain idle.
PROCESSOR PIPELINING (contd.)
4) Buffered or unbuffered flow – One stage of pipeline sends
data directly to another one or a buffer is place between each
pairs of stages often to cater for delays.
5) Finite Chunks or Continuous Bit Streams – The digital
information that passes though a pipeline can consist of a
sequence or small data items or an arbitrarily long bit stream.
6) Automatic Data Feed Or Manual Data Feed – Some
implementations of pipelines use a separate mechanism to
move information, and other implementations require each
stage to participate in moving information.
7) Uni-function or Multifunction – This depends on whether or
not different functions could be performed at different times
through the pipeline segments
PROCESSOR PIPELINING (contd.)
• Executing an instruction in a typical
Microprocessor basically includes the
following basic stages:
PROCESSOR PIPELINING (contd.)
• Implementing this is a typical pipeline
structure would look like this:
PROCESSOR PIPELINING (contd.)
Clock cycle
1 2 3 4 5 6 7 8 9
lw $t0, IF ID EX MEM WB
4($sp)
sub $v0, $a0, IF ID EX ME WB
$a1 M
and $t1, $t2, IF ID EX ME WB
$t3 M
or $s0, $s1, IF ID EX MEM WB
$s2
add $sp, $sp, IF ID EX MEM WB
-4
PROCESSOR PIPELINING (contd.)
• The pipeline diagram above shows the execution of a
series of instructions.
• The instruction sequence is shown vertically, from top to bottom.
• Clock cycles are shown horizontally, from left to right.
• Each instruction is divided into its component stages. (We show five
stages for every instruction, which will make the control unit
easier.)
Clock cycle
1 2 3 4 5 6 7 8 9
lw $t0, IF ID EX ME WB
4($sp) M
sub $v0, IF ID EX ME WB
$a0, $a1 M
and $t1, IF ID EX ME WB
$t2, $t3 M
or $s0, IF ID EX ME WB
$s1, $s2 filling full emptying
M
add $sp, IF ID EX ME WB
$sp, -4 M
PROCESSOR PIPELINING (contd.)
• The pipeline depth is the number of stages—in this
case, five.
• In the first four cycles here, the pipeline is filling,
since there are unused functional units.
• In cycle 5, the pipeline is full. Five instructions are
being executed simultaneously, so all hardware
units are in use.
• In cycles 6-9, the pipeline is emptying.
CALCULATING THROUGHPUT
• To determine the improvement in throughput of a pipelined against a non-pipelined
Where:
Bus-based multiprocessors
MULTIPROCESSING (contd.)
A multi-processing Operating System can
run several processes at the same time
Each process has its own address/memory space
The OS's scheduler decides which and when each
process is executed
Only one process is actually executing on the
processing cores (s) at any given time.
However, the system appears to be running
several programs simultaneously
MULTIPROCESSOR
ARCHITECTURES
Message-Passing Shared-Memory
Architectures Architectures
• Separate address space for • Single address space
each processor. shared by all processors.
• Processors communicate • Processors communicate by
via message passing. memory read/write.
• SMP or NUMA.
• Cache coherence is
important issue.
MULTIPROCESSOR ARCHITECTURES
(contd.)
Message-Passing Shared-Memory
Architectures Architectures
SHARED-MEMORY ARCHITECTURE:
SMP and NUMA
• SMP = Symmetric Multiprocessor
• All memory are equally close to all processors.
• Typical interconnection network between processors is a
shared bus.
• It is easy to program, but difficult to scale (i.e. add more
processors) (8 – 32 processors allowed).
• Also referred to as UMA (Uniform Memory Access)
• NUMA = Non-Uniform Memory Access
• Each memory is closer to some processors than others.
• a.k.a. “Distributed Shared Memory”.
• Typically interconnection between processors is a grid or
hypercube.
MULTIPROCESSOR
ARCHITECTURES
• Different possible methods exist using
which the modern commodity
Operating System could harness these
multiprocessor architectures for its
optimal performance… These include:
One-to-One Mapping of OS to CPU,
Master-Slave CPU Designations, and
Symmetric & Asymmetric
Multiprocessing Implementations, etc.
… You could read these up for
PARALLELISM:
MULTITHREADING
MULTITHREADING
• Multithreading is basically a software feature of
modern commodity operating systems that is used
to maximize and take advantage of the parallelism
capabilities in the microprocessor hardware by
breaking up tasks into threads.
• Threads are lightweight processes that the
processor can easily switch between with minimal
switching.
• Threads are important because they help to
enhance parallel processing; increase response of
the machine to the user; utilize the idle time of the
CPU; and prioritize user tasks based on a priority
MULTITHREADING (contd.)
• For Example, a simple web server typically
listens for a request and then serves it… If
the web server does not feature a
multithreaded capability, the requests
awaiting processing would be in a queue,
thereby increasing the response time for
requests; also the server might hang /
become deadlocked when a bad request is
encountered.
• However, with a multithreaded environment,
the web server is able to serve multiple
MULTITHREADING (contd.)
• Synchronization is a feature of
multithreaded environments (such as the
Java Runtime Environment) that help to
prevent data corruption… It allows for only
one thread to perform an operation on a
particular data object at a time… If multiple
threads require an access to a particular
object, synchronization helps in maintaining
consistency…
WHY MULTITHREADING?
In a single threaded application, one
thread of execution must complete
the entire tasks
Ifan application has several tasks to
perform, those tasks will be performed
when the thread can get to them.
A single task which requires a lot of
processing can make the entire
application appear to be "sluggish" or
unresponsive.
WHY MULTITHREADING? (contd.)
In a multithreaded application, each
task can be performed by a separate
thread
Ifone thread is executing a long process, it
does not make the entire application wait for
it to finish.
If a multithreaded application is being
executed on a system that has multiple
processors, the OS may execute
separate threads simultaneously on
HOW MULTITHREADING WORKS?
Each thread is given its own "context"
A thread's context includes virtual registers
and its own calling stack
The "scheduler" decides which thread
executes at any given time
The VM may use its own scheduler
Since many OSes now directly support
multithreading, the VM may use the
system's scheduler for scheduling threads
HOW MULTITHREADING WORKS?
(contd.)
The scheduler maintains a list of ready
threads (the run queue) and a list of threads
waiting for input (the wait queue)
Each thread has a priority. The scheduler
typically schedules between the highest
priority threads in the run queue
Note: the programmer cannot make assumptions
about how threads are going to be scheduled.
Typically, threads will be executed differently on
different platforms.
SUMMARY
• Parallelism
• Pipelining, Principle, Facts & Hazards…
• Multiprocessing & Multiprocessing
Architectures…
• Multithreading, Reasons for
Multithreading & How it works…
BIBLIOGRAPHY
1. Hennessy, & Patterson (2007). Computer
Architecture: A Quantitative Approach (Fourth
Edition). San Francisco. Elsevier.
2. Stallings (2010). Computer Organization and
Architecture (Eighth Edition). New Jersey.
Prentice-Hall.
3. Harris, & Harris (2012). Digital Design and
Computer Architecture (Second Edition). San
Francisco. Elsevier.
N
I O
ST
E
U
Q ?
S
Image Source: http://iamforkids.org/wp-content/uploads/2013/11/j04278101.jpg - Retrieved Online on January 11, 2016
L E
DU
M O
O F
ND
E