0% found this document useful (0 votes)
5 views

DST4030A Lecture Notes Week 3

The document provides an overview of parallel computing, covering processes, multitasking, threads, instruction-level parallelism (ILP), hardware multithreading, and classifications of parallel hardware. It explains how multitasking operating systems manage processes and threads, and discusses ILP techniques like pipelining and multiple issue. Additionally, it classifies parallel computers using Flynn's taxonomy into SIMD and MIMD systems, detailing their architectures and advantages/disadvantages.

Uploaded by

allansharad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DST4030A Lecture Notes Week 3

The document provides an overview of parallel computing, covering processes, multitasking, threads, instruction-level parallelism (ILP), hardware multithreading, and classifications of parallel hardware. It explains how multitasking operating systems manage processes and threads, and discusses ILP techniques like pipelining and multiple issue. Additionally, it classifies parallel computers using Flynn's taxonomy into SIMD and MIMD systems, detailing their architectures and advantages/disadvantages.

Uploaded by

allansharad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

DST4030A: Parallel Computing

Introduction to Parallel Processor Machines

Dr Mike Asiyo

School of Science & Technology

Dr Asiyo (USIU) DST4030A: Parallel Computing 1 / 31


Contents

1 Processes, multitasking, and threads

2 Instruction-level parallelism

3 Hardware multithreading

4 Parallel hardware

Dr Asiyo (USIU) DST4030A: Parallel Computing 2 / 31


Processes, multitasking, and threads

Processes, multitasking, and threads

Dr Asiyo (USIU) DST4030A: Parallel Computing 3 / 31


Processes, multitasking, and threads

When a user runs a program, the operating system creates a


process—an instance of a computer program that is being executed.
A process consists of several entities:
The executable machine language program.
A block of memory, which will include the executable code, a call
stack that keeps track of active functions, a heap that can be used for
memory explicitly allocated by the user program, and some other
memory locations.
Descriptors of resources that the operating system has allocated to
the process, for example, file descriptors.
Security information—for example, information specifying which
hardware and software resources the process can access.
Information about the state of the process, such as whether the
process is ready to run or is waiting on some resource, the content of
the registers, and information about the process’s memory.
Dr Asiyo (USIU) DST4030A: Parallel Computing 4 / 31
Processes, multitasking, and threads

Most modern operating systems are multitasking. This means that


the operating system provides support for the apparent
simultaneous execution of multiple programs.
In a multitasking OS, if a process needs to wait for a resource—for
example, it needs to read data from external storage—it will block.
This means that it will stop executing and the operating system can
run another process.
Threading provides a mechanism for programmers to divide their
programs into more or less independent tasks, with the property
that when one thread is blocked, another thread can be run.
Threads are contained within processes, so they can use the same
executable, and they usually share the same memory and the same
I/O devices. In fact, two threads belonging to one process can share
most of the process’s resources.
Dr Asiyo (USIU) DST4030A: Parallel Computing 5 / 31
Processes, multitasking, and threads

If a process is the “master” thread of execution and threads are


started and stopped by the process, then we can envision the
process and its subsidiary threads as lines: when a thread is started,
it forks off the process; when a thread terminates, it joins the
process.

Dr Asiyo (USIU) DST4030A: Parallel Computing 6 / 31


Instruction-level parallelism

Instruction-level parallelism

Dr Asiyo (USIU) DST4030A: Parallel Computing 7 / 31


Instruction-level parallelism

Instruction-level parallelism, or ILP, attempts to improve processor


performance by having multiple processor components or functional
units simultaneously executing instructions.
There are two main approaches to ILP:
pipelining, in which functional units are arranged in stages; and
multiple issue, in which multiple instructions can be simultaneously
initiated.

Dr Asiyo (USIU) DST4030A: Parallel Computing 8 / 31


Instruction-level parallelism Pipelining

The principle of pipelining is similar to a factory assembly line:


while one team is bolting a car’s engine to the chassis, another team
can connect the transmission to the engine and the driveshaft of a
car that’s already been processed by the first team, and a third
team can bolt the body to the chassis in a car that’s been processed
by the first two teams.
Pipelines improve performance by taking individual pieces of
hardware or functional units and connecting them in sequence.
In general, a pipeline with k stages won’t get a k−fold improvement
in performance. For example, if the times required by the various
functional units are different, then the stages will effectively run at
the speed of the slowest functional unit. Furthermore, delays—such
as waiting for an operand to become available—can cause the
pipeline to stall.
Dr Asiyo (USIU) DST4030A: Parallel Computing 9 / 31
Instruction-level parallelism Multiple Issue

Multiple issue processors replicate functional units and try to simultaneously


execute different instructions in a program. For example, if we have two
complete floating point adders, we can approximately halve the time it takes to
execute the loop
for(i = 0; i<1000; i++){
z[i] = x[i] + y[i];
}
While the first adder is computing z[0], the second can compute z[1]; while the
first is computing z[2], the second can compute z[3]; and so on.
If the functional units are scheduled at compile time, the multiple issue system
is said to use static multiple issue. If they’re scheduled at run-time, the system
is said to use dynamic multiple issue. A processor that supports dynamic
multiple issue is sometimes said to be superscalar.

Dr Asiyo (USIU) DST4030A: Parallel Computing 10 / 31


Hardware multithreading

Hardware multithreading

Dr Asiyo (USIU) DST4030A: Parallel Computing 11 / 31


Hardware multithreading

ILP can be very difficult to exploit: a program with a long sequence of


dependent statements offers few opportunities. For example, in a direct
calculation of the Fibonacci numbers
f[0] = f[1] = 1 ;
for(i = 2; i<= n; i++){
f[i] = f[i-1] + f[i-2];
}
there’s essentially no opportunity for simultaneous execution of instructions.
Thread-level parallelism, or TLP, attempts to provide parallelism through the
simultaneous execution of different threads, so it provides a coarser-grained
parallelism than ILP, that is, the program units that are being simultaneously
executed— threads—are larger or coarser than the finer-grained
units—individual instructions.

Dr Asiyo (USIU) DST4030A: Parallel Computing 12 / 31


Hardware multithreading

Hardware multithreading provides a means for systems to continue doing useful


work when the task being currently executed has stalled—for example, if the
current task has to wait for data to be loaded from memory. Instead of looking
for parallelism in the currently executing thread, it may make sense to simply
run another thread. Of course, for this to be useful, the system must support
very rapid switching between threads.
In fine-grained multithreading, the processor switches between threads after
each instruction, skipping threads that are stalled. While this approach has the
potential to avoid wasted machine time due to stalls, it has the drawback that a
thread that’s ready to execute a long sequence of instructions may have to wait
to execute every instruction.
Coarse-grained multithreading attempts to avoid this problem by only
switching threads that are stalled waiting for a time-consuming operation to
complete (e.g., a load from main memory). This has the virtue that switching
threads doesn’t need to be nearly instantaneous. However, the processor can be
idled on shorter stalls, and thread switching will also cause delays.

Dr Asiyo (USIU) DST4030A: Parallel Computing 13 / 31


Hardware multithreading

Simultaneous multithreading, or SMT, is a variation on


fine-grained multithreading. It attempts to exploit superscalar
processors by allowing multiple threads to make use of the multiple
functional units. If we designate “preferred” threads—threads that
have many instructions ready to execute—we can somewhat reduce
the problem of thread slowdown.

Dr Asiyo (USIU) DST4030A: Parallel Computing 14 / 31


Parallel hardware

Parallel hardware

Dr Asiyo (USIU) DST4030A: Parallel Computing 15 / 31


Parallel hardware Classifications of parallel computers

We employ two independent classifications of parallel computers:


Flynn’s taxonomy, classifies a parallel computer according to the
number of instruction streams and the number of data streams (or
datapaths) it can simultaneously manage.
The alternative classification has to do with how the cores access
memory.

Dr Asiyo (USIU) DST4030A: Parallel Computing 16 / 31


Parallel hardware Flynn’s Taxonomy

1 SIMD systems
Single instruction, multiple data, or SIMD, systems are parallel systems.
They operate on multiple data streams by applying the same instruction to
multiple data items, so an abstract SIMD system can be thought of as
having a single control unit and multiple datapaths.
An instruction is broadcast from the control unit to the datapaths, and
each datapath either applies the instruction to the current data item, or it
is idle.
SIMD systems are ideal for parallelizing simple loops that operate on large
arrays of data.
Parallelism that’s obtained by dividing data among the processors and
having the processors all apply (more or less) the same instructions to
their subsets of the data is called data-parallelism.
SIMD parallelism can be very efficient on large data parallel problems, but
SIMD systems often don’t do very well on other types of parallel problems.

Dr Asiyo (USIU) DST4030A: Parallel Computing 17 / 31


Parallel hardware Flynn’s Taxonomy

Figure 1: SISD and SIMD architecture

Dr Asiyo (USIU) DST4030A: Parallel Computing 18 / 31


Parallel hardware Flynn’s Taxonomy

2 MIMD systems
Multiple instruction, multiple data, or MIMD, systems support
multiple simultaneous instruction streams operating on multiple data
streams.
MIMD systems typically consist of a collection of fully independent
processing units or cores, each of which has its own control unit and
its own datapath.
MIMD systems are usually asynchronous, that is, the processors can
operate at their own pace.

Dr Asiyo (USIU) DST4030A: Parallel Computing 19 / 31


Parallel hardware Flynn’s Taxonomy

Figure 2: MIMD Architecture

Dr Asiyo (USIU) DST4030A: Parallel Computing 20 / 31


Parallel hardware Flynn’s Taxonomy

There are two principal types of MIMD systems: shared-memory


systems and distributed-memory systems.
In a shared-memory system a collection of autonomous processors is
connected to a memory system via an interconnection network, and
each processor can access each memory location.
In a shared-memory system, the processors usually communicate
implicitly by accessing shared data structures.
In a distributed-memory system, each processor is paired with its
own private memory, and the processor-memory pairs communicate
over an interconnection network.
So in distributed-memory systems, the processors usually
communicate explicitly by sending messages or by using special
functions that provide access to the memory of another processor.
Dr Asiyo (USIU) DST4030A: Parallel Computing 21 / 31
Parallel hardware Cores Access Memory

3 Shared Memory: Involves sharing the same address space as


shown in Figure 3

Figure 3: Shared Memory Parallel Computer Architecture


Dr Asiyo (USIU) DST4030A: Parallel Computing 22 / 31
Parallel hardware Cores Access Memory

Advantages of Shared Memory:


Global address space provides a user friendly programming
perspective to memory
Data sharing between tasks is both fast and uniform due to the
proximity of memory to CPUs
Disadvantages of Shared Memory:
Lack of scalability between memory and CPUs.
Programmer responsibility for synchronization constructs that
ensure ”correct” access of global memory.

Dr Asiyo (USIU) DST4030A: Parallel Computing 23 / 31


Parallel hardware Cores Access Memory

4 Distributed Memory: Involves a communication network to


connect inter Processor memory as shown in Figure 4

Figure 4: Distributed Memory Parallel Computer Architecture

Dr Asiyo (USIU) DST4030A: Parallel Computing 24 / 31


Parallel hardware Cores Access Memory

Processors have their own local memory.


Memory addresses in one processor do not map to another
processor, so there is no concept of global address space across all
processors.
Changes it makes to its local memory have no effect on the memory
of other processors.
Task of the programmer to explicitly define how and when data is
communicated.
Synchronization between tasks is likewise the programmer’s
responsibility.

Dr Asiyo (USIU) DST4030A: Parallel Computing 25 / 31


Parallel hardware Cores Access Memory

Advantages of Distributed Memory:


Scalable Memory: is scalable with the number of processors.
Each processor can rapidly access its own memory.
Cost effective: can use commodity, off-the-shelf processors and
networking.
Disadvantages of Distributed Memory:
Programmer is responsible for details associated with data
communication between processors.
It may be difficult to map existing data structures, based on global
memory, to this memory organization.
Non-uniform memory access times -data residing on a remote node
takes longer to access than node local data.

Dr Asiyo (USIU) DST4030A: Parallel Computing 26 / 31


Parallel hardware Cores Access Memory

5 Hybrid Distributed-Shared Memory: The largest and fastest


computers in the world today employ both shared and distributed
memory architectures (Figure 5).

Figure 5: Hybrid Memory Parallel Computer Architecture

Dr Asiyo (USIU) DST4030A: Parallel Computing 27 / 31


Parallel hardware Cores Access Memory

Figure 6: Hybrid Systems with Multicore CPUs

Dr Asiyo (USIU) DST4030A: Parallel Computing 28 / 31


Parallel hardware Cores Access Memory

The shared memory component can be a shared memory machine


and/or graphics processing units (GPU).
The distributed memory component is the networking of multiple
shared memory/GPU machines, which know only about their own
memory not the memory on another machine.
Network communications are required to move data from one
machine to another.
Trends indicate that this type of memory architecture will prevail
and increase at the high end of computing for the foreseeable future.

Dr Asiyo (USIU) DST4030A: Parallel Computing 29 / 31


Parallel hardware Cores Access Memory

Advantages & Disadvantages of Hybrid


Whatever is common to both shared and distributed memory
architectures
Increased scalability is an important advantage
Increased programming complexity is a major disadvantage

Dr Asiyo (USIU) DST4030A: Parallel Computing 30 / 31


Thank You!

Dr Asiyo (USIU) DST4030A: Parallel Computing 31 / 31

You might also like