Unit IV CA
Unit IV CA
Welcome…
19IT202T /
Computer Architecture
Syllabus – Unit IV
UNIT-IV PARALLELISM
Introduction to Multicore processors
and other shared memory
multiprocessors - Flynn's classification:
SISD, MIMD, SIMD, SPMD and
Vector - Hardware multithreading: Fine-
grained, Coarse-grained and
Simultaneous Multithreading (SMT) -
GPU architecture: NVIDIA GPU
Architecture, NVIDIA GPU Memory
Structure
Topics:
• Introduction to Multicore processors
• Other shared memory multiprocessors
• Flynn’s classification:
o SISD,
o MIMD,
o SIMD,
o SPMD and Vector
• Hardware multithreading
• GPU architecture 4
Introduction to Multicore
processors
Multicore processors
• What is a Processor?
o A single chip package that fits in a socket
o Cores can have functional units, cache, etc.
associated with them
• The main goal of the multi-core design is to
provide computing units with an increasing
processing power.
• A multicore processor is a single computing
component with two or more “independent”
processors (called "cores").
• known as a chip multiprocessor or CMP
6
EXAMPLES
dual-core processor with 2 cores
• e.g. AMD Phenom II X2, Intel Core 2 Duo E8500
quad-core processor with 4 cores
• e.g. AMD Phenom II X4, Intel Core i5 2500T
hexa-core processor with 6 cores
• e.g. AMD Phenom II X6, Intel Core i7 Extreme Ed. 980X
octa-core processor with 8 cores
• e.g. AMD FX-8150, Intel Xeon E7-2820
7
Processor
8
Single core
9
Multicore
10
Number of core
Homogeneous (symmetric) cores:
types
• All of the cores in a homogeneous multicore
processor are of the same type; typically the core
processing units are general-purpose central
processing units that run a single multicore
operating system.
• Example: Intel Core 2
12
Heterogeneous Multicore Processor
13
shared memory multiprocessors
14
Shared Memory Multiprocessors
• A system with multiple CPUs “sharing” the
same main memory is called multiprocessor.
• In a multiprocessor system all processes on
the various CPUs share a unique logical
address space, which is mapped on a physical
memory that can be distributed among the
processors.
• Each process can read and write a data item
simply using load and store operations, and
process communication is through shared
memory.
15
Shared Memory Multiprocessors
16
Questions:
• Multicore processor
• Hexacore processor
• Homogeneous Multicore processor
• Heterogeneous Multicore processor
• Multiprocessor
• Shared memory Multiprocessor
17
• Single address space multiprocessors come in two
styles.
o Uniform Memory Access (UMA)
o Non-Uniform Memory Access (NUMA)
UMA Architecture:
• In the first style, the latency to a word in
memory does not depend on which processor
asks for it. Such machines are called uniform
memory access (UMA) multiprocessors.
NUMA/DSMA Architecture:
• In the second style, some memory accesses
are much faster than others, depending on
which processor asks for which word, typically
because main memory is divided and attached to
different microprocessors or to different memory
controllers on the same chip.
• Such machines are called nonuniform memory
access (NUMA) multiprocessors. 18
•
Types:
The shared-memory multiprocessors fall into
two classes, depending on the number of
processors involved, which in turn dictates a
memory organization and interconnect
strategy.
• They are:
1. Centralized shared memory (Uniform Memory
Access)
2. Distributed shared memory (NonUniform
Memory Access)
19
1. Centralized shared memory architecture
20
2. Distributed shared memory architecture
21
Flynn’s
classification
Flynn's classification:
24
SISD
• SISD machines executes a single instruction on
individual data values using a single processor.
• Based on traditional Von Neumann uniprocessor
architecture, instructions are executed
sequentially or serially, one step after the next.
• Until most recently, most computers are of SISD
type.
• Conventional uniprocessor
25
SISD
26
SIMD
• An SIMD machine executes a single instruction on
multiple data values simultaneously using many
processors.
• Since there is only one instruction, each processor
does not have to fetch and decode each
instruction. Instead, a single control unit does the
fetch and decoding for all processors.
• SIMD architectures include array processors.
27
SIMD
• Data level parallelism:
o Parallelism achieved by performing the same operation on
independent data.
28
MISD
• Each processor executes a different sequence of instructions.
• In case of MISD computers, multiple processing units operate on
one single-data stream .
• This category does not actually exist. This category was included in
the taxonomy for the sake of completeness.
29
MISD
•
30
Questions:
• Uniform Memory Access (UMA)
• Non-Uniform Memory Access (NUMA)
• Centralized shared memory
• Distributed shared memory
• Flynn’s classification:
31
MIMD
• MIMD machines are usually referred to as
multiprocessors or multicomputers.
• It may execute multiple instructions
simultaneously, contrary to SIMD machines.
• Each processor must include its own control unit
that will assign to the processors parts of a task or
a separate task.
• It has two subclasses: Shared memory and
distributed memory
32
MIMD
33
Analogy of Flynn’s Classifications
• An analogy of Flynn’s classification is the
check-in desk at an airport
SISD: a single desk
SIMD: many desks and a supervisor with
a megaphone giving instructions that
every desk obeys
MIMD: many desks working at their own
pace, synchronized through a central
database
34
Hardware categorization
37
Structure of a vector unit containing four lanes
38
vector lane
• One or more vector functional units and a portion of the vector
register fi le.
39
Questions:
• MIMD
• Examples for Flynn’s classification
40
Hardware
multithreading
Hardware multithreading
• A thread is a lightweight process with its own
instructions and data.
• Each thread has all the state (instructions, data,
PC, register state, etc.) necessary to allow it to
execute.
• Multithreading (MT) allows multiple threads to
share the functional units of a single processor.
42
Hardware multithreading
• Increasing utilization of a processor by
switching to another thread when one thread
is stalled.
• Types of Multithreading:
o Fine-grained Multithreading
• Cycle by cycle
o Coarse-grained Multithreading
• Switch on event (e.g., cache miss)
o Simultaneous Multithreading (SMT)
• Instructions from multiple threads executed concurrently in the
same cycle
43
4-issue machine
47
Coarse-grained MT switches threads only
on costly stalls, such as L2 misses.
The processor is not slowed down (by
thread switching), since instructions from
other threads will only be issued when a
thread encounters a costly stall.
Since a CPU with coarse-grained MT issues
instructions from a single thread, when a
stall occurs the pipeline must be emptied.
The new thread must fill the pipeline before
instructions will be able to complete.
48
Coarse-grained MT switches threads only
on costly stalls, such as L2 misses.
Advantages:
– thread switching doesn’t have to be
essentially free and much less likely to slow down
the execution of an individual thread
Disadvantage:
– limited, due to pipeline start-up costs, in its
ability to overcome throughput loss
Pipeline must be flushed and refilled on
thread switches
49
Coarse-grained MT
50
Questions
• Define thread.
• What is mean by hardware multithreading?
• Types of multithreading
52
1 1 1 1
1
Time stamp of
2 1 1 1 1 single thread
3 1 2 2 2 execution
4
5 4
1
6 4 4 5 5 2
3
7 4 4 5 5
4
8 5 5 5 6 5
6
9 5 5 7 7
10 7 7 7 7 8
9
1 8 8 8 6 10
1
1
1 6 6 6 8 1
2 1
1 8 9 9 7 2
3
14 7 1 9 9
1
1 1 1 1 8
5 0 0 0
1 8 1 1 1
6 2 2
2
53
Approaches to use the issue slots.
54
55
Amdahl’s law
Speedup
• Speedup measures increase in running time due
to parallelism. The number of PEs is given by n.
• Based on running times, S(n) = ts/tp , where
o ts is the execution time on a single processor, using the fastest
known sequential algorithm
o tp is the execution time using a parallel processor.
57
Speedup in Simplest
Terms
58
Amdahl’s law:
“It states that the potential speedup gained by the parallel execution
of a program is limited by the portion that can be parallelized.”
59
Amdahl’s law
• execution time before is 1 for some unit of time
60
Question:
• When parallelizing an application, the ideal speedup is speeding up
by the number of processors. What is the speedup with 8
processors if 60% of the application is parallelizable?
61
Question:
• When parallelizing an application, the ideal speedup is speeding up
by the number of processors. What is the speedup with 8
processors if 80% of the application is parallelizable?
62
QUESTION:
• Suppose that we are considering an enhancement that runs 10
times faster than the original machine but is usable only 40% of the
time. What is the overall speedup gained by incorporating the
enhancement.?
63
Question
• Suppose you want to achieve a speed-up
of 90 times faster with 100 processors.
What percentage of the original
computation can be sequential?
64
Question
• Suppose you want to achieve a speed-up
of 90 times faster with 100 processors.
What percentage of the original
computation can be sequential?
65
Question
• Suppose you want to perform two sums: one is a sum of 10
scalar variables, and one is a matrix sum of a pair of two-
dimensional arrays, with dimensions 10 by 10. For now
let’s assume only the matrix sum is parallelizable. What
speed-up do you get with 10 versus 40 processors?
• Next, calculate the speed-ups assuming the matrices grow
to 20 by 20.
66
Graphics
processing unit
(GPU)
Graphics processing unit (GPU)
• It is a processor optimized for 2D/3D graphics, video, visual computing, and display.
• It is highly parallel, highly multithreaded multiprocessor optimized for visual
computing.
• It provide real-time visual interaction with computed objects via graphics images,
and video.
• Heterogeneous Systems: combine a GPU with a CPU
68
GPU Hardware
69
70
An Introduction to the NVIDIA GPU Architecture
71
NVIDIA GPU Memory
Structures
72
Thank you…