0% found this document useful (0 votes)

11 views38 pages

23.L20 Multiprocessing Multithreading Vectorization

Uploaded by

Yash Pundeer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views38 pages

23.L20 Multiprocessing Multithreading Vectorization

Uploaded by

Yash Pundeer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Multiprocessing,

multithreading and
vectorization

Sparsh Mittal
IIT Hyderabad, India

Courtesy for some slides: S. R. Sarangi and others

2
Background

3
Strong and weak scaling

• Strong scaling: how solution time varies with the

number of cores for a fixed total problem size.
– Use 2X machines for a task => solve it in half the
time
• Weak scaling: how solution time varies with the
number of cores for a fixed problem size per cores.
– if the dataset is twice as big, use 2X machines to solve
the task in constant time.

4
Shared Memory vs Message Passing
 Shared Memory
 All the threadds share the virtual address space.
 They can communicate with each other by reading and
writing values from/to shared memory.
● Application ensures no data corruption (Lock/Unlock)
● Example language: OpenMP, CUDA
 Message Passing
 Programs communicate between each other by sending and
receiving messages (e.g., sending emails
 They do not share memory addresses.
 Example language: MPI
5
Types of Parallelism

• Instruction Level Parallelism

– Different instructions within a stream can be executed in parallel
– Pipelining, out-of-order execution, speculative execution, VLIW
– Dataflow

• Data Parallelism
– Different pieces of data can be operated on in parallel
– SIMD: Vector processing, array processing
– Systolic arrays, streaming processors

• Task Level Parallelism

– Different “tasks/threads” can be executed in parallel
– Multithreading
– Multiprocessing (multi-core)

7
7
Flynn's Taxonomy

8
Flynn's Classification

 Instruction stream → Set of

instructions that are executed
 Data stream → Data values that the
instructions process
 Four types of multiprocessors : SISD,
SIMD, MISD, MIMD

9
SISD and SIMD

 SISD → Standard uniprocessor

 SIMD → One instruction, operates on
multiple pieces of data. Vector processors
have one instruction that operates on
many pieces of data in parallel. For
example, one instruction can compute the
sin-1 of 4 values in parallel.

10
MISD

 MISD → Multiple Instruction Single Data

 Very rare in practice
 Consider an aircraft that has a MIPS, an ARM, and
an X86 processor operating on the same data
(multiple instruction streams)
 We have different instructions operating on the
same data
 The final outcome is decided on the basis of a
majority vote.

11
MIMD

 MIMD → Multiple instruction, multiple

data (two types, SPMD, MPMD)
 SPMD → Single program, multiple data.
Examples: OpenMP or MPI programs. We
typically have multiple processes or threads
executing the same program with different
inputs.
 MPMD → A master program, delegates work to
multiple slave programs. The programs are
different.

12
Summary
• SISD: Single instruction operates on single data element
• SIMD: Single instruction operates on multiple data
elements
– Array processor
– Vector processor
• MISD: Multiple instructions operate on single data
element
– Closest form: systolic array processor, streaming
processor
• MIMD: Multiple instructions operate on multiple data
elements (multiple instruction streams)
– Multiprocessor
– Multithreaded processor
13
13
Multiprocessing

 The term multiprocessing refers to multiple processors

working in parallel.
 This is a generic definition, and it can refer to multiple
processors in the same chip, or processors across different
chips.
 A multicore processor is a specific type of multiprocessor that
contains all of its constituent processors in the same chip.
Each such processor is known as a core.

14
Multithreading

15
The Notion of Threads

 We spawn a set of separate threads

 Properties of threads
 A thread shares its address space with other
threads
 It has its own program counter, set of registers,
and stack
 A process contains multiple threads
 Threads communicate with each other by
writing values to memory or via
synchronization operations

16
Operation of the Program

Parent thread

Initialisation

Spawn child threads

Child
threads

Time

Thread join operation

Sequential
section

17
Multithreading

 Multithreading → A design paradigm that

proposes to run multiple threads on the
same pipeline.
 Three types
 Coarse grained
 Fine grained
 Simultaneous

18
Analogy

• Consider a car which can be shared by 4 people (A,

B, C, D) in following way:
• Option1: Four people ride the car simultaneously
almost all the time to go to their destinations.
• Option2: A uses the car for 15 days, B for 14 days, C
for 18 days and D for 16 days (and repeat).
• Option3: A uses the car for 8 hours, B for 9 hours,
C for 5 hours and D for 7 hours (and repeat).
• Match the above three options to three types of
multithreading

19
Coarse Grained Multithreading

 Assume that we want to run 4 threads on

a pipeline
 Run thread 1 for n cycles, run thread 2 for
n cycles, ….
1

4 2

20
Implémentation

 Steps to minimize the context switch

overhead
 For a 4-way coarse grained MT machine
 4 program counters
 4 register files
 4 flags registers
 A context register that contains a thread id.
 Zero overhead context switching → Change the thread
id in the context register

21
Advantages

 Assume that thread 1 has an L2 miss

 Wait for 200 cycles
 Schedule thread 2
 Now let us say that thread 2 has an L2 miss
 Schedule thread 3
 We can have a sophisticated algorithm that
switches every n cycles, or when there is a long
latency event such as an L2 miss.
 Minimises idle cycles for the entire system

22
Fine Grained Multithreading

 The switching granularity is very small

 1-2 cycles
 Advantage :
 Can take advantage of low latency events such as division,
or L1 cache misses
 Minimise idle cycles to an even greater extent
 Correctness Issues
 We can have instructions of 2 threads simultaneously in the
pipeline.
 We never forward/interlock for instructions across threads

23
Simultaneous Multithreading

 Most modern processors have multiple

issue slots
 Can issue multiple instructions to the functional
units
 For example, a 3 issue processor can fetch, decode,
and execute 3 instructions per cycle
 If a benchmark has low ILP (instruction level
parallelism), then fine and coarse grained
multithreading cannot really help.

24
Simultaneous Multithreading

 Main Idea
 Partition the issue slots across threads
 Scenario : In the same cycle
 Issue 2 instructions for thread 1
 and, issue 1 instruction for thread 2
 and, issue 1 instruction for thread 3

 Support required
 Need smart instruction selection logic.
 Balance fairness and throughput

25
Summary

Coarse grained Fine grained Simultaneous

multithreading multithreading multithreading
Thread 1

Thread 2

Thread 3
Time

Thread 4

issue
slots

26
Vectorization (and vector
processor)

27
BIG PICTURE

28
Vectorization

29 https://colfaxresearch.com/knl-avx512/
Some of the SIMD instruction sets used in
industry

Register size Instruction set

80 bits MMX
128-bits SSE1/SSE2 etc.
256-bits AVX/AVX2
512-bits AVX-512

30
Vector Processors

 A vector instruction operates on arrays of

data
 Example : There are vector instructions to add or
multiply two arrays of data, and produce an array as
output
 Advantage : Can be used to perform all kinds of
array, matrix, and linear algebra operations. These
operations form the core of many scientific
programs, high intensity graphics, and data analytics
applications.

31
Software Interface

 Let us define a vector register

 Example : 128 bit registers in the MMX instruction set
→ XMM0 … XMM15
 Can hold 4 floating point values, or 8 2-byte short
integers
 Addition of vector registers is equivalent to pairwise
addition of each of the individual elements.
 The result is saved in a vector register of the same
size.

32
Example of Vector Addition

vr1

vr2

vr3

Let us define 8 128 bit vector registers in SimpleRisc. vr0 ... vr7

33
Loading Vector Registers

 There are two options :

 Option 1 : We assume that the data elements are
stored in contiguous locations
 Let us define the v.ld instruction that uses this
assumption.
Instruction Semantics
v.ld vr1, 12[r1] vr1  ([r1+12], [r1+16],[r1+20], [r1+24])

 Option 2: Assume that the elements are not saved in

contiguous locations.
 For this, there are scatter-gather instructions

34
Scatter Gather Operation

 The data is scattered in memory

 The load operation needs to gather the data and
save it in a vector register.
 Let us define a scatter gather version of the load
instruction → v.sg.ld
 It uses another vector register that contains the
addresses of each of the elements.

Instruction Semantics
v.sg.ld vr1, vr2 vr1  ([vr2[0]], [vr2[1]], [vr2[2]], [vr2[3]])

35
Vector Store Operation

 We can similarly define two vector store

operations
Instruction Semantics
v.sg.st vr1, vr2 [vr2[0]]  vr1[0]
[vr2[1]]  vr1[1]
[vr2[2]]  vr1[2]
[vr2[3]]  vr1[3]

Instruction Semantics
v.st vr1, 12[r1] [r1+12]  vr1[0]
[r1+16]  vr1[1]
[r1+20]  vr1[2]
[r1+24]  vr1[3]

36
Vector Operations

 We can now define custom operations on vector

registers
 v.add → Adds two vector registers
 v.mul → Multiplies two vector registers
 We can even have operations that have a vector
operand and a scalar operand → Multiply a vector
with a scalar.

37
Design of a Vector Processor

 Salient Points
 We have a vector register file and a scalar register file
 There are scalar and vector functional units
 Unless we are converting a vector to a scalar or vice
versa, we in general do not forward values between
vector and scalar instructions
 The memory unit needs support for regular operations,
vector operations, and possibly scatter-gather
operations.

39
References

• S. Mittal et al, “A Survey on Evaluating and Optimizing

Performance of Intel Xeon Phi” 2019 (pdf)

@vtucode - in 21CS643 Module 1 2021 Scheme
No ratings yet
@vtucode - in 21CS643 Module 1 2021 Scheme
127 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages
CC Unit 1
No ratings yet
CC Unit 1
139 pages
Flynn'S Classification: Cs6303 Computer Architecture
No ratings yet
Flynn'S Classification: Cs6303 Computer Architecture
11 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
CA Chap7 Multicores Multiprocessors
No ratings yet
CA Chap7 Multicores Multiprocessors
42 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
04 Hardware
No ratings yet
04 Hardware
109 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Lecture #1 - Class-1
No ratings yet
Lecture #1 - Class-1
17 pages
CSO Computer Programming
No ratings yet
CSO Computer Programming
73 pages
03 TLP
No ratings yet
03 TLP
33 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
7TH - Unit 4-21ec74h6 - Ca
No ratings yet
7TH - Unit 4-21ec74h6 - Ca
67 pages
Simultaneous Multithreading
No ratings yet
Simultaneous Multithreading
50 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Unit 3-4
No ratings yet
Unit 3-4
76 pages
20 Question of CA
No ratings yet
20 Question of CA
26 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 26-Aug-2021 Module2-SIMD-VectorProcessors
16 pages
Android Intents 1
No ratings yet
Android Intents 1
30 pages
EE6304 Lecture13 Processors
No ratings yet
EE6304 Lecture13 Processors
69 pages
Chapter2 Part 3
No ratings yet
Chapter2 Part 3
27 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
SIMD
No ratings yet
SIMD
44 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Parallel and Distributed Computing Module I
No ratings yet
Parallel and Distributed Computing Module I
28 pages
Coa Unit 5
No ratings yet
Coa Unit 5
20 pages
CH17 COA9e
No ratings yet
CH17 COA9e
51 pages
UNIT-V-Pipeline and Array Processing and Multi Processors
No ratings yet
UNIT-V-Pipeline and Array Processing and Multi Processors
51 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Advanced Computer Architecture: The Architecture of Parallel Computers
No ratings yet
Advanced Computer Architecture: The Architecture of Parallel Computers
44 pages
Coa-Unit - 5 Notes
No ratings yet
Coa-Unit - 5 Notes
38 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
Computer Architecture Simd Vector Gpu
No ratings yet
Computer Architecture Simd Vector Gpu
16 pages
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
58 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
DSCC Unit 1 PDF
No ratings yet
DSCC Unit 1 PDF
14 pages
Parallel Architectures Parallel Architectures: Ever Faster
No ratings yet
Parallel Architectures Parallel Architectures: Ever Faster
11 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Introduction To Parallel Processing
No ratings yet
Introduction To Parallel Processing
49 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
COA U5 PPT Full
No ratings yet
COA U5 PPT Full
43 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
Chapter 9
No ratings yet
Chapter 9
28 pages
COA UNIT-V PPTS Dr.G.Bhaskar ECE
No ratings yet
COA UNIT-V PPTS Dr.G.Bhaskar ECE
100 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
Advanced Computer Architecture: Presented By, Krishna
No ratings yet
Advanced Computer Architecture: Presented By, Krishna
35 pages
Topic 1 2024
No ratings yet
Topic 1 2024
41 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
Unit VI
No ratings yet
Unit VI
50 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Parallel 123
No ratings yet
Parallel 123
28 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
16 pages
Lec 44 Multicore
No ratings yet
Lec 44 Multicore
23 pages
17.L15 BranchPrediction
No ratings yet
17.L15 BranchPrediction
38 pages
Parallel Processing
No ratings yet
Parallel Processing
22 pages
20-Unit 7-22-04-2024
No ratings yet
20-Unit 7-22-04-2024
97 pages
Parallel Computing
No ratings yet
Parallel Computing
34 pages
26-27 SIMD Architecture
No ratings yet
26-27 SIMD Architecture
33 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Group 1 Computer Architecture and Oranization
No ratings yet
Group 1 Computer Architecture and Oranization
56 pages
Unit 4 - Parallel Computer Structures Word
No ratings yet
Unit 4 - Parallel Computer Structures Word
12 pages
CC Assignment 3
No ratings yet
CC Assignment 3
8 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
24.L21 RooflineModel1
No ratings yet
24.L21 RooflineModel1
29 pages
CSO Lecture Notes Unit - 5
No ratings yet
CSO Lecture Notes Unit - 5
11 pages
13.L11 Pipelining
No ratings yet
13.L11 Pipelining
17 pages
Parallel and Distributed Ir
No ratings yet
Parallel and Distributed Ir
33 pages
4 Underlying Principles of Parallel
No ratings yet
4 Underlying Principles of Parallel
25 pages
16.L14 SuperscalarOOOPipelining
No ratings yet
16.L14 SuperscalarOOOPipelining
12 pages
10 Distributed Systems
No ratings yet
10 Distributed Systems
66 pages
CC MIdTerm (AP-21)
No ratings yet
CC MIdTerm (AP-21)
17 pages
PG Cloud Computing Unit II
No ratings yet
PG Cloud Computing Unit II
52 pages
Computer Architecture Question - Bank
No ratings yet
Computer Architecture Question - Bank
13 pages
10.gate Competitive Interview Questions - Form
No ratings yet
10.gate Competitive Interview Questions - Form
5 pages
Unit 6-Part2 - Parallel - Processing
No ratings yet
Unit 6-Part2 - Parallel - Processing
21 pages
DC 70
No ratings yet
DC 70
2 pages
Cciot Pfe
No ratings yet
Cciot Pfe
15 pages
Revision Worksheet Answers
No ratings yet
Revision Worksheet Answers
6 pages
Flynn's Classification
No ratings yet
Flynn's Classification
3 pages
Application and Implementation of DES Algorithm Based on FPGA
From Everand
Application and Implementation of DES Algorithm Based on FPGA
madhav
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet

23.L20 Multiprocessing Multithreading Vectorization

Uploaded by

23.L20 Multiprocessing Multithreading Vectorization

Uploaded by

Multiprocessing,

Courtesy for some slides: S. R. Sarangi and others

• Strong scaling: how solution time varies with the

• Instruction Level Parallelism

• Task Level Parallelism

 Instruction stream → Set of

 SISD → Standard uniprocessor

 MISD → Multiple Instruction Single Data

 MIMD → Multiple instruction, multiple

 The term multiprocessing refers to multiple processors

 We spawn a set of separate threads

Spawn child threads

Thread join operation

 Multithreading → A design paradigm that

• Consider a car which can be shared by 4 people (A,

 Assume that we want to run 4 threads on

 Steps to minimize the context switch

 Assume that thread 1 has an L2 miss

 The switching granularity is very small

 Most modern processors have multiple

Coarse grained Fine grained Simultaneous

Register size Instruction set

 A vector instruction operates on arrays of

 Let us define a vector register

 There are two options :

 Option 2: Assume that the elements are not saved in

 The data is scattered in memory

 We can similarly define two vector store

 We can now define custom operations on vector

• S. Mittal et al, “A Survey on Evaluating and Optimizing

You might also like