0% found this document useful (0 votes)

23 views20 pages

Lec6 - TLP Data Dependence Solutions

Uploaded by

WoloWizard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views20 pages

Lec6 - TLP Data Dependence Solutions

Uploaded by

WoloWizard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Parallel Computing and Programming.

Lecture 6: Thread Level Parallelism: Data

Dependence Solutions.

Dr. Rony Kassam

IEF Tishreen Uni

S1 2021
Index
n Von Neumann vs Dataflow Models.
n ISA vs Microarchitecture.
n Single-cycle vs Multi-cycle Microarchitectures.
n Instruction Level Parallelism: Pipelining Intro.
n Instruction Level Parallelism: Issues in Pipeline Design.
n Thread Level Parallelism: Data Dependence Solutions.
n Thread Level Parallelism: Shared Memory and OpenMP.

2
Recall: How to Handle Data Dependences
n Anti and output dependences are easier to handle
q write to the destination in one stage and in program order

n Flow dependences are more interesting

n Five fundamental ways of handling flow dependences

q Detect and wait until value is available in register file
q Detect and forward/bypass data to dependent instruction
q Detect and eliminate the dependence at the software level
n No need for the hardware to detect dependence
q Predict the needed value(s), execute “speculatively”, and verify
q Do something else (fine-grained multithreading)
n No need to detect
3
How to Handle Control Dependences
n Critical to keep the pipeline full with correct sequence of
dynamic instructions.

n Potential solutions if the instruction is a control-

flow instruction:

n Stall the pipeline until we know the next fetch address

n Guess the next fetch address (branch prediction)
n Employ delayed branching (branch delay slot)
n Do something else (fine-grained multithreading)
n Eliminate control-flow instructions (predicated execution)
n Fetch from both possible paths (if you know the addresses
of both possible paths) (multipath execution)
4
Improving Performance
n Increase clock rate fs:
q Reached practical maximum for today’s technology.
q < 5GHz for general purpose computers
n Lower CPI (cycles per instruction)
q SIMD, “instruction level parallelism”
n Perform multiple tasks simultaneously
q Multiple CPUs, each executing different program.
q Tasks may be related
n E.g. each CPU performs part of a big matrix multiplication.
q or unrelated
n E.g. distribute different web http requests over different computers
n Do all of the above:
q High fs , SIMD, multiple parallel tasks

5
Multithreading
n Typical scenario:
q Active thread encounters cache miss..
q Active thread waits ~ 1000 cycles for data from DRAM à
switch out and run different thread until data available
n Problem
q Must save current thread state and load new thread state
n PC, all registers (could be many, e.g. AVX)
q must perform switch in ≪ 1000 cycles
n Can hardware help?
q Moore’s Law: transistors are plenty

6
Multithreaded Pipeline Example

q Four copies of PC and Registers inside processor hardware

q Looks identical to Four processors to software (hardware thread 0,1,2,3)
q Hyper-Threading:
All threads can be active simultaneously

7
Hyper-Threading

Simultaneous Multithreading (HT): Logical CPUs >

Physical CPUs
• Run multiple threads at the same time per core
• Each thread has own architectural state (PC, Registers, etc.)
• Share resources (cache, instruction unit, execution units)
8
Conclusion I
n Logical threads
q ≈ 1% more hardware
q ≈ 10% (?) better performance
n Separate registers
n Share datapath, ALU(s), caches
n Multicore
q => Duplicate Processors
q ≈ 50% more hardware
q ≈ 2X better performance?
n Modern machines do both
q Multiple cores with multiple threads per core

9
Conclusion II
n Thread Level Parallelism
q Thread: sequence of instructions, with own program counter
and processor state (e.g., register file)
q Multicore:
n Physical CPU: One thread (at a time) per CPU, in software OS
switches threads typically in response to I/O events like disk
read/write
n Logical CPU: Fine-grain thread switching, in hardware, when
thread blocks due to cache miss/memory access
n Hyper-Threading aka Simultaneous Multithreading (SMT): Exploit
superscalar architecture to launch instructions from different
threads at the same time!

10
Conclusion III
n Sequential software execution speed is limited
q Clock rates flat or declining
n Parallelism the only path to higher performance
q SIMD: instruction level parallelism
n Implemented in all high perf. CPUs today (x86, ARM, ...) Partially supported by compilers
2X width every 3-4 years
q MIMD: thread level parallelism
n Multicore processors
n Supported by Operating Systems (OS)
n Requires programmer intervention to exploit at single program level
n Add 2 cores every 2 years (2, 4, 6, 8, 10, ...)
q Intel Xeon W-3275: 28 Cores, 56 Threads
q SIMD & MIMD for maximum performance
n Key challenge: craft parallel programs with high performance on
multiprocessors as # of processors increase – i.e., that scale
q Scheduling, load balancing, time for synchronization, overhead communication

11
Languages Supporting Parallel Programming I

12
Languages Supporting Parallel Programming II
n Parallel Programming Models and Machines (plus some architecture, e.g., caches)
Algorithm/machine model Language / Library skills
Shared memory OpenMP
PGAS
Distributed memory MPI
Data parallel SPARK
CUDA

n Parallelization Strategies for the “Motifs” of Scientific Computing (and Data)

Dense Linear Algebra Monte Carlo
Sparse Linear Algebra Spectral Methods
Particle Methods Graphs
Structured Grids Sorting
Unstructured Grids Hashing
n Performance models: Roofline, α-β (latency/bandwidth), LogP
n Cross-cutting: Communication avoiding, load balancing, hierarchical algorithms,
autotuning, Moore’s Law, Amdahl’s Law, Little’s Law
13
Why So Many Parallel Programming Languages?
n Why “intrinsics”?
q TO Intel: fix your #()&$! compiler, thanks...
n It’s happening ... But
q SIMD features are continually added to compilers
n (Intel, gcc)
q Intense area of research
q Research progress:
n 20+ years to translate C into good (fast!) assembly
n How long to translate C into good (fast!) parallel code?
q General problem is very hard to solve
q Present state: specialized solutions for specific cases
q Your opportunity to become famous!

14
Parallel Programming Languages
n Number of choices is indication of
q No universal solution
n Needs are very problem specific
q E.g.,
n Scientific computing/machine learning (matrix multiply)
n Webserver: handle many unrelated requests simultaneously
n Input / output: it’s all happening simultaneously!
n Specialized languages for different tasks
q Some are easier to use (for some problems)
q None is particularly “easy” to use
n Parallel language examples for high-performance
computing
q OpenMP

15
Parallel Loops
n Serial execution:
for (int i=0; i<100; i++) {
...
}
n Parallel Execution:

n Parallel for in OpenMP

#include <omp.h>
#pragma omp parallel for
for (int i=0; i<100; i++) {
...
}
16
OpenMP Example

The call to find the maximum number of threads that are available to do work is
omp_get_max_threads() (from omp.h).

17
OpenMP
n C extension: no new language to learn
n Multi-threaded, shared-memory parallelism
q Compiler Directives, #pragma
q Runtime Library Routines, #include <omp.h>
n #pragma
q Ignored by compilers unaware of OpenMP
q Same source for multiple architectures
n E.g., same program for 1 & 16 cores
n Only works with shared memory

18
OpenMP Programming Model
n Fork - Join Model:

n OpenMP programs begin as single process (main thread)

q Sequential execution

n When parallel region is encountered

q Master thread “forks” into team of parallel threads

q Executed simultaneously

q At end of parallel region, parallel threads ”join”, leaving only master

thread
n Process repeats for each parallel region
q Amdahl’s Law?

19
What Kind of Threads?
n OpenMP threads are operating system (software)
threads
n OS will multiplex requested OpenMP threads onto
available hardware threads
n Hopefully each gets a real hardware thread to run on,
so no OS-level time-multiplexing
n But other tasks on machine compete for hardware
threads!
n Be “careful” (?) when timing results for Projects!

CV338H-X42 Schematic Diagram
67% (3)
CV338H-X42 Schematic Diagram
21 pages
SGC 2 - Configuration
No ratings yet
SGC 2 - Configuration
35 pages
Single-Phase Shift Control For Dual Active Bridge Using Adaptive Pi Control Technique
No ratings yet
Single-Phase Shift Control For Dual Active Bridge Using Adaptive Pi Control Technique
9 pages
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
No ratings yet
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
50 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
WinCC Flexible Compatibility List e
67% (3)
WinCC Flexible Compatibility List e
2 pages
Multicore Architecture
No ratings yet
Multicore Architecture
159 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
Multicore Architecture
No ratings yet
Multicore Architecture
159 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
Lec3 - Single-Cycle and Multi-Cycle Microarchitectures
No ratings yet
Lec3 - Single-Cycle and Multi-Cycle Microarchitectures
121 pages
Node.js, JavaScript, API: Interview Questions and Answers
From Everand
Node.js, JavaScript, API: Interview Questions and Answers
John Edward Cooper Berg
5/5 (1)
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Multi Core Architectures and Programming
No ratings yet
Multi Core Architectures and Programming
10 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
Pthreads Programming
No ratings yet
Pthreads Programming
54 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages
Unit 5
No ratings yet
Unit 5
96 pages
Coa Chapter 5
No ratings yet
Coa Chapter 5
96 pages
Intro To OpenMP Mattson Customized
No ratings yet
Intro To OpenMP Mattson Customized
94 pages
Architecture
No ratings yet
Architecture
67 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
The Complete Future Trait Guide
From Everand
The Complete Future Trait Guide
Hamze Ghalebi
No ratings yet
Travelmate p243m PDF
No ratings yet
Travelmate p243m PDF
271 pages
LP V Theory and Practical Explanation: o o o o
No ratings yet
LP V Theory and Practical Explanation: o o o o
96 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Lec1 - Von Neumann Vs Dataflow Models
No ratings yet
Lec1 - Von Neumann Vs Dataflow Models
52 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
SIMPLEX TSW-operator-5
No ratings yet
SIMPLEX TSW-operator-5
4 pages
Chapter Three Parallel Computing
No ratings yet
Chapter Three Parallel Computing
44 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
Lec7 - TLP Shared Memory and OpenMP
No ratings yet
Lec7 - TLP Shared Memory and OpenMP
45 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Lec2 - ISA Vs Microarchitecture
No ratings yet
Lec2 - ISA Vs Microarchitecture
38 pages
Lec5 - ILP Issues in Pipeline Design
No ratings yet
Lec5 - ILP Issues in Pipeline Design
38 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
pdc2: MODULE2
No ratings yet
pdc2: MODULE2
113 pages
Catalog 180-70R-TS
No ratings yet
Catalog 180-70R-TS
38 pages
LAB 1: OOP Review + Single Linked List + Doubly Linked List: N N n-1 n-1 1 0 N n-1 0
No ratings yet
LAB 1: OOP Review + Single Linked List + Doubly Linked List: N N n-1 n-1 1 0 N n-1 0
32 pages
2.2 DD2356 Threads
No ratings yet
2.2 DD2356 Threads
22 pages
Memory Systems
No ratings yet
Memory Systems
32 pages
Concurrency: CS2403 Programming Languages
No ratings yet
Concurrency: CS2403 Programming Languages
44 pages
9300 Servo Inverter TR
No ratings yet
9300 Servo Inverter TR
10 pages
Lecture 5 PT - CreateTopology
No ratings yet
Lecture 5 PT - CreateTopology
17 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
OP909 Manual NO COUNT FEATURE
No ratings yet
OP909 Manual NO COUNT FEATURE
23 pages
Lec4 - ILP Pipelining Intro
No ratings yet
Lec4 - ILP Pipelining Intro
24 pages
اتصالات البيانات والشبكات
No ratings yet
اتصالات البيانات والشبكات
52 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Lecture - 10 - Abstract Classes
No ratings yet
Lecture - 10 - Abstract Classes
20 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
Telon Ps
No ratings yet
Telon Ps
5 pages
Unit1 RMD PDF
No ratings yet
Unit1 RMD PDF
27 pages
Prebook MCAP
No ratings yet
Prebook MCAP
11 pages
Libcint: An Efficient General Integral Library For Gaussian Basis Functions
No ratings yet
Libcint: An Efficient General Integral Library For Gaussian Basis Functions
23 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Contenido: Badi Concept
No ratings yet
Contenido: Badi Concept
34 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
01 Line Following
No ratings yet
01 Line Following
11 pages
Pepper 1
No ratings yet
Pepper 1
13 pages
1MRK511348-UUS - en Communication Protocol Manual DNP 670 Series 2.1
No ratings yet
1MRK511348-UUS - en Communication Protocol Manual DNP 670 Series 2.1
68 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
This File Is A Basic Template For Assembly Code For A PIC18F4550
No ratings yet
This File Is A Basic Template For Assembly Code For A PIC18F4550
3 pages
Improved Automotive CAN Protocol Based On Payload
No ratings yet
Improved Automotive CAN Protocol Based On Payload
15 pages
COA UNIT 5 (AutoRecovered)
No ratings yet
COA UNIT 5 (AutoRecovered)
14 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
Introduction To Thevenins Theorem
No ratings yet
Introduction To Thevenins Theorem
8 pages
Lecture #1 - Class-1
No ratings yet
Lecture #1 - Class-1
17 pages
Mahimai Don Bosco 2021 J. Phys. Conf. Ser. 1964 062014
No ratings yet
Mahimai Don Bosco 2021 J. Phys. Conf. Ser. 1964 062014
7 pages
CSE5006 Multicore-Architectures ETH 1 AC41
No ratings yet
CSE5006 Multicore-Architectures ETH 1 AC41
9 pages
Operating System 4
No ratings yet
Operating System 4
33 pages
Number Beyond 9999 Practice Paper Part 3
No ratings yet
Number Beyond 9999 Practice Paper Part 3
6 pages
SL300 GNSS Receiver Brochure
No ratings yet
SL300 GNSS Receiver Brochure
6 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
COVID 19 Human Detector Using Tinkercad Circuits
No ratings yet
COVID 19 Human Detector Using Tinkercad Circuits
8 pages
Incident Response Project Report
No ratings yet
Incident Response Project Report
3 pages
Getting More Out of Processors: Everyone Wants To Compute Faster, But How?
No ratings yet
Getting More Out of Processors: Everyone Wants To Compute Faster, But How?
8 pages
Jamal Ahmad Resume QA Automation
No ratings yet
Jamal Ahmad Resume QA Automation
1 page
36x48 GhOST ISCAPoster
No ratings yet
36x48 GhOST ISCAPoster
1 page
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
OWASP Mobile Security Testing Guide
100% (3)
OWASP Mobile Security Testing Guide
536 pages

Lec6 - TLP Data Dependence Solutions

Uploaded by

Lec6 - TLP Data Dependence Solutions

Uploaded by

Parallel Computing and Programming.

Lecture 6: Thread Level Parallelism: Data

Dr. Rony Kassam

IEF Tishreen Uni

n Flow dependences are more interesting

n Five fundamental ways of handling flow dependences

n Potential solutions if the instruction is a control-

n Stall the pipeline until we know the next fetch address

q Four copies of PC and Registers inside processor hardware

Simultaneous Multithreading (HT): Logical CPUs >

n Parallelization Strategies for the “Motifs” of Scientific Computing (and Data)

n Parallel for in OpenMP

n OpenMP programs begin as single process (main thread)

n When parallel region is encountered

q At end of parallel region, parallel threads ”join”, leaving only master

You might also like