0% found this document useful (0 votes)
23 views20 pages

Lec6 - TLP Data Dependence Solutions

Uploaded by

WoloWizard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views20 pages

Lec6 - TLP Data Dependence Solutions

Uploaded by

WoloWizard
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Parallel Computing and Programming.

Lecture 6: Thread Level Parallelism: Data


Dependence Solutions.

Dr. Rony Kassam

IEF Tishreen Uni


S1 2021
Index
n Von Neumann vs Dataflow Models.
n ISA vs Microarchitecture.
n Single-cycle vs Multi-cycle Microarchitectures.
n Instruction Level Parallelism: Pipelining Intro.
n Instruction Level Parallelism: Issues in Pipeline Design.
n Thread Level Parallelism: Data Dependence Solutions.
n Thread Level Parallelism: Shared Memory and OpenMP.

2
Recall: How to Handle Data Dependences
n Anti and output dependences are easier to handle
q write to the destination in one stage and in program order

n Flow dependences are more interesting

n Five fundamental ways of handling flow dependences


q Detect and wait until value is available in register file
q Detect and forward/bypass data to dependent instruction
q Detect and eliminate the dependence at the software level
n No need for the hardware to detect dependence
q Predict the needed value(s), execute “speculatively”, and verify
q Do something else (fine-grained multithreading)
n No need to detect
3
How to Handle Control Dependences
n Critical to keep the pipeline full with correct sequence of
dynamic instructions.

n Potential solutions if the instruction is a control-


flow instruction:

n Stall the pipeline until we know the next fetch address


n Guess the next fetch address (branch prediction)
n Employ delayed branching (branch delay slot)
n Do something else (fine-grained multithreading)
n Eliminate control-flow instructions (predicated execution)
n Fetch from both possible paths (if you know the addresses
of both possible paths) (multipath execution)
4
Improving Performance
n Increase clock rate fs:
q Reached practical maximum for today’s technology.
q < 5GHz for general purpose computers
n Lower CPI (cycles per instruction)
q SIMD, “instruction level parallelism”
n Perform multiple tasks simultaneously
q Multiple CPUs, each executing different program.
q Tasks may be related
n E.g. each CPU performs part of a big matrix multiplication.
q or unrelated
n E.g. distribute different web http requests over different computers
n Do all of the above:
q High fs , SIMD, multiple parallel tasks

5
Multithreading
n Typical scenario:
q Active thread encounters cache miss..
q Active thread waits ~ 1000 cycles for data from DRAM à
switch out and run different thread until data available
n Problem
q Must save current thread state and load new thread state
n PC, all registers (could be many, e.g. AVX)
q must perform switch in ≪ 1000 cycles
n Can hardware help?
q Moore’s Law: transistors are plenty

6
Multithreaded Pipeline Example

q Four copies of PC and Registers inside processor hardware


q Looks identical to Four processors to software (hardware thread 0,1,2,3)
q Hyper-Threading:
All threads can be active simultaneously

7
Hyper-Threading

Simultaneous Multithreading (HT): Logical CPUs >


Physical CPUs
• Run multiple threads at the same time per core
• Each thread has own architectural state (PC, Registers, etc.)
• Share resources (cache, instruction unit, execution units)
8
Conclusion I
n Logical threads
q ≈ 1% more hardware
q ≈ 10% (?) better performance
n Separate registers
n Share datapath, ALU(s), caches
n Multicore
q => Duplicate Processors
q ≈ 50% more hardware
q ≈ 2X better performance?
n Modern machines do both
q Multiple cores with multiple threads per core

9
Conclusion II
n Thread Level Parallelism
q Thread: sequence of instructions, with own program counter
and processor state (e.g., register file)
q Multicore:
n Physical CPU: One thread (at a time) per CPU, in software OS
switches threads typically in response to I/O events like disk
read/write
n Logical CPU: Fine-grain thread switching, in hardware, when
thread blocks due to cache miss/memory access
n Hyper-Threading aka Simultaneous Multithreading (SMT): Exploit
superscalar architecture to launch instructions from different
threads at the same time!

10
Conclusion III
n Sequential software execution speed is limited
q Clock rates flat or declining
n Parallelism the only path to higher performance
q SIMD: instruction level parallelism
n Implemented in all high perf. CPUs today (x86, ARM, ...) Partially supported by compilers
2X width every 3-4 years
q MIMD: thread level parallelism
n Multicore processors
n Supported by Operating Systems (OS)
n Requires programmer intervention to exploit at single program level
n Add 2 cores every 2 years (2, 4, 6, 8, 10, ...)
q Intel Xeon W-3275: 28 Cores, 56 Threads
q SIMD & MIMD for maximum performance
n Key challenge: craft parallel programs with high performance on
multiprocessors as # of processors increase – i.e., that scale
q Scheduling, load balancing, time for synchronization, overhead communication

11
Languages Supporting Parallel Programming I

12
Languages Supporting Parallel Programming II
n Parallel Programming Models and Machines (plus some architecture, e.g., caches)
Algorithm/machine model Language / Library skills
Shared memory OpenMP
PGAS
Distributed memory MPI
Data parallel SPARK
CUDA

n Parallelization Strategies for the “Motifs” of Scientific Computing (and Data)


Dense Linear Algebra Monte Carlo
Sparse Linear Algebra Spectral Methods
Particle Methods Graphs
Structured Grids Sorting
Unstructured Grids Hashing
n Performance models: Roofline, α-β (latency/bandwidth), LogP
n Cross-cutting: Communication avoiding, load balancing, hierarchical algorithms,
autotuning, Moore’s Law, Amdahl’s Law, Little’s Law
13
Why So Many Parallel Programming Languages?
n Why “intrinsics”?
q TO Intel: fix your #()&$! compiler, thanks...
n It’s happening ... But
q SIMD features are continually added to compilers
n (Intel, gcc)
q Intense area of research
q Research progress:
n 20+ years to translate C into good (fast!) assembly
n How long to translate C into good (fast!) parallel code?
q General problem is very hard to solve
q Present state: specialized solutions for specific cases
q Your opportunity to become famous!

14
Parallel Programming Languages
n Number of choices is indication of
q No universal solution
n Needs are very problem specific
q E.g.,
n Scientific computing/machine learning (matrix multiply)
n Webserver: handle many unrelated requests simultaneously
n Input / output: it’s all happening simultaneously!
n Specialized languages for different tasks
q Some are easier to use (for some problems)
q None is particularly “easy” to use
n Parallel language examples for high-performance
computing
q OpenMP

15
Parallel Loops
n Serial execution:
for (int i=0; i<100; i++) {
...
}
n Parallel Execution:

n Parallel for in OpenMP


#include <omp.h>
#pragma omp parallel for
for (int i=0; i<100; i++) {
...
}
16
OpenMP Example

The call to find the maximum number of threads that are available to do work is
omp_get_max_threads() (from omp.h).

17
OpenMP
n C extension: no new language to learn
n Multi-threaded, shared-memory parallelism
q Compiler Directives, #pragma
q Runtime Library Routines, #include <omp.h>
n #pragma
q Ignored by compilers unaware of OpenMP
q Same source for multiple architectures
n E.g., same program for 1 & 16 cores
n Only works with shared memory

18
OpenMP Programming Model
n Fork - Join Model:

n OpenMP programs begin as single process (main thread)


q Sequential execution

n When parallel region is encountered


q Master thread “forks” into team of parallel threads

q Executed simultaneously

q At end of parallel region, parallel threads ”join”, leaving only master


thread
n Process repeats for each parallel region
q Amdahl’s Law?

19
What Kind of Threads?
n OpenMP threads are operating system (software)
threads
n OS will multiplex requested OpenMP threads onto
available hardware threads
n Hopefully each gets a real hardware thread to run on,
so no OS-level time-multiplexing
n But other tasks on machine compete for hardware
threads!
n Be “careful” (?) when timing results for Projects!

20

You might also like