Lec6 - TLP Data Dependence Solutions
Lec6 - TLP Data Dependence Solutions
2
Recall: How to Handle Data Dependences
n Anti and output dependences are easier to handle
q write to the destination in one stage and in program order
5
Multithreading
n Typical scenario:
q Active thread encounters cache miss..
q Active thread waits ~ 1000 cycles for data from DRAM à
switch out and run different thread until data available
n Problem
q Must save current thread state and load new thread state
n PC, all registers (could be many, e.g. AVX)
q must perform switch in ≪ 1000 cycles
n Can hardware help?
q Moore’s Law: transistors are plenty
6
Multithreaded Pipeline Example
7
Hyper-Threading
9
Conclusion II
n Thread Level Parallelism
q Thread: sequence of instructions, with own program counter
and processor state (e.g., register file)
q Multicore:
n Physical CPU: One thread (at a time) per CPU, in software OS
switches threads typically in response to I/O events like disk
read/write
n Logical CPU: Fine-grain thread switching, in hardware, when
thread blocks due to cache miss/memory access
n Hyper-Threading aka Simultaneous Multithreading (SMT): Exploit
superscalar architecture to launch instructions from different
threads at the same time!
10
Conclusion III
n Sequential software execution speed is limited
q Clock rates flat or declining
n Parallelism the only path to higher performance
q SIMD: instruction level parallelism
n Implemented in all high perf. CPUs today (x86, ARM, ...) Partially supported by compilers
2X width every 3-4 years
q MIMD: thread level parallelism
n Multicore processors
n Supported by Operating Systems (OS)
n Requires programmer intervention to exploit at single program level
n Add 2 cores every 2 years (2, 4, 6, 8, 10, ...)
q Intel Xeon W-3275: 28 Cores, 56 Threads
q SIMD & MIMD for maximum performance
n Key challenge: craft parallel programs with high performance on
multiprocessors as # of processors increase – i.e., that scale
q Scheduling, load balancing, time for synchronization, overhead communication
11
Languages Supporting Parallel Programming I
12
Languages Supporting Parallel Programming II
n Parallel Programming Models and Machines (plus some architecture, e.g., caches)
Algorithm/machine model Language / Library skills
Shared memory OpenMP
PGAS
Distributed memory MPI
Data parallel SPARK
CUDA
14
Parallel Programming Languages
n Number of choices is indication of
q No universal solution
n Needs are very problem specific
q E.g.,
n Scientific computing/machine learning (matrix multiply)
n Webserver: handle many unrelated requests simultaneously
n Input / output: it’s all happening simultaneously!
n Specialized languages for different tasks
q Some are easier to use (for some problems)
q None is particularly “easy” to use
n Parallel language examples for high-performance
computing
q OpenMP
15
Parallel Loops
n Serial execution:
for (int i=0; i<100; i++) {
...
}
n Parallel Execution:
The call to find the maximum number of threads that are available to do work is
omp_get_max_threads() (from omp.h).
17
OpenMP
n C extension: no new language to learn
n Multi-threaded, shared-memory parallelism
q Compiler Directives, #pragma
q Runtime Library Routines, #include <omp.h>
n #pragma
q Ignored by compilers unaware of OpenMP
q Same source for multiple architectures
n E.g., same program for 1 & 16 cores
n Only works with shared memory
18
OpenMP Programming Model
n Fork - Join Model:
q Executed simultaneously
19
What Kind of Threads?
n OpenMP threads are operating system (software)
threads
n OS will multiplex requested OpenMP threads onto
available hardware threads
n Hopefully each gets a real hardware thread to run on,
so no OS-level time-multiplexing
n But other tasks on machine compete for hardware
threads!
n Be “careful” (?) when timing results for Projects!
20