Lecture ParallelArchTLP-DLP
Lecture ParallelArchTLP-DLP
1
Topics
• Introduction
• Programming on shared memory system (Chapter 7)
– OpenMP
• Principles of parallel algorithm design (Chapter 3)
• Programming on large scale systems (Chapter 6)
– MPI (point to point and collectives)
– Introduction to PGAS languages, UPC and Chapel
• Analysis of parallel program executions (Chapter 5)
– Performance Metrics for Parallel Systems
• Execution Time, Overhead, Speedup, Efficiency, Cost
– Scalability of Parallel Systems
– Use of performance tools
2
Topics
• Programming on shared memory system (Chapter 7)
– Cilk/Cilkplus and OpenMP Tasking
– PThread, mutual exclusion, locks, synchronizations
• Parallel architectures and memory
– Parallel computer architectures
• Thread Level Parallelism
• Data Level Parallelism
• Synchronization
– Memory hierarchy and cache coherency
• Manycore GPU architectures and programming
– GPUs architectures
– CUDA programming
– Introduction to offloading model in OpenMP 3
Lecture: Parallel Architecture –
Thread Level Parallelism
Note:
• Parallelism in hardware
• Not (just) multi-/many-core architecture
4
Binary Code and Instructions
6
Pipeline
7
Pipeline and Superscalar
8
What do we do with that many transistors?
– Multi-issue processors:
• Allow multiple instructions to start execution per clock cycle
• Superscalar (Intel x86, AMD, …) vs. VLIW architectures
– VLIW/EPIC architectures:
• Allow compilers to indicate independent instructions per
issue packet
• Example: Intel Itanium
– Vector units:
• Allow for the efficient expression and execution of vector
operations
• Example: SSE - SSE4, AVX instructions
Limitations of optimizing a single instruction
stream (II)
• Problem: within a single instruction stream we do not find
enough independent instructions to execute simultaneously due
to
– data dependencies
– limitations of speculative execution across multiple branches
– difficulties to detect memory dependencies among instruction
(alias analysis)
• Consequence: significant number of functional units are idling at
any given time
• Question: Can we maybe execute instructions from another
instructions stream
– Another thread?
– Another process?
Hardware Multi-Threading (SMT)
• Three types of hardware multi-threading (single-core only):
– Coarse-grained MT
– Fine-grained MT
– Simultaneous Multi-threading
Superscalar Coarse MT Fine MT SMT
Thread-level parallelism
• Problems for executing instructions from multiple threads
at the same time
– The instructions in each thread might use the same register
names
– Each thread has its own program counter
• Virtual memory management allows for the execution of
multiple threads and sharing of the main memory
• When to switch between different threads:
– Fine grain multithreading: switches between every instruction
– Course grain multithreading: switches only on costly stalls (e.g.
level 2 cache misses)
Simultaneous Multi-threading
• Convert Thread-level parallelism to instruction-level
parallelism
• Dynamically scheduled processors already have most
hardware mechanisms in place to support SMT (e.g.
register renaming)
• Required additional hardware:
– Register file per thread
– Program counter per thread
• Operating system view:
– If a CPU supports n simultaneous threads, the Operating
System views them as n processors
– OS distributes most time consuming threads ‘fairly’ across the
n processors that it sees.
Example for SMT architectures (I)
• Intel Hyperthreading:
– First released for Intel Xeon processor family in 2002
– Supports two architectural sets per CPU,
– Each architectural set has its own
• General purpose registers
• Control registers
• Interrupt control registers
• Machine state registers
– Adds less than 5% to the relative chip size
Reference: D.T. Marr et. al. “Hyper-Threading Technology Architecture and Microarchitecture”,
Intel Technology Journal, 6(1), 2002, pp.4-15.
ftp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_t
echnology.pdf
Example for SMT architectures (II)
• IBM Power 5
– Same pipeline as IBM Power 4 processor but with SMT support
– Further improvements:
• Increase associativity of the L1 instruction cache
• Increase the size of the L2 and L3 caches
• Add separate instruction prefetch and buffering units for
each SMT
• Increase the size of issue queues
• Increase the number of virtual registers used internally by
the processor.
Simultaneous Multi-Threading
• Works well if
– Number of compute intensive threads does not exceed the number of
threads supported in SMT
– Threads have highly different characteristics (e.g. one thread doing mostly
integer operations, another mainly doing floating point operations)
• Does not work well if
– Threads try to utilize the same function units
– Assignment problems:
• e.g. a dual processor system, each processor supporting 2 threads
simultaneously (OS thinks there are 4 processors)
• 2 compute intensive application processes might end up on the same
processor instead of different processors (OS does not see the difference
between SMT and real processors!)
Lecture: Parallel Architecture --
Data Level Parallelism
18
Classification of Parallel Architectures
Flynn’s Taxonomy
• SISD: Single instruction single data
– Classical von Neumann architecture
• SIMD: Single instruction multiple data
– Vector, GPU, etc
• MISD: Multiple instructions single data
– Non existent, just listed for completeness
• MIMD: Multiple instructions multiple data
– Most common and general parallel machine
– Multi-/many- processors/cores/threads/computers
Single Instruction Multiple Data
• Also known as Array-processors
• A single instruction stream is broadcasted to multiple
processors, each having its own data stream
– Still used in some graphics cards today
Control unit
21
SIMD Instructions
• Originally developed for Multimedia applications
• Same operation executed for multiple data items
• Uses a fixed length register and partitions the carry chain to allow
utilizing the same functional unit for multiple operations
– E.g. a 64 bit adder can be utilized for two 32-bit add
operations simultaneously
• Instructions originally not intended to be used by compiler, but just for
handcrafting specific operations in device drivers
• All elements in a register have to be on the same memory page to
avoid page faults within the instruction
Intel SIMD Instructions
• MMX (Mult-Media Extension) - 1996
– Existing 64 bit floating point register could be used for eight 8-
bit operations or four 16-bit operations
• SSE (Streaming SIMD Extension) – 1999
– Successor to MMX instructions
– Separate 128-bit registers added for sixteen 8-bit, eight 16-bit,
or four 32-bit operations
• SSE2 – 2001, SSE3 – 2004, SSE4 - 2007
– Added support for double precision operations
• AVX (Advanced Vector Extensions) - 2010
– 256-bit registers added
Vector Processors
• Vector processors abstract operations on vectors, e.g.
replace the following loop
for (i=0; i<n; i++) {
a[i] = b[i] + c[i];
}
by
42
Synchronization between processors
• Required on all levels of multi-threaded programming
– Lock/unlock
– Mutual exclusion
– Barrier synchronization
pthread_mutex_lock (&mymutex);
*cp++; // read, increment and write shared variable
pthread_mutex_unlock (&mymutex);
return NULL;
}
Synchronization
• Lock/unlock operations on the hardware level, e.g.
– Lock returning 1 if lock is free/available
– Lock returning 0 if lock is unavailable
• Implementation using atomic exchange (compare and swap)
– Process sets the value of a register/memory location to the
required operation
– Setting the value must not be interrupted in order to avoid
race conditions
– Access by multiple processes/threads will be resolved by write
serialization
Synchronization (II)
• Other synchronization primitives:
– Test-and-set
– Fetch-and-increment
• Problems with all three algorithms:
– Require a read and write operation in a single, uninterruptable
sequence
– Hardware can not allow any operations between the read and
the write operation
– Complicates cache coherence
– Must not deadlock
Load linked/store conditional
• Pair of instructions where the second instruction returns a
value indicating, whether the pair of instructions was
executed as if the instructions were atomic
• Special pair of load and store operations
– Load linked (LL)
– Store conditional (SC): returns 1 if successful, 0 otherwise
• Store conditional returns an error if
– Contents of memory location specified by LL changed before
calling SC
– Processor executes a context switch
Load linked/store conditional (II)
• Assembler code sequence to atomically exchange the
contents of register R4 and the memory location specified
by R1
54
Moore’s Law
• Long-term trend on the number of transistor per integrated circuit
• Number of transistors double every ~18 month
Source: http://en.wikipedia.org/wki/Images:Moores_law.svg
The “Future” of Moore’s Law
• The chips are down for Moore’s law
– http://www.nature.com/news/the-chips-are-down-for-moore-
s-law-1.19338
• Special Report: 50 Years of Moore's Law
– http://spectrum.ieee.org/static/special-report-50-years-of-
moores-law
• Moore’s law really is dead this time
– http://arstechnica.com/information-
technology/2016/02/moores-law-really-is-dead-this-time/
• Rebooting the IT Revolution: A Call to Action (SIA/SRC,
2015)
– https://www.semiconductors.org/clientuploads/Resources/RIT
R%20WEB%20version%20FINAL.pdf
56