0% found this document useful (0 votes)
29 views

Lecture ParallelArchTLP-DLP

Noted

Uploaded by

presen_scribd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Lecture ParallelArchTLP-DLP

Noted

Uploaded by

presen_scribd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Lecture: Parallel Architecture – Thread Level

Parallelism and Data Level Parallelism

CSCE 569 Parallel Computing


Department of Computer Science and Engineering
Yonghong Yan
[email protected]
http://cse.sc.edu/~yanyh

1
Topics
• Introduction
• Programming on shared memory system (Chapter 7)
– OpenMP
• Principles of parallel algorithm design (Chapter 3)
• Programming on large scale systems (Chapter 6)
– MPI (point to point and collectives)
– Introduction to PGAS languages, UPC and Chapel
• Analysis of parallel program executions (Chapter 5)
– Performance Metrics for Parallel Systems
• Execution Time, Overhead, Speedup, Efficiency, Cost
– Scalability of Parallel Systems
– Use of performance tools
2
Topics
• Programming on shared memory system (Chapter 7)
– Cilk/Cilkplus and OpenMP Tasking
– PThread, mutual exclusion, locks, synchronizations
• Parallel architectures and memory
– Parallel computer architectures
• Thread Level Parallelism
• Data Level Parallelism
• Synchronization
– Memory hierarchy and cache coherency
• Manycore GPU architectures and programming
– GPUs architectures
– CUDA programming
– Introduction to offloading model in OpenMP 3
Lecture: Parallel Architecture –
Thread Level Parallelism

Note:
• Parallelism in hardware
• Not (just) multi-/many-core architecture

4
Binary Code and Instructions

• Compile a program using -save-temps flags to see the binary


code or disassembly a binary code using objdump -D 5
Stages to Execute an Instruction

6
Pipeline

7
Pipeline and Superscalar

8
What do we do with that many transistors?

• Optimizing the execution of a single instruction stream through


– Pipelining
• Overlap the execution of multiple instructions
• Example: all RISC architectures; Intel x86 underneath the
hood
– Out-of-order execution:
• Allow instructions to overtake each other in accordance with
code dependencies (RAW, WAW, WAR)
• Example: all commercial processors (Intel, AMD, IBM, SUN)
– Branch prediction and speculative execution:
• Reduce the number of stall cycles due to unresolved
branches
• Example: (nearly) all commercial processors
What do we do with that many transistors? (II)

– Multi-issue processors:
• Allow multiple instructions to start execution per clock cycle
• Superscalar (Intel x86, AMD, …) vs. VLIW architectures
– VLIW/EPIC architectures:
• Allow compilers to indicate independent instructions per
issue packet
• Example: Intel Itanium
– Vector units:
• Allow for the efficient expression and execution of vector
operations
• Example: SSE - SSE4, AVX instructions
Limitations of optimizing a single instruction
stream (II)
• Problem: within a single instruction stream we do not find
enough independent instructions to execute simultaneously due
to
– data dependencies
– limitations of speculative execution across multiple branches
– difficulties to detect memory dependencies among instruction
(alias analysis)
• Consequence: significant number of functional units are idling at
any given time
• Question: Can we maybe execute instructions from another
instructions stream
– Another thread?
– Another process?
Hardware Multi-Threading (SMT)
• Three types of hardware multi-threading (single-core only):
– Coarse-grained MT
– Fine-grained MT
– Simultaneous Multi-threading
Superscalar Coarse MT Fine MT SMT
Thread-level parallelism
• Problems for executing instructions from multiple threads
at the same time
– The instructions in each thread might use the same register
names
– Each thread has its own program counter
• Virtual memory management allows for the execution of
multiple threads and sharing of the main memory
• When to switch between different threads:
– Fine grain multithreading: switches between every instruction
– Course grain multithreading: switches only on costly stalls (e.g.
level 2 cache misses)
Simultaneous Multi-threading
• Convert Thread-level parallelism to instruction-level
parallelism
• Dynamically scheduled processors already have most
hardware mechanisms in place to support SMT (e.g.
register renaming)
• Required additional hardware:
– Register file per thread
– Program counter per thread
• Operating system view:
– If a CPU supports n simultaneous threads, the Operating
System views them as n processors
– OS distributes most time consuming threads ‘fairly’ across the
n processors that it sees.
Example for SMT architectures (I)
• Intel Hyperthreading:
– First released for Intel Xeon processor family in 2002
– Supports two architectural sets per CPU,
– Each architectural set has its own
• General purpose registers
• Control registers
• Interrupt control registers
• Machine state registers
– Adds less than 5% to the relative chip size
Reference: D.T. Marr et. al. “Hyper-Threading Technology Architecture and Microarchitecture”,
Intel Technology Journal, 6(1), 2002, pp.4-15.
ftp://download.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_t
echnology.pdf
Example for SMT architectures (II)
• IBM Power 5
– Same pipeline as IBM Power 4 processor but with SMT support
– Further improvements:
• Increase associativity of the L1 instruction cache
• Increase the size of the L2 and L3 caches
• Add separate instruction prefetch and buffering units for
each SMT
• Increase the size of issue queues
• Increase the number of virtual registers used internally by
the processor.
Simultaneous Multi-Threading
• Works well if
– Number of compute intensive threads does not exceed the number of
threads supported in SMT
– Threads have highly different characteristics (e.g. one thread doing mostly
integer operations, another mainly doing floating point operations)
• Does not work well if
– Threads try to utilize the same function units
– Assignment problems:
• e.g. a dual processor system, each processor supporting 2 threads
simultaneously (OS thinks there are 4 processors)
• 2 compute intensive application processes might end up on the same
processor instead of different processors (OS does not see the difference
between SMT and real processors!)
Lecture: Parallel Architecture --
Data Level Parallelism

18
Classification of Parallel Architectures
Flynn’s Taxonomy
• SISD: Single instruction single data
– Classical von Neumann architecture
• SIMD: Single instruction multiple data
– Vector, GPU, etc
• MISD: Multiple instructions single data
– Non existent, just listed for completeness
• MIMD: Multiple instructions multiple data
– Most common and general parallel machine
– Multi-/many- processors/cores/threads/computers
Single Instruction Multiple Data
• Also known as Array-processors
• A single instruction stream is broadcasted to multiple
processors, each having its own data stream
– Still used in some graphics cards today

Instructions Data Data Data Data


stream

Control unit

processor processor processor processor


SIMD In Real
• Hardware for data-level parallelism
for (i=0; i<n; i++) {
a[i] = b[i] + c[i];
}

• Three major implementations


– SIMD extensions to conventional CPU
– Vector architectures
– GPU variant

21
SIMD Instructions
• Originally developed for Multimedia applications
• Same operation executed for multiple data items
• Uses a fixed length register and partitions the carry chain to allow
utilizing the same functional unit for multiple operations
– E.g. a 64 bit adder can be utilized for two 32-bit add
operations simultaneously
• Instructions originally not intended to be used by compiler, but just for
handcrafting specific operations in device drivers
• All elements in a register have to be on the same memory page to
avoid page faults within the instruction
Intel SIMD Instructions
• MMX (Mult-Media Extension) - 1996
– Existing 64 bit floating point register could be used for eight 8-
bit operations or four 16-bit operations
• SSE (Streaming SIMD Extension) – 1999
– Successor to MMX instructions
– Separate 128-bit registers added for sixteen 8-bit, eight 16-bit,
or four 32-bit operations
• SSE2 – 2001, SSE3 – 2004, SSE4 - 2007
– Added support for double precision operations
• AVX (Advanced Vector Extensions) - 2010
– 256-bit registers added
Vector Processors
• Vector processors abstract operations on vectors, e.g.
replace the following loop
for (i=0; i<n; i++) {
a[i] = b[i] + c[i];
}

by

a = b + c; ADDV.D V10, V8, V6

• Some languages offer high-level support for these


operations (e.g. Fortran90 or newer)
AVX Instructions
AVX Instruction Description
VADDPD Add four packed double-precision operands
VSUBPD Subtract four packed double-precision operands
VMULPD Multiply four packed double-precision operands
VDIVPD Divide four packed double-precision operands
VFMADDPD Multiply and add four packed double-precision operands
VFMSUBPD Multiply and subtract four packed double-precision operands
VCMPxx Compare four packed double-precision operands for EQ,
NEQ, LT, LTE, GT, GE…
VMOVAPD Move aligned four packed double-precision operands
VBROADCASTSD Broadcast one double-precision operand to four locations in a
256-bit register
Main concepts
• Advantages of vector instructions
– A single instruction specifies a great deal of work
– Since each loop iteration must not contain data dependence to
other loop iterations
• No need to check for data hazards between loop iterations
• Only one check required between two vector instructions
• Loop branches eliminated
Basic vector architecture
• A modern vector processor contains
– Regular, pipelined scalar units
– Regular scalar registers
– Vector units – (inventors of pipelining! )
– Vector register: can hold a fixed number of entries (e.g. 64)
– Vector load-store units
Comparison MIPS code vs. vector code
Example: Y=aX+Y for 64 elements

L.D F0, a /* load scalar a*/


DADDIU R4, Rx, #512 /* last address */
L: L.D F2, 0(Rx) /* load X(i) */
MUL.D F2, F2, F0 /* calc. a times X(i)*/
L.D F4, 0(Ry) /* load Y(i) */
ADD.D F4, F4, F2 /* aX(I) + Y(i) */
S.D F4, 0(Ry) /* store Y(i) */
DADDIU Rx, Rx, #8 /* increment X*/
DADDIU Ry, Ry, #8 /* increment Y */
DSUBU R20, R4, Rx /* compute bound */
BNEZ R20, L
Comparison MIPS code vs. vector code (II)
Example: Y=aX+Y for 64 elements

L.D F0, a /* load scalar a*/


LV V1, 0(Rx) /* load vector X */
MULVS.D V2, V1, F0 /* vector scalar mult*/
LV V3, 0(Ry) /* load vector Y */
ADDV.D V4, V2, V3 /* vector add */
SV V4, 0(Ry) /* store vector Y */
Vector length control
• What happens if the length is not matching the length of
the vector registers?
• A vector-length register (VLR) contains the number of
elements used within a vector register
• Strip mining: split a large loop into loops less or equal the
maximum vector length (MVL)
Vector length control (II)
low =0;
VL = (n mod MVL);
for (j=0; j < n/MVL; j++ ) {
for (i=low; i < low + VL; i++ ) {
Y(i) = a * X(i) + Y(i);
}
low += VL;
VL = MVL;
}
Vector stride
• Memory on typically organized in multiple banks
– Allow for independent management of different memory
addresses
– Memory bank time an order of magnitude larger than CPU
clock cycle
• Example: assume 8 memory banks and 6 cycles of memory
bank time to deliver a data item
– Overlapping of multiple data requests by the hardware
Vector stride (II)
• What happens if the code does not access subsequent
elements of the vector
for (i=0; i<n; i+=2) {
a[i] = b[i] + c[i];
}
– Vector load ‘compacts’ the data items in the vector register
(gather)
• No affect on the execution of the loop
• You might however use only a subset of the memory banks -
> longer load time
• Worst case: stride is a multiple of the number of memory
banks
Conditional execution
• Consider the following loop
for (i=0; i< N; i++ ) {
if ( A(i) != 0 ) {
A(i) = A(i) – B(i);
}
}
• Loop can usually not been vectorized because of the
conditional statement
• Vector-mask control: boolean vector of length MLV to
control whether an instruction is executed or not
– Per element of the vector
Conditional execution (II)
LV V1, Ra /* load vector A into V1 */
LV V2, Rb /* load vector B into V2 */
L.D F0, #0 /* set F0 to zero */
SNEVS.D V1, F0 /* set VM(i)=1 if V1(i)!=F0 */
SUBV.D V1, V1, V2 /* sub using vector mask*/
CVM /* clear vector mask to 1 */
SV V1, Ra /* store V1 */
Support for sparse matrices
• Access of non-zero elements in a sparse matrix often
described by
A(K(i)) = A(K(i)) + C (M(i))
– K(i) and M(i) describe which elements of A and C are non-zero
– Number of non-zero elements have to match, location not
necessarily
• Gather-operation: take an index vector and fetch the
according elements using a base-address
– Mapping from a non-contiguous to a contiguous
representation
• Scatter-operation: inverse of the gather operation
Support for sparse matrices (II)
LV Vk, Rk /* load index vector K into V1 */
LVI Va, (Ra+Vk) /* Load vector indexed A(K(i)) */
LV Vm, Rm /* load index vector M into V2 */
LVI Vc, (Rc+Vm) /* Load vector indexed C(M(i)) */
ADDV.D Va, Va, Vc /* set VM(i)=1 if V1(i)!=F0 */
SVI Va, (Ra+Vk) /* store vector indexed A(K(i)) */
• Note:
– Compiler needs the explicit hint, that each element of K is
pointing to a distinct element of A
– Hardware alternative: a hash table keeping track of the
address acquired
• Start of a new vector iteration (convoy) as soon as an
address appears the second time
Lecture: Parallel Architecture –
Synchronization between processors,
cores and HW threads

42
Synchronization between processors
• Required on all levels of multi-threaded programming
– Lock/unlock
– Mutual exclusion
– Barrier synchronization

• Key hardware capability: *cp++


– Uninterruptable instruction capable of automatically retrieving
or changing a value
Race Condition
int count = 0;
int * cp = &count;
….
*cp++; /* by two threads */

Pictures from wikipedia: http://en.wikipedia.org/wiki/Race_condition 44


Simple Example (IIIb)

void *thread_func (void *arg){


int * cp (int *) arg;

pthread_mutex_lock (&mymutex);
*cp++; // read, increment and write shared variable
pthread_mutex_unlock (&mymutex);

return NULL;
}
Synchronization
• Lock/unlock operations on the hardware level, e.g.
– Lock returning 1 if lock is free/available
– Lock returning 0 if lock is unavailable
• Implementation using atomic exchange (compare and swap)
– Process sets the value of a register/memory location to the
required operation
– Setting the value must not be interrupted in order to avoid
race conditions
– Access by multiple processes/threads will be resolved by write
serialization
Synchronization (II)
• Other synchronization primitives:
– Test-and-set
– Fetch-and-increment
• Problems with all three algorithms:
– Require a read and write operation in a single, uninterruptable
sequence
– Hardware can not allow any operations between the read and
the write operation
– Complicates cache coherence
– Must not deadlock
Load linked/store conditional
• Pair of instructions where the second instruction returns a
value indicating, whether the pair of instructions was
executed as if the instructions were atomic
• Special pair of load and store operations
– Load linked (LL)
– Store conditional (SC): returns 1 if successful, 0 otherwise
• Store conditional returns an error if
– Contents of memory location specified by LL changed before
calling SC
– Processor executes a context switch
Load linked/store conditional (II)
• Assembler code sequence to atomically exchange the
contents of register R4 and the memory location specified
by R1

try: MOV R3, R4


LL R2, 0(R1)
SC R3, 0(R1)
BEQZ R3, try
MOV R4, R2
Load linked/store conditional (III)
• Implementing fetch-and-increment using load linked and
conditional store
try: LL R2, 0(R1)
DADDUI R3, R2, #1
SC R3, 0(R1)
BEQZ R3, try
• Implementation of LL/SC by using a special Link Register,
which contains the address of the operation
Spin locks
• A lock that a processor continuously tries to acquire, spinning around in a
loop until it succeeds.
• Trivial implementation

DADDUI R2, R0, #1


lockit: EXCH R2, 0(R1) !atomic exchange
BNEZ R2, lockit
• Since the EXCH operation includes a read and a modify operation
– Value will be loaded into the cache
• Good if only one processor tries to access the lock
• Bad if multiple processors in an SMP try to get the lock (cache coherence)
– EXCH includes a write attempt, which will lead to a write-miss for SMPs
Spin locks (II)
• For cache coherent SMPs, slight modification of the loop
required

lockit: LD R2, 0(R1) !load the lock


BNEZ R2, lockit !lock available?
DADDUI R2, R0, #1 !load locked value
EXCH R2, 0(R1) !atomic exchange
BNEZ R2, lockit !EXCH successful?
Spin locks (III)
• …or using LL/SC
lockit: LL R2, 0(R1) !load the lock
BNEZ R2, lockit !lock available?
DADDUI R2, R0, #1 !load locked value
SC R2, 0(R1) !atomic exchange
BNEZ R2, lockit !SC successful?
Lecture: Parallel Architecture –
Moore’s Law

54
Moore’s Law
• Long-term trend on the number of transistor per integrated circuit
• Number of transistors double every ~18 month

Source: http://en.wikipedia.org/wki/Images:Moores_law.svg
The “Future” of Moore’s Law
• The chips are down for Moore’s law
– http://www.nature.com/news/the-chips-are-down-for-moore-
s-law-1.19338
• Special Report: 50 Years of Moore's Law
– http://spectrum.ieee.org/static/special-report-50-years-of-
moores-law
• Moore’s law really is dead this time
– http://arstechnica.com/information-
technology/2016/02/moores-law-really-is-dead-this-time/
• Rebooting the IT Revolution: A Call to Action (SIA/SRC,
2015)
– https://www.semiconductors.org/clientuploads/Resources/RIT
R%20WEB%20version%20FINAL.pdf

56

You might also like