2 TypesofParallelism
2 TypesofParallelism
1
Types of Parallelism
• Implicit Parallelism: The compiler, runtime, or system automatically identifies
and manages parallelism without requiring the programmer to explicitly define it.
2
Explicit Parallelism
• A programming approach where the developer explicitly instructs the system
where and how to parallelize operations using dedicated language features like
operators, function calls, or directives.
4
Task Parallelism
• As the opposite of data parallelism, task parallelism breaks down a problem
into independent tasks that can be executed concurrently different processors.
Different operations are performed on different parts of the data.
• Task parallelism is called "coarse grain" parallelism because the computational work
is spread into just a few subtasks.
• More code is run in parallel because the parallelism is implemented at a higher level
than in data parallelism.
• Easier to implement and has less overhead than data parallelism.
• Example:
#pragma omp parallel for
for (int i = 0; i < n; i++)
a[i] = b[i] + c[i];
6
Thread-Level Parallelism
• Deals with the execution of multiple threads within a program to execute
different parts of the code simultaneously. The programmer explicitly creates
and manages threads that run concurrently.
7
Implicit Parallelism
• The compiler, runtime, or system automatically identifies and manages
parallelism without requiring the programmer to explicitly define it.
8
Instruction Level Parallelism
• Instruction Level Parallelism (ILP) is a set of techniques for executing multiple
instructions at the same time within the same CPU core (Note: ILP has
nothing to do with multicore)
• Basic Idea: Execute several instructions in parallel; i.e., overlap the execution
of instructions to run programs faster (“to improve performance”)
• The Problem: A CPU core has a lot of circuits, but at any given moment, many
parts remain idle. This wastes processing power..
• The solution: Different parts of the CPU core work on different instructions
simultaneously instead of waiting for one instruction to finish before starting the
next. If a CPU can handle 10 operations at once, a program could run up to 10
times faster—though, in reality, the speedup is usually lower. 9
Examples of Instruction Level Parallelism
• Pipelining: Start performing an operation on one piece of data while finishing
the same operation on another piece of data – perform different stages of the
same INSTRUCTION on different sets of operands at the same time (like an
assembly line). Example: While adding two numbers, the CPU can start
preparing the next addition before finishing the first.
• Superscalar: The CPU can execute multiple different instructions at the same
time. Example: It can add, multiply, and load data simultaneously.
• Vectorization: The CPU performs the same operation on multiple data points at 10
once instead of processing them one by one. Example: Adding two lists of
numbers in one step instead of adding each number separately.
5 Stages in Executing a MIPS instruction
• IF: Instruction Fetch, Increment Program Counter
• EX: Execution
• Mem-ref: Calculate Address
• Arithmetic/logical: Perform Operation
• MEM:
• Load: Read Data from Memory
• Store: Write Data to Memory
• WB:
11
• Write Data Back to Register
Instruction Execution
12
Instruction Execution
13
Instruction Execution
14
Pipelining Execution
15
Pipelining Execution
16
Superscalar Execution
17
Superscalar Execution
18
Superscalar Execution
19
Functional Units in a CPU
20
Super-pipelining: Superscalar + Pipeline
21
Super-pipelining: Superscalar + Pipeline
22
Vectorization
• A vector register is a register that’s made
up of many individual registers that
simultaneously perform the same
operation on multiple sets of operands,
producing multiple results. In a sense,
vectors are like operation-specific cache.
23
What determines the degree of ILP?
• Dependencies: property of the program
• Hazards: property of the pipeline
24
Types of Dependencies
• Data Dependence: When one instruction needs data from a previous
instruction before it can execute. The second instruction must wait for the first to
finish. (True Dependence)
• RAW: Read-After-Write
25
Types of Dependencies
Write-After-Write (WAW or Output Dependence): Occurs when two instructions write
to the same variable. The second write cannot execute before the first completes.
• Control Dependence:
• When an instruction depends on a decision (branch/jump) from a previous instruction.
• The second instruction must wait to see if the first condition is true.
26
What is Dependency Analysis?
• It examines how different parts of a program depend on each other.
• Ensures that parts of the program execute in the correct order.
• Prevents incorrect results due to improper instruction execution.
• A key role of modern compilers, which optimize code execution while respecting
dependencies.
• A data dependency describes how different pieces of data affect each other.
• A control dependency describes how instruction sequences affect each other.
S1: X = A + B
S1 S2
S2: C = X + 1
28
Loop Carried Dependency
FOR i = 2 to MAX
a[i] = a[i-1] + b[i]
END FOR
• There is no way to execute iteration i until after iteration i-1 has completed, so
this loop can’t be parallelized.
29
Why do we care?
• Loops are very common in many programs.
• Also, it’s easier to optimize loops than more arbitrary sequences of
instructions: when a program does the same thing over and over, it’s easier to
predict what’s likely to happen next.
• Loops are the favorite control structures of High Performance Computing. Both
hardware vendors and compiler writers build for optimizing loop performance
using instruction-level parallelism (superscalar, pipelining, and vectorization).
30
ILP and Data Dependencies, Hazards
• HW/SW must preserve program order: code must give the same results as if
instructions were executed sequentially in original order of the source program
• Goal: Exploit parallelism by preserving program order only where it affects the
outcome of the program
31
Named Dependence and Hazards
• Name Dependence: When two instructions use the same variable name (register or
memory location), but there is no flow of data between them associated with that
variable name.
• Anti-Dependence: Instruction S2 writes operand before Instruction S1 reads it
S1: A = X + B
S1 S2
S2: X = C + D
• Output-Dependence: Instruction S2 writes operand before Instruction S1 writes it
S1: X = A + B
... S1 S2
S2: X = C + D
• If anti-dependence caused a hazard in pipeline, then the hazard is called a Write
After Read (WAR) hazard
32
• If output-dependence caused a hazard in pipeline, then the hazard is called a
Write After Write (WAW) hazard
Control Dependencies
• Every program has a well-defined flow of control that moves from instruction to
instruction. Every instruction is control dependent on some set of branches (if
condition, switch case, function calls, I/O), and, in general, these control
dependencies must be preserved to preserve program order.
IF p1 THEN
S1
IF p2 THEN
S2
• S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.
34
Speculation
• Types of predictors
• Branch Prediction: Guessing which path (if/else) the program will take.
• Value Prediction: Predicting future values based on past data
• Prefetching (memory access pattern prediction): Predicting which memory locations
will be needed soon and loading them early.
35
Limits to Pipelining
• Pipelining increases CPU efficiency by overlapping instruction execution, but it
has limitations due to hazards. Hazards prevent the next instruction from
executing in its designated clock cycle.
36
Why Does Order Matter?
• Dependencies can affect whether we can execute a particular part of the
program in parallel.
• If we cannot execute that part of the program in parallel, then it’ll be SLOW.
37
Dependencies in Loops
• Dependencies in loops are easy to understand if loops are unrolled. Now the
dependences are between statement “instances”
FOR i = 1 to n
S1 a[i] = b[i] + 1
S2 c[i] = a[i] + 2
Iteration: 1 2 3 4 ...
Instances of S1: S1 S1 S1 S1
...
38
Instances of S2: S2 S2 S2 S2
Dependencies in Loops
• A little more complex example
FOR i = 1 to n
S1 a[i] = b[i] + 1
S2 c[i] = a[i-1] + 2
Iteration: 1 2 3 4 ...
Instances of S1: S1 S1 S1 S1
...
39
Instances of S2: S2 S2 S2 S2
Dependencies in Loops
• Even more complex example
FOR i = 1 to n
S1 a = b[i] + 1
S2 c[i] = a + 2
Iteration: 1 2 3 4 ...
Instances of S1: S1 S1 S1 S1
...
40
Instances of S2: S2 S2 S2 S2
Branch Dependency
y = 7
IF (x != 0) THEN
y = 1.0/x
END IF
41
Optimizing with Dependencies
• It is valid to parallelize/vectorize a loop if no dependences cross its iteration
boundaries:
FOR i = 1 to n
S1 a[i] = b[i] + 1
S2 c[i] = a[i] + 2
1 2 3
4 ...
S1 S1 S1 S1
...
S2 S2 S2 S2 42
Loop or Branch Dependency?
• Is this a loop carried dependency or a branch dependency?
FOR i = 1 to MAX
IF (x[i] != 0)
y[i] = 1.0/x[i]
END IF
END FOR
43
Loop or Branch Dependency?
• Is this a loop carried dependency or a branch dependency?
FOR i = 1 to MAX
IF (x[i] != 0)
y[i] = 1.0/x[i]
END IF
END FOR
Answer:
• The given code has a branch dependency due to the conditional (IF x[i] != 0).
• However, it does not have a loop-carried dependency since each iteration
is independent.
44
Compiler-Driven Parallelism
• Tricks that compilers play for automatic parallelization
• Scalar optimizations
• Loop-level optimizations
45
Scalar Optimizations
• Copy Propagation
• Constant Folding
• Dead Code Removal
• Strength Reduction
• Common Subexpression Elimination
• Variable Renaming
• Not every compiler does all these, so it sometimes can be worth doing these
by hands
46
Copy Propagation
Before x = y
z = 1 + x
Compile
After x = y
z = 1 + y
47
No data dependency
Constant Folding
Before After
• Notice that sum is actually the sum of two constants, so the compiler can
precalculate it, eliminating the addition that otherwise would be performed at
runtime.
48
Dead Code Removal
Before After
var = 5 var = 5
PRINT *, var PRINT *, var
STOP STOP
PRINT *, var * 2
• Since the last statement never executes, the compiler can eliminate it.
49
Strength Reduction
Before After
x = y^2.0 x = y * y
a = c / 2.0 a = c * 0.5
• Raising one value to the power of another, or dividing, is more expensive than
multiplying. If the compiler can tell that the power is a small integer, or that the
denominator is a constant, it’ll use multiplication instead.
50
Common Subexpression Elimination
Before After
d = c * (a / b) adivb = a / b
e = (a / b) * 2.0 d = c * adivb
e = adivb * 2.0
x = y * z x0 = y * z
q = r + x * 2.0 q = r + x0 * 2
x = a + b x = a + b
• The original code has output dependency, while the new code doesn’t – but
the final value of x is still correct.
52
Loop Optimizations
• Hoisting Loop Invariant Code
• Unswitching
• Iteration Peeling
• Index Set Splitting
• Loop Interchange
• Unrolling
• Loop Fusion
• Loop Fission
• Inlining
• Not every compiler does all these, so it sometimes can be worth doing these
by hands 53
Hoisting Loop Invariant Code
• The code that does not change inside the loop is known as loop invariant. It
doesn’t need to be calculated over and over.
Before After
FOR i = 1 to n temp = c * d
a[i] = b[i] + c * d FOR i = 1 to n
e = g(n) a[i] = b[i] +
END FOR temp
END FOR
e = g(n)
54
Unswitching
FOR i = 1 to n
FOR j = 2 to n The condition is
IF (t(i) > 0) THEN j-independent
a[i][j] = a[i][j] * t(i) + b(j)
ELSE
a[i][j] = 0.0
END IF
END FOR Before
END FOR
FOR i = 1 to n
IF (t(i) > 0) THEN
FOR j = 2 to n So, it can migrate
a[i][j] = a[i][j] * t(i) + b(j) outside the j loop
END FOR
ELSE
FOR j = 2 to n
a[i][j] = 0.0 55
END FOR
END IF After
END FOR
Iteration Peeling
FOR i = 1 to n
IF ((i == 1) OR (i == n)) THEN
x[i] = y[i]
Before
ELSE
x[i] = y[I + 1] + y[i - 1]
END IF
END FOR
x(1) = y(1)
FOR i = 2 to n-1 After
x[i] = y[I + 1] + y[i - 1]
END FOR
x(n) = y(n)
56
Index Set Splitting
FOR i = 1 to n
a[i] = b[i] + c[i]
IF ((i > 10) THEN
Before
d[i] = a[i] + b[i – 10]
END IF
END FOR
FOR i = 1 to 10
a[i] = b[i] + c[i]
END FOR
FOR i = 11 to n After
a[i] = b[i] + c[i]
d[i] = a[i] + b[i - 10]
END FOR
57
• Note that this is a generalization of peeling.
Loop Interchange
Before After
FOR i = 1 to nj FOR i = 1 to ni
FOR j = 1 to ni FOR j = 1 to nj
a[i][j] = b[i][j] a[i][j] = b[i]
END FOR [j]
END FOR END FOR
END FOR
• The array elements a[i][j] and a[i][j+1] are near each other in memory,
while a[i+1][j] may be far, so it makes sense to make the i loop be the
outer loop.
FOR i = 1, n, 4
a[i] = a[i] + b[i]
a[i+1] = a[i+1] + b[i+1]
a[i+2] = a[i+2] + b[i+2] After
a[i+3] = a[i+3] + b[i+3]
END FOR
• Unrolling creates multiple operations that typically load from the same, or
adjacent, cache lines.
• So, an unrolled loop has more operations without increasing the memory
accesses by much.
FOR i = 1 to n
c[i] = a[i] /2 Before
END FOR
FOR i = 1 to n
d[i] = 1 / c[i]
END FOR
FOR i = 1, n
a[i] = b[i] + 1
c[i] = a[i] / 2 After
d[i] = 1 / c[i]
END FOR
61
• As with unrolling, this has fewer branches. It also has fewer total memory
references.
Loop Fission
FOR i = 1, n
a[i] = b[i] + 1
c[i] = a[i] / 2 Before
d[i] = 1 / c[i]
END FOR
FOR i = 1 to n
a[i] = b[i] + 1
END FOR
FOR i = 1 to n
c[i] = a[i] /2
END FOR After
FOR i = 1 to n
d[i] = 1 / c[i]
END FOR 62
• Fission reduces the cache footprint and the number of operations per iteration.
To Fuse or to Fizz?
• The question of when to perform fusion versus when to perform fission, like
many many optimization questions, is highly dependent on the application, the
platform and a lot of other issues that get very, very complicated.
• That’s why it’s important to examine the actual behavior of the executable.
63
Inlining
Before After
FOR i = 1 to n FOR i = 1 to n
a[i] = func(i) a[i] = i * 3
END FOR END FOR
int func(x)
func = x * 3
RETURN func
65
Transforming Programs: Renaming
S1: A = X + B
S2: X = Y + 1
S3: C = X + B
S4: X = Z + B
S5: D = X + 1
S1: A = X + B
S2: X1 = Y + 1
S3: C = X1 + B
S4: X2 = Z + B 66
S5: D = X2 + 1
Transform Programs: Scalar Expansion
Scalar expansion is an optimization technique that transforms a loop containing a scalar dependency into an equivalent
loop using an array version of the scalar variable. This eliminates dependencies and improves parallelization.
FOR i = 1 to n S1 S1 S1 S1
S1 a = b[i] + 1
S2 c[i] = a + ...
d[i]
END FOR S2 S2 S2 S2
FOR i = 1 to n S1 S1 S1 S1
S1 a1[i] = b[i] + 1
S2 c[i] = a1[i] + d[i]
...
END FOR 67
a = a1[n] S2 S2 S2 S2
Transform Programs: Scalar Expansion
• The rule is simple: it is valid to expand a scalar if it is always assigned before
it is used within the body of the loop and if either
• its final value at the end of the loop is known, or
• it is never used after the loop.
68
Final Message:
• Implicit parallelism is becoming increasingly valuable with the advancement in
hardware and compilers capabilities.
• Try to get maximum of what a compiler can do for you automatically (LEARN
compiler flags/options/directives). But keep ensured whether your BULL (the
compiler) is really pulling your cart FAST (not just dancing), even that to the
RIGHT direction.
• While optimizing the code, use profiling to find out HOT SPOTS or Bottlenecks
and then focus on those, one by one, also ensuring correctness. The 10-90 rule 69
often works. Algorithm transformation could be considered, also.