L28 Parallelization
L28 Parallelization
S1 : A 1.0
S2 : B A 2.0
S3 : A C D
Data Dependence, Parallelization, S4 :
A B/C
and Locality Enhancement
We define four types of data dependence.
S1 : A 1.0 S1 : A 1.0
S2 : B A 2.0 S2 : B A 2.0
S3 : A C D S3 : A C D
S4 : A B/C S4 : A B/C
We define four types of data dependence. We define four types of data dependence.
Anti dependence: a statement Si precedes a statement Sj in Output dependence: a statement Si precedes a statement Sj
execution and Si uses a data value that Sj computes. in execution and Si computes a data value that Sj also
computes.
It implies that Si must be executed before Sj.
It implies that Si must be executed before Sj.
Si δ Sj
a
(S2 δ S3 )
a
Si δo Sj (S1 δo S3 and S3 δo S4 )
1
Data Dependence Data Dependence (continued)
The dependence is said to flow from Si to Sj because Si
precedes Sj in execution.
S1 : A 1.0
S2 : B A 2.0 Si is said to be the source of the dependence. Sj is said to
S3 : A C D be the sink of the dependence.
S4 : A B/C The only “true” dependence is flow dependence; it
represents the flow of data in the program.
We define four types of data dependence. The other types of dependence are caused by programming
style; they may be eliminated by re-naming.
Input dependence: a statement Si precedes a statement Sj
in execution and Si uses a data value that Sj also uses. S1 : A 1.0
S2 : B A 2.0
Does this imply that Si must execute before Sj? S3 : A1 C D
S4 : A2 B/C
Si δI Sj (S3 δI S4 )
S1 : A 1.0
S1 S2 : B A 2.0
S1 : A 1.0 t
S3 : A CD
S2 : B A 2.0 S2 o
S3 : A CD
S4 : A B/C
t S3
S4 : A B/C o I
S4
2
Example 1 Example 2
Example 3 Example 4
Are you sure you know why it is S2 <a S1 even though S1 appears a(4,2) a(4,3) a(4,4)
before S2 in the code? S δ(t, ) S or S δ(1,
t
1)
S
Carnegie Mellon Carnegie Mellon
Optimizing Compilers: Parallelization -11- Todd C. Mowry Optimizing Compilers: Parallelization -12- Todd C. Mowry
3
Problem Formulation Problem Formulation
Consider the following perfect nest of depth d: Dependence
willexist
if there
exists two iteration vectors k
and j such that L k j U and:
do I1 L1 , U1 array reference f1(k ) g1( j )
do I2 L2, U2 and
a( , fk (I ) , , ) f2 (k ) g2 ( j )
and
do Id Ld , Ud
a(f1 (I ), f2 (I ), ,fm (I ))
and
a(g1 (I ), g2 (I ), , gm (I )) fm (k ) gm ( j )
subscript subscript
enddo position function
or That is:
enddo subscript
enddo expression f1(k ) g1( j ) 0
and
f2 (k ) g2 ( j ) 0
I (I1,I2 ,,Id ) and
L (L1, L2 , , Ld ) linear functions
and
b0 b1 I1 b2 I2 bd Id fm (k ) gm ( j ) 0
U (U1,U2 ,,Ud )
L U
Carnegie Mellon Carnegie Mellon
Optimizing Compilers: Parallelization -13- Todd C. Mowry Optimizing Compilers: Parallelization -14- Todd C. Mowry
Does there exist two iteration vectors i1 and i2, such that
Does there exist two iteration vectors i1 and i2, such that 2 i1 i2 4 and such that:
2 i1 i2 4 and such that:
i1 = i2 +1?
i1 = i2 -1?
Answer: yes; i1=3 & i2=2 and i1=4 & i2 =3. (But, but!).
Answer: yes; i1=2 & i2=3 and i1=3 & i2 =4.
Hence,, there is dependence!
p
Hence, there is dependence!
The dependence distance vector is i2-i1 = -1.
The dependence distance vector is i2-i1 = 1. The dependence direction vector is sign(-1) = .
The dependence direction vector is sign(1) = . Is this possible?
4
Problem Formulation - Example Problem Formulation
Dependence testing is equivalent to an integer linear
do i = 1, 10 programming (ILP) problem of 2d variables & m+d constraint!
S1: a(2*i) = b(i) + c(i)
S2: d(i) = a(2*i+1) An algorithm that
end do
determines if there exits two iteration
vectors k and j that satisfies these constraints is called a
dependence tester.
Does there exist two iteration vectors i1 and i2, such that The dependence distance vector is given by j - k.
1 i1 i2 10 and such that:
The dependence direction vector is give by sign( j - k ).
2*i1 = 2*i2 +1?
Dependence testing is NP-complete!
Answer: no; 2
2*ii1 is even & 2
2*ii2+1 is odd.
odd
A dependence test that reports dependence only when there
Hence, there is no dependence! is dependence is said to be exact. Otherwise it is in-exact.
5
Lamport’s Test - Example Lamport’s Test - Example
do i = 1, n do i = 1, n
do j = 1, n do j = 1, n
S: a(i,j) = a(i-1,j+1) S: a(i,2*j) = a(i-1,2*j+1)
end do end do
end
d do
d end
d do
d
b = 1; c1 = 0; c2 = -1 b = 1; c1 = 0; c2 = 1 b = 1; c1 = 0; c2 = -1 b = 2; c1 = 0; c2 = 1
c1 c2 c1 c2 c1 c2 c1 c2 1
1 1 1
b b b b 2
There is dependence. There is dependence. There is dependence. There is no dependence.
Distance (i) is 1. Distance (j) is -1. Distance (i) is 1.
?
S δ(1,
t
1)
S or S δ(t, ) S There is no dependence!
Carnegie Mellon Carnegie Mellon
Optimizing Compilers: Parallelization -21- Todd C. Mowry Optimizing Compilers: Parallelization -22- Todd C. Mowry
6
GCD Test Example Dependence Testing Complications
Unknown loop bounds.
do i = 1, 10
do i = 1, N
S1: a(i) = b(i) + c(i)
S1: a(i) = a(i+10)
S2: d(i) = a(i-100)
end do
end do
What is the relationship between N and 10?
Does there exist two iteration vectors i1 and i2, such that
1 i1 i2 10 and such that: Triangular loops.
i1 = i2 -100? do i = 1, N
or do j = 1, i-1
i2 - i1 = 100? S: a(i,j) = a(j,i)
end do
There will be an integer solution if and only if gcd(1,-1) divides end do
100.
This is the case, and hence, there is dependence! Or is there? Must impose j i as an additional constraint.
Carnegie Mellon Carnegie Mellon
Optimizing Compilers: Parallelization -25- Todd C. Mowry Optimizing Compilers: Parallelization -26- Todd C. Mowry
do i = 1, 10 do i = 1, N do i = 1, N
S1: a(i) = a(i+k)
end do
S1:
S2:
x = a(i)
b(i)
()=x
S1:
S2:
x(i) = a(i)
b(i)
( ) = x(i)
()
end do end do
Same problem as unknown loop bounds, but occur due to
some loop transformations (e.g., normalization).
j = N-1
do i = 1, N do i = 1, N
S1:
do i = L, H
a(i) = a(i-1)
S1: a(i) = a(j) S1: a(i) = a(N-i)
S2: j = j - 1
end do end do end do
sum = 0 do i = 1, N
do i = 1, H-L
do i = 1, N
S1: sum = sum + a(i)
S1: sum(i) = a(i)
end do
S1: a(i+L) = a(i+L-1) end do sum += sum(i) i = 1, N
end do
Carnegie Mellon Carnegie Mellon
Optimizing Compilers: Parallelization -27- Todd C. Mowry Optimizing Compilers: Parallelization -28- Todd C. Mowry
7
Serious Complications Loop Parallelization
Aliases. A dependence is said to be carried by a loop if the loop is
– Equivalence Statements in Fortran: the outmost loop whose removal eliminates the dependence.
If a dependence is not carried by the loop, it is loop-
real a(10,10), b(10) independent.
d i=2
do 2, n-11
makes b the same as the first column of a. do j = 2, m-1
a(i, j) = …
... = a(i, j)
– Common blocks: Fortran’s way of having shared/global variables.
b(i, j) = …
common /shared/a,b,c … = b(i, j-1)
:
:
c(i, j) = …
subroutine foo (…) … = c(i-1, j)
common /shared/a,b,c end do
end do
common /shared/x,y,z
Carnegie Mellon Carnegie Mellon
Optimizing Compilers: Parallelization -29- Todd C. Mowry Optimizing Compilers: Parallelization -30- Todd C. Mowry
b(i, j) = … b(i, j) = …
… = b(i, j-1) δt, … = b(i, j-1)
c(i, j) = … c(i, j) = …
… = c(i-1, j) … = c(i-1, j)
end do end do
end do end do
8
Loop Parallelization Loop Parallelization
A dependence is said to be carried by a loop if the loop is A dependence is said to be carried by a loop if the loop is
the outmost loop whose removal eliminates the dependence. the outmost loop whose removal eliminates the dependence.
If a dependence is not carried by the loop, it is loop- If a dependence is not carried by the loop, it is loop-
independent. independent.
d i=2
do 2, n-11 d i=2
do 2, n-11
do j = 2, m-1 do j = 2, m-1
a(i, j) = … a(i, j) = …
... = a(i, j) δt, ... = a(i, j)
b(i, j) = … b(i, j) = …
… = b(i, j-1) δt, … = b(i, j-1)
c(i, j) = … c(i, j) = …
δt, δt,
… = c(i-1, j) … = c(i-1, j)
end do end do
end do end do
Outermost loop with a non “=“ direction carries dependence!
Carnegie Mellon Carnegie Mellon
Optimizing Compilers: Parallelization -33- Todd C. Mowry Optimizing Compilers: Parallelization -34- Todd C. Mowry
9
Loop Parallelization - Example Loop Parallelization - Example
fork fork
j=2 j=m-1 j=2 j=m-1
join join
Iterations of loop i must be executed sequentially, but the Iterations of loop i must be executed sequentially, but the
iterations of loop j may be executed in parallel. iterations of loop j may be executed in parallel. Why?
Loop interchange changes the order of the loops to improve the Loop interchange changes the order of the loops to improve the
spatial locality of a program. spatial locality of a program.
do j = 1, n do j = 1, n do i = 1, n
do i = 1, n do i = 1, n do j = 1, n
... a(i,j) ... ... a(i,j) ... … a(i,j) ...
end do end do end do
end do end do end do
j i
P P
C C
i j
M M
10
Loop Interchange Loop Interchange
Loop interchange can improve the granularity of parallelism!
j
δt, δt,
i
do i = 1,n δt, do j = 1,n
do i = 1, n do j = 1, n do j = 1,n do i = 1,n
do j = 1, n do i = 1, n δt, δt,
… a(i,j) … … a(i,j) …
a(i,j) = b(i,j) a(i,j) = b(i,j) end do end do
c(i,j) = a(i-1,j) c(i,j) = a(i-1,j) end do δt, δt, end do
end do end do
end do end do
δt, δt,
δt, δt,
When is loop interchange legal?
j j
11
Loop Interchange Loop Blocking (Loop Tiling)
Exploits temporal locality in a loop nest.
j
δt, δt,
i
do i = 1,n do j = 1,n
do j = 1,n δt, δt, do i = 1,n
… a(i,j) … … a(i,j) … do t = 1,T
end do end do do i = 1,n
end do δt, end do do j = 1,n
… a(i,j) …
end do
end do
end do
control loops
p control loops
p jc =1
do ic = 1,
1 nn, B do ic = 1,
1 nn, B
do jc = 1, n , B do jc = 1, n , B
do t = 1,T do t = 1,T
ic =1
do i = 1,B do i = 1,B
do j = 1,B do j = 1,B
… a(ic+i-1,jc+j-1) … … a(ic+i-1,jc+j-1) …
end do end do
end do end do
end do end do
end do end do
end do B: Block size end do B: Block size
12
Loop Blocking (Loop Tiling) Loop Blocking (Loop Tiling)
Exploits temporal locality in a loop nest. Exploits temporal locality in a loop nest.
control loops
p jc =2 control loops
p
do ic = 1,
1 nn, B do ic = 1,
1 nn, B
do jc = 1, n , B do jc = 1, n , B
do t = 1,T do t = 1,T
ic =1
do i = 1,B do i = 1,B
do j = 1,B do j = 1,B
… a(ic+i-1,jc+j-1) … … a(ic+i-1,jc+j-1) …
end do end do
end do end do ic =2
end do end do
end do end do
end do B: Block size end do B: Block size
jc =1
do ic = 1, n, B
do t = 1,T do jc = 1, n , B
control loops
p do t = 1,T
1T do ic = 1,
1 nn, B do t = 1,T
1T
do ic = 1,
1 nn, B
do jc = 1, n , B do i = 1,n do i = 1,B do i = 1,B
do t = 1,T do j = 1,n do jc = 1, n, B do j = 1,B
do i = 1,B … a(i,j) … do j = 1,B … a(ic+i-1,jc+j-1) …
do j = 1,B end do … a(ic+i-1,jc+j-1) … end do
… a(ic+i-1,jc+j-1) … end do end do end do
end do end do end do end do
end do ic =2 end do end do
end do end do
end do
end do B: Block size
jc =2
When is loop blocking legal?
Carnegie Mellon Carnegie Mellon
Optimizing Compilers: Parallelization -51- Todd C. Mowry Optimizing Compilers: Parallelization -52- Todd C. Mowry
13