0% found this document useful (0 votes)
21 views21 pages

ST7 SHP 1.3 ExOptimVectoSIMD 1spp

gcc –O3 pgm.c –o pgm value of n Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm Specific compilation option: to improve loop unrolling 12

Uploaded by

josh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views21 pages

ST7 SHP 1.3 ExOptimVectoSIMD 1spp

gcc –O3 pgm.c –o pgm value of n Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm Specific compilation option: to improve loop unrolling 12

Uploaded by

josh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

ST7: High Performance Simulation

Examples of serial optimization and


vectorization with SIMD units

Stéphane Vialle

[email protected]
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm

Optimized code on one core:


- Optimized compilation
- Serial optimizations
- Vectorization

Parallelized code on one node


- Multithreading
- Minimal/relaxed synchronization
- Data-Computation localization
- NUMA effect avoidance

Distributed code on a cluster:


- Message passing across processes
- Load balanced computations
- Minimal communications
- Overlapped computations and comms
2
Optimization and vectorization of a
Dense Matrix Product

1 – Stengths and weaknesses of the naive version


2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order

3
Strenghts and weaknesses
of the naive version
j
B
A
Strenghts : = i
C
, , ,
The naive version
implements directly the
mathematical expression for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
 Code is: for (int k = 0; k < N; k++)
• easy to implement C[i][j] += A[i][k]*B[k][j];
• easy to maintain

Weaknesses :
1. Internal loop does not acces B[k][j] elements in contiguous order
 non-optimal use of the cache memory
2. Internal loop writes to the same variable C[i][j] at each iteration
 limits/stops the vectorization (not independent iterations)
4
Optimization and vectorization of a
Dense Matrix Product

1 – Stengths and weaknesses of the naive version


2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order

5
1st solution: evolution of the data storage

Identification of « cache misses »


for i B
for j
for (int k=0; k<n; k++)
Cij += Aik * Bkj
k,
k+1,
Considering successive …
iterations of k-loop:
Accessing non
contiguous elts in RAM:
A • many « cache misses »
Cij += Aik*Bkj C
• poor usage of the
k,k+1,… cache memory
i

Access to successive elements


in RAM: takes advantage of the j
cache memory mechanism 6
1st solution: evolution of the data storage

Avoiding « cache misses »


for i TB Access to successive
for j
for (int k=0; k<n; k++) elements in RAM: takes
Cij += Aik * TBjk j advantage of the cache
k memory mechanism
Considering successive
iterations of k-loop:

A
Cij += Aik*TBjk
C
k
i

Access to successive elements


in RAM: takes advantage of the j
cache memory mechanism 7
1st solution: evolution of the data storage

Avoiding « cache misses »


Source code with new data storage
// Dense matrix product: C = A×B
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
C[i][j] += A[i][k]*TB[j][k];
}
}
Compilation: gcc –O3 pgm.c –o pgm

This source code version is closer of the architecture of the processor


• faster
• more complex to understand and to maintain
• requires TB … (store TB instead of B? transpose B when needed ?)
8
1st solution: evolution of the data storage

Identification of a vectorization lock


for i
At each iteration:
TB
for j • Identical computation
for (int k=0; k<n; k++) (no if-then-else,
Cij += Aik * TBjk j
k no divergence)
• Access to successive
Considering successive
array indexes
iterations of k-loop:
• Write/Accumulation
A in the same variable
C
Cij += Aik*TBjk (Cij):
k • iterations are not
i
independent
• vectorization
is limited

j
9
1st solution: evolution of the data storage

Unlocking the vectorization


for i if n = 4.q TB
• Loop unrolling
for j
double Acc[4] = {0} • Accumulation in a
k
for (k=0; k<n; k+=4) j vector of buffers
Acc[0] += Aik+0*TBjk+0
Acc[1] += Aik+1*TBjk+1 Each k-loop iteration:
Acc[2] += Aik+2*TBjk+2
Acc[3] += Aik+3*TBjk+3 • includes 4 identical &
Cij = Acc[0]+Acc[1]+ independent instructions
Acc[2]+Acc[3]; • reading and writing
A C successive array indexes
k  Compiler can
i vectorize each
k-loop iteration

j
10
1st solution: evolution of the data storage

Unlocking the vectorization


Source code 1: with new data storage & loop unrolling
// Dense matrix product: C = A×B Loop unrolling
for (int i = 0; i < n; i++) with 8-factor
for (int j = 0; j < n; j++) {
(in case of long
double accu[8] = {0.0};
for (int k = 0; k < (n/8)*8; k += 8) { AVX units)
accu[0] += A[i][k+0]*TB[j][k+0];
accu[1] += A[i][k+1]*TB[j][k+1]; Integer division:
……… (900/8)*8 = 896
accu[7] += A[i][k+7]*TB[j][k+7];
}
Generic solution
for (int k = (n/8)*8; k < n; k++)
accu[0] += A[i][k]*TB[j][k]; runs for any
C[i][j] = accu[0] + … + accu[7]; value of n
}
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Specific compilation option:
to improve loop unrolling 11
1st solution: evolution of the data storage

Unlocking the vectorization


Source code 2: with new data storage & loop unrolling
// Dense matrix product: C = A×B Loop unrolling
for (int i = 0; i < n; i++) with 8-factor
for (int j = 0; j < n; j++) {
(in case of long
for (int k = 0; k < (n/8)*8; k += 8) {
C[i][j] += A[i][k+0]*TB[j][k+0] + AVX units)
A[i][k+1]*TB[j][k+1] +
……… Integer division:
A[i][k+7]*TB[j][k+7]; (900/8)*8 = 896
}
for (int k = (n/8)*8; k < n; k++) Generic solution
C[i][j] += A[i][k]*TB[j][k]; runs for any
}
value of n
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Implement loop unrolling & a big instruction, grouping many identical
operations on successive array indexes
 Better or worst than previous solution: depends on the compiler 12
Optimization and vectorization of a
Dense Matrix Product

1 – Stengths and weaknesses of the naive version


2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order

13
2nd solution: evolution of the loop order

Inversion of j and k loops


j
for i B Access to successive
for k array indexes: right use
for j k
of the cache memory
Cij += Aik * Bkj

Considering successive Access to successive


iterations of j-loop: array indexes: right use
of the cache memory
A C
+ Independent & identical
k operations
i + No write conflict
 auto-vectorization

+ no explicit loop-unrolling
(use only –funroll-loops
Access to only one element
j option of the compiler)
of A: no cache miss pb 14
2nd solution: evolution of the loop order

Inversion of j and k loops


Source code with new loop order: « ikj »
// Dense matrix product: C = A×B
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][k]*B[k][j];
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Usually faster when compiling
with –f unroll-loops option
• Right use of the cache memory
• Suppression of the write conflict during vectorization of the inner loop
 Decreases the number of cache misses
 Enable the auto-vectorization
Elegant and efficient ! … but not always so simple ! 15
2nd solution: evolution of the loop order

Investigating all possible inner loop


Inner
Cij += Aik * Bkj loop
Right use NO! NO! OK
of cache cache misses cache misses access same lt
Vectorisation NO NO! OK i
enabled not-contiguous NO not-contiguous NO access same elt
Right use OK OK OK
of cache contiguous access same elt contiguous
j
Vectorisation OK OK
OK access same elt OK OK
enabled contiguous contiguous
no W conflict
Right use OK OK NO!
of cache access same elt contiguous cache misses
Vectorisation NO! OK NO! k
W conflit NO contiguous
NO
enabled not-contiguous

 inner loop = j-loop : the only right solution


(without changing the data storage) 16
Experiments

17
Exp. on an Intel Xeon Haswell
Experiments (2018) :
Dense matrix product: 4096x4096, double precision
Processor: Intel Xeon Haswell E5‐2637 v3 - 2014
(4 physical cores – 2 threads/core)

Seq. Naive Seq. Naive -O3 + BLAS


-O0 -O3 Optimized code + monothread
Vectorization (OpenBLAS)
0.12 Gflops 0.35 Gflops 3.10 Gflops 46.3 Gflops
×1.0 ×2.9 ×25.8 ×385.8
2 threads/core 1 thread/core
(best configuration) (best configuration)

 Use optimized HPC libraries when available


 Optimize your source code when HPC library does not exist
18
Exp. on 2 Intel Xeon architectures
Experiments of kernel K0 (section 2) - 2019
• Sarah : quad-cores à 3.5 GHz
(E5-2637 v3 « haswell », 15MB cache, 2014 – gcc 5.4.0)
• Kyle : octo-cores à 2.1 GHz
(Silver 4110 CPU « skylake », 11MB cache, 2017 – gcc 7.3.0)

1024x1024 Naif 2048x2048


8 8
 + Accu
6 6
 + TB
4 4
GFlops

GFlops
 + Loop‐unroll
2  + Vect‐Accu 2
0  + Long op 0
Sarah Kyle Sarah Kyle ikj Sarah Kyle Sarah Kyle
O0 O0 O3 O3 O0 O0 O3 O3
 K0-ikj algorithm appears the best
 Significant parts of small matrices fit in cache: higher K0 perfs. 19
Exp. on 2 Intel Xeon architectures
Expériments of kernel K1 (OpenBLAS) – 2019
• Sarah : quad-cores à 3.5 GHz
(E5-2637 v3 « haswell », 15MB cache, 2014 – gcc 5.4.0)
• Kyle : octo-cores à 2.1 GHz
(Silver 4110 CPU « skylake », 11MB cache, 2017 – gcc 7.3.0)

1024x1024 2048x2048 4096x4096


60 60 60
GFlops

GFlops

GFlops
40 40 40
Best K0 ‐ TP
20 20 20
K1 ‐ OpenBLAS
0 0 0
Sarah Kyle Sarah Kyle Sarah Kyle

The "BLAS" are the result of long developments : use this library!
20
Examples of serial optimization and
vectorization with SIMD units

Questions ?

21

You might also like