0% found this document useful (0 votes)

21 views21 pages

ST7 SHP 1.3 ExOptimVectoSIMD 1spp

gcc –O3 pgm.c –o pgm value of n Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm Specific compilation option: to improve loop unrolling 12

Uploaded by

josh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views21 pages

ST7 SHP 1.3 ExOptimVectoSIMD 1spp

gcc –O3 pgm.c –o pgm value of n Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm Specific compilation option: to improve loop unrolling 12

Uploaded by

josh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

ST7: High Performance Simulation

Examples of serial optimization and

vectorization with SIMD units

Stéphane Vialle

[email protected]
http://www.metz.supelec.fr/~vialle
HPC programming strategy
Numerical algorithm

Optimized code on one core:

- Optimized compilation
- Serial optimizations
- Vectorization

Parallelized code on one node

- Multithreading
- Minimal/relaxed synchronization
- Data-Computation localization
- NUMA effect avoidance

Distributed code on a cluster:

- Message passing across processes
- Load balanced computations
- Minimal communications
- Overlapped computations and comms
2
Optimization and vectorization of a
Dense Matrix Product

1 – Stengths and weaknesses of the naive version

2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order

3
Strenghts and weaknesses
of the naive version
j
B
A
Strenghts : = i
C
, , ,
The naive version
implements directly the
mathematical expression for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
 Code is: for (int k = 0; k < N; k++)
• easy to implement C[i][j] += A[i][k]*B[k][j];
• easy to maintain

Weaknesses :
1. Internal loop does not acces B[k][j] elements in contiguous order
 non-optimal use of the cache memory
2. Internal loop writes to the same variable C[i][j] at each iteration
 limits/stops the vectorization (not independent iterations)
4
Optimization and vectorization of a
Dense Matrix Product

1 – Stengths and weaknesses of the naive version

2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order

5
1st solution: evolution of the data storage

Identification of « cache misses »

for i B
for j
for (int k=0; k<n; k++)
Cij += Aik * Bkj
k,
k+1,
Considering successive …
iterations of k-loop:
Accessing non
contiguous elts in RAM:
A • many « cache misses »
Cij += Aik*Bkj C
• poor usage of the
k,k+1,… cache memory
i

Access to successive elements

in RAM: takes advantage of the j
cache memory mechanism 6
1st solution: evolution of the data storage

Avoiding « cache misses »

for i TB Access to successive
for j
for (int k=0; k<n; k++) elements in RAM: takes
Cij += Aik * TBjk j advantage of the cache
k memory mechanism
Considering successive
iterations of k-loop:

A
Cij += Aik*TBjk
C
k
i

Access to successive elements

in RAM: takes advantage of the j
cache memory mechanism 7
1st solution: evolution of the data storage

Avoiding « cache misses »

Source code with new data storage
// Dense matrix product: C = A×B
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
C[i][j] += A[i][k]*TB[j][k];
}
}
Compilation: gcc –O3 pgm.c –o pgm

This source code version is closer of the architecture of the processor

• faster
• more complex to understand and to maintain
• requires TB … (store TB instead of B? transpose B when needed ?)
8
1st solution: evolution of the data storage

Identification of a vectorization lock

for i
At each iteration:
TB
for j • Identical computation
for (int k=0; k<n; k++) (no if-then-else,
Cij += Aik * TBjk j
k no divergence)
• Access to successive
Considering successive
array indexes
iterations of k-loop:
• Write/Accumulation
A in the same variable
C
Cij += Aik*TBjk (Cij):
k • iterations are not
i
independent
• vectorization
is limited

j
9
1st solution: evolution of the data storage

Unlocking the vectorization

for i if n = 4.q TB
• Loop unrolling
for j
double Acc[4] = {0} • Accumulation in a
k
for (k=0; k<n; k+=4) j vector of buffers
Acc[0] += Aik+0*TBjk+0
Acc[1] += Aik+1*TBjk+1 Each k-loop iteration:
Acc[2] += Aik+2*TBjk+2
Acc[3] += Aik+3*TBjk+3 • includes 4 identical &
Cij = Acc[0]+Acc[1]+ independent instructions
Acc[2]+Acc[3]; • reading and writing
A C successive array indexes
k  Compiler can
i vectorize each
k-loop iteration

j
10
1st solution: evolution of the data storage

Unlocking the vectorization

Source code 1: with new data storage & loop unrolling
// Dense matrix product: C = A×B Loop unrolling
for (int i = 0; i < n; i++) with 8-factor
for (int j = 0; j < n; j++) {
(in case of long
double accu[8] = {0.0};
for (int k = 0; k < (n/8)*8; k += 8) { AVX units)
accu[0] += A[i][k+0]*TB[j][k+0];
accu[1] += A[i][k+1]*TB[j][k+1]; Integer division:
……… (900/8)*8 = 896
accu[7] += A[i][k+7]*TB[j][k+7];
}
Generic solution
for (int k = (n/8)*8; k < n; k++)
accu[0] += A[i][k]*TB[j][k]; runs for any
C[i][j] = accu[0] + … + accu[7]; value of n
}
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Specific compilation option:
to improve loop unrolling 11
1st solution: evolution of the data storage

Unlocking the vectorization

Source code 2: with new data storage & loop unrolling
// Dense matrix product: C = A×B Loop unrolling
for (int i = 0; i < n; i++) with 8-factor
for (int j = 0; j < n; j++) {
(in case of long
for (int k = 0; k < (n/8)*8; k += 8) {
C[i][j] += A[i][k+0]*TB[j][k+0] + AVX units)
A[i][k+1]*TB[j][k+1] +
……… Integer division:
A[i][k+7]*TB[j][k+7]; (900/8)*8 = 896
}
for (int k = (n/8)*8; k < n; k++) Generic solution
C[i][j] += A[i][k]*TB[j][k]; runs for any
}
value of n
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Implement loop unrolling & a big instruction, grouping many identical
operations on successive array indexes
 Better or worst than previous solution: depends on the compiler 12
Optimization and vectorization of a
Dense Matrix Product

1 – Stengths and weaknesses of the naive version

2 – 1st solution: evolution of the data storage
3 – 2nd solution: evolution of the loop order

13
2nd solution: evolution of the loop order

Inversion of j and k loops

j
for i B Access to successive
for k array indexes: right use
for j k
of the cache memory
Cij += Aik * Bkj

Considering successive Access to successive

iterations of j-loop: array indexes: right use
of the cache memory
A C
+ Independent & identical
k operations
i + No write conflict
 auto-vectorization

+ no explicit loop-unrolling
(use only –funroll-loops
Access to only one element
j option of the compiler)
of A: no cache miss pb 14
2nd solution: evolution of the loop order

Inversion of j and k loops

Source code with new loop order: « ikj »
// Dense matrix product: C = A×B
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][k]*B[k][j];
Compilation: gcc –O3 –f unroll-loops pgm.c –o pgm
Usually faster when compiling
with –f unroll-loops option
• Right use of the cache memory
• Suppression of the write conflict during vectorization of the inner loop
 Decreases the number of cache misses
 Enable the auto-vectorization
Elegant and efficient ! … but not always so simple ! 15
2nd solution: evolution of the loop order

Investigating all possible inner loop

Inner
Cij += Aik * Bkj loop
Right use NO! NO! OK
of cache cache misses cache misses access same lt
Vectorisation NO NO! OK i
enabled not-contiguous NO not-contiguous NO access same elt
Right use OK OK OK
of cache contiguous access same elt contiguous
j
Vectorisation OK OK
OK access same elt OK OK
enabled contiguous contiguous
no W conflict
Right use OK OK NO!
of cache access same elt contiguous cache misses
Vectorisation NO! OK NO! k
W conflit NO contiguous
NO
enabled not-contiguous

 inner loop = j-loop : the only right solution

(without changing the data storage) 16
Experiments

17
Exp. on an Intel Xeon Haswell
Experiments (2018) :
Dense matrix product: 4096x4096, double precision
Processor: Intel Xeon Haswell E5‐2637 v3 - 2014
(4 physical cores – 2 threads/core)

Seq. Naive Seq. Naive -O3 + BLAS

-O0 -O3 Optimized code + monothread
Vectorization (OpenBLAS)
0.12 Gflops 0.35 Gflops 3.10 Gflops 46.3 Gflops
×1.0 ×2.9 ×25.8 ×385.8
2 threads/core 1 thread/core
(best configuration) (best configuration)

 Use optimized HPC libraries when available

 Optimize your source code when HPC library does not exist
18
Exp. on 2 Intel Xeon architectures
Experiments of kernel K0 (section 2) - 2019
• Sarah : quad-cores à 3.5 GHz
(E5-2637 v3 « haswell », 15MB cache, 2014 – gcc 5.4.0)
• Kyle : octo-cores à 2.1 GHz
(Silver 4110 CPU « skylake », 11MB cache, 2017 – gcc 7.3.0)

1024x1024 Naif 2048x2048

8 8
+ Accu
6 6
+ TB
4 4
GFlops

GFlops
+ Loop‐unroll
2 + Vect‐Accu 2
0 + Long op 0
Sarah Kyle Sarah Kyle ikj Sarah Kyle Sarah Kyle
O0 O0 O3 O3 O0 O0 O3 O3
 K0-ikj algorithm appears the best
 Significant parts of small matrices fit in cache: higher K0 perfs. 19
Exp. on 2 Intel Xeon architectures
Expériments of kernel K1 (OpenBLAS) – 2019
• Sarah : quad-cores à 3.5 GHz
(E5-2637 v3 « haswell », 15MB cache, 2014 – gcc 5.4.0)
• Kyle : octo-cores à 2.1 GHz
(Silver 4110 CPU « skylake », 11MB cache, 2017 – gcc 7.3.0)

1024x1024 2048x2048 4096x4096

60 60 60
GFlops

GFlops

GFlops
40 40 40
Best K0 ‐ TP
20 20 20
K1 ‐ OpenBLAS
0 0 0
Sarah Kyle Sarah Kyle Sarah Kyle

The "BLAS" are the result of long developments : use this library!
20
Examples of serial optimization and
vectorization with SIMD units

Questions ?

Operating Systems: (Notes To Prepare in 1 Night Before Exam) Based On Galvin Book By:-Shashank Shekhar
100% (2)
Operating Systems: (Notes To Prepare in 1 Night Before Exam) Based On Galvin Book By:-Shashank Shekhar
31 pages
DBMS II Chapter 1
No ratings yet
DBMS II Chapter 1
17 pages
Questions Os
No ratings yet
Questions Os
6 pages
PIP Tahun 2024 SMP Semua Tahap Kab. Enrekang Semua Kecamatan Semua Status Cair 20241220
No ratings yet
PIP Tahun 2024 SMP Semua Tahap Kab. Enrekang Semua Kecamatan Semua Status Cair 20241220
970 pages
Ec8552-Cao Unit 5
No ratings yet
Ec8552-Cao Unit 5
72 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
53 pages
Unit 3
No ratings yet
Unit 3
10 pages
2024 12 02 - 12 17 12.2857 - 0500 8f671351
No ratings yet
2024 12 02 - 12 17 12.2857 - 0500 8f671351
17 pages
An Introductory Textbook On Cyber-Physical Systems
No ratings yet
An Introductory Textbook On Cyber-Physical Systems
7 pages
SIMD Vs MIMD With Memory Models
No ratings yet
SIMD Vs MIMD With Memory Models
7 pages
Semaphore Unix
No ratings yet
Semaphore Unix
5 pages
Parallel and Distributed Algorithms: Johnnie W. Baker
No ratings yet
Parallel and Distributed Algorithms: Johnnie W. Baker
67 pages
Cse308 - Operating Systems: G.Manikandan Sap / Ict / Soc Sastra
No ratings yet
Cse308 - Operating Systems: G.Manikandan Sap / Ict / Soc Sastra
15 pages
Lecture 13: Distributed Transactions: Notes Adapted From Tanenbaum's "Distributed Systems Principles and Paradigms"
No ratings yet
Lecture 13: Distributed Transactions: Notes Adapted From Tanenbaum's "Distributed Systems Principles and Paradigms"
24 pages
Unit 5 DOS SCR
No ratings yet
Unit 5 DOS SCR
46 pages
Quiz#8 ITM6505
No ratings yet
Quiz#8 ITM6505
7 pages
Maekawa
No ratings yet
Maekawa
2 pages
Parallel Processing: sp2016 Lec#3
No ratings yet
Parallel Processing: sp2016 Lec#3
23 pages
Graded Quiz 3
No ratings yet
Graded Quiz 3
15 pages
Process Migration
63% (8)
Process Migration
21 pages
OS Process Synchronisation
No ratings yet
OS Process Synchronisation
5 pages
Akka Java
No ratings yet
Akka Java
452 pages
CPS 356 Lecture Notes - Scheduling
No ratings yet
CPS 356 Lecture Notes - Scheduling
10 pages
Case Incident 1 Multitasking A Good Use of Your Time
88% (8)
Case Incident 1 Multitasking A Good Use of Your Time
3 pages
Comparing and Evaluating The Performance of Inter Process Communication Models in Linux Environment
No ratings yet
Comparing and Evaluating The Performance of Inter Process Communication Models in Linux Environment
5 pages
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
No ratings yet
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
27 pages
Lecture-3 Parallel Computer Memory Architecture
No ratings yet
Lecture-3 Parallel Computer Memory Architecture
14 pages
Laboratory Exercise: Thread Priority: 1. Frmtrackthread Window Form
No ratings yet
Laboratory Exercise: Thread Priority: 1. Frmtrackthread Window Form
4 pages
Thread Is One of The Important Concepts in Android
No ratings yet
Thread Is One of The Important Concepts in Android
10 pages
ROSALES - 04 Quiz 1
No ratings yet
ROSALES - 04 Quiz 1
2 pages

ST7 SHP 1.3 ExOptimVectoSIMD 1spp

Uploaded by

ST7 SHP 1.3 ExOptimVectoSIMD 1spp

Uploaded by

ST7: High Performance Simulation

Examples of serial optimization and

Optimized code on one core:

Parallelized code on one node

Distributed code on a cluster:

1 – Stengths and weaknesses of the naive version

1 – Stengths and weaknesses of the naive version

Identification of « cache misses »

Access to successive elements

Avoiding « cache misses »

Access to successive elements

Avoiding « cache misses »

This source code version is closer of the architecture of the processor

Identification of a vectorization lock

Unlocking the vectorization

Unlocking the vectorization

Unlocking the vectorization

1 – Stengths and weaknesses of the naive version

Inversion of j and k loops

Considering successive Access to successive

Inversion of j and k loops

Investigating all possible inner loop

 inner loop = j-loop : the only right solution

Seq. Naive Seq. Naive -O3 + BLAS

 Use optimized HPC libraries when available

1024x1024 Naif 2048x2048

1024x1024 2048x2048 4096x4096

You might also like