0% found this document useful (0 votes)
114 views67 pages

Accelerating CFD Simulations With Gpus: Patrice Castonguay

This document discusses accelerating computational fluid dynamics (CFD) simulations with GPUs. It first provides background on CFD and why GPUs are well-suited for accelerating CFD codes due to their high computational power and parallelism. It then summarizes efforts to accelerate the commercial CFD code ANSYS Fluent using GPUs, including developing novel parallel algorithms for algebraic multigrid methods. It also discusses accelerating the research code SD3D using a GPU implementation, achieving speedups of up to 11x over CPUs.

Uploaded by

astudespus
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views67 pages

Accelerating CFD Simulations With Gpus: Patrice Castonguay

This document discusses accelerating computational fluid dynamics (CFD) simulations with GPUs. It first provides background on CFD and why GPUs are well-suited for accelerating CFD codes due to their high computational power and parallelism. It then summarizes efforts to accelerate the commercial CFD code ANSYS Fluent using GPUs, including developing novel parallel algorithms for algebraic multigrid methods. It also discusses accelerating the research code SD3D using a GPU implementation, achieving speedups of up to 11x over CPUs.

Uploaded by

astudespus
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Accelerating CFD Simulations with GPUs

Patrice Castonguay
HPC Developer Technology Engineer December 4th, 2012

Outline
Computational Fluid Dynamics Why GPUs?

Accelerating ANSYS Fluent (commercial CFD software)


Background Novel parallel algorithms to replace sequential algorithms Performance results

Accelerating SD3D (research code developed at Stanford)


Background Performance results

Conclusions
2

Computational Fluid Dynamics

Computational Fluid Dynamics


Simulation of fluid flows Large number of applications

Computational Fluid Dynamics


Navier-Stokes equations: coupled nonlinear partial differential equations which govern unsteady, compressible, viscous fluid flows

Computational Fluid Dynamics


Partition the domain into large number of cells, and solve for fluid properties (density, velocity, pressure) inside each cell

Why GPUs?

Why GPUs
Higher performance Intel CPU:
8-core Sandy Bridge
fp32 performance: 384 Gflops fp64 performance: 192 Gflops Memory bandwidth: 52 GB/s

NVIDIA GPU:
Tesla GK110
fp32 performance: 3935 Gflops 10X fp64 performance: 1311 Gflops 6.8X Memory bandwidth: 250 GB/s 4.8X
8

Why GPUs
Power efficiency
Traditional CPUs not economically feasible

=
Jaguar (3rd fastest supercomputer in Nov. 2011) 2.3 Petaflops @ 7 megawatts

7 megawatts 7,000 Homes


9

Why GPUs
Power efficiency
Traditional CPUs not economically feasible

Scaled to 100 petaflops

Jaguar @ 2.3 Petaflops 7 megawatts 7,000 homes

300 megawatts 300,000 homes Quebec City and its metropolitan area
10

Why GPUs
CPU
Optimized for Many Parallel Tasks

GPU Accelerator

Optimized for Serial Tasks

Higher computational power per watt


11

Why GPUs
Many scientific applications already benefit from GPUs
Relative Performance K20x vs. dual-socket Sandy Bridge E5-2687w 3.10 GHz Sandy Bridge
Chroma
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

10.20 8.00 1.00 8.85 5.41 1.00 7.17 3.46

SPECFEM3D
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

AMBER
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

1.00
4.40 1.80 1.00 2.73

WS-LSMS
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

NAMD
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

1.71
1.00 0 1 2 3 4 5 6 7 8 9 10 11
12

Why GPUs
Algorithms found in computer games are surprisingly similar to the ones found in scientific applications

Thriving gaming market funds GPU development for supercomputers

13

ANSYS Fluent

14

ANSYS Fluent
Largest share of CFD software market Finite volume method

Many different options: incompressible/compressible, inviscid/viscous, two-phase, explicit/implicit, ) Most popular option: implicit incompressible Navier-Stokes solver

15

Incompressible Navier-Stokes
Coupled non-linear PDEs Unknowns: u,v,w and p (density is constant) Mass:

Momentum x:

Momentum y:

Momentum z:
16

Incompressible Navier-Stokes
Solution procedure:

Runtime:
Assemble Linear System of Equations

Accelerate this first


Solve Linear System of Equation, Ax = b

~ 33%

~ 67%
Yes

No

Converged ?

Stop
17

Incompressible Navier-Stokes
Large sparse system of equations with 4x4 block entries

In practice, millions of rows

Aij is 4x4 matrix

xi is 4x1 vector ( xi = [pi, ui, vi, wi] )


18

Multigrid
2D Poisson equation, Jacobi smoother

Initial error

Error after 5 iterations

Error after 10 iterations

Error after 15 iterations

Convergence of error norm


19

Multigrid
Most simple iterative solvers are efficient at damping highfrequency errors, but inefficient at damping low-frequency error Key idea: Represent the error on a coarser grid so that low-frequency errors become high frequency errors

20

Multigrid

Smooth

Fine Grid Restrict

Smooth

Coarse Grid

21

Algebraic Multigrid (AMG)


Example: Two-Level V-cycle
Pre-Smooth Compute Residual

Post-Smooth

Restrict Residual

Prolongate Correction

Create

Solve
22

Algebraic Multigrid (AMG)


Apply recursively
~106 Unknowns

Pre-smooth Coarsen Pre-smooth Coarsen Pre-smooth

Post-smooth Prolongate Correction

Post-smooth
Prolongate Correction Post-smooth

Coarsen

Prolongate Correction

1-2 Unknowns
Solve
23

ANSYS Fluent
We have accelerated the AMG solver in Fluent with GPUs

Some parts of the algorithm were easily ported to the GPU


Sparse matrix vector multiplications Norm computations

Other operations were inherently sequential


Aggregation procedure to create restriction operator Gauss-Seidel, DILU smoothers Matrix-matrix multiplication

Need to develop novel parallel algorithms!


24

Aggregation
How do we create the coarser levels? w3,5 = f(A3,5, A5,3)
5
3 2 4

Graph representation of a matrix:

1
6
25

Aggregation
How do we create the coarser levels?

wi,j = f(Ai,j, Aj,i)


i j

26

Aggregation
In aggregation-based AMG, group vertices that are strongly connected to each other

27

Aggregation
New matrix/graph:

28

Aggregation
Want do merge vertices that are strongly connected to each other Similar to weighted graph matching problem
i j

wi,j = f(Ai,j, Aj,i)

29

Parallel aggregation
Each vertex extends a hand to its strongest neighbor

30

Parallel aggregation
Each vertex checks if its strongest neighbor extended a hand back

31

Parallel aggregation

32

Parallel aggregation
Repeat with unmatched vertices

33

Parallel aggregation

34

Parallel aggregation

35

Parallel aggregation

36

Parallel aggregation

37

Parallel aggregation

38

Smoothers
Smoothing step

M is called preconditioning matrix Here x is solution vector (includes all unknowns in the system)

39

Jacobi Smoother
For Jacobi, M is block-diagonal

Inherently parallel, each xi can be updated independently of each other Maps very well to the GPU
40

Parallel Smoothers
Smoothers available in Fluent (Gauss-Seidel & DILU) are sequential! For DILU smoother, M has the form

E is block diagonal matrix such that

Can recover ILU(0) for certain matrices Only requires one extra diagonal of storage
41

DILU Smoother
Construction of block diagonal matrix is sequential

Inversion of inversions)

is also sequential (two triangular matrix

42

Coloring
Use coloring to extract parallelism Coloring: assignment of color to vertices such that no two vertices of same color are adjacent
65 24 66 75 9 1 74 12 71 45 80 75 39 86 64 98 91 79 59 77 57 11 60 50 14 10 25 81 39 7 44 51 54 2 64 69 79 33 88 72 95 10 48 62 11 90 20 44

43

Coloring
Use coloring to extract parallelism With m unknowns and p colors, m/p unknowns can be processed in parallel

44

DILU Smoother
Construction of block diagonal matrix is now parallel

Inversion of matrix

is also parallel

45

Parallel Graph Coloring Min/Max


How do you color a graph/matrix in parallel? Parallel graph coloring algorithm of Luby, new variant developed at NVIDIA

46

Parallel Graph Coloring Min/Max


Assign a random number to each vertex

65 24 66 75 9 1

74 12 71 45 80 75

39 86 64 98 91 79

59 77 57 11 60 50

14 10 25 81 39 7

44 51 54 2 64 69

79 33 88 72 95 10

48 62 11 90 20 44

47

Parallel Graph Coloring Min/Max


Round 1: Each vertex checks if its a local maximum or minimum If max, color=dark blue. If min, color=green

65 24 66 75 9 1

74 12 71 45 80 75

39 86 64 98 91 79

59 77 57 11 60 50

14 10 25 81 39 7

44 51 54 2 64 69

79 33 88 72 95 10

48 62 11 90 20 44

48

Parallel Graph Coloring Min/Max


Round 2: Each vertex checks if its a local maximum or minimum. If max, color=pink. If min, color=red

65 24 66 75 9 1

74 12 71 45 80 75

39 86 64 98 91 79

59 77 57 11 60 50

14 10 25 81 39 7

44 51 54 2 64 69

79 33 88 72 95 10

48 62 11 90 20 44

49

Parallel Graph Coloring Min/Max


Round 3: Each vertex checks if its a local maximum or minimum. If max, color=purple. If min, color=white

65 24 66 75 9 1

74 12 71 45 80 75

39 86 64 98 91 79

59 77 57 11 60 50

14 10 25 81 39 7

44 51 54 2 64 69

79 33 88 72 95 10

48 62 11 90 20 44

50

AMG Timings
CPU Fluent solver: AMG(F-cycle, agg8, DILU, 0pre, 3post) GPU nvAMG solver: AMG(V-cycle, agg8, MC-DILU, 0pre, 3post)
1.4

1.2
1 0.8 0.6 0.4 0.2 0 Helix (hex 208K)

Lower is Better

K20X C2090 3930K(6)

Helix (tet 1173K)


51

AMG Timings
CPU Fluent solver: AMG(F-cycle, agg8, DILU, 0pre, 3post) GPU nvAMG solver: AMG(V-cycle, agg2, MC-DILU, 0pre, 3post)
9 8 7 6
Lower is Better

5
4 3 2 1 0 Airfoil (hex 784K) Aircraft (hex 1798K)

K20X C2090 3930K(6)

52

SD3D

Flux Reconstruction (FR) Method


Proposed by Huynh in 2007, similar to the Spectral Difference and Discontinuous Galerkin methods High-order method: spatial order of accuracy is > 2 High-Order Method

Error

Low-Order Method

Computational Cost

54

Flux Reconstruction (FR) Method


Solution in each element approximated by a multidimensional polynomial of order N Order of accuracy: hN+1 Multiple DOFs per element
N=2 N=3 N=4 N=1

55

Flux Reconstruction (FR) Method


Plunging airfoil: zero AOA, Re=1850, frequency: 2.46 rad/s 5th order accuracy in space, 4th order accurate RK time stepping

56

Flux Reconstruction (FR) Method


Computations are demanding: Millions of DOFS Hundreds of thousands of time steps More work per DOF compared to low-order methods

Until recently, high-order simulations over complex 3D geometries were intractable, unless you had access to large cluster

GPUs to the rescue!


57

GPU Implementation
Ported entire code to the GPU

FR and other high-order methods for unstructured grids map well to GPUs:
Large amount of parallelism (millions of DOFs)
More work per DOF compared to low-order methods Cell-local operations benefit from fast user-managed on-chip memory

Required some programming efforts, but was worth while


58

Single-GPU Implementation
12 10 8
7.60 8.01 7.58 6.21 6.23 5.51 11.27 9.23 9.05 7.44

8.63 7.21

Speedup

Tets
Hexs Prisms

6 4 2

0
3 4 5 6

Order of Accuracy

Speedup of the single-GPU algorithm (Tesla C2050) relative to a parallel computation on a six-core Xeon x5670 (Westmere) @ 2.9GHz

59

Applications
Unsteady viscous flow over sphere at Reynolds 300, Mach=0.3 28914 prisms and 258075 tets , 4th order accuracy, 6.3 million DOFs Ran on desktop machine with 3 C2050 GPUs

60

Applications
3 GPUs: same performance as 21 Xeon x5670 CPUs (126 cores) 3 GPUs personal computer: $10,000, easy to manage

Iso-surfaces of Q-criterion colored by Mach number for flow over sphere at Re=300, M=0.3

61

Multi-GPU Implementation
30

Speedup relative to 1 GPU

25 20 15 10 5 0 2

No Overlap Communication Overlap Communication and GPU Transfers Overlap

12

16

32

Number of GPUs

Speedup relative to 1 GPU for a 6th order accurate simulation running on a mesh with 55947 tetrahedral elements

62

Applications
Transitional flow over SD7003 airfoil, Re=60000, Mach=0.2, AOA=4 4th order accurate solution, 400000 RK iterations, 21.2 million DOFs

63

Applications

15 hours on 16 C2070s

157 hours ( > 6 days ) on 16 Xeon x5670 CPUs


64

Conclusions
Most scientific applications have large amount of parallelism

Parallel applications map well to GPUs which have hundreds of simple, power efficient cores Higher performance, higher performance/watt
Presented two successful uses of GPUs in CFD
Linear solver in Ansys Fluent (hard to parallelize) Research-oriented CFD code

65

Conclusions
Future of HPC is CPU + GPU/Accelerator

Need to develop new parallel numerical methods to replace inherently sequential algorithms (such as Gauss-Seidel, ILU preconditioners, etc..)
Peak flops vs memory bandwidth gap still growing
Flops are free Need to develop numerical methods that have larger flops/bytes ratio

66

Questions?
Patrice Castonguay [email protected]

You might also like