Accelerating CFD Simulations with GPUs
Patrice Castonguay
HPC Developer Technology Engineer December 4th, 2012
Outline
Computational Fluid Dynamics Why GPUs?
Accelerating ANSYS Fluent (commercial CFD software)
Background Novel parallel algorithms to replace sequential algorithms Performance results
Accelerating SD3D (research code developed at Stanford)
Background Performance results
Conclusions
2
Computational Fluid Dynamics
Computational Fluid Dynamics
Simulation of fluid flows Large number of applications
Computational Fluid Dynamics
Navier-Stokes equations: coupled nonlinear partial differential equations which govern unsteady, compressible, viscous fluid flows
Computational Fluid Dynamics
Partition the domain into large number of cells, and solve for fluid properties (density, velocity, pressure) inside each cell
Why GPUs?
Why GPUs
Higher performance Intel CPU:
8-core Sandy Bridge
fp32 performance: 384 Gflops fp64 performance: 192 Gflops Memory bandwidth: 52 GB/s
NVIDIA GPU:
Tesla GK110
fp32 performance: 3935 Gflops 10X fp64 performance: 1311 Gflops 6.8X Memory bandwidth: 250 GB/s 4.8X
8
Why GPUs
Power efficiency
Traditional CPUs not economically feasible
=
Jaguar (3rd fastest supercomputer in Nov. 2011) 2.3 Petaflops @ 7 megawatts
7 megawatts 7,000 Homes
9
Why GPUs
Power efficiency
Traditional CPUs not economically feasible
Scaled to 100 petaflops
Jaguar @ 2.3 Petaflops 7 megawatts 7,000 homes
300 megawatts 300,000 homes Quebec City and its metropolitan area
10
Why GPUs
CPU
Optimized for Many Parallel Tasks
GPU Accelerator
Optimized for Serial Tasks
Higher computational power per watt
11
Why GPUs
Many scientific applications already benefit from GPUs
Relative Performance K20x vs. dual-socket Sandy Bridge E5-2687w 3.10 GHz Sandy Bridge
Chroma
Single-CPU+K20X Single-CPU+M2090 Dual-CPU
10.20 8.00 1.00 8.85 5.41 1.00 7.17 3.46
SPECFEM3D
Single-CPU+K20X Single-CPU+M2090 Dual-CPU
AMBER
Single-CPU+K20X Single-CPU+M2090 Dual-CPU
1.00
4.40 1.80 1.00 2.73
WS-LSMS
Single-CPU+K20X Single-CPU+M2090 Dual-CPU
NAMD
Single-CPU+K20X Single-CPU+M2090 Dual-CPU
1.71
1.00 0 1 2 3 4 5 6 7 8 9 10 11
12
Why GPUs
Algorithms found in computer games are surprisingly similar to the ones found in scientific applications
Thriving gaming market funds GPU development for supercomputers
13
ANSYS Fluent
14
ANSYS Fluent
Largest share of CFD software market Finite volume method
Many different options: incompressible/compressible, inviscid/viscous, two-phase, explicit/implicit, ) Most popular option: implicit incompressible Navier-Stokes solver
15
Incompressible Navier-Stokes
Coupled non-linear PDEs Unknowns: u,v,w and p (density is constant) Mass:
Momentum x:
Momentum y:
Momentum z:
16
Incompressible Navier-Stokes
Solution procedure:
Runtime:
Assemble Linear System of Equations
Accelerate this first
Solve Linear System of Equation, Ax = b
~ 33%
~ 67%
Yes
No
Converged ?
Stop
17
Incompressible Navier-Stokes
Large sparse system of equations with 4x4 block entries
In practice, millions of rows
Aij is 4x4 matrix
xi is 4x1 vector ( xi = [pi, ui, vi, wi] )
18
Multigrid
2D Poisson equation, Jacobi smoother
Initial error
Error after 5 iterations
Error after 10 iterations
Error after 15 iterations
Convergence of error norm
19
Multigrid
Most simple iterative solvers are efficient at damping highfrequency errors, but inefficient at damping low-frequency error Key idea: Represent the error on a coarser grid so that low-frequency errors become high frequency errors
20
Multigrid
Smooth
Fine Grid Restrict
Smooth
Coarse Grid
21
Algebraic Multigrid (AMG)
Example: Two-Level V-cycle
Pre-Smooth Compute Residual
Post-Smooth
Restrict Residual
Prolongate Correction
Create
Solve
22
Algebraic Multigrid (AMG)
Apply recursively
~106 Unknowns
Pre-smooth Coarsen Pre-smooth Coarsen Pre-smooth
Post-smooth Prolongate Correction
Post-smooth
Prolongate Correction Post-smooth
Coarsen
Prolongate Correction
1-2 Unknowns
Solve
23
ANSYS Fluent
We have accelerated the AMG solver in Fluent with GPUs
Some parts of the algorithm were easily ported to the GPU
Sparse matrix vector multiplications Norm computations
Other operations were inherently sequential
Aggregation procedure to create restriction operator Gauss-Seidel, DILU smoothers Matrix-matrix multiplication
Need to develop novel parallel algorithms!
24
Aggregation
How do we create the coarser levels? w3,5 = f(A3,5, A5,3)
5
3 2 4
Graph representation of a matrix:
1
6
25
Aggregation
How do we create the coarser levels?
wi,j = f(Ai,j, Aj,i)
i j
26
Aggregation
In aggregation-based AMG, group vertices that are strongly connected to each other
27
Aggregation
New matrix/graph:
28
Aggregation
Want do merge vertices that are strongly connected to each other Similar to weighted graph matching problem
i j
wi,j = f(Ai,j, Aj,i)
29
Parallel aggregation
Each vertex extends a hand to its strongest neighbor
30
Parallel aggregation
Each vertex checks if its strongest neighbor extended a hand back
31
Parallel aggregation
32
Parallel aggregation
Repeat with unmatched vertices
33
Parallel aggregation
34
Parallel aggregation
35
Parallel aggregation
36
Parallel aggregation
37
Parallel aggregation
38
Smoothers
Smoothing step
M is called preconditioning matrix Here x is solution vector (includes all unknowns in the system)
39
Jacobi Smoother
For Jacobi, M is block-diagonal
Inherently parallel, each xi can be updated independently of each other Maps very well to the GPU
40
Parallel Smoothers
Smoothers available in Fluent (Gauss-Seidel & DILU) are sequential! For DILU smoother, M has the form
E is block diagonal matrix such that
Can recover ILU(0) for certain matrices Only requires one extra diagonal of storage
41
DILU Smoother
Construction of block diagonal matrix is sequential
Inversion of inversions)
is also sequential (two triangular matrix
42
Coloring
Use coloring to extract parallelism Coloring: assignment of color to vertices such that no two vertices of same color are adjacent
65 24 66 75 9 1 74 12 71 45 80 75 39 86 64 98 91 79 59 77 57 11 60 50 14 10 25 81 39 7 44 51 54 2 64 69 79 33 88 72 95 10 48 62 11 90 20 44
43
Coloring
Use coloring to extract parallelism With m unknowns and p colors, m/p unknowns can be processed in parallel
44
DILU Smoother
Construction of block diagonal matrix is now parallel
Inversion of matrix
is also parallel
45
Parallel Graph Coloring Min/Max
How do you color a graph/matrix in parallel? Parallel graph coloring algorithm of Luby, new variant developed at NVIDIA
46
Parallel Graph Coloring Min/Max
Assign a random number to each vertex
65 24 66 75 9 1
74 12 71 45 80 75
39 86 64 98 91 79
59 77 57 11 60 50
14 10 25 81 39 7
44 51 54 2 64 69
79 33 88 72 95 10
48 62 11 90 20 44
47
Parallel Graph Coloring Min/Max
Round 1: Each vertex checks if its a local maximum or minimum If max, color=dark blue. If min, color=green
65 24 66 75 9 1
74 12 71 45 80 75
39 86 64 98 91 79
59 77 57 11 60 50
14 10 25 81 39 7
44 51 54 2 64 69
79 33 88 72 95 10
48 62 11 90 20 44
48
Parallel Graph Coloring Min/Max
Round 2: Each vertex checks if its a local maximum or minimum. If max, color=pink. If min, color=red
65 24 66 75 9 1
74 12 71 45 80 75
39 86 64 98 91 79
59 77 57 11 60 50
14 10 25 81 39 7
44 51 54 2 64 69
79 33 88 72 95 10
48 62 11 90 20 44
49
Parallel Graph Coloring Min/Max
Round 3: Each vertex checks if its a local maximum or minimum. If max, color=purple. If min, color=white
65 24 66 75 9 1
74 12 71 45 80 75
39 86 64 98 91 79
59 77 57 11 60 50
14 10 25 81 39 7
44 51 54 2 64 69
79 33 88 72 95 10
48 62 11 90 20 44
50
AMG Timings
CPU Fluent solver: AMG(F-cycle, agg8, DILU, 0pre, 3post) GPU nvAMG solver: AMG(V-cycle, agg8, MC-DILU, 0pre, 3post)
1.4
1.2
1 0.8 0.6 0.4 0.2 0 Helix (hex 208K)
Lower is Better
K20X C2090 3930K(6)
Helix (tet 1173K)
51
AMG Timings
CPU Fluent solver: AMG(F-cycle, agg8, DILU, 0pre, 3post) GPU nvAMG solver: AMG(V-cycle, agg2, MC-DILU, 0pre, 3post)
9 8 7 6
Lower is Better
5
4 3 2 1 0 Airfoil (hex 784K) Aircraft (hex 1798K)
K20X C2090 3930K(6)
52
SD3D
Flux Reconstruction (FR) Method
Proposed by Huynh in 2007, similar to the Spectral Difference and Discontinuous Galerkin methods High-order method: spatial order of accuracy is > 2 High-Order Method
Error
Low-Order Method
Computational Cost
54
Flux Reconstruction (FR) Method
Solution in each element approximated by a multidimensional polynomial of order N Order of accuracy: hN+1 Multiple DOFs per element
N=2 N=3 N=4 N=1
55
Flux Reconstruction (FR) Method
Plunging airfoil: zero AOA, Re=1850, frequency: 2.46 rad/s 5th order accuracy in space, 4th order accurate RK time stepping
56
Flux Reconstruction (FR) Method
Computations are demanding: Millions of DOFS Hundreds of thousands of time steps More work per DOF compared to low-order methods
Until recently, high-order simulations over complex 3D geometries were intractable, unless you had access to large cluster
GPUs to the rescue!
57
GPU Implementation
Ported entire code to the GPU
FR and other high-order methods for unstructured grids map well to GPUs:
Large amount of parallelism (millions of DOFs)
More work per DOF compared to low-order methods Cell-local operations benefit from fast user-managed on-chip memory
Required some programming efforts, but was worth while
58
Single-GPU Implementation
12 10 8
7.60 8.01 7.58 6.21 6.23 5.51 11.27 9.23 9.05 7.44
8.63 7.21
Speedup
Tets
Hexs Prisms
6 4 2
0
3 4 5 6
Order of Accuracy
Speedup of the single-GPU algorithm (Tesla C2050) relative to a parallel computation on a six-core Xeon x5670 (Westmere) @ 2.9GHz
59
Applications
Unsteady viscous flow over sphere at Reynolds 300, Mach=0.3 28914 prisms and 258075 tets , 4th order accuracy, 6.3 million DOFs Ran on desktop machine with 3 C2050 GPUs
60
Applications
3 GPUs: same performance as 21 Xeon x5670 CPUs (126 cores) 3 GPUs personal computer: $10,000, easy to manage
Iso-surfaces of Q-criterion colored by Mach number for flow over sphere at Re=300, M=0.3
61
Multi-GPU Implementation
30
Speedup relative to 1 GPU
25 20 15 10 5 0 2
No Overlap Communication Overlap Communication and GPU Transfers Overlap
12
16
32
Number of GPUs
Speedup relative to 1 GPU for a 6th order accurate simulation running on a mesh with 55947 tetrahedral elements
62
Applications
Transitional flow over SD7003 airfoil, Re=60000, Mach=0.2, AOA=4 4th order accurate solution, 400000 RK iterations, 21.2 million DOFs
63
Applications
15 hours on 16 C2070s
157 hours ( > 6 days ) on 16 Xeon x5670 CPUs
64
Conclusions
Most scientific applications have large amount of parallelism
Parallel applications map well to GPUs which have hundreds of simple, power efficient cores Higher performance, higher performance/watt
Presented two successful uses of GPUs in CFD
Linear solver in Ansys Fluent (hard to parallelize) Research-oriented CFD code
65
Conclusions
Future of HPC is CPU + GPU/Accelerator
Need to develop new parallel numerical methods to replace inherently sequential algorithms (such as Gauss-Seidel, ILU preconditioners, etc..)
Peak flops vs memory bandwidth gap still growing
Flops are free Need to develop numerical methods that have larger flops/bytes ratio
66
Questions?
Patrice Castonguay [email protected]