0% found this document useful (0 votes)

114 views67 pages

Accelerating CFD Simulations With Gpus: Patrice Castonguay

This document discusses accelerating computational fluid dynamics (CFD) simulations with GPUs. It first provides background on CFD and why GPUs are well-suited for accelerating CFD codes due to their high computational power and parallelism. It then summarizes efforts to accelerate the commercial CFD code ANSYS Fluent using GPUs, including developing novel parallel algorithms for algebraic multigrid methods. It also discusses accelerating the research code SD3D using a GPU implementation, achieving speedups of up to 11x over CPUs.

Uploaded by

astudespus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views67 pages

Accelerating CFD Simulations With Gpus: Patrice Castonguay

Uploaded by

astudespus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Accelerating CFD Simulations with GPUs

Patrice Castonguay
HPC Developer Technology Engineer December 4th, 2012

Outline
Computational Fluid Dynamics Why GPUs?

Accelerating ANSYS Fluent (commercial CFD software)

Background Novel parallel algorithms to replace sequential algorithms Performance results

Accelerating SD3D (research code developed at Stanford)

Background Performance results

Conclusions
2

Computational Fluid Dynamics

Simulation of fluid flows Large number of applications

Computational Fluid Dynamics

Navier-Stokes equations: coupled nonlinear partial differential equations which govern unsteady, compressible, viscous fluid flows

Computational Fluid Dynamics

Partition the domain into large number of cells, and solve for fluid properties (density, velocity, pressure) inside each cell

Why GPUs?

Why GPUs
Higher performance Intel CPU:
8-core Sandy Bridge
fp32 performance: 384 Gflops fp64 performance: 192 Gflops Memory bandwidth: 52 GB/s

NVIDIA GPU:
Tesla GK110
fp32 performance: 3935 Gflops 10X fp64 performance: 1311 Gflops 6.8X Memory bandwidth: 250 GB/s 4.8X
8

Why GPUs
Power efficiency
Traditional CPUs not economically feasible

=
Jaguar (3rd fastest supercomputer in Nov. 2011) 2.3 Petaflops @ 7 megawatts

7 megawatts 7,000 Homes

Why GPUs
Power efficiency
Traditional CPUs not economically feasible

Scaled to 100 petaflops

Jaguar @ 2.3 Petaflops 7 megawatts 7,000 homes

300 megawatts 300,000 homes Quebec City and its metropolitan area
10

Why GPUs
CPU
Optimized for Many Parallel Tasks

GPU Accelerator

Optimized for Serial Tasks

Higher computational power per watt

Why GPUs
Many scientific applications already benefit from GPUs
Relative Performance K20x vs. dual-socket Sandy Bridge E5-2687w 3.10 GHz Sandy Bridge
Chroma
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

10.20 8.00 1.00 8.85 5.41 1.00 7.17 3.46

SPECFEM3D
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

AMBER
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

1.00
4.40 1.80 1.00 2.73

WS-LSMS
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

NAMD
Single-CPU+K20X Single-CPU+M2090 Dual-CPU

1.71
1.00 0 1 2 3 4 5 6 7 8 9 10 11
12

Why GPUs
Algorithms found in computer games are surprisingly similar to the ones found in scientific applications

Thriving gaming market funds GPU development for supercomputers

ANSYS Fluent

ANSYS Fluent
Largest share of CFD software market Finite volume method

Many different options: incompressible/compressible, inviscid/viscous, two-phase, explicit/implicit, ) Most popular option: implicit incompressible Navier-Stokes solver

Incompressible Navier-Stokes
Coupled non-linear PDEs Unknowns: u,v,w and p (density is constant) Mass:

Momentum x:

Momentum y:

Momentum z:
16

Incompressible Navier-Stokes
Solution procedure:

Runtime:
Assemble Linear System of Equations

Accelerate this first

Solve Linear System of Equation, Ax = b

~ 33%

~ 67%
Yes

Converged ?

Stop
17

Incompressible Navier-Stokes
Large sparse system of equations with 4x4 block entries

In practice, millions of rows

Aij is 4x4 matrix

xi is 4x1 vector ( xi = [pi, ui, vi, wi] )

Multigrid
2D Poisson equation, Jacobi smoother

Initial error

Error after 5 iterations

Error after 10 iterations

Error after 15 iterations

Convergence of error norm

Multigrid
Most simple iterative solvers are efficient at damping highfrequency errors, but inefficient at damping low-frequency error Key idea: Represent the error on a coarser grid so that low-frequency errors become high frequency errors

Multigrid

Smooth

Fine Grid Restrict

Smooth

Coarse Grid

Algebraic Multigrid (AMG)

Example: Two-Level V-cycle
Pre-Smooth Compute Residual

Post-Smooth

Restrict Residual

Prolongate Correction

Create

Solve
22

Algebraic Multigrid (AMG)

Apply recursively
~106 Unknowns

Pre-smooth Coarsen Pre-smooth Coarsen Pre-smooth

Post-smooth Prolongate Correction

Post-smooth
Prolongate Correction Post-smooth

Coarsen

Prolongate Correction

1-2 Unknowns
Solve
23

ANSYS Fluent
We have accelerated the AMG solver in Fluent with GPUs

Some parts of the algorithm were easily ported to the GPU

Sparse matrix vector multiplications Norm computations

Other operations were inherently sequential

Aggregation procedure to create restriction operator Gauss-Seidel, DILU smoothers Matrix-matrix multiplication

Need to develop novel parallel algorithms!

Aggregation
How do we create the coarser levels? w3,5 = f(A3,5, A5,3)
5
3 2 4

Graph representation of a matrix:

1
6
25

Aggregation
How do we create the coarser levels?

wi,j = f(Ai,j, Aj,i)

i j

Aggregation
In aggregation-based AMG, group vertices that are strongly connected to each other

Aggregation
New matrix/graph:

Aggregation
Want do merge vertices that are strongly connected to each other Similar to weighted graph matching problem
i j

wi,j = f(Ai,j, Aj,i)

Parallel aggregation
Each vertex extends a hand to its strongest neighbor

Parallel aggregation
Each vertex checks if its strongest neighbor extended a hand back

Parallel aggregation

Parallel aggregation
Repeat with unmatched vertices

Parallel aggregation

Smoothers
Smoothing step

M is called preconditioning matrix Here x is solution vector (includes all unknowns in the system)

Jacobi Smoother
For Jacobi, M is block-diagonal

Inherently parallel, each xi can be updated independently of each other Maps very well to the GPU
40

Parallel Smoothers
Smoothers available in Fluent (Gauss-Seidel & DILU) are sequential! For DILU smoother, M has the form

E is block diagonal matrix such that

Can recover ILU(0) for certain matrices Only requires one extra diagonal of storage
41

DILU Smoother
Construction of block diagonal matrix is sequential

Inversion of inversions)

is also sequential (two triangular matrix

Coloring
Use coloring to extract parallelism Coloring: assignment of color to vertices such that no two vertices of same color are adjacent
65 24 66 75 9 1 74 12 71 45 80 75 39 86 64 98 91 79 59 77 57 11 60 50 14 10 25 81 39 7 44 51 54 2 64 69 79 33 88 72 95 10 48 62 11 90 20 44

Coloring
Use coloring to extract parallelism With m unknowns and p colors, m/p unknowns can be processed in parallel

DILU Smoother
Construction of block diagonal matrix is now parallel

Inversion of matrix

is also parallel

Parallel Graph Coloring Min/Max

How do you color a graph/matrix in parallel? Parallel graph coloring algorithm of Luby, new variant developed at NVIDIA

Parallel Graph Coloring Min/Max

Assign a random number to each vertex

65 24 66 75 9 1

74 12 71 45 80 75

39 86 64 98 91 79

59 77 57 11 60 50

14 10 25 81 39 7

44 51 54 2 64 69

79 33 88 72 95 10

48 62 11 90 20 44

Parallel Graph Coloring Min/Max

Round 1: Each vertex checks if its a local maximum or minimum If max, color=dark blue. If min, color=green

65 24 66 75 9 1

74 12 71 45 80 75

39 86 64 98 91 79

59 77 57 11 60 50

14 10 25 81 39 7

44 51 54 2 64 69

79 33 88 72 95 10

48 62 11 90 20 44

Parallel Graph Coloring Min/Max

Round 2: Each vertex checks if its a local maximum or minimum. If max, color=pink. If min, color=red

65 24 66 75 9 1

74 12 71 45 80 75

39 86 64 98 91 79

59 77 57 11 60 50

14 10 25 81 39 7

44 51 54 2 64 69

79 33 88 72 95 10

48 62 11 90 20 44

Parallel Graph Coloring Min/Max

Round 3: Each vertex checks if its a local maximum or minimum. If max, color=purple. If min, color=white

65 24 66 75 9 1

74 12 71 45 80 75

39 86 64 98 91 79

59 77 57 11 60 50

14 10 25 81 39 7

44 51 54 2 64 69

79 33 88 72 95 10

48 62 11 90 20 44

AMG Timings
CPU Fluent solver: AMG(F-cycle, agg8, DILU, 0pre, 3post) GPU nvAMG solver: AMG(V-cycle, agg8, MC-DILU, 0pre, 3post)
1.4

1.2
1 0.8 0.6 0.4 0.2 0 Helix (hex 208K)

Lower is Better

K20X C2090 3930K(6)

Helix (tet 1173K)

AMG Timings
CPU Fluent solver: AMG(F-cycle, agg8, DILU, 0pre, 3post) GPU nvAMG solver: AMG(V-cycle, agg2, MC-DILU, 0pre, 3post)
9 8 7 6
Lower is Better

5
4 3 2 1 0 Airfoil (hex 784K) Aircraft (hex 1798K)

K20X C2090 3930K(6)

SD3D

Flux Reconstruction (FR) Method

Proposed by Huynh in 2007, similar to the Spectral Difference and Discontinuous Galerkin methods High-order method: spatial order of accuracy is > 2 High-Order Method

Error

Low-Order Method

Computational Cost

Flux Reconstruction (FR) Method

Solution in each element approximated by a multidimensional polynomial of order N Order of accuracy: hN+1 Multiple DOFs per element
N=2 N=3 N=4 N=1

Flux Reconstruction (FR) Method

Plunging airfoil: zero AOA, Re=1850, frequency: 2.46 rad/s 5th order accuracy in space, 4th order accurate RK time stepping

Flux Reconstruction (FR) Method

Computations are demanding: Millions of DOFS Hundreds of thousands of time steps More work per DOF compared to low-order methods

Until recently, high-order simulations over complex 3D geometries were intractable, unless you had access to large cluster

GPUs to the rescue!

GPU Implementation
Ported entire code to the GPU

FR and other high-order methods for unstructured grids map well to GPUs:
Large amount of parallelism (millions of DOFs)
More work per DOF compared to low-order methods Cell-local operations benefit from fast user-managed on-chip memory

Required some programming efforts, but was worth while

Single-GPU Implementation
12 10 8
7.60 8.01 7.58 6.21 6.23 5.51 11.27 9.23 9.05 7.44

8.63 7.21

Speedup

Tets
Hexs Prisms

6 4 2

0
3 4 5 6

Order of Accuracy

Speedup of the single-GPU algorithm (Tesla C2050) relative to a parallel computation on a six-core Xeon x5670 (Westmere) @ 2.9GHz

Applications
Unsteady viscous flow over sphere at Reynolds 300, Mach=0.3 28914 prisms and 258075 tets , 4th order accuracy, 6.3 million DOFs Ran on desktop machine with 3 C2050 GPUs

Applications
3 GPUs: same performance as 21 Xeon x5670 CPUs (126 cores) 3 GPUs personal computer: $10,000, easy to manage

Iso-surfaces of Q-criterion colored by Mach number for flow over sphere at Re=300, M=0.3

Multi-GPU Implementation
30

Speedup relative to 1 GPU

25 20 15 10 5 0 2

No Overlap Communication Overlap Communication and GPU Transfers Overlap

Number of GPUs

Speedup relative to 1 GPU for a 6th order accurate simulation running on a mesh with 55947 tetrahedral elements

Applications
Transitional flow over SD7003 airfoil, Re=60000, Mach=0.2, AOA=4 4th order accurate solution, 400000 RK iterations, 21.2 million DOFs

Applications

15 hours on 16 C2070s

157 hours ( > 6 days ) on 16 Xeon x5670 CPUs

Conclusions
Most scientific applications have large amount of parallelism

Parallel applications map well to GPUs which have hundreds of simple, power efficient cores Higher performance, higher performance/watt
Presented two successful uses of GPUs in CFD
Linear solver in Ansys Fluent (hard to parallelize) Research-oriented CFD code

Conclusions
Future of HPC is CPU + GPU/Accelerator

Need to develop new parallel numerical methods to replace inherently sequential algorithms (such as Gauss-Seidel, ILU preconditioners, etc..)
Peak flops vs memory bandwidth gap still growing
Flops are free Need to develop numerical methods that have larger flops/bytes ratio

Questions?
Patrice Castonguay [email protected]

2025 R1 What's New - Ansys Fluent
No ratings yet
2025 R1 What's New - Ansys Fluent
137 pages
Graph Theory - S A Chodum
67% (3)
Graph Theory - S A Chodum
289 pages
Performance Analysis of Different Iterative Solvers Parallelized On Gpu Architecture
No ratings yet
Performance Analysis of Different Iterative Solvers Parallelized On Gpu Architecture
8 pages
Graph Theory and Interconnection Networks
100% (1)
Graph Theory and Interconnection Networks
722 pages
Class 8
No ratings yet
Class 8
72 pages
HORSES3D User Manual
No ratings yet
HORSES3D User Manual
45 pages
Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)
No ratings yet
Brook For GPUs - Stream Computing On Graphics Hardware - Slides (2004)
36 pages
Yu 等 - 2021 - GPU-Acceleration of the ELPA2 Distributed Eigensol
No ratings yet
Yu 等 - 2021 - GPU-Acceleration of the ELPA2 Distributed Eigensol
36 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Lecture 5 Numpy and Scipy
No ratings yet
Lecture 5 Numpy and Scipy
47 pages
Witherden 2015 - On The Development and Implementation of High-Order Flux Reconstruction Schemes For Computational Fluid Dynamics
No ratings yet
Witherden 2015 - On The Development and Implementation of High-Order Flux Reconstruction Schemes For Computational Fluid Dynamics
131 pages
Ecp2018 Magma Tutorial 1
No ratings yet
Ecp2018 Magma Tutorial 1
50 pages
Fluids Talk Notes
No ratings yet
Fluids Talk Notes
75 pages
Dense Matrix Algebra On The GPU
No ratings yet
Dense Matrix Algebra On The GPU
22 pages
Articles CAF Symmetric FSM Published
No ratings yet
Articles CAF Symmetric FSM Published
9 pages
What's New For Ansys Fluent
No ratings yet
What's New For Ansys Fluent
139 pages
JCP Symmpois Published
No ratings yet
JCP Symmpois Published
19 pages
Dutto 1999
No ratings yet
Dutto 1999
14 pages
Antony Grant 2017 Rapid Indirect Trajectory Optimization On Highly Parallel Computing Architectures
No ratings yet
Antony Grant 2017 Rapid Indirect Trajectory Optimization On Highly Parallel Computing Architectures
11 pages
B1 Data Parallel
No ratings yet
B1 Data Parallel
52 pages
Sparse 1
No ratings yet
Sparse 1
68 pages
S5403 Nelson Inoue
No ratings yet
S5403 Nelson Inoue
36 pages
Stanford 2013
No ratings yet
Stanford 2013
36 pages
Jasak2 Slides
No ratings yet
Jasak2 Slides
29 pages
High Performance Parallel Computing of Flows in Complex Geometries - Part 1 - Methods 22222
No ratings yet
High Performance Parallel Computing of Flows in Complex Geometries - Part 1 - Methods 22222
27 pages
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
No ratings yet
Experience of Developing Sparse Matrix Algorithms and Software For Sustainablity
22 pages
Linear Solvers GPU
No ratings yet
Linear Solvers GPU
10 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
GPU - Linear Algebra Operators For GPU Implementation of Num
No ratings yet
GPU - Linear Algebra Operators For GPU Implementation of Num
9 pages
Parallel Computing in CFD: Milovan Perić
No ratings yet
Parallel Computing in CFD: Milovan Perić
25 pages
Computer Graphics CSE 306
No ratings yet
Computer Graphics CSE 306
119 pages
An Introduction To Combinatorics and Graph Theory: David Guichard
100% (1)
An Introduction To Combinatorics and Graph Theory: David Guichard
155 pages
ImprovingNoise - Ken Perlin (2002)
No ratings yet
ImprovingNoise - Ken Perlin (2002)
2 pages
Solving Pdes With Cuda
No ratings yet
Solving Pdes With Cuda
34 pages
LectureNoteInCS1573 (VECPAR'98)
No ratings yet
LectureNoteInCS1573 (VECPAR'98)
11 pages
Network Engineering PDF
0% (1)
Network Engineering PDF
44 pages
Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
No ratings yet
Kruger, Westermann - Linear Algebra Operators For GPU Implementation of Numerical Algorithms
9 pages
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
No ratings yet
UNIT V Scalable Multi-GPU Programming (T2 Chapter 6) - P P With CUDA
43 pages
CUDA Physx Fluids - Harris
No ratings yet
CUDA Physx Fluids - Harris
46 pages
Elective 4 The Mathematics of Graphs
100% (2)
Elective 4 The Mathematics of Graphs
108 pages
Journal of Computational Physics: Sanghyun Ha, Junshin Park, Donghyun You
No ratings yet
Journal of Computational Physics: Sanghyun Ha, Junshin Park, Donghyun You
19 pages
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
No ratings yet
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
16 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
Parallel-Vector Equation Solvers For Finite Element Engineering Applications
No ratings yet
Parallel-Vector Equation Solvers For Finite Element Engineering Applications
15 pages
Freefem Doc
100% (1)
Freefem Doc
418 pages
Mathematics in The Modern World-Module 6
100% (1)
Mathematics in The Modern World-Module 6
74 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
B.Tech R22 Mid Question Bank DM
No ratings yet
B.Tech R22 Mid Question Bank DM
7 pages
Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
No ratings yet
Understanding The Efficiency of GPU Algorithms For Matrix-Matrix Multiplication
5 pages
Christen 07
No ratings yet
Christen 07
8 pages
A FEM Algorithm in Octave: June 2000
100% (1)
A FEM Algorithm in Octave: June 2000
39 pages
Ai-Unit-Iii Notes
No ratings yet
Ai-Unit-Iii Notes
46 pages
Freefem Doc PDF
No ratings yet
Freefem Doc PDF
426 pages
List of Unsolved Problems in Mathematics PDF
No ratings yet
List of Unsolved Problems in Mathematics PDF
16 pages
Semtex-User Guide
No ratings yet
Semtex-User Guide
75 pages
Understanding The Graphics Pipeline
No ratings yet
Understanding The Graphics Pipeline
35 pages
Freefem Doc
No ratings yet
Freefem Doc
354 pages
Manual Freefem
No ratings yet
Manual Freefem
140 pages
Ray Tracing On GPU: University of Applied Sciences Basel (FHBB) Diploma Thesis
No ratings yet
Ray Tracing On GPU: University of Applied Sciences Basel (FHBB) Diploma Thesis
44 pages
Algorithms Unit 4
No ratings yet
Algorithms Unit 4
40 pages
Freefem Doc
No ratings yet
Freefem Doc
346 pages
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
No ratings yet
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
4 pages
B Tech-CSBS
No ratings yet
B Tech-CSBS
44 pages
CS 240A: Solving Ax B in Parallel: Dense A: Gaussian Elimination With Partial Pivoting (LU)
No ratings yet
CS 240A: Solving Ax B in Parallel: Dense A: Gaussian Elimination With Partial Pivoting (LU)
35 pages
A FEM Alghorithm in Octave
No ratings yet
A FEM Alghorithm in Octave
39 pages
Dynamic Prgming & Backtracking
0% (1)
Dynamic Prgming & Backtracking
98 pages
Net Questions
No ratings yet
Net Questions
8 pages
An Introduction To Combinatorics and Graph Theory
No ratings yet
An Introduction To Combinatorics and Graph Theory
123 pages
Module 7
No ratings yet
Module 7
43 pages
TY Syllabus
No ratings yet
TY Syllabus
38 pages
Construct Examination Timetabling Using Graph Colouring in CAS UUM
No ratings yet
Construct Examination Timetabling Using Graph Colouring in CAS UUM
18 pages
Traffic Control in PDF
No ratings yet
Traffic Control in PDF
8 pages
Combinatorics Geometry and Probability A Tribute To Paul Erds Bla Bollobs Download
No ratings yet
Combinatorics Geometry and Probability A Tribute To Paul Erds Bla Bollobs Download
89 pages
Chapter 5 Backtracking
No ratings yet
Chapter 5 Backtracking
21 pages
A Mathematical Guide To Operator Learning
No ratings yet
A Mathematical Guide To Operator Learning
45 pages
Nov Dec 2023 Solu (AI)
No ratings yet
Nov Dec 2023 Solu (AI)
25 pages
Graphs: Massachusetts Institute of Technology 6.042J/18.062J, Fall '02 Professor Albert Meyer Dr. Radhika Nagpal
No ratings yet
Graphs: Massachusetts Institute of Technology 6.042J/18.062J, Fall '02 Professor Albert Meyer Dr. Radhika Nagpal
17 pages
(Ebooks PDF) Download Discrete Mathematics and Applications 2nd Edition Ferland Full Chapters
100% (8)
(Ebooks PDF) Download Discrete Mathematics and Applications 2nd Edition Ferland Full Chapters
55 pages
Immediate Download Algorithmic Graph Theory and Perfect Graphs 2nd Edition Martin Charles Golumbic Ebooks 2024
100% (1)
Immediate Download Algorithmic Graph Theory and Perfect Graphs 2nd Edition Martin Charles Golumbic Ebooks 2024
41 pages
SECTION 10.8 Graph Coloring: Abc, C E
No ratings yet
SECTION 10.8 Graph Coloring: Abc, C E
3 pages
The Poor Cartographer-Graph: Coloring
No ratings yet
The Poor Cartographer-Graph: Coloring
13 pages
GraphColoring QA
No ratings yet
GraphColoring QA
2 pages
MATH F243 Handout
No ratings yet
MATH F243 Handout
3 pages
Quiz Format (1) DAA4A
No ratings yet
Quiz Format (1) DAA4A
1 page
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
From Everand
Analog Dialogue, Volume 45, Number 4: Analog Dialogue, #4
Analog Dialogue
No ratings yet
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet

Accelerating CFD Simulations With Gpus: Patrice Castonguay

Uploaded by

Accelerating CFD Simulations With Gpus: Patrice Castonguay

Uploaded by

Accelerating CFD Simulations with GPUs

Accelerating ANSYS Fluent (commercial CFD software)

Accelerating SD3D (research code developed at Stanford)

Computational Fluid Dynamics

Computational Fluid Dynamics

Computational Fluid Dynamics

Computational Fluid Dynamics

7 megawatts 7,000 Homes

Scaled to 100 petaflops

Jaguar @ 2.3 Petaflops 7 megawatts 7,000 homes

Optimized for Serial Tasks

Higher computational power per watt

10.20 8.00 1.00 8.85 5.41 1.00 7.17 3.46

Thriving gaming market funds GPU development for supercomputers

Accelerate this first

In practice, millions of rows

Aij is 4x4 matrix

xi is 4x1 vector ( xi = [pi, ui, vi, wi] )

Error after 5 iterations

Error after 10 iterations

Error after 15 iterations

Convergence of error norm

Fine Grid Restrict

Algebraic Multigrid (AMG)

Algebraic Multigrid (AMG)

Pre-smooth Coarsen Pre-smooth Coarsen Pre-smooth

Post-smooth Prolongate Correction

Some parts of the algorithm were easily ported to the GPU

Other operations were inherently sequential

Need to develop novel parallel algorithms!

Graph representation of a matrix:

wi,j = f(Ai,j, Aj,i)

wi,j = f(Ai,j, Aj,i)

E is block diagonal matrix such that

is also sequential (two triangular matrix

Parallel Graph Coloring Min/Max

Parallel Graph Coloring Min/Max

Parallel Graph Coloring Min/Max

Parallel Graph Coloring Min/Max

Parallel Graph Coloring Min/Max

K20X C2090 3930K(6)

Helix (tet 1173K)

K20X C2090 3930K(6)

Flux Reconstruction (FR) Method

Flux Reconstruction (FR) Method

Flux Reconstruction (FR) Method

Flux Reconstruction (FR) Method

GPUs to the rescue!

Required some programming efforts, but was worth while

Speedup relative to 1 GPU

No Overlap Communication Overlap Communication and GPU Transfers Overlap

157 hours ( > 6 days ) on 16 Xeon x5670 CPUs

You might also like