Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla
Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla
Thousands of blades
Arranged in rows
Blade row
CFD
of
a
jet
engine
fan
Blades coloured by
pressure
IntroducBon
to
CFD
Blade
Flow
Conserve:
• Mass
• Momentum
• Energy
Example:
mass
conservaBon
• Evaluate
mass
fluxes
on
each
face
A
Fmass = ∑ ρVn
4
€
Example:
mass
conservaBon
• Sum
fluxes
on
faces
to
find
density
change
in
cell
Δt
Δρ cell =
Δvol
∑ Fmass
€
Example:
mass
conservaBon
• Update
density
1
Δρ node = ∑ Δρ cell
8
“Unsteady”
approximaBon
–
all
blades
in
row
1
component
(1000
blades)
500
Mcells
0.1
M
CPU
hours
Engine
(4000
blades)
2
Gcells
1
M
CPU
hours
Peak
FLOPs
The
purpose
of
GPUs
Graphics
and
scienBfic
compuBng
GPUs
are
designed
to
apply
the
same
shading
func,on
to
many
pixels
simultaneously
Graphics
and
scienBfic
compuBng
GPUs
are
designed
to
apply
the
same
func,on
to
many
data
simultaneously
Are
GPUs
a
good
fit
for
CFD?
• Our
CFD
code
is:
– SIMD
(same
funcBons
applied
to
all
cells
in
domain)
– Single
precision
– Large
datasets
(c
10M
nodes)
fit
on
one
4GB
Tesla
card
• (
bandwith
on
card
is
high
c
102
GB/s
much
slower
to/from
card
c
8
GB/s
and
steps
in
CFD
are
“memory
bound”
)
CUDA
• Programming
GPUs
without
the
graphics
abstracBon
• Scalar
variables
(not
graphics‐type
4‐vectors!)
• Extensions
to
C
(not
graphics
APIs,
eg
OPENGL)
CUDA
• Programming
GPUs
without
the
graphics
abstracBon
• Scalar
variables
(not
graphics‐type
4‐vectors!)
• Extensions
to
C
(not
graphics
APIs,
eg
OPENGL)
• BUT
–
porBng
15,000
lines
of
exisBng
FORTRAN
CFD
code
to
CUDA
sBll
a
lengthy
task
Overall
strategy
• Divide
up
domain
– each
sub‐domain
to
a
thread
block
– update
nodes
in
sub‐domain
with
most
efficient
stencil
operaBon
we
can
come
up
with!
– update
sub‐domain
boundaries
(MPI
if
needed)
SBLOCK
–
stencil
framework
• SBLOCK
framework
for
stencil
operaBons
on
structured
grids:
– Source‐to‐source
compiler
• Takes
in
high
level
kernel
definiBons
• Produces
opBmised
kernels
in
C
or
CUDA
SBLOCK
–
stencil
framework
• SBLOCK
framework
for
stencil
operaBons
on
structured
grids:
– Source‐to‐source
compiler
• Takes
in
high
level
kernel
definiBons
• Produces
opBmised
kernels
in
C
or
CUDA
• Allows
new
stencils
to
be
implemented
quickly
• Allows
new
stencil
opBmisaBon
strategies
to
be
deployed
on
all
stencils
(without
typos!)
SBLOCK
Example
SBLOCK
definiBon
kind = "stencil"
bpin = ["a"]
bpout = ["b”]
• 9
minutes
on
a
Tesla
S870
(4
GPUs)
• 12
hours
on
one
2.5GHz
CPU
core
FORTRAN
&
CUDA
comparison
Fortran
CUDA
Impact
of
GPU
accelerated
CFD
• Tesla
Personal
Supercomputer
enables
– Full
turbine
in
10
minutes
(not
12
hours)
– One
blade
(for
design)
in
2
minutes
• Tesla
cluster
enables
– InteracBve
(seconds)
design
of
blades
for
first
Bme
– Use
of
higher
accuracy
methods
at
early
stage
in
design
process
Summary
• Many
science
applicaBons
fit
the
SIMD
model
used
in
GPUs
• CUDA
enables
science
developers
to
access
to
NVIDIA
GPUs
without
cumbersome
graphics
APIs
• ExisBng
codes
have
to
be
analysed
and
re‐coded
to
best
fit
the
many‐core
architecture
• The
speedups
are
such
that
this
can
be
worth
doing
• For
our
applicaBon,
the
step‐change
in
capability
is
revoluBonary
More
informaBon
www.many-core.group.cam.ac.uk