0% found this document useful (0 votes)
96 views56 pages

Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla

A jet engine fan has thousands of blades Arranged in rows Each blade row has a bespoke blade profile designed with cfd. Each blade row uses data from surrounding nodes - "stencil" operation. A good fit for Our code applied to all cells in - Single precision - Large datasets fit on one card on card is high much slower to / from card and steps in are Programming without the graphics.

Uploaded by

fengwang2102
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views56 pages

Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla

A jet engine fan has thousands of blades Arranged in rows Each blade row has a bespoke blade profile designed with cfd. Each blade row uses data from surrounding nodes - "stencil" operation. A good fit for Our code applied to all cells in - Single precision - Large datasets fit on one card on card is high much slower to / from card and steps in are Programming without the graphics.

Uploaded by

fengwang2102
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

NVIDIA® TESLA™

Case study: CFD


Dr. Graham Pullan
University of Cambridge
Outline

•  CFD
(for
turbomachinery)

•  A
good
fit
for
GPUs?

•  ImplementaBon


•  Results

•  ImplicaBons


•  Summary

Turbomachinery


Thousands of blades

Arranged in rows

Each blade row has a


bespoke blade profile
designed with CFD

Blade row
CFD
of
a
jet
engine
fan


Blades coloured by
pressure
IntroducBon
to
CFD

Blade

Flow

Divide the volume into cells


Governing
equaBons
for
each
cell

Governing
equaBons
for
each
cell


Conserve:
•  Mass
•  Momentum
•  Energy
Example:
mass
conservaBon

•  Evaluate
mass
fluxes
on
each
face


A
Fmass = ∑ ρVn
4


Example:
mass
conservaBon

•  Sum
fluxes
on
faces
to
find
density
change
in
cell


Δt
Δρ cell =
Δvol
∑ Fmass


Example:
mass
conservaBon

•  Update
density


1
Δρ node = ∑ Δρ cell
8

(only 4 of 8 surrounding cells shown)


Similarity
of
steps


Each step uses data from surrounding nodes – “stencil” operation


Similarity
of
equaBons

•  For
each
equaBon
(5
in
all):

–  Set
relevant
flux
(mass,
momentum,
energy)

–  Sum
fluxes

–  Update
nodes

–  (plus
smoothing
–
also
stencil


boundary
condiBons
–
not
stencil)

CPU
run
Bmes
(x86
machines)

“Steady”
approximaBon
–
one
blade
per
row


1
blade
 
 
 
0.5
Mcells 
1
CPU
hour


1
stage
(2
blades) 
 
1.0
Mcells 
3
CPU
hours


1
component
(5
stages) 
5.0
Mcells 
20
CPU
hours

CPU
run
Bmes
(x86
machines)

“Steady”
approximaBon
–
one
blade
per
row


1
blade
 
 
 
0.5
Mcells 
1
CPU
hour


1
stage
(2
blades) 
 
1.0
Mcells 
3
CPU
hours


1
component
(5
stages) 
5.0
Mcells 
20
CPU
hours


“Unsteady”
approximaBon
–
all
blades
in
row


1
component
(1000
blades) 
500
Mcells 
0.1
M
CPU
hours


Engine
(4000
blades)
 
2
Gcells 
1
M
CPU
hours

Peak
FLOPs

The
purpose
of
GPUs

Graphics
and
scienBfic
compuBng


GPUs
are
designed
to
apply
the


same
shading
func,on


to
many
pixels
simultaneously

Graphics
and
scienBfic
compuBng


GPUs
are
designed
to
apply
the


same

func,on


to
many
data
simultaneously

Are
GPUs
a
good
fit
for
CFD?

•  Our
CFD
code
is:

–  SIMD
(same
funcBons
applied
to
all
cells
in
domain)

–  Single
precision

–  Large
datasets
(c
10M
nodes)
fit
on
one
4GB
Tesla
card

•  (
bandwith
on
card
is
high
c
102
GB/s

much
slower
to/from
card
c
8
GB/s

and
steps
in
CFD
are
“memory
bound”
)

CUDA


•  Programming
GPUs
without
the
graphics
abstracBon


•  Scalar
variables
(not
graphics‐type
4‐vectors!)

•  Extensions
to
C
(not
graphics
APIs,
eg
OPENGL)

CUDA


•  Programming
GPUs
without
the
graphics
abstracBon


•  Scalar
variables
(not
graphics‐type
4‐vectors!)

•  Extensions
to
C
(not
graphics
APIs,
eg
OPENGL)


•  BUT
–
porBng
15,000
lines
of
exisBng
FORTRAN
CFD
code
to

CUDA
sBll
a
lengthy
task

Overall
strategy

•  Divide
up
domain


–  each
sub‐domain
to
a
thread
block

–  update
nodes
in
sub‐domain
with

most
efficient
stencil
operaBon
we

can
come
up
with!

–  update
sub‐domain
boundaries

(MPI
if
needed)

SBLOCK
–
stencil
framework

•  SBLOCK
framework
for
stencil
operaBons
on
structured
grids:


–  Source‐to‐source
compiler

•  Takes
in
high
level
kernel
definiBons

•  Produces
opBmised
kernels
in
C
or
CUDA

SBLOCK
–
stencil
framework

•  SBLOCK
framework
for
stencil
operaBons
on
structured
grids:


–  Source‐to‐source
compiler

•  Takes
in
high
level
kernel
definiBons

•  Produces
opBmised
kernels
in
C
or
CUDA


•  Allows
new
stencils
to
be
implemented
quickly

•  Allows
new
stencil
opBmisaBon
strategies
to
be
deployed
on

all
stencils
(without
typos!)

SBLOCK

Example
SBLOCK
definiBon

kind = "stencil"
bpin = ["a"]
bpout = ["b”]

calc = {"lvalue": "b",


"rvalue": """sf1*a[0][0][0] +
sfd6*(a[1][0][0] + a[1][0][0] +
a[0][1][0] + a[0][1][0] +
a[0][0][1] + a[0][0][1])"""}
C
implementaBon

void smooth(float sf, float *a, float *b)
{
for (k=0; k < nk; k++) {
for (j=0; j < nj; j++) {
for (i=0; i < ni; i++) {
// compute indices i000, im100, etc (not shown) //
b[i000] = sf1*a[i000] +
sfd6*(a[im100] + a[ip100] +
a[i0m10] + a[i0p10]
+ a[i00m1] + a[i00p1]);
}
}
}
}
CUDA
strategy
(aeer
Dafa
et
al.)

•  Each
thread
in
a
block
reads
sub‐domain
data
from
global

device
memory
to
SM
shared
memory
(coalesced
reads
for

maximum
bandwidth)

•  Synch
threads

•  Update
nodes
in
sub‐domain
using
shared
memory
and

output
result
back
to
global
memory

CUDA
strategy
(aeer
Dafa
et
al.)

•  Each
thread
in
a
block
reads
sub‐domain
data
from
global

device
memory
to
SM
shared
memory
(coalesced
reads
for

maximum
bandwidth)

•  Synch
threads

•  Update
nodes
in
sub‐domain
using
shared
memory
and

output
result
back
to
global
memory

•  But
shared
memory
and
max
threads
per
block
are
limited,
so

best
plan
is
to
march
through
sub‐domain
plane‐by‐plane…

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
strategy

CUDA
example

__global__ void smooth_kernel(float sf, float *a_d, float *b_d)
{
__shared__ float a[16][5][3]; // shared memory array
CUDA
example

__global__ void smooth_kernel(float sf, float *a_d, float *b_d)
{
__shared__ float a[16][5][3]; // shared memory array
a[i][j][0] = a_d[i0m10]; // fetch first three planes
a[i][j][1] = a_d[i000];
a[i][j][2] = a_d[i0p10];
__syncthreads(); // make sure planes are loaded
CUDA
example

__global__ void smooth_kernel(float sf, float *a_d, float *b_d)
{
__shared__ float a[16][5][3]; // shared memory array
a[i][j][0] = a_d[i0m10]; // fetch first three planes
a[i][j][1] = a_d[i000];
a[i][j][2] = a_d[i0p10];
__syncthreads(); // make sure planes are loaded
// compute the stencil: //
b_d[i000] = sf1*a[i][j][1] +
+ sfd6*(a[i-1][j][1] + a[i+1][j][1]
+ a[i][j][0] + a[i][j][2]
+ a[i][j-1][1] + a[i][j+1][1])
// load next “k" plane and repeat //
Turbostream

•  CUDA
port
of
exisBng
FORTRAN
code
(TBLOCK)

•  15,000
lines
FORTRAN

•  5,000
lines
kernel
definiBons
‐>
30,000
lines
of
CUDA

•  Runs
on
CPU
or
mulBple
GPUs

•  20x
speedup
on
Tesla
C1060
as
compared
to
all
cores
of
a

modern
Intel
core2
quad.

Turbostream


Turbine geometry Flow solution


Turbostream


•  9
minutes
on
a
Tesla
S870
(4
GPUs)

•  12
hours
on
one
2.5GHz
CPU
core

FORTRAN
&
CUDA
comparison


Fortran


CUDA

Impact
of
GPU
accelerated
CFD

•  Tesla
Personal
Supercomputer
enables

–  Full
turbine
in
10
minutes
(not
12
hours)

–  One
blade
(for
design)
in
2
minutes


•  Tesla
cluster
enables

–  InteracBve
(seconds)
design
of
blades
for
first
Bme

–  Use
of
higher
accuracy
methods
at
early
stage
in
design

process

Summary

•  Many
science
applicaBons
fit
the
SIMD
model
used
in
GPUs

•  CUDA
enables
science
developers
to
access
to
NVIDIA
GPUs

without
cumbersome
graphics
APIs

•  ExisBng
codes
have
to
be
analysed
and
re‐coded
to
best
fit
the

many‐core
architecture

•  The
speedups
are
such
that
this
can
be
worth
doing

•  For
our
applicaBon,
the
step‐change
in
capability
is

revoluBonary

More
informaBon

www.many-core.group.cam.ac.uk

You might also like