0% found this document useful (0 votes)

96 views56 pages

Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla

A jet engine fan has thousands of blades Arranged in rows Each blade row has a bespoke blade profile designed with cfd. Each blade row uses data from surrounding nodes - "stencil" operation. A good fit for Our code applied to all cells in - Single precision - Large datasets fit on one card on card is high much slower to / from card and steps in are Programming without the graphics.

Uploaded by

fengwang2102

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views56 pages

Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla

Uploaded by

fengwang2102

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

NVIDIA® TESLA™

Case study: CFD

Dr. Graham Pullan
University of Cambridge
Outline 
•  CFD (for turbomachinery) 
•  A good ﬁt for GPUs? 
•  ImplementaBon  
•  Results 
•  ImplicaBons  
•  Summary 
Turbomachinery 

Thousands of blades

Arranged in rows

Each blade row has a

bespoke blade profile
designed with CFD

Blade row
CFD of a jet engine fan 

Blades coloured by
pressure
IntroducBon to CFD 
Blade

Flow

Divide the volume into cells

Governing equaBons for each cell 
Governing equaBons for each cell 

Conserve:
•  Mass
•  Momentum
•  Energy
Example: mass conservaBon 
•  Evaluate mass ﬂuxes on each face 

A
Fmass = ∑ ρVn
4

€
Example: mass conservaBon 
•  Sum ﬂuxes on faces to ﬁnd density change in cell 

Δt
Δρ cell =
Δvol
∑ Fmass

€
Example: mass conservaBon 
•  Update density 

1
Δρ node = ∑ Δρ cell
8

(only 4 of 8 surrounding cells shown)

Similarity of steps 

Each step uses data from surrounding nodes – “stencil” operation

Similarity of equaBons 
•  For each equaBon (5 in all): 
–  Set relevant ﬂux (mass, momentum, energy) 
–  Sum ﬂuxes 
–  Update nodes 
–  (plus smoothing – also stencil 
 boundary condiBons – not stencil) 
CPU run Bmes (x86 machines) 
“Steady” approximaBon – one blade per row 
 1 blade       0.5 Mcells  1 CPU hour 
 1 stage (2 blades)    1.0 Mcells  3 CPU hours 
 1 component (5 stages)  5.0 Mcells  20 CPU hours 
CPU run Bmes (x86 machines) 
“Steady” approximaBon – one blade per row 
 1 blade       0.5 Mcells  1 CPU hour 
 1 stage (2 blades)    1.0 Mcells  3 CPU hours 
 1 component (5 stages)  5.0 Mcells  20 CPU hours 

“Unsteady” approximaBon – all blades in row 
 1 component (1000 blades)  500 Mcells  0.1 M CPU hours 
 Engine (4000 blades)   2 Gcells  1 M CPU hours 
Peak FLOPs 
The purpose of GPUs 
Graphics and scienBﬁc compuBng 

GPUs are designed to apply the  
same shading func,on  
to many pixels simultaneously 
Graphics and scienBﬁc compuBng 

GPUs are designed to apply the  
same  func,on  
to many data simultaneously 
Are GPUs a good ﬁt for CFD? 
•  Our CFD code is: 
–  SIMD (same funcBons applied to all cells in domain) 
–  Single precision 
–  Large datasets (c 10M nodes) ﬁt on one 4GB Tesla card 
•  ( bandwith on card is high c 102 GB/s 
much slower to/from card c 8 GB/s 
and steps in CFD are “memory bound” ) 
CUDA  
•  Programming GPUs without the graphics abstracBon  
•  Scalar variables (not graphics‐type 4‐vectors!) 
•  Extensions to C (not graphics APIs, eg OPENGL) 
CUDA  
•  Programming GPUs without the graphics abstracBon  
•  Scalar variables (not graphics‐type 4‐vectors!) 
•  Extensions to C (not graphics APIs, eg OPENGL) 

•  BUT – porBng 15,000 lines of exisBng FORTRAN CFD code to 
CUDA sBll a lengthy task 
Overall strategy 
•  Divide up domain  
–  each sub‐domain to a thread block 
–  update nodes in sub‐domain with 
most efficient stencil operaBon we 
can come up with! 
–  update sub‐domain boundaries 
(MPI if needed) 
SBLOCK – stencil framework 
•  SBLOCK framework for stencil operaBons on structured grids:  
–  Source‐to‐source compiler 
•  Takes in high level kernel definiBons 
•  Produces opBmised kernels in C or CUDA 
SBLOCK – stencil framework 
•  SBLOCK framework for stencil operaBons on structured grids:  
–  Source‐to‐source compiler 
•  Takes in high level kernel definiBons 
•  Produces opBmised kernels in C or CUDA 

•  Allows new stencils to be implemented quickly 
•  Allows new stencil opBmisaBon strategies to be deployed on 
all stencils (without typos!) 
SBLOCK 
Example SBLOCK deﬁniBon 
kind = "stencil"
bpin = ["a"]
bpout = ["b”]

calc = {"lvalue": "b",

"rvalue": """sf1*a[0][0][0] +
sfd6*(a[1][0][0] + a[1][0][0] +
a[0][1][0] + a[0][1][0] +
a[0][0][1] + a[0][0][1])"""}
C implementaBon 
void smooth(float sf, float *a, float *b)
{
for (k=0; k < nk; k++) {
for (j=0; j < nj; j++) {
for (i=0; i < ni; i++) {
// compute indices i000, im100, etc (not shown) //
b[i000] = sf1*a[i000] +
sfd6*(a[im100] + a[ip100] +
a[i0m10] + a[i0p10]
+ a[i00m1] + a[i00p1]);
}
}
}
}
CUDA strategy (aeer Dafa et al.) 
•  Each thread in a block reads sub‐domain data from global 
device memory to SM shared memory (coalesced reads for 
maximum bandwidth) 
•  Synch threads 
•  Update nodes in sub‐domain using shared memory and 
output result back to global memory 
CUDA strategy (aeer Dafa et al.) 
•  Each thread in a block reads sub‐domain data from global 
device memory to SM shared memory (coalesced reads for 
maximum bandwidth) 
•  Synch threads 
•  Update nodes in sub‐domain using shared memory and 
output result back to global memory 
•  But shared memory and max threads per block are limited, so 
best plan is to march through sub‐domain plane‐by‐plane… 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA strategy 
CUDA example 
__global__ void smooth_kernel(float sf, float *a_d, float *b_d)
{
__shared__ float a[16][5][3]; // shared memory array
CUDA example 
__global__ void smooth_kernel(float sf, float *a_d, float *b_d)
{
__shared__ float a[16][5][3]; // shared memory array
a[i][j][0] = a_d[i0m10]; // fetch first three planes
a[i][j][1] = a_d[i000];
a[i][j][2] = a_d[i0p10];
__syncthreads(); // make sure planes are loaded
CUDA example 
__global__ void smooth_kernel(float sf, float *a_d, float *b_d)
{
__shared__ float a[16][5][3]; // shared memory array
a[i][j][0] = a_d[i0m10]; // fetch first three planes
a[i][j][1] = a_d[i000];
a[i][j][2] = a_d[i0p10];
__syncthreads(); // make sure planes are loaded
// compute the stencil: //
b_d[i000] = sf1*a[i][j][1] +
+ sfd6*(a[i-1][j][1] + a[i+1][j][1]
+ a[i][j][0] + a[i][j][2]
+ a[i][j-1][1] + a[i][j+1][1])
// load next “k" plane and repeat //
Turbostream 
•  CUDA port of exisBng FORTRAN code (TBLOCK) 
•  15,000 lines FORTRAN 
•  5,000 lines kernel deﬁniBons ‐> 30,000 lines of CUDA 
•  Runs on CPU or mulBple GPUs 
•  20x speedup on Tesla C1060 as compared to all cores of a 
modern Intel core2 quad. 
Turbostream 

Turbine geometry Flow solution

Turbostream 

•  9 minutes on a Tesla S870 (4 GPUs) 
•  12 hours on one 2.5GHz CPU core 
FORTRAN & CUDA comparison 

Fortran 

CUDA 
Impact of GPU accelerated CFD 
•  Tesla Personal Supercomputer enables 
–  Full turbine in 10 minutes (not 12 hours) 
–  One blade (for design) in 2 minutes 

•  Tesla cluster enables 
–  InteracBve (seconds) design of blades for first Bme 
–  Use of higher accuracy methods at early stage in design 
process 
Summary 
•  Many science applicaBons fit the SIMD model used in GPUs 
•  CUDA enables science developers to access to NVIDIA GPUs 
without cumbersome graphics APIs 
•  ExisBng codes have to be analysed and re‐coded to best fit the 
many‐core architecture 
•  The speedups are such that this can be worth doing 
•  For our applicaBon, the step‐change in capability is 
revoluBonary 
More informaBon 
www.many-core.group.cam.ac.uk

SEL 13 - PROT401 - LinePilotProtection - r4
100% (1)
SEL 13 - PROT401 - LinePilotProtection - r4
56 pages
Astreya FNL v1.1 Data-Center TrendBook
100% (1)
Astreya FNL v1.1 Data-Center TrendBook
19 pages
Translator Program PDF
100% (1)
Translator Program PDF
3 pages
Eks Energy y HE
100% (1)
Eks Energy y HE
19 pages
Hirschmann Greyhound2000 Switch 2025 02 PB363 en
No ratings yet
Hirschmann Greyhound2000 Switch 2025 02 PB363 en
5 pages
Proposal IP Transit
No ratings yet
Proposal IP Transit
8 pages
Nvidia XID - Errors
No ratings yet
Nvidia XID - Errors
12 pages
Pothole Finder Android Report
100% (2)
Pothole Finder Android Report
42 pages
National Programming Skills Report - Engineers 2017 - Report Brief
100% (2)
National Programming Skills Report - Engineers 2017 - Report Brief
20 pages
Color Making and Mixing Process Using PLC
100% (1)
Color Making and Mixing Process Using PLC
5 pages
Module 2 Class 1
No ratings yet
Module 2 Class 1
9 pages
Sustainability Analysis On NVDA
100% (1)
Sustainability Analysis On NVDA
24 pages
AI Neocloud Playbook and Anatomy
No ratings yet
AI Neocloud Playbook and Anatomy
19 pages
Using Ffmpeg With Nvidia Gpu Hardware Acceleration: Application Note
No ratings yet
Using Ffmpeg With Nvidia Gpu Hardware Acceleration: Application Note
20 pages
9619 Philips 22PFL3404-77!78!26PFL3404!77!78 Chassis TPS2.1L-LA Televisor LCD Manual de Servicio
0% (1)
9619 Philips 22PFL3404-77!78!26PFL3404!77!78 Chassis TPS2.1L-LA Televisor LCD Manual de Servicio
100 pages
FFRTC Log Bak
No ratings yet
FFRTC Log Bak
2,818 pages
The Role of Communication Systems For Smart Transmission: Irwin Barneto, Abb India LTD., Cigre Nov. 2013
No ratings yet
The Role of Communication Systems For Smart Transmission: Irwin Barneto, Abb India LTD., Cigre Nov. 2013
23 pages
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
80 pages
DGX A100 System Architecture Whitepaper
No ratings yet
DGX A100 System Architecture Whitepaper
23 pages
Extreme Power Supply Calculator
100% (1)
Extreme Power Supply Calculator
2 pages
Ready For The Challenges of Tomorrow.: FOX615 Multiservice Platform
100% (1)
Ready For The Challenges of Tomorrow.: FOX615 Multiservice Platform
5 pages
Cyberjaya Data Centers: With Intelligence Built Into Our Solutions
No ratings yet
Cyberjaya Data Centers: With Intelligence Built Into Our Solutions
9 pages
Unified Communications
100% (1)
Unified Communications
20 pages
Nvidia DGX Station A100 Datasheet
No ratings yet
Nvidia DGX Station A100 Datasheet
2 pages
2012 05 Public Transport Infrastructure Manual
No ratings yet
2012 05 Public Transport Infrastructure Manual
203 pages
Nokia Teleprotection Ebook FINAL 052517
No ratings yet
Nokia Teleprotection Ebook FINAL 052517
11 pages
Solivieri
No ratings yet
Solivieri
245 pages
RA11334001 DSPB200 ReferenceArch
No ratings yet
RA11334001 DSPB200 ReferenceArch
29 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CPUs GPUs Accelerators
No ratings yet
CPUs GPUs Accelerators
22 pages
Sistema de Alarmas MOS MCS 2200
No ratings yet
Sistema de Alarmas MOS MCS 2200
364 pages
Nvidia h100 Datasheet 2430615
No ratings yet
Nvidia h100 Datasheet 2430615
4 pages
AFL Fiber Cable Accessories PDF
No ratings yet
AFL Fiber Cable Accessories PDF
136 pages
Module 4: Substation Equipment's Details and Operations: July 2021
No ratings yet
Module 4: Substation Equipment's Details and Operations: July 2021
14 pages
Mobile
No ratings yet
Mobile
31 pages
Hardware / Virtualization / Architectures: Foundations of The Cloud
No ratings yet
Hardware / Virtualization / Architectures: Foundations of The Cloud
32 pages
344.48 Nvidia Control Panel Quick Start Guide PDF
No ratings yet
344.48 Nvidia Control Panel Quick Start Guide PDF
33 pages
Reinforcing Renewables
No ratings yet
Reinforcing Renewables
32 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
NGC Registry Launch Technical Overview
No ratings yet
NGC Registry Launch Technical Overview
11 pages
High Performance Computing Lecture 2 Parallel Programming With MPI Pub
No ratings yet
High Performance Computing Lecture 2 Parallel Programming With MPI Pub
50 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
FOSDEM14 HPC Devroom 12 Sniper
No ratings yet
FOSDEM14 HPC Devroom 12 Sniper
33 pages
Dgx1 v100 System Architecture Whitepaper
No ratings yet
Dgx1 v100 System Architecture Whitepaper
43 pages
12 Io
No ratings yet
12 Io
48 pages
Using FFmpeg With NVIDIA GPU Hardware Acceleration
No ratings yet
Using FFmpeg With NVIDIA GPU Hardware Acceleration
22 pages
The Indian Semiconductor Industry
No ratings yet
The Indian Semiconductor Industry
6 pages
SAP Document On BADI
No ratings yet
SAP Document On BADI
23 pages
Twincat e
No ratings yet
Twincat e
26 pages
583 - NSK5 Modem
No ratings yet
583 - NSK5 Modem
6 pages
PW3335 (-01,-02,-03,-04) Power Meter: Communication Command Instruction Manual
No ratings yet
PW3335 (-01,-02,-03,-04) Power Meter: Communication Command Instruction Manual
123 pages
Tesla V100 Performance Guide
No ratings yet
Tesla V100 Performance Guide
23 pages
Pages From (MiCOM P54x - EN M - La4 Technical Manual 120911)
No ratings yet
Pages From (MiCOM P54x - EN M - La4 Technical Manual 120911)
6 pages
Netman 208 Installation User Manual
No ratings yet
Netman 208 Installation User Manual
104 pages
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
No ratings yet
Microsoft in High Performance Computing: An Introduction: Aditya Krishnan Technical Product Manager Microsoft Corp
21 pages
High Performance Computing Update 0908 - InCOSE
No ratings yet
High Performance Computing Update 0908 - InCOSE
19 pages
JVM Memory Management & Diagnostics
No ratings yet
JVM Memory Management & Diagnostics
24 pages
2021-02-04 DAIM Company Presentation
No ratings yet
2021-02-04 DAIM Company Presentation
17 pages
47 TS For Armoured 75 Ohm Coaxial Cable For PLCC
No ratings yet
47 TS For Armoured 75 Ohm Coaxial Cable For PLCC
8 pages
Javascript 3
No ratings yet
Javascript 3
21 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
TB 04631 001 - v01
No ratings yet
TB 04631 001 - v01
25 pages
High Performance Network-on-Chip Through MPLS
No ratings yet
High Performance Network-on-Chip Through MPLS
4 pages
Investor Presentation - Jay Bee Laminations Limited
No ratings yet
Investor Presentation - Jay Bee Laminations Limited
18 pages
DEWA Automation Paper
No ratings yet
DEWA Automation Paper
27 pages
ELEC6036-MOTIVATE Note-0 High Perf Cloud Mobile Computing 2021-22
No ratings yet
ELEC6036-MOTIVATE Note-0 High Perf Cloud Mobile Computing 2021-22
17 pages
CC 1 Unit Notes
No ratings yet
CC 1 Unit Notes
8 pages
Ibm - Pre.C1000-018.By - Vceplus.60Q - Demo: Website: Vce To PDF Converter: Facebook: Twitter
No ratings yet
Ibm - Pre.C1000-018.By - Vceplus.60Q - Demo: Website: Vce To PDF Converter: Facebook: Twitter
21 pages
7585 Alcatel Lucent Deploying Ipmpls Communications Networks Smart Grids PDF
No ratings yet
7585 Alcatel Lucent Deploying Ipmpls Communications Networks Smart Grids PDF
20 pages
R01cp0035ej0300 Ra
No ratings yet
R01cp0035ej0300 Ra
18 pages
Ict1402 1
No ratings yet
Ict1402 1
19 pages
Mc3200 Configurations Accessories Guide
No ratings yet
Mc3200 Configurations Accessories Guide
19 pages
432Hz 聲音治療
No ratings yet
432Hz 聲音治療
4 pages
Virtualization Standards and Compliance
No ratings yet
Virtualization Standards and Compliance
19 pages
DGX Solution Stack Whitepaper
No ratings yet
DGX Solution Stack Whitepaper
24 pages
HPC Datasheet sc23 h200 Datasheet 3002446
No ratings yet
HPC Datasheet sc23 h200 Datasheet 3002446
3 pages
File Organization in DBMS - Set 1 - GeeksforGeeks
No ratings yet
File Organization in DBMS - Set 1 - GeeksforGeeks
6 pages
CAM350-840 DataSheet
No ratings yet
CAM350-840 DataSheet
4 pages
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
No ratings yet
HPC - Unit Test-I (9 July 2020) : Mark Only One Oval
5 pages
Stack
No ratings yet
Stack
6 pages
Understanding TCP/IP: PC Network Advisor
No ratings yet
Understanding TCP/IP: PC Network Advisor
5 pages
Here Is The Classic Block Diagram of A Process Under PID Control
No ratings yet
Here Is The Classic Block Diagram of A Process Under PID Control
9 pages
Dharan Rajan Resume
No ratings yet
Dharan Rajan Resume
2 pages
HASIM FARHAN - Original
No ratings yet
HASIM FARHAN - Original
1 page
Amplifier
No ratings yet
Amplifier
3 pages
Nvidia DGX Station Print Infographic 738375 Web
No ratings yet
Nvidia DGX Station Print Infographic 738375 Web
1 page
Datasheet HP 20
No ratings yet
Datasheet HP 20
1 page
VGD - Checklist of Tools Equipment Supplies
No ratings yet
VGD - Checklist of Tools Equipment Supplies
2 pages
HFC's Lucky Seven Technologies: How Can Cable Operators Compete With FTTH?
No ratings yet
HFC's Lucky Seven Technologies: How Can Cable Operators Compete With FTTH?
4 pages

Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla

Uploaded by

Case Study: CFD Dr. Graham Pullan University of Cambridge: Nvidia Tesla

Uploaded by

NVIDIA® TESLA™

Case study: CFD

Each blade row has a

Divide the volume into cells

(only 4 of 8 surrounding cells shown)

Each step uses data from surrounding nodes – “stencil” operation

calc = {"lvalue": "b",

Turbine geometry Flow solution

You might also like