GPUIntro
GPUIntro
20 petaflops
http://nvidianews.nvidia.com/Releases/NVIDIA-Powers-Titan-World-s-Fastest-Supercomputer-For-Open-Scientific-Research-8a0.aspx#source=pr 2
No 1: Tianhe-2 (MilkyWay-2) – 3,120,000 cores (Intel Xeon E5-2692 with Intel Xeon Phi coprocessors)
Tesla K20 GPU Computing modules
Kepler architecture. Introduced November 2012
K20
2496 FP32 cores, 832 FP64
cores
Wattage 225 watts
GFLOPs:
Single Precision: 3519 - 4106
Double Precision: 1173
3
UNC-C CUDA
Teaching Center
2010: NVIDIA Corp. selected UNC-
Charlotte Department of Computer
Science to be a CUDA Teaching
Center, kindly providing GPU
equipment and TA support.
Donated C2050 used in coit-grid06
5
CPU-GPU architecture evolution
1970s - 1980s
Co-processors -- very old idea appeared in Early designs
Co-processor
1970s and 1980s -- floating point co-
processors attached to microprocessors that CPU
did not then have floating point capability.
Coprocessors simply executed floating point Memory
instructions that were fetched from memory.
CPU
Graphics cards -- Around same time,
hardware support for displays, especially with
increasing use of graphics and PC games. Graphics
card
Led to graphics processing units (GPUs)
attached to CPU to create video display. Display
8
Example -- GeForce 6 Series Architecture (2004-5)
From GPU Gems 2, Copyright 2005 by NVIDIA Corporation
9
General-Purpose GPU designs
a 10
Graphics Processing Units (GPUs)
Brief History
GPU Computing
General-purpose computing
on graphics processing units
(GPGPUs)
GPUs with
programmable shading
Nvidia GeForce
GE 3 (2001) with
programmable shading
Fermi
NVIDIA's first Tesla
GPU with C870, S870, C1060, S1070, C2050, …
general purpose
processors GeForce 400 series
GTX460/465/470/475/
Quadro 480/485
Established by Jen- GT 80
GeForce 200 series
Hsun Huang, Chris GeForce
8800 GTX260/275/280/285/295
Malachowsky,
Curtis Priem GeForce 8 series
1993 1995 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
NVIDIA GT 80 chip/GeForce 8800 card
(2006)
First GPU for high performance computing as well as graphics
Unified processors
that could perform
vertex, geometry,
pixel, and general
computing
operations
Single-instruction
multiple thread
(SIMT) prog. model 13
Evolving GPU design:
NVIDIA Fermi architecture
(announced Sept 2009)
• Data parallel single instruction
multiple data operation (“Stream”
processing)
• Up to 512 cores (“stream processing
engines”, SPEs, organized as 16
SPEs, each having 32 SPEs)
• 3GB or 6 GB GDDR5 memory
• Many innovations including L1/L2
caches, unified device memory
addressing, ECC memory, …
• First implementation: Tesla 20 series
(single chip C2050/2070, 4 chip
S2050/2070)
3 billion transistor chip?
Number of cores limited by power
considerations, C2050 has 448 * Whitepaper NVIDIA’s Next Generation CUDA 14
Compute Architecture: Fermi, NVIDIA, 2008
cores.
GPU performance gains over CPUs
1400
T12
1200
NVIDIA GPU GT200
1000
Intel CPU
800
GFLOPs
G80
600
400
G70 3GHz Xeon
Westmere
200 NV40
3GHz Core2 Quad
NV30 3GHz Dual Duo
Core P4
0
9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009
15
Source © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
NVIDIA Kepler architecture and GPUs
(2012+)
GK104 chip with 1536 cores
http://www.tomshardware.com/news/
Nvidia-Kepler-GK104-GeForce-GTX- 16
670-680,14691.html
NVIDA GPUs
•
14 streaming multiprocessor (SMs)
•
Each streaming multiprocessor has 32 streaming
processor (SPs)
•
So 448 streaming processor (cores)
18
NVIDIA K20
(as on coit-grid08)
•
13 streaming multiprocessor (SMXs, extreme)
•
Each streaming multiprocessor has 192
streaming processor (SPs)
•
So 2496 streaming processor (cores)
19
CUDA
(Compute Unified Device Architecture)
• Architecture and programming model introduced in NVIDIA in 2007
• Enables GPUs to execute programs written in C.
• Within C programs, call SIMT “kernel” routines that are executed on
GPU.
• CUDA syntax extension to C identify routine as a Kernel.
• Very easy to learn although to get highest possible execution
performance requires understanding of hardware architecture.
• Version 3 introduced 2009
• Version 4 introduced 2011 – significant additions including “unified
virtual addressing” – a single address space across GPU and host.
• Most recent version 5.5 introduced July 2013
• We will go into CUDA in detail shortly and have programming 20
Questions