0% found this document useful (0 votes)
17 views

GPUIntro

Uploaded by

sumitwalia177
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

GPUIntro

Uploaded by

sumitwalia177
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

Emergence of GPU systems

for general purpose high


performance computing

ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013


Titan Supercomputer
Oak Ridge National Laboratory in Oak Ridge, Tenn
World’s fastest computer on TOP500 list Nov 2012 – May 2013
Down to No 2 June 2013*

18,688 NVIDIA Tesla


K20X GPUs (each
having 2688 cores)

20 petaflops

Upgraded from Jaguar


supercomputer.

10 times faster and 5


times more energy
efficient than 2.3-
petaflops Jaguar
system while occupying
the same floor space.

http://nvidianews.nvidia.com/Releases/NVIDIA-Powers-Titan-World-s-Fastest-Supercomputer-For-Open-Scientific-Research-8a0.aspx#source=pr 2
No 1: Tianhe-2 (MilkyWay-2) – 3,120,000 cores (Intel Xeon E5-2692 with Intel Xeon Phi coprocessors)
Tesla K20 GPU Computing modules
Kepler architecture. Introduced November 2012

K20 – 2496 thread processors


(cores)
K20X – 2688 thread processors
(cores)

2013: K40 – 2880 thread


processors

K20
2496 FP32 cores, 832 FP64
cores
Wattage 225 watts

GFLOPs:
Single Precision: 3519 - 4106
Double Precision: 1173
3
UNC-C CUDA
Teaching Center
2010: NVIDIA Corp. selected UNC-
Charlotte Department of Computer
Science to be a CUDA Teaching
Center, kindly providing GPU
equipment and TA support.
Donated C2050 used in coit-grid06

2011: NVIDIA kindly provided 50 GTX 480 GPU cards valued at


$15,000 as continuing support for the CUDA Teaching Center.
2012: NVIDIA donates a K20, used in cci-grid08.
2013 NVIDIA Teaching Center status renewed.
Our course materials are posted on NVIDIA’s corporate site
next to those from Stanford, and other top schools.
4
http://developer.nvidia.com/cuda-training

5
CPU-GPU architecture evolution
1970s - 1980s
Co-processors -- very old idea appeared in Early designs
Co-processor
1970s and 1980s -- floating point co-
processors attached to microprocessors that CPU
did not then have floating point capability.
Coprocessors simply executed floating point Memory
instructions that were fetched from memory.

CPU
Graphics cards -- Around same time,
hardware support for displays, especially with
increasing use of graphics and PC games. Graphics
card
Led to graphics processing units (GPUs)
attached to CPU to create video display. Display

2013: Xeon Phi processor with 60 cores is described as a co-processor although


7
connected thro a PCIe interface in a similar fashion to recent GPU cards.
Pipelined programmable GPU
Dedicated pipeline (late1990s-early 2000s)

By late1990’s, graphics chips


Input stage
needed to support 3-D graphics,
especially for games and graphics.
APIs such as DirectX and OpenGL.
Vertex shader
stage
Generally had a pipeline structure
with individual stages performing Graphics
memory
specialized operations, finally Geometry
leading to loading frame buffer for shader stage
display.

Individual stages may have access Rasterizer stage


to graphics memory for storing Frame
intermediate computed data. buffer
Pixel shading
stage

8
Example -- GeForce 6 Series Architecture (2004-5)
From GPU Gems 2, Copyright 2005 by NVIDIA Corporation

9
General-Purpose GPU designs

High performance pipelines call for high-speed (IEEE) floating point


operations.

People tried to use GPU cards to speed up scientific computations

Known as GPGPU (General-purpose computing on graphics


processing units) -- Difficult to do with specialized graphics pipelines,
but possible.)

By mid 2000’s, recognized that individual stages of graphics pipeline


could be implemented by a more general purpose processor core
(although with a data-parallel paradigm)

a 10
Graphics Processing Units (GPUs)
Brief History
GPU Computing
General-purpose computing
on graphics processing units
(GPGPUs)
GPUs with
programmable shading
Nvidia GeForce
GE 3 (2001) with
programmable shading

DirectX graphics API


OpenGL graphics API
Hardware-accelerated
3D graphics
S3 graphics cards-
single chip 2D
accelerator
Atari 8-bit IBM PC Professional Playstation
computer Graphics Controller
text/graphics chip card

1970 1980 1990 2000 2010


Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit
NVIDIA products
Tesla Kepler K20
NVIDIA Corp. a leader in GPU has 2496 thread
processors
GPUs for high performance Maxwell
computing: C2050 GPU has (2013)
448 thread Kepler
processors (2011)

Fermi
NVIDIA's first Tesla
GPU with C870, S870, C1060, S1070, C2050, …
general purpose
processors GeForce 400 series
GTX460/465/470/475/
Quadro 480/485
Established by Jen- GT 80
GeForce 200 series
Hsun Huang, Chris GeForce
8800 GTX260/275/280/285/295
Malachowsky,
Curtis Priem GeForce 8 series

GeForce 2 series GeForce FX series


NV1 GeForce 1

1993 1995 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
NVIDIA GT 80 chip/GeForce 8800 card
(2006)
First GPU for high performance computing as well as graphics
Unified processors
that could perform
vertex, geometry,
pixel, and general
computing
operations

Could now write


programs in C
rather than graphics
APIs.

Single-instruction
multiple thread
(SIMT) prog. model 13
Evolving GPU design:
NVIDIA Fermi architecture
(announced Sept 2009)
• Data parallel single instruction
multiple data operation (“Stream”
processing)
• Up to 512 cores (“stream processing
engines”, SPEs, organized as 16
SPEs, each having 32 SPEs)
• 3GB or 6 GB GDDR5 memory
• Many innovations including L1/L2
caches, unified device memory
addressing, ECC memory, …
• First implementation: Tesla 20 series
(single chip C2050/2070, 4 chip
S2050/2070)
3 billion transistor chip?
Number of cores limited by power
considerations, C2050 has 448 * Whitepaper NVIDIA’s Next Generation CUDA 14
Compute Architecture: Fermi, NVIDIA, 2008
cores.
GPU performance gains over CPUs
1400
T12

1200
NVIDIA GPU GT200
1000
Intel CPU
800
GFLOPs

G80
600

400
G70 3GHz Xeon
Westmere
200 NV40
3GHz Core2 Quad
NV30 3GHz Dual Duo
Core P4

0
9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009
15
Source © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
NVIDIA Kepler architecture and GPUs
(2012+)
GK104 chip with 1536 cores

A lot of major new features


over earlier Fermi
architecture
K10/GK104 1536 cores
K20/GK110 2496 cores
K40/GK180 2880 cores

CUDA Computer Capability 3.0


see next

http://www.tomshardware.com/news/
Nvidia-Kepler-GK104-GeForce-GTX- 16
670-680,14691.html
NVIDA GPUs

Stream processing -- Term used to denote processing of


a stream of instructions operating in a data parallel
fashion.

Stream Processors (SPs) – theeexecution cores that will


execute the stream. Each stream processor has compute
resources such as register file, instruction scheduler, …

Streaming multiprocessors (SMs) -- groups of streaming


processors that shares control logic and cache.
NVIDIA C2050
(as on coit-grid06.uncc.edu and cci-grid07)


14 streaming multiprocessor (SMs)

Each streaming multiprocessor has 32 streaming
processor (SPs)

So 448 streaming processor (cores)

Apparently Fermi was originally intended to have


512 cores (16 SM) but design got too hot.

18
NVIDIA K20
(as on coit-grid08)


13 streaming multiprocessor (SMXs, extreme)

Each streaming multiprocessor has 192
streaming processor (SPs)

So 2496 streaming processor (cores)

Actually 15 SMs (2880 core) fabricated on chip to


improve yield.

19
CUDA
(Compute Unified Device Architecture)
• Architecture and programming model introduced in NVIDIA in 2007
• Enables GPUs to execute programs written in C.
• Within C programs, call SIMT “kernel” routines that are executed on
GPU.
• CUDA syntax extension to C identify routine as a Kernel.
• Very easy to learn although to get highest possible execution
performance requires understanding of hardware architecture.
• Version 3 introduced 2009
• Version 4 introduced 2011 – significant additions including “unified
virtual addressing” – a single address space across GPU and host.
• Most recent version 5.5 introduced July 2013
• We will go into CUDA in detail shortly and have programming 20
Questions

You might also like