0% found this document useful (0 votes)

53 views35 pages

Programming For Graphics Processing Units (Gpus) : Parallel

Uploaded by

The Gamer Last night

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views35 pages

Programming For Graphics Processing Units (Gpus) : Parallel

Uploaded by

The Gamer Last night

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 35

CS6963

Parallel Programming for

Graphics Processing Units
(GPUs)

Lecture 1: Introduction

L1: Introduction
1
Course Information
CS6963: Parallel Programming for
GPUs, MW 10:45-
12:05, MEB 3105
• Website:
http://www.eng.utah.edu/~cs6963/
• Professor:
Mary Hall
MEB 3466, [email protected], 5-1039
Office hours: 12:20-1:20 PM, Mondays
• Teaching Assistant:
Sriram Aananthakrishnan
MEB 3157, [email protected]
Office hours: 2-3PM, Thursdays
Lectures (slides and MP3) will be posted on website.
L1: Introduction
2
Source Materials for Today’s Lecture

• Wen-mei Hwu (UIUC) and David Kirk (NVIDIA)

http://courses.ece.uiuc.edu/ece498/al1/
• Jim Demmel (UCB) and Kathy Yelick (UCB,
NERSC)
http://www.eecs.berkeley.edu/~yelick/cs267_sp07/lectures
• NVIDIA:
http://www.nvidia.com
• Others as noted

L1: Introduction
3
Course Objectives
• Learn how to program “graphics” processors for
general-purpose multi-core computing
applications
– Learn how to think in parallel and write correct parallel
programs
– Achieve performance and scalability through
understanding of architecture and software mapping
• Significant hands-on programming experience
– Develop real applications on real hardware
• Discuss the current parallel computing context
– What are the drivers that make this course timely
– Contemporary programming models and architectures,
and where is the field going

L1: Introduction
4
Grading Criteria

• Homeworks and mini-projects:

25%
• Midterm test: 15%

• Project proposal: 10%

• Project design review: 10%
• Project presentation/demo 15%
• Project final report 20%
• Class participation 5%

L1: Introduction
5
Primary Grade: Team Projects

• Some logistical issues:

– 2-3 person teams
– Projects will start in late February
• Three parts:
– (1) Proposal; (2) Design review; (3) Final report
and demo
• Application code:
– Most students will work on MPM, a particle-in-
cell code.
– Alternative applications must be approved by
me (start early).

L1: Introduction
6
Collaboration Policy

• I encourage discussion and

exchange of information between
students.
• But the final work must be your
own.
– Do not copy code, tests, assignments
or written reports.
– Do not allow others to copy your code,
tests, assignments or written reports.

L1: Introduction
7
Lab Information
Primary lab
• “lab6” in WEB 130
• Windows machines
• Accounts are supposed to be set up for all who
were registered as of Friday
• Contact [email protected] with questions
Secondary
• Until we get to timing experiments, assignments
can be completed on any machine running CUDA
2.0 (Linux, Windows, MAC OS)
Tertiary
• Tesla S1070 system expected soon

L1: Introduction
8
Text and Notes

1. NVidia, CUDA Programmng Guide, available from

http://www.nvidia.com/object/cuda_develop.htm
l for CUDA 2.0 and Windows, Linux or MAC OS.
2. [Recommended] M. Pharr (ed.), GPU Gems 2 –
Programming Techniques for High Performance
Graphics and General-Purpose Computation,
Addison Wesley, 2005.
http://http.developer.nvidia.com/GPUGems2/gpu
gems2_part01.html
3. [Additional] Grama, A. Gupta, G. Karypis, and V.
Kumar, Introduction to Parallel Computing, 2nd
Ed. (Addison-Wesley, 2003).
4. Additional readings associated with lectures.

L1: Introduction
9
Schedule:
A Few Make-up Classes

A few make-up classes needed due to my travel

Time slot: Friday, 10:45-12:05, MEB 3105

Dates: February 20, March 13, April 3, April 24

L1: Introduction
10
Today’s Lecture

• Overview of course (done)

• Important problems require powerful
computers …
– … and powerful computers must be parallel.
– Increasing importance of educating parallel
programmers (you!)
• Why graphics processors?
• Opportunities and limitations
• Developing high-performance parallel
applications
– An optimization perspective
L1: Introduction
11
Parallel and Distributed Computing
• Limited to supercomputers?
– No! Everywhere!
• Scientific applications?
– These are still important, but also many new
commercial applications and new consumer
applications are going to emerge.
• Programming tools adequate and established?
– No! Many new research challenges

My Research Area

L1: Introduction
12
Why we need powerful computers

L1: Introduction
13
Scientific Simulation:
The Third Pillar of Science
• Traditional scientific and engineering paradigm:
1) Do theory or paper design.
2) Perform experiments or build system.
• Limitations:
– Too difficult -- build large wind tunnels.
– Too expensive -- build a throw-away passenger jet.
– Too slow -- wait for climate or galactic evolution.
– Too dangerous -- weapons, drug design, climate
experimentation.
• Computational science paradigm:
3) Use high performance computer systems to simulate the
phenomenon
• Base on known physical laws and efficient numerical
methods.
L1: Introduction
lide source: Jim Demmel, UC Berkeley 14
The quest for increasingly more
powerful machines
• Scientific simulation will continue to
push on system requirements:
– To increase the precision of the result
– To get to an answer sooner (e.g.,
climate modeling, disaster modeling)
• The U.S. will continue to acquire
systems of increasing scale
– For the above reasons
– And to maintain competitiveness

L1: Introduction
15
A Similar Phenomenon in Commodity
Systems
• More capabilities in software
• Integration across software
• Faster response
• More realistic graphics
•…

L1: Introduction
16
Why powerful computers must be
parallel

L1: Introduction
17
Technology Trends: Moore’s Law
Transistor
count still
rising

Clock
speed
flattening
sharply

Slide from Maurice Herlihy

L1: Introduction
18
Techology Trends: Power Issues

From www.electronics-cooling.com/.../jan00_a2f2.jpg
L1: Introduction
19
Power Perspective
1000000000

100000000

10000000

1000000
GigaFlop/s

MegaWatts
100000

10000
Performance (Gflops)
1000
Power
100

1
1960 1970 1980 1990 2000 2010 2020
0.1

0.01

0.001

Slide source: Bob Lucas

L1: Introduction
20
The Multi-Core Paradigm Shift
What to do with all these transistors?
• Key ideas:
– Movement away from increasingly
complex processor design and faster
clocks
– Replicated functionality (i.e., parallel)
is simpler to design
– Resources more efficiently utilized
– Huge power management advantages
All Computers are Parallel Computers.
L1: Introduction
21
Who Should Care
About Performance Now?

• Everyone! (Almost)
– Sequential programs will not get faster
• If individual processors are simplified and compete for
shared resources.
• And forget about adding new capabilities to the
software!
– Parallel programs will also get slower
• Quest for coarse-grain parallelism at odds with smaller
storage structures and limited bandwidth.
– Managing locality even more important than
parallelism!
• Hierarchies of storage and compute structures

• Small concession: some programs are

nevertheless fast enough
L1: Introduction
22
Why Massively Parallel
Processor
• A quiet revolution and potential build-up
– Calculation: 367 GFLOPS vs. 32 GFLOPS
– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s
– Until last year, programmed through graphics API
GFLOPS

G80 = GeForce 8800 GTX

G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800

– GPU in every PC and workstation – massive volume and

potential impact
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 L1: Introduction
ECE 498AL, University of Illinois, Urbana-Champaign 23
GeForce 8800
16 highly threaded SM’s, >128 FPU’s, 367 GFLOPS,
768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to
Host CPU
Input Assembler

Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
Texture Texture Texture Texture Texture Texture Texture Texture

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

ECE 498AL1, University of Illinois, Urbana-
Champaign
L1: Introduction
25
Concept of GPGPU
(General-Purpose Computing on
GPUs)
• Idea:
• Potential for very high performance at low cost
• Architecture well suited for certain kinds of parallel
applications (data parallel)
• Demonstrations of 30-100X speedup over CPU
• Early challenges:
– Architectures very customized to graphics problems (e.g.,
vertex and fragment processors)
– Programmed using graphics-specific programming models
or libraries
• Recent trends:
– Some convergence between commodity and GPUs and
their associated parallel programming models

See http://gpgpu.org L1: Introduction

26
Stretching Traditional
Architectures
• Traditional parallel architectures cover some super-
applications
– DSP, GPU, network apps, Scientific, Transactions
• The game is to grow mainstream architectures “out” or
domain-specific architectures “in”
– CUDA is latter

Traditional applications

Current architecture
coverage

New applications

Domain-specific
architecture coverage

Obstacles

ECE 498AL, University of Illinois, Urbana-Champaign
27
The fastest computer in the world
today
• What is its name? RoadRunner

Los Alamos National

• Where is it located? Laboratory

• How many processors does it ~19,000 processor chips

have? (~129,600 “processors”)

AMD Opterons and

• What kind of processors? IBM Cell/BE (in Playstations)

1.105 Petaflop/second
• How fast is it? One quadrilion operations/s
1 x 1016

See http://www.top500.org L1: Introduction

28
Parallel Programming Complexity
An Analogy to Preparing Thanksgiving Dinner
• Enough parallelism? (Amdahl’s Law)
– Suppose you want to just serve turkey
• Granularity
– How frequently must each assistant report to the chef
• After each stroke of a knife? Each step of a recipe? Each dish
completed?
• Locality
All of these
– Grab things
the spices one makes
at a time? Or collect parallel
ones that are needed
programming even harder than
prior to starting a dish?
– What if you have to go to the grocery store while cooking?
sequential programming.
• Load balance
– Each assistant gets a dish? Preparing stuffing vs. cooking green
beans?
• Coordination and Synchronization
– Person chopping onions for stuffing can also supply green beans
– Start pie after turkey is out of the oven
L1: Introduction
29
Finding Enough Parallelism
• Suppose only part of an application seems parallel
• Amdahl’s law
– let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
– P = number of processors

Speedup(P) = Time(1)/Time(P) 1-s

<= 1/(s + (1-s)/P) s

<= 1/s
• Even if the parallel part speeds up perfectly,
performance is limited by the sequential
part

L1: Introduction
30
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier
to getting desired speedup
• Parallelism overheads include:
– cost of starting a thread or process
– cost of communicating shared data
– cost of synchronizing
– extra (redundant) computation
• Each of these can be in the range of milliseconds
(=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of
work to run fast in parallel (I.e. large granularity), but
not so large that there is not enough parallel work

L1: Introduction
31
Locality and (Device) Grid

Parallelism
Conventional
Block (0, 0) Block (1, 0)

Storage Shared Memory Shared Memory

Hierarchy Proc Host + GPU

Cache Storage
Registers Registers Registers Registers

L2 Cache Hierarchy
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

L3 Cache Local
Memory
Local
Memory
Local
Memory
Local
Memory

Host Global
Memory

Memory Constant
Memory

Texture
Memory

• Large memories are slow, fast memories are small

Courtesy NVIDIA
• Storage hierarchies are large and fast on average
• Algorithm should do most work on nearby data

L1: Introduction
32
Load Imbalance
• Load imbalance is the time that some
processors in the system are idle due to
– insufficient parallelism (during that phase)
– unequal size tasks
• Examples of the latter
– different control flow paths on different tasks
– adapting to “interesting parts of a domain”
– tree-structured computations
– fundamentally unstructured problems
• Algorithm needs to balance load

L1: Introduction
33
Summary of Lecture

• Technology trends have caused the multi-core

paradigm shift in computer architecture
– Every computer architecture is parallel
• Parallel programming is reaching the masses
– This course will help prepare you for the
future of programming.
• We are seeing some convergence of graphics
and general-purpose computing
• Graphics processors can achieve high performance for
more general-purpose applications
• GPGPU computing
– Heterogeneous, suitable for data-parallel applications

L1: Introduction
34
Next Time

• Immersion!
– Introduction to CUDA

L1: Introduction
35

Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
UE3 Auto Report Ini Dump 0001
No ratings yet
UE3 Auto Report Ini Dump 0001
374 pages
Video Cards
No ratings yet
Video Cards
17 pages
CSED405 Lec1-Course Intro - 240903 - 203340
No ratings yet
CSED405 Lec1-Course Intro - 240903 - 203340
65 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Note2 4
No ratings yet
Note2 4
11 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
cs179 2024 Lec01
No ratings yet
cs179 2024 Lec01
26 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
No ratings yet
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
31 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Lecture 17-Introduction To GPU
No ratings yet
Lecture 17-Introduction To GPU
36 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
GPGPU
No ratings yet
GPGPU
139 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
GPUParallelProgramming PDF
No ratings yet
GPUParallelProgramming PDF
104 pages
lec30
No ratings yet
lec30
13 pages
Owens
No ratings yet
Owens
67 pages
Unit 4
No ratings yet
Unit 4
48 pages
Assignment
No ratings yet
Assignment
16 pages
GPUIntro
No ratings yet
GPUIntro
21 pages
Intro To Gpu &amp Cuda
No ratings yet
Intro To Gpu &amp Cuda
15 pages
Dept. of Electrical Engineering COMSATS University Islamabad Fall 2018
No ratings yet
Dept. of Electrical Engineering COMSATS University Islamabad Fall 2018
26 pages
UNIT 4 GPU Computing - HPC
No ratings yet
UNIT 4 GPU Computing - HPC
13 pages
Lecture 1 Introduction 1
No ratings yet
Lecture 1 Introduction 1
49 pages
Difference Between High-Performance Computing (HPC) High-Throughput Computing
No ratings yet
Difference Between High-Performance Computing (HPC) High-Throughput Computing
49 pages
Cuda
No ratings yet
Cuda
69 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
CUDA
No ratings yet
CUDA
46 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
No ratings yet
An Approach To Parallel Processing: Yashraj Rai Puja Padiya
3 pages
1.2.technologies For Network-Based Systems
No ratings yet
1.2.technologies For Network-Based Systems
20 pages
PDC Lecture 7-8 GPU Architectures
No ratings yet
PDC Lecture 7-8 GPU Architectures
25 pages
Topic 8
No ratings yet
Topic 8
71 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Lec 1
No ratings yet
Lec 1
27 pages
Lecture 9
No ratings yet
Lecture 9
72 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Parallel Processing Using GPU's
No ratings yet
Parallel Processing Using GPU's
34 pages
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
Virtual Report Processing: The Mapper Story
From Everand
Virtual Report Processing: The Mapper Story
Louis Schlueter
No ratings yet
COL333/671: Introduction To AI
No ratings yet
COL333/671: Introduction To AI
37 pages
L03 Problem Solving As Search I
No ratings yet
L03 Problem Solving As Search I
66 pages
COL333/671: Introduction To AI
No ratings yet
COL333/671: Introduction To AI
17 pages
COL334 Assignment 2 Final
No ratings yet
COL334 Assignment 2 Final
5 pages
Thank You
No ratings yet
Thank You
1 page
PCI Express VGA Card: Ati Gpu
No ratings yet
PCI Express VGA Card: Ati Gpu
4 pages
Coalesced
No ratings yet
Coalesced
202 pages
Portrait Displays - Supported Hardware: Nvidia Geforce Windows 7 Windows 8 Windows 10
No ratings yet
Portrait Displays - Supported Hardware: Nvidia Geforce Windows 7 Windows 8 Windows 10
6 pages
History
No ratings yet
History
31 pages
VST Price List 28nov06
No ratings yet
VST Price List 28nov06
10 pages
Hardware HOWTO
No ratings yet
Hardware HOWTO
214 pages
Motherboard Manual Gigabyte Ga-eg45m-Ds2h e
No ratings yet
Motherboard Manual Gigabyte Ga-eg45m-Ds2h e
112 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
GAME Requirements
No ratings yet
GAME Requirements
2 pages
GPU Core Fill Gflops Band: Login Register
No ratings yet
GPU Core Fill Gflops Band: Login Register
4 pages
NV+series+Manual+EN+v7 7 (SP4)
No ratings yet
NV+series+Manual+EN+v7 7 (SP4)
357 pages
UE3 Auto Report Ini Dump 0001
No ratings yet
UE3 Auto Report Ini Dump 0001
256 pages
Videocard Benchmark
No ratings yet
Videocard Benchmark
17 pages
Desktop GPU Performance Hierarchy Table PDF
No ratings yet
Desktop GPU Performance Hierarchy Table PDF
3 pages
Graphics Card Hierarchy
No ratings yet
Graphics Card Hierarchy
4 pages
Readme English
No ratings yet
Readme English
25 pages
History
No ratings yet
History
36 pages
775 I 945 GZ
No ratings yet
775 I 945 GZ
41 pages
Geforce Manual English
No ratings yet
Geforce Manual English
38 pages
Graphics A x86 906.1
No ratings yet
Graphics A x86 906.1
13 pages
GECKReadme
No ratings yet
GECKReadme
2 pages
Desktop GPU Performance Hierarchy Table PDF
No ratings yet
Desktop GPU Performance Hierarchy Table PDF
3 pages
(Archive) Personal - Computer.world - Magazine.nov.2005
No ratings yet
(Archive) Personal - Computer.world - Magazine.nov.2005
199 pages
Carte Mère Nouvel Ordinateur Multi Langues
No ratings yet
Carte Mère Nouvel Ordinateur Multi Langues
115 pages
Conversions
100% (2)
Conversions
619 pages
History
No ratings yet
History
36 pages
Ue 3 Autoreportinidump 0009
No ratings yet
Ue 3 Autoreportinidump 0009
247 pages
Video Product Comparison
No ratings yet
Video Product Comparison
2 pages

Programming For Graphics Processing Units (Gpus) : Parallel

Uploaded by

Programming For Graphics Processing Units (Gpus) : Parallel

Uploaded by

CS6963

Parallel Programming for

• Wen-mei Hwu (UIUC) and David Kirk (NVIDIA)

• Homeworks and mini-projects:

• Project proposal: 10%

• Some logistical issues:

• I encourage discussion and

1. NVidia, CUDA Programmng Guide, available from

A few make-up classes needed due to my travel

Time slot: Friday, 10:45-12:05, MEB 3105

Dates: February 20, March 13, April 3, April 24

• Overview of course (done)

Slide from Maurice Herlihy

Slide source: Bob Lucas

• Small concession: some programs are

G80 = GeForce 8800 GTX

– GPU in every PC and workstation – massive volume and

Thread Execution Manager

Load/store Load/store Load/store Load/store Load/store Load/store

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

See http://gpgpu.org L1: Introduction

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007

Los Alamos National

• How many processors does it ~19,000 processor chips

AMD Opterons and

See http://www.top500.org L1: Introduction

Speedup(P) = Time(1)/Time(P) 1-s

<= 1/(s + (1-s)/P) s

Storage Shared Memory Shared Memory

Hierarchy Proc Host + GPU

• Large memories are slow, fast memories are small

• Technology trends have caused the multi-core

You might also like