0% found this document useful (0 votes)

59 views35 pages

217 Lec1

Uploaded by

palash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views35 pages

217 Lec1

Uploaded by

palash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

CS/EE 217

GPU Programming and Architecture

Lecture 1: Introduction

Slide credit: Slides adapted from ! 1

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012!
Course Goals
• Learn how to program GPGPU processors
and achieve
– high performance
– functionality and maintainability
– scalability across future generations
• Technical subjects
– principles and patterns of parallel algorithms
– processor architecture features and constraints
– programming API, tools and techniques

2
Course Staff
• Professor:
Nael Abu-Ghazaleh
WCH-441, (951) 827-2347
Use 217 to start your e-mail subject line
Office hours: TBD soon; or by appointment
• Teaching Assistants:
– We will have one!
• If we can find one
– Office hours: TBA
• Class may be moving in time and space
– Sorry, I will let you know soon

3
Web Resources
• Course website:
http://www.cs.ucr.edu/~nael/217-f15
– Handouts and lecture slides
– Resources, announcements, projects, …
– Note: While we’ll make an effort to post announcements
on the web, we can’t guarantee it, and won’t make any
allowances for people who miss things in class.
• Piazza for discussions
– Channel for electronic announcements
– Forum for Q&A – course staff read the board, and your
classmates often have answers
• iLearn for submissions and grades
4
Grading
• Exam+Final: 35%

• Labs (Programming assignments): 35%

• Project: 30%
– Design Document: 25%
– Project Presentation: 25%
– Demo/Functionality/Performance/Report: 50%

5
Academic Honesty
• You are allowed and encouraged to discuss
assignments with other students in the class.
Getting verbal advice/help from people who’ve
already taken the course is also fine.
• Any reference to assignments from previous terms
or web postings is unacceptable
• Any copying of non-trivial code is unacceptable
– Non-trivial = more than a line or so
– Includes reading someone else’s code and then going off
to write your own.

6
Academic Honesty (cont.)
• Giving/receiving help on an exam is
unacceptable
• Penalties for academic dishonesty:
– Zero on the assignment for the first occasion
– Automatic failure of the course for repeat
offenses
– UCR academic honesty policy trumps any
instructor policies

7
Team Projects
• Work can be divided up between team
members in any way that works for you
• However, each team member will demo the
final checkpoint of each project individually,
and will get a separate demo grade
– This will include questions on the entire design
– Rationale: if you don’t know enough about the
whole design to answer questions on it, you
aren’t involved enough in the project
8
Text/Notes
1. D. Kirk and W. Hwu, “Programming Massively
Parallel Processors – A Hands-on Approach,
Second Edition”
2. CUDA by example, Sanders and Kandrot
3. Nvidia CUDA C Programming Guide
– https://docs.nvidia.com/cuda/cuda-c-programming-guide/
4. Occasional research papers
5. Lecture notes on class website
– Tentative schedule on class website
– Will try to assign reading ahead of time
9
Blue Waters Hardware

Cray System & Storage cabinets: • >300

Compute nodes: • >25,000

Usable Storage Bandwidth: • >1 TB/s

System Memory: • >1.5 Petabytes

Memory per core module: • 4 GB

Gemin Interconnect Topology: • 3D Torus

Usable Storage: • >25 Petabytes

Peak performance: • >11.5 Petaflops

Number of AMD Interlogos processors: • >49,000

Number of AMD x86 core modules: • >380,000

Number of NVIDIA Kepler GPUs: • >3,000

10
Cray XK7 Compute Node

e n2
PCIe G

XK7 Compute Node

Characteristics HT3
AMD Series 6200 (Interlagos) HT3

NVIDIA Kepler
Host Memory
32GB Z
1600 MT/s DDR3
Y
NVIDIA Tesla X2090 Memory
6GB GDDR5 capacity
X
Gemini High Speed Interconnect

Keplers in final installation

11
CPU and GPU have very different
design philosophy
GPU CPU
Throughput Oriented Cores Latency Oriented Cores

Chi Chi
p Compute Unit p
Core
Cache/Local Mem
Local Cache
Registers
Threading

Registers

Control
SIMD
Unit SIMD Unit
CPUs: Latency Oriented Design
• Large caches
– Convert long latency memory
accesses to short latency ALU ALU
cache accesses Control
ALU ALU
• Sophisticated control CPU
– Branch prediction for Cache
reduced branch latency
– Data forwarding for reduced DRAM
data latency
• Powerful ALU
– Reduced operation latency
13
GPUs: Throughput Oriented
Design
• Small caches
– To boost memory throughput
• Simple control
– No branch prediction
GPU
– No data forwarding
• Energy efficient ALUs
– Many, long latency but heavily
pipelined for high throughput DRAM

• Require massive number of

threads to tolerate latencies
14
Heterogeneous Computing: Use
Both CPU and GPU

• CPUs for sequential • GPUs for parallel parts

parts where latency where throughput wins
matters – GPUs can be 10+X faster
– CPUs can be 10+X faster than CPUs for parallel
than GPUs for sequential code
code

15
Heterogeneous parallel computing is
catching on.
Data
Financial Scientific Engineering Medical
Intensive
Analysis Simulation Simulation Imaging
Analytics

Digital Electronic
Digital Video Computer Biomedical
Audio Design
Processing Vision Informatics
Processing Automation

Statistical Ray Tracing Interactive Numerical

Modeling Rendering Physics Methods

• 280 submissions to GPU Computing Gems

– 110 articles included in two volumes
16
Parallel Programming Work Flow
• Identify compute intensive parts of an
application
• Adopt scalable algorithms
• Optimize data arrangements to maximize
locality
• Performance Tuning
• Pay attention to code portability and
maintainability
Software Dominates System Cost
• SW lines per chip
increases at 2x/10
months

• HW gates per chip

increases at 2x/18
months

• Future system must

minimize software
redevelopment
9/25/15! (c) Wen-mei Hwu, Cool
Chips
Keys to Software Cost Control
App

Core A

• Scalability

9/25/15!
Keys to Software Cost Control
App

Core A
Core A
2.0

• Scalability
– The same application runs efficiently on new
generations of cores

9/25/15!
Keys to Software Cost Control
App

Core A Core A Core A

• Scalability
– The same application runs efficiently on new
generations of cores
– The same application runs efficiently on more of
the same cores
9/25/15!
Scalability and Portability
• Performance growth with HW generations
– Increasing number of compute units
– Increasing number of threads
– Increasing vector length
– Increasing pipeline depth
– Increasing DRAM burst size
– Increasing number of DRAM channels
– Increasing data movement latency
• Portability across many different HW types
– Multi-core CPUs vs. many-core GPUs
– VLIW vs. SIMD vs. threading
– Shared memory vs. distributed memory

9/25/15!
Keys to Software Cost Control
App App App

Core B Core A Core C

• Scalability
• Portability
– The same application runs efficiently on
9/25/15!
different types of cores
Keys to Software Cost Control
App App App

• Scalability
• Portability
– The same application runs efficiently on different types
of cores
– The same application runs efficiently on systems with
different organizations and interfaces

9/25/15!
Parallelism Scalability

9/25/15!
Algorithm Complexity and Data
Scalability

9/25/15!
Why is data scalability important?
• Any algorithm complexity higher than linear
is not data scalable
– Execution time explodes as data size grows even for an n*log(n)
algorithm

• Processing large data sets is a major

motivation for parallel computing
• A sequential algorithm with linear data
scalability can outperform a parallel
algorithm with n*log(n) complexity
– log(n) grows to be greater than degree of HW parallelism and makes
parallel algorithm run slower than sequential algorithm
9/25/15!
Parallelism cannot overcome
complexity for large data sets

9/25/15!
A Real Example of Data Scalability
Particle-Mesh Algorithms

9/25/15!
Massive
Parallelism -
Regularity

5/24/2012! (c) Wen-mei Hwu,

CTHPC 2012
Load Balance
• The total amount of time to complete a
parallel job is limited by the thread that takes
the longest to finish

good bad!
Global Memory Bandwidth
Ideal Reality
Conflicting Data Accesses Cause
Serialization and Delays

• Massively parallel
execution cannot
afford serialization

• Contentions in accessing
critical data causes
serialization
What is the stake?

• Scalable and portable software lasts through

many hardware generations

Scalable algorithms and libraries can be

the best legacy we can leave behind from
this era
QUESTIONS?

Symptoms of Ca3 Problems
100% (1)
Symptoms of Ca3 Problems
4 pages
106105220
No ratings yet
106105220
993 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
CSED405 Lec1-Course Intro - 240903 - 203340
No ratings yet
CSED405 Lec1-Course Intro - 240903 - 203340
65 pages
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
No ratings yet
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
31 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Owens
No ratings yet
Owens
67 pages
GPGPU
No ratings yet
GPGPU
139 pages
Cours 1
No ratings yet
Cours 1
38 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Unit 4
No ratings yet
Unit 4
48 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
And Motivation: Presenter
No ratings yet
And Motivation: Presenter
22 pages
Administrative Stuff : Instructor
No ratings yet
Administrative Stuff : Instructor
8 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Cours 1
No ratings yet
Cours 1
38 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
PI CSE30 Lecture 1 Intro PDF
No ratings yet
PI CSE30 Lecture 1 Intro PDF
45 pages
Parallel Path Tracing
No ratings yet
Parallel Path Tracing
35 pages
Basics Computer Architecture by Pooyan Jamshidi 1731311297
No ratings yet
Basics Computer Architecture by Pooyan Jamshidi 1731311297
266 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
No ratings yet
s7122 Stephen Jones Cuda Optimization Tips Tricks and Techniques
71 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
CSE5006 Multicore-Architectures ETH 1 AC41
No ratings yet
CSE5006 Multicore-Architectures ETH 1 AC41
9 pages
Computer Architecture and Design (2) (1) - 240710 - 095936
No ratings yet
Computer Architecture and Design (2) (1) - 240710 - 095936
4 pages
Ca LP
No ratings yet
Ca LP
6 pages
Note2 4
No ratings yet
Note2 4
11 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
CS 61C: Great Ideas in Computer Architecture: Course Introduction
No ratings yet
CS 61C: Great Ideas in Computer Architecture: Course Introduction
55 pages
Lec01 1 Introduction
No ratings yet
Lec01 1 Introduction
36 pages
Introduction To HPC: Content and Definitions
No ratings yet
Introduction To HPC: Content and Definitions
22 pages
CAO Fall 2024 Lecture 01 Introduction Motivation
No ratings yet
CAO Fall 2024 Lecture 01 Introduction Motivation
68 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
ASPLOS 2021 - Golden Age of Compilers
No ratings yet
ASPLOS 2021 - Golden Age of Compilers
64 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Lecture02 - High-Level Digital Design Automation
No ratings yet
Lecture02 - High-Level Digital Design Automation
34 pages
Intro To CUDA
No ratings yet
Intro To CUDA
16 pages
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
No ratings yet
Learnopencv Com Demystifying Gpu Architectures For Deep Learning
1 page
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
8BVI0055HWDS.000-1 en
No ratings yet
8BVI0055HWDS.000-1 en
10 pages
Module 6 NC II Presenting Relevant Information Final
No ratings yet
Module 6 NC II Presenting Relevant Information Final
68 pages
MGD Lime Projects - Activation Schedule (01 April 2025) Calls
No ratings yet
MGD Lime Projects - Activation Schedule (01 April 2025) Calls
81 pages
Power Tools: Rotary Hammer
No ratings yet
Power Tools: Rotary Hammer
2 pages
Daa 2
No ratings yet
Daa 2
4 pages
Magic and The Mind
No ratings yet
Magic and The Mind
379 pages
1 Vemm Regalit .
No ratings yet
1 Vemm Regalit .
4 pages
Embroidery Stitches
No ratings yet
Embroidery Stitches
16 pages
Hays Report V4 02122013 Online
No ratings yet
Hays Report V4 02122013 Online
13 pages
2014 Capstone Team Member Guide
No ratings yet
2014 Capstone Team Member Guide
28 pages
Various - Rock'n Roll Project
No ratings yet
Various - Rock'n Roll Project
15 pages
Musico 2023 The Role of Perfectionistic Self Presentation and Problematic Instagram Use in The Relationship Between
No ratings yet
Musico 2023 The Role of Perfectionistic Self Presentation and Problematic Instagram Use in The Relationship Between
15 pages
Atitude of Fast-Food Worker
No ratings yet
Atitude of Fast-Food Worker
8 pages
Fresh Air Cooling Coil - 9800 CFM
No ratings yet
Fresh Air Cooling Coil - 9800 CFM
1 page
YAMAHA OUTBOARD LZ200NETO, LZ200TR Service Repair Manual X 100101 PDF
No ratings yet
YAMAHA OUTBOARD LZ200NETO, LZ200TR Service Repair Manual X 100101 PDF
60 pages
Site Case Study
No ratings yet
Site Case Study
3 pages
GX-6000 Manual PDF
No ratings yet
GX-6000 Manual PDF
262 pages
Incredible India Quiz Prelims: Avant-Garde Essence 2018
No ratings yet
Incredible India Quiz Prelims: Avant-Garde Essence 2018
86 pages
History of Kenya
No ratings yet
History of Kenya
2 pages
PHD Download
No ratings yet
PHD Download
1 page
Module 2.1 Managerial Economics
No ratings yet
Module 2.1 Managerial Economics
18 pages
Mated Ttbbi 3
No ratings yet
Mated Ttbbi 3
1 page
First Summative Test in English 5
No ratings yet
First Summative Test in English 5
2 pages
Company QMS - Quality Policy
No ratings yet
Company QMS - Quality Policy
2 pages
Ielts Listening Pretest
No ratings yet
Ielts Listening Pretest
5 pages
ABM - Business Finance CG - 2
No ratings yet
ABM - Business Finance CG - 2
6 pages
J24 Jimmys Combo
No ratings yet
J24 Jimmys Combo
54 pages
New Microsoft Office Word Document
No ratings yet
New Microsoft Office Word Document
34 pages
Making The Most of Your Conference Poster: DR Krystyna Haq Graduate Education Officer Graduate Research School
No ratings yet
Making The Most of Your Conference Poster: DR Krystyna Haq Graduate Education Officer Graduate Research School
19 pages

217 Lec1

Uploaded by

217 Lec1

Uploaded by

CS/EE 217

GPU Programming and Architecture

Slide credit: Slides adapted from ! 1

• Labs (Programming assignments): 35%

Cray System & Storage cabinets: • >300

Compute nodes: • >25,000

Usable Storage Bandwidth: • >1 TB/s

System Memory: • >1.5 Petabytes

Memory per core module: • 4 GB

Gemin Interconnect Topology: • 3D Torus

Usable Storage: • >25 Petabytes

Peak performance: • >11.5 Petaflops

Number of AMD Interlogos processors: • >49,000

Number of AMD x86 core modules: • >380,000

Number of NVIDIA Kepler GPUs: • >3,000

XK7 Compute Node

Keplers in final installation

• Require massive number of

• CPUs for sequential • GPUs for parallel parts

Statistical Ray Tracing Interactive Numerical

• 280 submissions to GPU Computing Gems

• HW gates per chip

• Future system must

Core A Core A Core A

Core B Core A Core C

• Processing large data sets is a major

5/24/2012! (c) Wen-mei Hwu,

• Scalable and portable software lasts through

Scalable algorithms and libraries can be

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012! 35

You might also like