CS 179: GPU Computing: Recitation 1 - 4/1/16

This document provides an overview of GPU computing and parallelizing problems for GPUs. It discusses: - How CUDA programs use kernels to define GPU thread behavior and the CPU to control program flow and data transfers. - How the GPU hardware divides threads into a grid of blocks that are assigned to streaming multiprocessors, with blocks further divided into warps of 32 threads. - Examples of parallelizable problems like adding arrays where each element can be computed independently, versus non-parallelizable problems like exponential moving averages that rely on previous computations. - How linear time-invariant systems can be characterized by their impulse response function h[n], allowing convolution to be computed in parallel by having each thread compute a

Uploaded by

Rajul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views18 pages

CS 179: GPU Computing: Recitation 1 - 4/1/16

Uploaded by

Rajul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

CS 179: GPU Computing

Recitation 1 - 4/1/16
Recap
● Device (GPU) runs CUDA kernel defined in .cu and .cuh files
- C++ code with a few extensions
- Compiled with proprietary NVCC compiler
- Kernel defines the behavior of each GPU thread
● Program control flow managed by host (CPU)
- Uses CUDA API calls to allocate GPU memory, and copy input data from
host RAM to device RAM
- In charge of calling kernel - (almost) like any other function
- Must also copy output data back from device to host
- Executable is ultimately C++ program compiled by G++
- Doesn’t treat object files (.o) produced by NVCC any differently
Recap
● GPU hardware abstraction consists of a grid of blocks of threads
- Grid and blocks can have up to three dimensions
- Each block assigned to an independent streaming multiprocessor (SM)
- SM divides blocks into warps of 32 threads
- All threads in a warp execute the same instruction concurrently
- Warp divergence occurs when threads must wait to execute different
instructions
- GPUs are slow - waiting adds up fast!
● Parallelizable problems can be broken into independent components
- Want to assign one thread per “thing that needs to get done”
- Even better if threads in a warp don’t diverge
Parallelizable Problems
● Most obvious example is adding two linear arrays
- CPU code:

- Need to allocate a, b, c and populate a, b beforehand

- (But you should know how to do that)
● Why is this parallelizable?
- For i ≠ j, operations for c[i] and c[j] don’t have any interdependence
- Could happen in any order
- Thus we can do them all at the same time (!) with the right hardware
Non-Parallelizable Problems
● Potentially harder to recognize
● Consider computing a moving average
- Input array x of n data points
- Output array y of n averages
- Two well-known options:
- Simple
- Exponential
● Simple method just weights all data points so far equally
- CPU code:
- Parallelizable? Yes!
- y[i] values separate
Non-Parallelizable Problems
● What about an exponential moving average?
- Uses a recurrence relation to decay point weight by a factor of 0 < 1 - c < 1
- Specifically, y[i] = c · x[i] + (1 - c) · y[i - 1]
- Thus y[n] = c · (x[n] + (1 - c) · x[n - 1] + … + (1 - c)n - 1 · x[1]) + (1 - c)n · x[0]
- CPU code:

- Parallelizable? Nope
- Need to know y[i] before calculating y[i + 1]
What Have We Learned?
● Not all problems are parallelizable
- Even similar-looking ones
● Harnessing the GPU’s power requires algorithms whose computations can be
done at the same time
- “Parallel execution”
- Opposite would be “serial execution,” CPU-style
● Output elements should probably not rely on one another
- Would require multiple kernel calls to compute otherwise
- Different blocks of threads can’t wait for each other, more on that
later in the course
- In addition to all the extra instructions, there’s a lot of overhead
Assignment 1: Small-Kernel Convolution
● First assignment involves manipulating an input signal
- In particular, a WAV (.wav) audio file
- We provide an example to test with
- Using audio will require libsndfile
- Installation instructions included in assignment
- Code also includes an option to use random data instead
● C++ and CUDA files provided, your job to fill in TODOs
- Code already includes CPU implementation of desired algorithm
- Your job is to write the equivalent CUDA kernel to parallelize it
- You’re also in charge of memory allocation and host-device data transfers
● Conceptually straightforward, goal is familiarity with integrating CUDA into C++
Some Background on Signals
● A system takes input signal(s), produces output signal(s)
● A signal can be a continuous or discrete stream of data
- Typically characterized by amplitude
- E.g. continuous acoustic input to a microphone
● A continuous signal can also be discretized
- Simply sample it at discrete intervals
- Ideally periodic in nature
- E.g. voltage waveform microphone output
● We will consider discrete signals
- Assignment uses two-channel audio
Linear Systems
● Suppose some system takes input signal xi[n] and produces output signal yi[n]
- We denote this as xi[n] → yi[n]
● If the system is linear, then for constants a, b we have:
- a · x1[n] + b · x2[n] → a · y1[n] + b · y2[n]
● Now suppose we want to pick out a single point in the signal
- We can do this with a delta function,
- If we treat it as a discrete signal, we can define it as:
- [n - k] = 1 if n = k, [n - k] = 0 if n ≠ k
- “Zero everywhere with a spike at k”
● This definition means that x[k] = x[n] · [n - k]
- Note: I was wrong about this in recitation. We use the delta function to
pick out the value of signal x[n] at constant point k.
Linear Systems
● Next we can define a system’s response to [n - k] as hk[n]
- I.e. [n - k] → hk[n]
● From linearity we then have x[n] · [n - k] → x[n] · hk[n]
- x[n] is the input signal, [n - k] is the delta function signal
- Note: I was wrong about this in recitation; see the previous slide for
details.
- Response at time k defined by response to delta function
Time-Invariance
● If a system is time-invariant, then it will satisfy:
- x[n] → y[n] ⇒ x[n + m] → y[n + m] for integer m
● Thus given [n - k] → hk[n] and [n - l] → hl[n], we can say that hk[n] and hl[n]
are “time-shifted” versions of each other
- Instead of a new response hk[n] for each k, we can define h[n] such that
[n] → h[n], and shift h with k such that [n - k] → h[n - k]
- By linearity, we then have x[n] · [n - k] → x[n] · hk[n]
● This lets us rewrite the system’s response x[n] → y[n]:
- x[n] = Σ x[k] · [n - k] → Σ x[k] · hk[n - k] = x[k] · hk[n] = y[n]
- Output must be equivalent to y[n] because x[n] → y[n]
- Note: sum is over all k.
What Have We Learned?
● Linear time-invariant systems have some very convenient properties
- Most importantly, they can be characterized entirely by h[n]
- This allows y[n] to be written entirely in terms of the input samples x[k]
and the delta function response h[n]
● Remember:
- y[n] = Σ x[k] · hk[n - k]
- x[n] is the input signal to our system
- y[n] is the output signal, or “impulse response” from our system
- [n] is the delta function signal
- h[n] is the impulse response from our system for [n]
Putting It All Together
● Assignment asks you to accelerate convolution of an input signal
- E.g. input x[0..99], system with h[0..3] delta function response
- For finite-duration h such as this, computable with y[n] = Σ x[k] · h[n - k]
- y[50] computation, for example, would be:
- y[50] = x[47] · h[3] + x[48] · h[2] + x[49] · h[1] + x[50] · h[0]
- All other h terms are 0
- Here y[50] etc. refer to the signal at that point
● This sum is parallelizable
- Pseudocode:
Assignment Details
● All you need to worry about is the kernel and memory operations
● We provide the skeleton and some useful tools
- CPU implementation - reference this for your GPU version
- Error checking code for your output
- Delta function response h[n] (default is Gaussian impulse response)
- Note: I was wrong in the recitation, saying that h[n] is the response to
any function we wish to convolve. Rather, the system is defined such
that its response to the delta function is the signal we to convolve.
- This derivation is a discrete-time version of https://en.wikipedia.
org/wiki/LTI_system_theory#Impulse_response_and_convolution.
Looking at this will help distinguish when we refer to the signal as a
function and when we refer to a specific point in it.
Assignment Details
● Code can be compiled in one of two modes
- Normal mode (AUDIO_ON defined to be 0)
- Generates random x[n]
- Can test performance on various input lengths
- Can run repeated trials by increasing number of channels
- Audio mode (AUDIO_ON defined to be 1)
- Reads x[n] from input WAV file
- Generates output WAV from y[n]
- Gaussian h[n] is an (imperfect) low-pass filter - high frequencies
should be attenuated
Debugging Tips
● printf() can be useful, but gets messy if all threads print
- Better to only print from certain threads, though your kernel will diverge
● If you want to check your kernel’s output, copy it back to the host
- More manageable than printing from the kernel and you can write normal
C++ to inspect the data
● Use the gpuErrchk() macro to check CUDA API calls for errors
- Example usage: gpuErrchk(cudaMalloc(&dev_in, length * sizeof (int)));
- Prints error info to stderr and exits
● Use small convolution test cases before trying large arrays or the test WAV
- E.g. 5-element x[n], 3-element h[n]
Any Questions?

BS en 1808 2015 Suspended Access Platforms
100% (2)
BS en 1808 2015 Suspended Access Platforms
136 pages
EEE6209 Advanced Digital Signal Processing (ADSP) : DR Charith Abhayaratne (CA)
No ratings yet
EEE6209 Advanced Digital Signal Processing (ADSP) : DR Charith Abhayaratne (CA)
24 pages
DSP Lab
No ratings yet
DSP Lab
104 pages
Signal Processing Columbia
No ratings yet
Signal Processing Columbia
448 pages
Iare DSP
No ratings yet
Iare DSP
407 pages
Unit 1
No ratings yet
Unit 1
406 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
Review Exercise - Chapter 1 - Solution
100% (1)
Review Exercise - Chapter 1 - Solution
9 pages
Kvr-Dsp-Unit 1
No ratings yet
Kvr-Dsp-Unit 1
134 pages
L02 Timedom
No ratings yet
L02 Timedom
64 pages
Digital Signal Processing
No ratings yet
Digital Signal Processing
3 pages
Lab Manual EC-313 DSP
No ratings yet
Lab Manual EC-313 DSP
67 pages
Control
No ratings yet
Control
144 pages
CS 179: GPU Programming: Lecture 9 / Homework 3
No ratings yet
CS 179: GPU Programming: Lecture 9 / Homework 3
33 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
DSP Lab Sample Viva Questions
100% (1)
DSP Lab Sample Viva Questions
10 pages
Week 4-6 Convolution Correlation
No ratings yet
Week 4-6 Convolution Correlation
64 pages
DSP 1
No ratings yet
DSP 1
129 pages
Topic 2: Time Domain: ELEN E4810: Digital Signal Processing
No ratings yet
Topic 2: Time Domain: ELEN E4810: Digital Signal Processing
60 pages
263lectures Slides
No ratings yet
263lectures Slides
256 pages
Bme3361 L02
No ratings yet
Bme3361 L02
60 pages
EE432B 00 Introduction
No ratings yet
EE432B 00 Introduction
50 pages
Verilog Programming Styles
No ratings yet
Verilog Programming Styles
95 pages
Chapter#2 Timedom 2
No ratings yet
Chapter#2 Timedom 2
64 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
SP Slides 1
No ratings yet
SP Slides 1
72 pages
000 Getstartedrpi Digital
100% (2)
000 Getstartedrpi Digital
116 pages
Dallas 2010 Perrott
No ratings yet
Dallas 2010 Perrott
55 pages
课件
No ratings yet
课件
175 pages
Exm Opencl Asian Option Opencl Fpga
No ratings yet
Exm Opencl Asian Option Opencl Fpga
19 pages
Physical Science Q2 Week 6 SLM 7
33% (3)
Physical Science Q2 Week 6 SLM 7
15 pages
Matt de Zaio ECE 641 Term Project 5/3/12
No ratings yet
Matt de Zaio ECE 641 Term Project 5/3/12
18 pages
Bme3361 L01
No ratings yet
Bme3361 L01
35 pages
Digital Signal Processing Lab Manual Updated
No ratings yet
Digital Signal Processing Lab Manual Updated
85 pages
L 2 GPU
No ratings yet
L 2 GPU
11 pages
Lecture 4 PDF
No ratings yet
Lecture 4 PDF
17 pages
Digital Signal Processing: Markus Kuhn
No ratings yet
Digital Signal Processing: Markus Kuhn
53 pages
1 L01-Intro
No ratings yet
1 L01-Intro
39 pages
Lecture 01
No ratings yet
Lecture 01
33 pages
Lect Chapter 2
No ratings yet
Lect Chapter 2
25 pages
16BEC0736 Avinash Yaramala: #Include #Include Float Float
No ratings yet
16BEC0736 Avinash Yaramala: #Include #Include Float Float
15 pages
Tools in Family Assessment
83% (6)
Tools in Family Assessment
3 pages
LAB PRoject
No ratings yet
LAB PRoject
14 pages
Chap 2
No ratings yet
Chap 2
51 pages
DSP Unit-1
No ratings yet
DSP Unit-1
120 pages
2-Signal & System-20230224-190109
No ratings yet
2-Signal & System-20230224-190109
40 pages
Digital Signal Processing: Markus Kuhn
No ratings yet
Digital Signal Processing: Markus Kuhn
104 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
Sem8 Endsem
No ratings yet
Sem8 Endsem
21 pages
Arpan G Srikanta Bhatta B S Guided by Prof.P Meena
No ratings yet
Arpan G Srikanta Bhatta B S Guided by Prof.P Meena
18 pages
Lecture 2 (DSP)
No ratings yet
Lecture 2 (DSP)
25 pages
DSP Lecture4 - NU
No ratings yet
DSP Lecture4 - NU
50 pages
Lab 3
No ratings yet
Lab 3
20 pages
Operation On Discrete-Time Signals & Discrete-Time System: Lecture - 3
No ratings yet
Operation On Discrete-Time Signals & Discrete-Time System: Lecture - 3
36 pages
Discrete-Time Signals and Systems: Gao Xinbo School of E.E., Xidian Univ
No ratings yet
Discrete-Time Signals and Systems: Gao Xinbo School of E.E., Xidian Univ
40 pages
System Modeling & Examples
No ratings yet
System Modeling & Examples
16 pages
LabSession2 SyC
No ratings yet
LabSession2 SyC
6 pages
1-DT Signals and Systems
No ratings yet
1-DT Signals and Systems
23 pages
Exp No 01
No ratings yet
Exp No 01
10 pages
Review of Linear Systems Theory
No ratings yet
Review of Linear Systems Theory
11 pages
Math Paper 3 Practice MS
No ratings yet
Math Paper 3 Practice MS
47 pages
Lab 2 FIR IIR Filter GNU Radio
No ratings yet
Lab 2 FIR IIR Filter GNU Radio
4 pages
Digital Signal Processing
No ratings yet
Digital Signal Processing
3 pages
English Grammar For ESL Learners
No ratings yet
English Grammar For ESL Learners
3 pages
Flume User Guide
No ratings yet
Flume User Guide
48 pages
VHDL For Synthesis: Summary Notes From: Appendix A: Synthesis The Designer's Guide To VHDL, 2 Ed
No ratings yet
VHDL For Synthesis: Summary Notes From: Appendix A: Synthesis The Designer's Guide To VHDL, 2 Ed
33 pages
FEA-Academy Course On-Demand - Practical Basic FEA
No ratings yet
FEA-Academy Course On-Demand - Practical Basic FEA
35 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Andculture Brand Guide
No ratings yet
Andculture Brand Guide
35 pages
Science 5 - Q2 - M12
No ratings yet
Science 5 - Q2 - M12
16 pages
Network Time Protocol (NTP) General Overview: David L. Mills University of Delaware
No ratings yet
Network Time Protocol (NTP) General Overview: David L. Mills University of Delaware
22 pages
Cosmetic & Homecare Industry
No ratings yet
Cosmetic & Homecare Industry
2 pages
Interest Q&amp A
No ratings yet
Interest Q&amp A
12 pages
HPC Iterative
No ratings yet
HPC Iterative
106 pages
HPC Cmake
No ratings yet
HPC Cmake
76 pages
Numerical Methods in Finance. Part A. (2010-2011)
No ratings yet
Numerical Methods in Finance. Part A. (2010-2011)
23 pages
Linear Models: Stability and Redundancy: 2.1 Singular Value Decomposition
No ratings yet
Linear Models: Stability and Redundancy: 2.1 Singular Value Decomposition
24 pages
Introduction 2025-02-09
No ratings yet
Introduction 2025-02-09
4 pages
Bioplastic 2
No ratings yet
Bioplastic 2
13 pages
HPC Unix
No ratings yet
HPC Unix
46 pages
MOFP - Families Fabaceae, Brassicaceae, Malvaceae
No ratings yet
MOFP - Families Fabaceae, Brassicaceae, Malvaceae
2 pages
Lec2 17
No ratings yet
Lec2 17
27 pages
Dmba203 - Marketing Management
No ratings yet
Dmba203 - Marketing Management
6 pages
HPC Architecture
No ratings yet
HPC Architecture
86 pages
HPC Linear
No ratings yet
HPC Linear
52 pages
HPC Cmakeshort
No ratings yet
HPC Cmakeshort
11 pages
HPC Arithmetic
No ratings yet
HPC Arithmetic
62 pages
HPC Scaling
No ratings yet
HPC Scaling
56 pages
Gambling, Random Walks and The Central Limit Theorem: 3.1 Random Variables and Laws of Large Num-Bers
No ratings yet
Gambling, Random Walks and The Central Limit Theorem: 3.1 Random Variables and Laws of Large Num-Bers
59 pages
HPC Debug
No ratings yet
HPC Debug
38 pages
HPC Programming
No ratings yet
HPC Programming
33 pages
Lec1 17
No ratings yet
Lec1 17
39 pages
Long-Range Dependency Effects in Network Timekeeping: David L. Mills University of Delaware
No ratings yet
Long-Range Dependency Effects in Network Timekeeping: David L. Mills University of Delaware
33 pages
Lec4 17
No ratings yet
Lec4 17
22 pages
HPC Nbody
No ratings yet
HPC Nbody
23 pages
Impact of Learning Styles On The Academic Performance of Junior High School Students of Golden Sunbeams Christian School, Antipolo City
No ratings yet
Impact of Learning Styles On The Academic Performance of Junior High School Students of Golden Sunbeams Christian School, Antipolo City
63 pages
HPC Graph
No ratings yet
HPC Graph
22 pages
Tempus Guidelines
No ratings yet
Tempus Guidelines
69 pages
HPC Intro
No ratings yet
HPC Intro
16 pages
Media and Information Literacy: Sablayan National Comprehensive High School
No ratings yet
Media and Information Literacy: Sablayan National Comprehensive High School
3 pages
HPC Performance
No ratings yet
HPC Performance
13 pages
HPC Pkgconfig
No ratings yet
HPC Pkgconfig
12 pages
HPC Git
No ratings yet
HPC Git
12 pages
Antenna Fundamentals
No ratings yet
Antenna Fundamentals
36 pages
Solution Test2
No ratings yet
Solution Test2
6 pages
Aroon Kumar: "Award Winning Global Marketer and Digital Business Leader"
No ratings yet
Aroon Kumar: "Award Winning Global Marketer and Digital Business Leader"
6 pages
Forest Monitoring System Using Wireless Sensor Network: Prof. Sagar Pradhan
No ratings yet
Forest Monitoring System Using Wireless Sensor Network: Prof. Sagar Pradhan
8 pages
Easy Car Hire Naha Okinawa-Ken, Japan
No ratings yet
Easy Car Hire Naha Okinawa-Ken, Japan
3 pages
4ME Brochure Update V2657
No ratings yet
4ME Brochure Update V2657
12 pages
0.1 Installation of R Packages
No ratings yet
0.1 Installation of R Packages
10 pages
Paper - 2011 - Widowati - Glucose-Ethanol Fermentation Dynamic Model
No ratings yet
Paper - 2011 - Widowati - Glucose-Ethanol Fermentation Dynamic Model
8 pages
Equity Structured Products Accumulator/ Decumulator
No ratings yet
Equity Structured Products Accumulator/ Decumulator
5 pages
Bol BPP
No ratings yet
Bol BPP
2 pages
Installation of NS2
No ratings yet
Installation of NS2
3 pages
Elective I (Math)
No ratings yet
Elective I (Math)
2 pages
SMBTA43-Siemens Semiconductor Group
No ratings yet
SMBTA43-Siemens Semiconductor Group
4 pages
Battery Impedance Test Equipment: Bite 2 and BITE 2P
No ratings yet
Battery Impedance Test Equipment: Bite 2 and BITE 2P
4 pages
Polysafe Strata Product Spec
No ratings yet
Polysafe Strata Product Spec
1 page
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet

CS 179: GPU Computing: Recitation 1 - 4/1/16

Uploaded by

CS 179: GPU Computing: Recitation 1 - 4/1/16

Uploaded by

CS 179: GPU Computing

- Need to allocate a, b, c and populate a, b beforehand

You might also like