0% found this document useful (0 votes)

114 views33 pages

CUDA Tricks PDF

This document summarizes CUDA tricks and specialized libraries. It discusses parallel scan algorithms like prefix sum that are efficient on GPUs. These scans can be used to build applications involving sorting, sparse matrix operations, and stream compaction. The document also introduces libraries like CUDPP, CUFFT, and CUBLAS that provide common primitive operations and linear algebra routines optimized for CUDA programming.

Uploaded by

Luis Carlos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views33 pages

CUDA Tricks PDF

Uploaded by

Luis Carlos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

CUDA Tricks

Presented by
D
Damodaran
d R
Ramani
i
Synopsis
 Scan Algorithm

 Applications
l

 Specialized Libraries

 CUDPP: CUDA Data Parallel Primitives Library

 Thrust: a Template Library for CUDA Applications

 CUDA FFT and BLAS libraries for the GPU

References
 Scan primitives for GPU Computing.
 Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens

 Presentation on scan primitives by Gary J. Katz based on the article

Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owens
(GPU GEMS Chapter
Ch 39)
Introduction
 GPUs massively parallel processors

 Programmable parts of the graphics pipeline

operates
p on primitives
p (vertices,
( , fragments)
g )

 These
ese p
primitive
t ep programs
og a s spa
spawn a
thread for each primitive to keep the parallel
processors full
p
 Stream programming model (particle
systems, image processing, grid-based
fluid simulations, and dense matrix algebra)
 Fragment
g program
p g operating
p g on n fragments
g
(accesses - O(n))
 Problem arises when access requirements are
complex (eg: prefix-sum – O(n2))
Prefix-Sum Example
 in: 3 1 7 0 4 1 6 3
 out: 0 3 4 11 11 14 16 22
Trivial Sequential Implementation
void scan(int* in, int* out, int n)
{
out[0] = 0;
for (int i = 1; i < n; i++)
out[i] = in[i-1] + out[i-1];
}
Scan: An Efficient Parallel Primitive

 Interested in finding efficient solutions to

parallel problems in which each output
requires global knowledge of the inputs.

 Why CUDA? (General Load-Store Memory

Architecture, On-chip Shared Memory, Thread
S
Synchronization)
h i i )
Threads & Blocks
 GeForce 8800 GTX ( 16 multiprocessors, 8 processors each)
 CUDA structures GPU programs into parallel thread blocks of up
to 512 SIMD
SIMD-parallel
parallel threads.
threads
 Programmers specify the number of thread blocks and threads
per block, and the hardware and drivers map thread blocks to
parallel multiprocessors on the GPU.
GPU
 Within a thread block, threads can communicate
through shared memory and cooperate through sync.
 B
Because only
l threads
th d withinithi the
th same block
bl k can cooperate
t via
i
shared memory and thread synchronization,programmers must
partition computation into multiple blocks.(complex
programming large performance benefits)
programming,
The Scan Operator
 Definition:
 The scan operation takes a binary associative
operator with identity I, and an array of n
elements
[a0, a1, …, an-11]
and returns the array
[I, a0, (a0 a1), … , (a0 a1 … an-2)]

Types – inclusive, exclusive, forward, backward

Parallel Scan
for(d = 1; d < log2n; d++)
for all k in parallel
( k >= 2d )
if(
x[out][k] = x[in][k – 2d-1] + x[in][k]
else
[ ][ ] = x[in][k]
x[out][k] [ ][ ]

Complexity O(nlog2n)
A work efficient parallel scan
 Goal is a parallel scan that is O(n)
instead of O((nlog
g2n)
 Solution:
 Balanced Trees: Build a binaryy tree on the
input data and sweep it to and from the
root.
Bi
Binary tree
t ith n leaves
with l h d=log
has l 2n levels,
l l
each level d has 2d nodes
One add is performed per node
node, therefore
O(n) add on a single traversal of the tree.
O(n) unsegmented scan
 Reduce/Up-Sweep
for(d = 0; d < log2n-1; d++)
for all k=0; k < n-1; k+=2d+1 in parallel
x[k+2d+1-1] = x[k+2d-1] + x[k+2d+1-1]

 D
Down-Sweep
S
x[n-1] = 0;
for(d
( = logg2n – 1;; d >=0;
; d--)
)
for all k = 0; k < n-1; k += 2d+1 in parallel
t = x[k + 2d – 1]
x[k + 2d - 1] = x[k + 2d+1 -1]
1]
x[k + 2d+1 - 1] = t + x[k + 2d+1 – 1]
Tree analogy

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 ∑(x0..x7)

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 0

x0 ∑(x0..x1) x2 0 x4 ∑(x4..x5) x6 ∑(x0..x3)

x0 0 x2 ∑(x0..x1) x4 ∑(x0..x3) x6 ∑(x0..x5)

0 x ∑(x0..x1) ∑(x0..x2) ∑(x0..x3) ∑(x0..x4) ∑(x0..x5) ∑(x0..x6)

0
O(n) Segmented Scan

Up-Sweep
 Down-Sweep
Features of segmented scan
 3 times slower than unsegmented scan
 Useful for building broad variety of
applications which are not possible with
unsegmented scan.
scan
Primitives built on scan
 Enumerate
 enumerate([t f f t f t t]) = [0 1 1 1 2 2 3]
 Exclusive scan of input vector
 Distribute (copy)
 distribute([a b c][d e]) = [a a a][d d]
 I l i scan off input
Inclusive i t vector
t
 Split and split-and-segment
Split divides the input vector into two pieces, with all the
elements marked false on the left side of the output vector and all the
elements marked true on the right.
Applications
 Quicksort
 Sparse Matrix-Vector Multiply
 Tridiagonal Matrix Solvers and Fluid
Simulation
 Radix Sort
 Stream Compaction
 Summed-Area
Summed Area Tables
Quicksort
Sparse Matrix-Vector
Multiplication
Stream Compaction
Definition:
 Extracts the ‘interest’ elements from an array of
elements and places them continuously in a new
array
 Uses:
 Collision Detection
 Sparse Matrix Compression
A B A D D E C F B

A B A C B
Stream Compaction
A B A D D E C F B Input: We want to
preserve the gray
elements
1 1 1 0 0 0 1 0 1 Set a ‘1’ in each gray
input
Scan
0 1 2 3 3 3 3 4 4

A B A D D E C F B
Scatter
S tt gray iinputs
t tto
output using scan result
as scatter address
A B A C B

0 1 2 3 4
Radix Sort Using Scan
100 111 010 110 011 101 001 000 Input Array
0 1 0 0 1 1 1 0 b = least significant bit
e = Insert a 1 for all
1 0 1 1 0 0 0 1 false sort keys
0 1 1 2 3 3 3 3 f = Scan the 1s

Total Falses = e[n-1] + f[n-1]

0-0+4 1-1+4 2-1+4 3-2+4 4-3+4 5-3+4 6-3+4 7-3+4
=4 =4 =5 =5 =5 =6 =7 =8 t = index – f + Total Falses

0 4 1 2 5 6 7 3 d=b?t:f

100 111 010 110 011 101 001 000

Scatter input using d
as scatter address
100 010 110 000 111 011 101 001
Specialized Libraries
 CUDPP: CUDA Data Parallel Primitives
Library
 CUDPP is a library of data-parallel
algorithm
g primitives
p such as p
parallel prefix-
p
sum (”scan”), parallel sort and parallel
reduction.
CUDPP_DLL CUDPPResult cudppSparseMa
trixVectorMultiply(CUDPPHandle sparse
MatrixHandle,void * d_y,const void
* d_x )
Perform matrix-vector multiply y = A*x for
arbitrary sparse matrix A and vector x.
CUDPPScanConfig config;
config.direction = CUDPP_SCAN_FORWARD;
config.exclusivity = CUDPP_SCAN_EXCLUSIVE;
config.op = CUDPP_ADD;
config datatype = CUDPP_FLOAT;
config.datatype CUDPP FLOAT;
config.maxNumElements = numElements;
config.maxNumRows = 1;
config.rowPitch = 0;
cudppInitializeScan(&config);
cudppScan(d odata d_idata,
cudppScan(d_odata, d idata numElements,
numElements &config);
CUFFT
 No. of elements<8192 slower than fftw
 >8192,
>8192 5x speedup over threaded fftw
and 10x over serial fftw.
CUBLAS
 Cuda Based Linear Algebra Subroutines
 Saxpy, conjugate gradient, linear solvers.
 3D reconstruction of planetary nebulae.
 http://graphics.tu-
bs.de/publications/Fernandez08TechReport.pdf
 GPU Variant 100 times faster than CPU
version
 Matrix size is limited by graphics card
memory and texture sizesize.
 Although taking advantage of sparce
matrices will help reduce memory
consumption, sparse matrix storage is
not implemented by CUBLAS
CUBLAS.
Useful Links
 http://www.science.uwaterloo.ca/~hmerz/CUDA_ben
chFFT/
 http://developer.download.nvidia.com/compute/cuda
/2_0/docs/CUBLAS_Library_2.0.pdf
 http://gpgpu org/developer/cudpp
http://gpgpu.org/developer/cudpp
 http://gpgpu.org/2009/05/31/thrust

CUDPP Slides
No ratings yet
CUDPP Slides
26 pages
Chapter Parallel Prefix Sum
No ratings yet
Chapter Parallel Prefix Sum
21 pages
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
No ratings yet
Parallel Prefix Sum (Scan) With CUDA: Mark Harris
21 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
Scan Primitives For Vector Computers
No ratings yet
Scan Primitives For Vector Computers
10 pages
08 Dataparallel
No ratings yet
08 Dataparallel
51 pages
Lecture 10
No ratings yet
Lecture 10
40 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Parralel Demro 002
No ratings yet
Parralel Demro 002
61 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Web GPU
0% (1)
Web GPU
40 pages
Lecture 03-Parallel Prefix
No ratings yet
Lecture 03-Parallel Prefix
6 pages
Scan Primitives
No ratings yet
Scan Primitives
11 pages
HPC Int2 Key
No ratings yet
HPC Int2 Key
10 pages
Parallel Prefix Sum
No ratings yet
Parallel Prefix Sum
32 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
Ble 90
No ratings yet
Ble 90
268 pages
Week 11
No ratings yet
Week 11
21 pages
Ece408 Lecture19 Sparse Matrix VK SP23
No ratings yet
Ece408 Lecture19 Sparse Matrix VK SP23
28 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
No ratings yet
Numerical Libraries For Petascale Computing: Brett Bode William Gropp
34 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Solving Pdes With Cuda
No ratings yet
Solving Pdes With Cuda
34 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
qt6j57h5zw Nosplash
No ratings yet
qt6j57h5zw Nosplash
2 pages
HPC Codes-2
No ratings yet
HPC Codes-2
15 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
Intro To Matlab GPU Programming
No ratings yet
Intro To Matlab GPU Programming
35 pages
Cublas Library
No ratings yet
Cublas Library
254 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
Fast Minimum Spanning Tree For Large Graphs On The GPU
No ratings yet
Fast Minimum Spanning Tree For Large Graphs On The GPU
6 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
21 pages
HPC File
No ratings yet
HPC File
22 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
No ratings yet
Department of Computer Engineering BE Laboratory Practice-I A.Y 2021-22 SEM1
45 pages
CUBLAS Library
No ratings yet
CUBLAS Library
264 pages
Pap 3 Shared Memory Algos
No ratings yet
Pap 3 Shared Memory Algos
23 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
Co 2
No ratings yet
Co 2
22 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
New HPC - Removed
No ratings yet
New HPC - Removed
5 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Top 5 Difference Between Callable and Runnable Interface in Java
No ratings yet
Top 5 Difference Between Callable and Runnable Interface in Java
5 pages
CMP 312 1
No ratings yet
CMP 312 1
23 pages
Sys Bios
No ratings yet
Sys Bios
243 pages
Talk 8 AndroidArc - Binder PDF
No ratings yet
Talk 8 AndroidArc - Binder PDF
29 pages
Freertos: Alberto Bosio
No ratings yet
Freertos: Alberto Bosio
52 pages
Threads
No ratings yet
Threads
18 pages
Vulkan Klein
No ratings yet
Vulkan Klein
6 pages
Athigiri Arulalan PDF
No ratings yet
Athigiri Arulalan PDF
42 pages
16-17 Final Question Solve
No ratings yet
16-17 Final Question Solve
12 pages
Unit-3 Multithreading
No ratings yet
Unit-3 Multithreading
25 pages
Getting Started With STM32 - Introduction To FreeRTOS
No ratings yet
Getting Started With STM32 - Introduction To FreeRTOS
8 pages
OS - Unit 1 - Notes
No ratings yet
OS - Unit 1 - Notes
15 pages
Operating System
No ratings yet
Operating System
83 pages
Borland C++ Builder 5 Developer's Guide DG
100% (1)
Borland C++ Builder 5 Developer's Guide DG
1,060 pages
5th Sem B.tech. CS, It
No ratings yet
5th Sem B.tech. CS, It
20 pages
Akka Java
No ratings yet
Akka Java
452 pages
Asynchronous Concurrent Execution
No ratings yet
Asynchronous Concurrent Execution
68 pages
Opc and Real-Time Systems in Labview
No ratings yet
Opc and Real-Time Systems in Labview
81 pages
OOPM Theory Questions
No ratings yet
OOPM Theory Questions
7 pages
Operating System Complete Note
No ratings yet
Operating System Complete Note
135 pages
Process Based Multitasking V/s Thread Based Multitasking
No ratings yet
Process Based Multitasking V/s Thread Based Multitasking
4 pages
Python Vs Java Comparison Python Java
No ratings yet
Python Vs Java Comparison Python Java
23 pages
Module 1 2024 Os
No ratings yet
Module 1 2024 Os
4 pages
OOP Through Java Unit - 4
No ratings yet
OOP Through Java Unit - 4
13 pages
Real Time Operating System
No ratings yet
Real Time Operating System
5 pages
Operating System Notes For Interview Preparation
No ratings yet
Operating System Notes For Interview Preparation
102 pages
CH 4 Threads
No ratings yet
CH 4 Threads
31 pages
Operating System Notes 3 - TutorialsDuniya PDF
No ratings yet
Operating System Notes 3 - TutorialsDuniya PDF
232 pages
Chapter - 1
No ratings yet
Chapter - 1
11 pages
Ee249 13 Rtos
No ratings yet
Ee249 13 Rtos
211 pages

CUDA Tricks PDF

Uploaded by

CUDA Tricks PDF

Uploaded by

CUDA Tricks

 CUDPP: CUDA Data Parallel Primitives Library

 Thrust: a Template Library for CUDA Applications

 CUDA FFT and BLAS libraries for the GPU

 Presentation on scan primitives by Gary J. Katz based on the article

 Programmable parts of the graphics pipeline

 Interested in finding efficient solutions to

 Why CUDA? (General Load-Store Memory

Types – inclusive, exclusive, forward, backward

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 ∑(x0..x7)

x0 ∑(x0..x1) x2 ∑(x0..x3) x4 ∑(x4..x5) x6 0

x0 ∑(x0..x1) x2 0 x4 ∑(x4..x5) x6 ∑(x0..x3)

x0 0 x2 ∑(x0..x1) x4 ∑(x0..x3) x6 ∑(x0..x5)

0 x ∑(x0..x1) ∑(x0..x2) ∑(x0..x3) ∑(x0..x4) ∑(x0..x5) ∑(x0..x6)

Total Falses = e[n-1] + f[n-1]

100 111 010 110 011 101 001 000

You might also like