0% found this document useful (0 votes)

37 views29 pages

GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa

This document provides an introduction and overview of GPGPU programming with CUDA. It discusses the shift from sequential to parallel processing to handle large data, and introduces GPUs as accelerators suited for parallel tasks like graphics processing. The architecture of Nvidia GPUs is described, including the hierarchy of streaming multiprocessors, scalar processors, threads and blocks. Finally, it outlines the CUDA programming model, including kernel execution, memory layout, and key APIs for kernel configuration, memory allocation and transfer, and variable declarations.

Uploaded by

Xafran Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views29 pages

GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa

Uploaded by

Xafran Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

You are on page 1/ 29

GPGPU Programming with CUDA

Leandro Avila - University of Northern Iowa

Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa

Outline

Introduction Architecture Description Introduction to CUDA API

Introduction

Shift in the traditional paradigm of sequential programming, towards parallel processing. Scientific computing needs to change in order to deal with vast amounts of data. Hardware changes contributed to move towards parallel processing.

Three Walls of Serial Performance

Memory Wall

Discrepancy between memory and CPU performance Effort put into ILP increases with not enough returns Clock frequency vs. Heat dissipation efforts.

Instruction Level Parallelism Wall

Power Wall

Manferdelli, J. (2007) - The Many-Core Inflection Point for Mass Market Computer Systems

Accelerators

In HPC, an accelerator is a hardware component whose role is to speed up some aspect of the computing workload. In the old days (1980s), supercomputers we had array processors, for vector operations on arrays, and floating point accelerators. More recently, Field Programmable Gate Arrays (FPGAs) allow reprogramming deep into the hardware.

Courtesy of Henry Neeman - http://www.oscer.ou.edu/

Accelerators

Advantages

They make your code run faster More expensive Harder to program Code is not portable from one accelerator to another. (OpenCL attempts to change this)

Disadvantages

Courtesy of Henry Neeman - http://www.oscer.ou.edu/

Introducing GPGPU

General Purpose Computing on Graphics Processing Units Great example of the trend of moving away from the traditional model.

Why GPUs?

Graphics Processing Units (GPUs) were originally designed to accelerate graphics tasks like image rendering. They became very popular with videogamers, because theyve produced better and better images, and lightning fast. And, prices have been extremely good, ranging from three figures at the low end to four figures at the high end. GPUs mostly do stuff like rendering images.

This is done through mostly floating point arithmetic the same stuff people use supercomputing for!

Courtesy of Henry Neeman - http://www.oscer.ou.edu/

GPU vs. CPU Flop Rate

From Nvidia CUDA Programing Guide

Architecture

Architecture Comparison
Nvidia Tesla C1060 Processing Cores Memory 240 4GB Intel i7 975 Extreme 4 L1 Cache 32KB/core L2 Cache 256KB/core L3 Cache 8MB (shared) 3.33.GHz 25 GB/sec 70 Double Precision

1.3 GHz Clock Speed Memory Bandwidth Floating Point Operations / Sec 102 GB/Sec 933 Single Precision 78 Double Precision

CPU vs. GPU

From Nvidia CUDA Programing Guide

Components

Texture Processors Clusters Streaming Multiprocessors Streaming Processor

From http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.html

Streaming Multiprocessors

Blocks of threads are assigned to SMs A SM contains 8 Scalar Processors Tesla C1060

Number of SM = 30 Number of Cores = 240

The more SM you have the better

Hardware Hierarchy

Stream Processor Array

Contains 10 Texture Processor Clusters

Texture Processor Clusters

Contains 3 Streaming Multiprocessors

Streaming Multiprocessors

Contains 8 Scalar Processors

Scalar Processors

They do the work :)

Connecting some dots...

Great! We see the GPU architecture is different from what we see in the traditional CPU. So... Now what? What this all means? How do we use it?

Glossary

The HOST Is the machine executing main program The DEVICE Is the card with the GPU The KERNEL Is the routine that runs on the GPU
A THREAD Is the basic execution unit in the GPU A BLOCK Is a group of threads A GRID Is a group of blocks A WARP Is a group of 32 threads

CUDA Kernel Execution

Recall that threads are organized in BLOCKS and at the same time BLOCKS are organized in a GRID. The GRID can have 2 dimensions. X and Y

Maximum sizes of each dimension of a grid:

65535 x 65535 x 1

The BLOCK(S) can have 3 dimensions X,Y,Z

Maximum sizes of each dimension of a block:

512 x 512 x 64

Prior to kernel execution we need to set it up by setting the dimensions of the GRID and the dimensions of the BLOCKS

Scheduling in Hardware
Host Device Grid 1 Kernel 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0)

Grid is launched Blocks are distributed to the necessary SMs SM initiates processing of warps SM schedules warps that are ready As warps finish and resources are liberated, then new warps are scheduled. SM can take 1024 threads

Block (2, 1)

Grid 2 Kernel 2

Block (1, 1)
Thread (0, 0) Thread (0, 1) Thread (0, 2) Thread (1, 0) Thread (1, 1) Thread (1, 2) Thread (2, 0) Thread (2, 1) Thread (2, 2) Thread (3, 0) Thread (3, 1) Thread (3, 2) Thread (4, 0) Thread (4, 1) Thread (4, 2)

Ex: 256 x 4 OR 128 x 8

Kirk & Hwu University of Illinois Urbana- Champaign

Memory Layout

Registers and shared memory are the fastest Local Memory is virtual memory Global Memory is the slowest.

From Nvidia CUDA Programing Guide

Thread Memory Access

Threads access memory as follows Registers Read & Write Local Memory Read & Write Shared Memory Read & Write (block level) Global Memory Read & Write (grid level) Constant Memory Read (grid level) Remember that Local Memory is implemented as virtual memory from a region that resides in Global Memory.

CUDA API

Programming Pattern

Host reads input and allocates memory in the device Host copies data to the device Host invokes a kernel that gets executed in parallel, using the data and hardware in the device, to do some useful work. Host copies back the results from the device for post processing.

Kernel Setup
_global_ void myKernel(); //declaration dim3 dimGrid(2,2,1); dim3 dimBlock(4,8,8); myKernel<<< dimGrid, dimBlock >>>( d_b, d_a );

Device Memory Allocation

cudaMalloc(&myDataAddress,sizeOfData) Address of a pointer to the allocated data and the size of such data. cudaFree(myDataPointer) Used to free the allocated memory on the device. Also check cudaMallocHost() and cudaFreeHost() in the CUDA Refrence Manual.

Device Data Transfer

cudaMemcpy() Requires: pointer to destination, pointer to source, size, type of transfer Examples:
cudaMemcpy(elements_d, elements_h,size,cudaMemcpyHostToDevice); cudaMemcpy(elements_h,elements_d,size,cudaMemcpyDeviceToHost);

Function Declaration
Executes On _device_ float myDeviceFunc() _host_ float myHostFunc() _global_ void myKernel() Device Host Device Callable From Device Host Host

_ global _ is used to declare a kernel. It must be void.

Useful Variables
gridDim.(x|y) = grid dimension on x and y blockDim = number of threads in a block blockIdx = block index whithin the grid blockIdx.(x|y) threadIdx = Thread index within a block threadIdx.(x|y|z)

Variable Type Qualifiers

Variable type qualifiers specify the memory location of a variable on the devices memory __device__

Declares a variable in the device Declares a constant in the device Declares a variable in thread shared memory

__constant__

__shared__

Note: All shared memory variables start at the same address. You must use offsets if multiple variables are declared in shared memory.

Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA
No ratings yet
CUDA
18 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Course 7
No ratings yet
Course 7
21 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Cuda
No ratings yet
Cuda
25 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Unit 4
No ratings yet
Unit 4
48 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lec 1
No ratings yet
Lec 1
27 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Interrupt Handling
No ratings yet
Interrupt Handling
30 pages
CUDA
No ratings yet
CUDA
33 pages
7 C's of Communication
100% (2)
7 C's of Communication
13 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
01 Linux Basics
100% (1)
01 Linux Basics
19 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Zindagi Gulzar Hai by Umaira Ahmed
100% (2)
Zindagi Gulzar Hai by Umaira Ahmed
118 pages
Algorithm & Data Structure Lec2 (BET)
No ratings yet
Algorithm & Data Structure Lec2 (BET)
49 pages
Telecommunication System Engineering Telephony Signal Transmission
No ratings yet
Telecommunication System Engineering Telephony Signal Transmission
37 pages
Antenna Lab#1
No ratings yet
Antenna Lab#1
25 pages
How To Recover From A Failed Linux Exadata DB Server Dbnodeupdate or Rollback (Doc ID 1952372.1)
No ratings yet
How To Recover From A Failed Linux Exadata DB Server Dbnodeupdate or Rollback (Doc ID 1952372.1)
9 pages
Low-Level Writing To NTFS File Systems: Rick Van Gorp
No ratings yet
Low-Level Writing To NTFS File Systems: Rick Van Gorp
22 pages
Memory Management III
No ratings yet
Memory Management III
23 pages
Solaris Job Interview Preparation Guide
No ratings yet
Solaris Job Interview Preparation Guide
13 pages
Repertoire Installation Guide
No ratings yet
Repertoire Installation Guide
9 pages
Bash Script Cheat Sheet
100% (2)
Bash Script Cheat Sheet
1 page
Customer Information Control System Training Material: ICIC010.2.SL
No ratings yet
Customer Information Control System Training Material: ICIC010.2.SL
196 pages
Log
No ratings yet
Log
12 pages
No 4 Guide
No ratings yet
No 4 Guide
306 pages
Sosomod
No ratings yet
Sosomod
11 pages
Raspberry Pi Command Line Audio PDF
No ratings yet
Raspberry Pi Command Line Audio PDF
8 pages
Network Security: Solutions To Review Questions and Exercises
No ratings yet
Network Security: Solutions To Review Questions and Exercises
8 pages
Sol 30
No ratings yet
Sol 30
6 pages
Devoir Audit Ceh 2
No ratings yet
Devoir Audit Ceh 2
7 pages
A Linux User's Guide To Logical Volume Management
No ratings yet
A Linux User's Guide To Logical Volume Management
10 pages
Micro Processor Lab 2 Manual
No ratings yet
Micro Processor Lab 2 Manual
7 pages
ITEC 221 6 - File Management
No ratings yet
ITEC 221 6 - File Management
35 pages
Course Outline OS-Template
No ratings yet
Course Outline OS-Template
7 pages
Process-to-Process Delivery:: Solutions To Review Questions and Exercises
No ratings yet
Process-to-Process Delivery:: Solutions To Review Questions and Exercises
6 pages
Log DVD1
No ratings yet
Log DVD1
3 pages
2.OS Question Bank-2018 Svit
100% (1)
2.OS Question Bank-2018 Svit
22 pages
04 Quiz 1 - ARG
No ratings yet
04 Quiz 1 - ARG
2 pages
Introduction To Docker: Alexander González
No ratings yet
Introduction To Docker: Alexander González
28 pages
Reinstalling Windows Without Any Data Loss.
No ratings yet
Reinstalling Windows Without Any Data Loss.
2 pages
ATLAS - Ti 8 Windows Installation Tips and Instructions
No ratings yet
ATLAS - Ti 8 Windows Installation Tips and Instructions
4 pages
WWW and HTTP: Solutions To Review Questions and Exercises
No ratings yet
WWW and HTTP: Solutions To Review Questions and Exercises
4 pages
Digital Signal Processing Solution Mannual Chapter Bonus Oppenheim
100% (2)
Digital Signal Processing Solution Mannual Chapter Bonus Oppenheim
14 pages
Linux Notes
No ratings yet
Linux Notes
21 pages
h12557 Storage Ms Hyper V Virtualization WP
No ratings yet
h12557 Storage Ms Hyper V Virtualization WP
59 pages
NSX Architecture Components Review
No ratings yet
NSX Architecture Components Review
5 pages
Linux Administration Course Content
No ratings yet
Linux Administration Course Content
2 pages
Online Transaction Processing
No ratings yet
Online Transaction Processing
2 pages
DOAG2016 Hacking Oracles Memory About Internals Troubleshooting PDF
0% (1)
DOAG2016 Hacking Oracles Memory About Internals Troubleshooting PDF
21 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Khail Tamasha by Ashfaq Ahmad
No ratings yet
Khail Tamasha by Ashfaq Ahmad
109 pages
4.practice Questions and Solutions Set-4
No ratings yet
4.practice Questions and Solutions Set-4
3 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa

Uploaded by

GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa

Uploaded by

GPGPU Programming with CUDA

Leandro Avila - University of Northern Iowa

Introduction Architecture Description Introduction to CUDA API

Three Walls of Serial Performance

Instruction Level Parallelism Wall

Courtesy of Henry Neeman - http://www.oscer.ou.edu/

Courtesy of Henry Neeman - http://www.oscer.ou.edu/

Courtesy of Henry Neeman - http://www.oscer.ou.edu/

GPU vs. CPU Flop Rate

From Nvidia CUDA Programing Guide

CPU vs. GPU

From Nvidia CUDA Programing Guide

Texture Processors Clusters Streaming Multiprocessors Streaming Processor

Number of SM = 30 Number of Cores = 240

The more SM you have the better

Stream Processor Array

Contains 10 Texture Processor Clusters

Texture Processor Clusters

Contains 3 Streaming Multiprocessors

Contains 8 Scalar Processors

They do the work :)

Connecting some dots...

CUDA Kernel Execution

Maximum sizes of each dimension of a grid:

The BLOCK(S) can have 3 dimensions X,Y,Z

Maximum sizes of each dimension of a block:

Ex: 256 x 4 OR 128 x 8

Kirk & Hwu University of Illinois Urbana- Champaign

From Nvidia CUDA Programing Guide

Thread Memory Access

Device Memory Allocation

Device Data Transfer

_ global _ is used to declare a kernel. It must be void.

Variable Type Qualifiers

You might also like