4 - Key Concepts

CUDA utilizes data parallelism to efficiently process large amounts of data on GPUs. It allows millions of threads to process pixels or other data points simultaneously. A CUDA program has host code that runs on the CPU and device code labeled as kernels that run on the GPU. The device code is compiled separately and executed on the GPU, where threads are organized into a grid of blocks to process data. Data must be transferred between host and device memory, and memory is allocated on the GPU to store results.

Uploaded by

olia.92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views2 pages

4 - Key Concepts

Uploaded by

olia.92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

CUDA - Key Concepts

In this chapter, we will learn about a few key concepts related to CUDA. We will understand data
parallelism, the program structure of CUDA and how a CUDA C Program is executed.

Data Parallelism

Modern applications process large amounts of data that incur signiﬁcant execution time on sequential
computers. An example of such an application is rendering pixels. For example, an application that
converts sRGB pixels to grayscale. To process a 1920x1080 image, the application has to process
2073600 pixels.
Processing all those pixels on a traditional uniprocessor CPU will take a very long time since the execution
will be done sequentially. (The time taken will be proportional to the number of pixels in the image).
Further, it is very ineﬃcient since the operation that is performed on each pixel is the same, but different
on the data (SPMD). Since processing one pixel is independent of the processing of any other pixel, all the
pixels can be processed in parallel. If we use 2073600 threads (“workers”) and each thread processes one
pixel, the task can be reduced to constant time. Millions of such threads can be launched on modern
GPUs.

As has already been explained in the previous chapters, GPU is traditionally used for rendering graphics.
For example, per-pixel lighting is a highly parallel, and data-intensive task and a GPU is perfect for the job.
We can map each pixel with a thread and they all can be processed in O(1) constant time.

Image processing and computer graphics are not the only areas in which we harness data parallelism to
our advantage. Many high-performance algebra libraries today such as CU-BLAS harness the processing
power of the modern GPU to perform data intensive algebra operations. One such operation, matrix
multiplication has been explained in the later sections.

Program Structure of CUDA

A typical CUDA program has code intended both for the GPU and the CPU. By default, a traditional C
program is a CUDA program with only the host code. The CPU is referred to as the host, and the GPU is
referred to as the device. Whereas the host code can be compiled by a traditional C compiler as the GCC,
the device code needs a special compiler to understand the api functions that are used. For Nvidia GPUs,
the compiler is called the NVCC (Nvidia C Compiler).

The device code runs on the GPU, and the host code runs on the CPU. The NVCC processes a CUDA
program, and separates the host code from the device code. To accomplish this, special CUDA keywords
are looked for. The code intended to run of the GPU (device code) is marked with special CUDA keywords
for labelling data-parallel functions, called ‘Kernels’. The device code is further compiled by the NVCC and
executed on the GPU.
Execution of a CUDA C Program
How does a CUDA program work? While writing a CUDA program, the programmer has explicit control on
the number of threads that he wants to launch (this is a carefully decided-upon number). These threads
collectively form a three-dimensional grid (threads are packed into blocks, and blocks are packed into
grids). Each thread is given a unique identiﬁer, which can be used to identify what data it is to be acted
upon.

Device Global Memory and Data Transfer

As has been explained in the previous chapter, a typical GPU comes with its own global memory (DRAM-
Dynamic Random Access Memory). For example, the Nvidia GTX 480 has DRAM size equal to 4G. From
now on, we will call this memory the device memory.
To execute a kernel on the GPU, the programmer needs to allocate separate memory on the GPU by
writing code. The CUDA API provides speciﬁc functions for accomplishing this. Here is the ﬂow sequence
−

After allocating memory on the device, data has to be transferred from the host memory to the
device memory.
After the kernel is executed on the device, the result has to be transferred back from the device
memory to the host memory.

The allocated memory on the device has to be freed-up. The host can access the device memory
and transfer data to and from it, but not the other way round.

CUDA provides API functions to accomplish all these steps.

X99-RS9 User's Manual
100% (1)
X99-RS9 User's Manual
15 pages
Gemsy RXM-2 Rxm-2a
100% (1)
Gemsy RXM-2 Rxm-2a
27 pages
Computer Application in Business
No ratings yet
Computer Application in Business
45 pages
COS101-101S-parts Horno Blodgett (Torero) PDF
No ratings yet
COS101-101S-parts Horno Blodgett (Torero) PDF
10 pages
Whitford Engineering Design Guide
100% (1)
Whitford Engineering Design Guide
44 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Microcontroller Lecture Notes Module III
No ratings yet
Microcontroller Lecture Notes Module III
88 pages
GPU Architecture
33% (3)
GPU Architecture
28 pages
CentrifugalDWPerfSuppl Catalog
No ratings yet
CentrifugalDWPerfSuppl Catalog
80 pages
InstallationManual B44066S October2010 V2
No ratings yet
InstallationManual B44066S October2010 V2
8 pages
Q2 CHS 9 MODULE 3 Week 4
No ratings yet
Q2 CHS 9 MODULE 3 Week 4
5 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Service Manual: Shanghai Teraoka Electronic Co.,Ltd
No ratings yet
Service Manual: Shanghai Teraoka Electronic Co.,Ltd
4 pages
SAP Basis Daily Monitoring Tcodes: ABAP Stack Checks
No ratings yet
SAP Basis Daily Monitoring Tcodes: ABAP Stack Checks
7 pages
DSP Processor Fundementals
100% (6)
DSP Processor Fundementals
210 pages
1 Cuda
100% (1)
1 Cuda
173 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Title Electronic Logic Fo Rscience
No ratings yet
Title Electronic Logic Fo Rscience
10 pages
Cuda Lab Manual
100% (1)
Cuda Lab Manual
22 pages
The Advantages of Upgrading To InfoSphere DataStage 8.7
No ratings yet
The Advantages of Upgrading To InfoSphere DataStage 8.7
6 pages
Alm Control Simatic en
No ratings yet
Alm Control Simatic en
14 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Cuda C
No ratings yet
Cuda C
70 pages
Performance Analysis Tools: Karl Fuerlinger
No ratings yet
Performance Analysis Tools: Karl Fuerlinger
103 pages
Assembly Langauge Experiment #1
100% (1)
Assembly Langauge Experiment #1
15 pages
2 - Introduction To The GPU
No ratings yet
2 - Introduction To The GPU
3 pages
L3 (Buses and Interupts)
No ratings yet
L3 (Buses and Interupts)
4 pages
NVIDIA CUDA C Programming Guide 3.1
No ratings yet
NVIDIA CUDA C Programming Guide 3.1
173 pages
CUDA
No ratings yet
CUDA
33 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
Site Lab Instructions SP
No ratings yet
Site Lab Instructions SP
2 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
King For Nikon
No ratings yet
King For Nikon
2 pages
Power Saving Iron Supporter
No ratings yet
Power Saving Iron Supporter
7 pages
ShockDisplay Curve Data Sheet
No ratings yet
ShockDisplay Curve Data Sheet
4 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Gme Uhf 3500 Series
No ratings yet
Gme Uhf 3500 Series
36 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Change Log
No ratings yet
Change Log
9 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Alex Denisov: Brief Description
No ratings yet
Alex Denisov: Brief Description
3 pages
Program Structure of CUDA
No ratings yet
Program Structure of CUDA
3 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Part1 22
No ratings yet
Part1 22
77 pages
W H D K F S P W H D W W H H D F F D D K: Summary. To Summarize, The Conv Layer
No ratings yet
W H D K F S P W H D W W H H D F F D D K: Summary. To Summarize, The Conv Layer
3 pages
(W F + 2P) /S + 1: Use of Zero-Padding
No ratings yet
(W F + 2P) /S + 1: Use of Zero-Padding
3 pages
Convolutional Layer: Web-Based Demo
No ratings yet
Convolutional Layer: Web-Based Demo
3 pages
CS231n - Convolutional-Networks 1
No ratings yet
CS231n - Convolutional-Networks 1
3 pages
CUDA - Introduction CUDA - Introduction
No ratings yet
CUDA - Introduction CUDA - Introduction
3 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
VMware
No ratings yet
VMware
3 pages
Improving The Performance of A Ray Tracing Algorithm Using A GPU
No ratings yet
Improving The Performance of A Ray Tracing Algorithm Using A GPU
10 pages
Unit 4
No ratings yet
Unit 4
48 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
6 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Cuda
No ratings yet
Cuda
69 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Back To School
No ratings yet
Back To School
5 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Mastering CUDA Python Programming
From Everand
Mastering CUDA Python Programming
Ed A Norex
No ratings yet
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet
Engineering AI Excellence
From Everand
Engineering AI Excellence
Azhar ul Haque Sario
No ratings yet
CUDA Programming in C: From Basics to Expert Proficiency
From Everand
CUDA Programming in C: From Basics to Expert Proficiency
William Smith
No ratings yet