0% found this document useful (0 votes)

94 views11 pages

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

This document provides an introduction to CUDA memory management and data transfer API functions. It explains how to allocate and free device memory using cudaMalloc and cudaFree. It also demonstrates how to transfer data between host and device memory using cudaMemcpy. The document includes code for a vector addition example that allocates device memory, copies the vectors to device, launches the kernel, and copies the result back to host.

Uploaded by

BagongJaruh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views11 pages

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

Uploaded by

BagongJaruh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Lesson 1.

Introduction to CUDA
- Memory Allocation and Data Movement API Functions

Objective

To learn the basic API functions in CUDA host

code
Device Memory Allocation
Host-Device Data Transfer

Data Parallelism - Vector Addition Example

vector A

A[0]

A[1]

A[2]

A[N-1]

vector B

B[0]

B[1]

B[2]

B[N-1]

C[0]

C[1]

C[2]

vector C

C[N-1]

Vector Addition Traditional C Code

// Compute vector sum C = A+B
void vecAdd(float *h_A, float *h_B, float *h_C,
int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i]+h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements

vecAdd(h_A, h_B, h_C, N);

}
4

Part 1
Host Memory

Heterogeneous Computing vecAdd

CUDA Host Code

Device Memory
GPU
Part 2

CPU

Part 3

#include <cuda.h>
void vecAdd(float *h_A, float *h_B, float *h_C,
int n)
{
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
1. // Allocate device memory for A, B, and C
// copy A and B to device memory

2. // Kernel launch code the device performs the

actual vector addition
3. // copy C from the device memory // Free device
vectors
}
5

Partial Overview of CUDA Memories

(Device) Grid

Device code can:

R/W per-thread
registers
R/W all-shared global
memory

Host code can

Transfer data to/from
per grid global
memory

Block (0, 1)

Block (0, 0)
Registers

Registers

Thread (0, 0)

Thread (0, 1)

Registers

Thread (0, 0) Thread (0, 1)

Host
Global
Memory

We will cover more memory types later.

CUDA Device Memory Management

API functions

(Device) Grid

Registers

Thread (0, 0)

Thread (0, 1)

Host
Global
Memory

cudaMalloc()

Block (0, 1)

Block (0, 0)

Registers

Thread (0, 0) Thread (0, 1)

Allocates object in the

device global memory
Two parameters
Address of a pointer to
the allocated object
Size of allocated object
in terms of bytes

cudaFree()
Frees object from device
global memory
Pointer to freed object
7

Host-Device Data Transfer

API functions

(Device) Grid

Registers

Thread (0, 0)

Thread (0, 1)

Host
Global
Memory

cudaMemcpy()

Block (0, 1)

Block (0, 0)

Registers

Thread (0, 0) Thread (0, 1)

memory data transfer

Requires four parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of
transfer
Transfer to device is
asynchronous
8

Vector Addition Host Code

void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{

int size = n * sizeof(float); float d_A, d_B, *d_C;

cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code to be shown later
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}

In Practice, Check for API Errors in Host Code

cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(%s in %s at line %d\n,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

To Learn More, Read

Chapter 3. Thank you!

Advanced Performance Optimization in CUDA (S62192)
No ratings yet
Advanced Performance Optimization in CUDA (S62192)
127 pages
DCD 280-500
100% (2)
DCD 280-500
268 pages
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
No ratings yet
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
77 pages
8086 Inst and Assembler Directives
100% (2)
8086 Inst and Assembler Directives
49 pages
Chapters 1 and 3: ARM Processor Architecture
No ratings yet
Chapters 1 and 3: ARM Processor Architecture
44 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
The Highly Variable Jurong Formation
No ratings yet
The Highly Variable Jurong Formation
88 pages
ARM Cortex-M3/M4 Processor Core Features
No ratings yet
ARM Cortex-M3/M4 Processor Core Features
38 pages
Guide To Natural Refrigerants - Market Growth For Europe
100% (1)
Guide To Natural Refrigerants - Market Growth For Europe
166 pages
CH06 Memory Organization
No ratings yet
CH06 Memory Organization
85 pages
Memory Controller
No ratings yet
Memory Controller
26 pages
Chapter 9 Solutions PDF
100% (4)
Chapter 9 Solutions PDF
16 pages
AMD ZEN Architecture PDF
100% (1)
AMD ZEN Architecture PDF
19 pages
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
No ratings yet
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
44 pages
B2B Project On Praxair
100% (1)
B2B Project On Praxair
25 pages
Telephony: Martyn Miguel Q. Tadena, ECE, ECT
No ratings yet
Telephony: Martyn Miguel Q. Tadena, ECE, ECT
72 pages
A Design of Bio-Mask Extracting Machine
0% (1)
A Design of Bio-Mask Extracting Machine
34 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
UNIT Wise Important Questions (DLCO)
No ratings yet
UNIT Wise Important Questions (DLCO)
2 pages
GPU - Graphical Processing Unit
No ratings yet
GPU - Graphical Processing Unit
69 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
CSE211
No ratings yet
CSE211
2 pages
Tensor Processing Unit
100% (1)
Tensor Processing Unit
15 pages
I O Interfacing Using 8086
100% (1)
I O Interfacing Using 8086
4 pages
ELEC 6831 Outline
100% (1)
ELEC 6831 Outline
2 pages
Pci - Pci Express Configuration Space Access
No ratings yet
Pci - Pci Express Configuration Space Access
7 pages
Facilities Planning and Design
100% (2)
Facilities Planning and Design
6 pages
SOP Maintenance AC
No ratings yet
SOP Maintenance AC
2 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Chapter 2 ARM Cortex-M3 Architecture - 3
No ratings yet
Chapter 2 ARM Cortex-M3 Architecture - 3
68 pages
GPU Wiki
No ratings yet
GPU Wiki
9 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
EE292A Lecture 2.ML - Hardware
No ratings yet
EE292A Lecture 2.ML - Hardware
61 pages
T-Head Xuantie C910 (Openc910) : High Performance Rv64 Compatible Processor
No ratings yet
T-Head Xuantie C910 (Openc910) : High Performance Rv64 Compatible Processor
5 pages
Linux Memory Management
No ratings yet
Linux Memory Management
30 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Chp3 - Graphs in DSA
No ratings yet
Chp3 - Graphs in DSA
78 pages
C++ Pointers
No ratings yet
C++ Pointers
103 pages
Unit-2 Memory Management - Detail
No ratings yet
Unit-2 Memory Management - Detail
81 pages
FEH851 D
No ratings yet
FEH851 D
59 pages
LMCDYN User's Manual
No ratings yet
LMCDYN User's Manual
20 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
CUDA
No ratings yet
CUDA
46 pages
Ece5023 Memory-Design-And-testing TH 1.1 47 Ece5023
No ratings yet
Ece5023 Memory-Design-And-testing TH 1.1 47 Ece5023
2 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
Lecture 30 GPU Programming Loop Parallelism
No ratings yet
Lecture 30 GPU Programming Loop Parallelism
16 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
(NEWS) Embedded Systems With ARM Cortex-M Microcontrollers in Assembly Language and C: Third Edition by Yifeng Zhu Online
No ratings yet
(NEWS) Embedded Systems With ARM Cortex-M Microcontrollers in Assembly Language and C: Third Edition by Yifeng Zhu Online
5 pages
10 Cuda Dgemm Tiled
No ratings yet
10 Cuda Dgemm Tiled
33 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Tesla V100 Performance Guide
No ratings yet
Tesla V100 Performance Guide
23 pages
Processors
No ratings yet
Processors
25 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor
No ratings yet
7.performance Analysis of Wallace Tree Multiplier With Kogge Stone Adder Using 15-4 Compressor
38 pages
Autosar Memory Stack (Memstack)
No ratings yet
Autosar Memory Stack (Memstack)
21 pages
8085 Microprocessor Questions
100% (1)
8085 Microprocessor Questions
6 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Computational Aerodynamic Performance Study of A Modern Blended Wing Body Airplane Configuration
No ratings yet
Computational Aerodynamic Performance Study of A Modern Blended Wing Body Airplane Configuration
10 pages
Data Sheet: BGY135 BGY136
No ratings yet
Data Sheet: BGY135 BGY136
10 pages
The $4984 Van Conversion Cost Breakdown - Vanspace
No ratings yet
The $4984 Van Conversion Cost Breakdown - Vanspace
6 pages
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
No ratings yet
Accelerating Matrix Multiplication With Block Sparse Format and NVIDIA Tensor Cores - NVIDIA Technical Blog
7 pages
Department of Computer Science National Tsing Hua University CS4100 Computer Architecture
No ratings yet
Department of Computer Science National Tsing Hua University CS4100 Computer Architecture
3 pages
Sheet Metal Operations of Robot
No ratings yet
Sheet Metal Operations of Robot
10 pages
Resume Praveen Sahu
No ratings yet
Resume Praveen Sahu
3 pages
ABB - IMZ Prospect
No ratings yet
ABB - IMZ Prospect
12 pages
Android Spinner: File: Activity - Main - XML
No ratings yet
Android Spinner: File: Activity - Main - XML
6 pages
Amba
No ratings yet
Amba
7 pages
12d20106a Prestressed Concrete
No ratings yet
12d20106a Prestressed Concrete
1 page
PHP 5 Syntax
No ratings yet
PHP 5 Syntax
6 pages
Raspberry Pi Notes (E-Next - In)
No ratings yet
Raspberry Pi Notes (E-Next - In)
3 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
EEE316 Experiment 1
No ratings yet
EEE316 Experiment 1
12 pages
Crane Operator
No ratings yet
Crane Operator
5 pages
NKF100C040AC3P: Product Data Sheet
No ratings yet
NKF100C040AC3P: Product Data Sheet
2 pages
Government Software Project Management in United States Resume Burton John "BJ" Meche
No ratings yet
Government Software Project Management in United States Resume Burton John "BJ" Meche
2 pages
Documentation Stair Climber
No ratings yet
Documentation Stair Climber
3 pages
Aircraft Performance Estimation Methods
No ratings yet
Aircraft Performance Estimation Methods
2 pages
IDA+VMWare - Linux Debugger
No ratings yet
IDA+VMWare - Linux Debugger
8 pages
h06974 MC Series Icv
No ratings yet
h06974 MC Series Icv
2 pages
Yaba College of Technology: Newton (°N) Ice Melting Point Boiling Point of Water
No ratings yet
Yaba College of Technology: Newton (°N) Ice Melting Point Boiling Point of Water
2 pages
A CLFILE Is A ANSI Standard Generic Output File For Tool
No ratings yet
A CLFILE Is A ANSI Standard Generic Output File For Tool
2 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
Options and Accessories: Complete Wear Part List
No ratings yet
Options and Accessories: Complete Wear Part List
1 page
GPU Datasheet
No ratings yet
GPU Datasheet
3 pages

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

Uploaded by

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

Uploaded by

Lesson 1.

To learn the basic API functions in CUDA host

Data Parallelism - Vector Addition Example

Vector Addition Traditional C Code

vecAdd(h_A, h_B, h_C, N);

Heterogeneous Computing vecAdd

2. // Kernel launch code the device performs the

Partial Overview of CUDA Memories

Device code can:

Host code can

Thread (0, 0) Thread (0, 1)

We will cover more memory types later.

CUDA Device Memory Management

Thread (0, 0) Thread (0, 1)

Allocates object in the

Host-Device Data Transfer

Thread (0, 0) Thread (0, 1)

memory data transfer

Vector Addition Host Code

int size = n * sizeof(float); float *d_A, *d_B, *d_C;

In Practice, Check for API Errors in Host Code

To Learn More, Read

You might also like

int size = n * sizeof(float); float d_A, d_B, *d_C;