0% found this document useful (0 votes)
94 views11 pages

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

This document provides an introduction to CUDA memory management and data transfer API functions. It explains how to allocate and free device memory using cudaMalloc and cudaFree. It also demonstrates how to transfer data between host and device memory using cudaMemcpy. The document includes code for a vector addition example that allocates device memory, copies the vectors to device, launches the kernel, and copies the result back to host.

Uploaded by

BagongJaruh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views11 pages

Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API

This document provides an introduction to CUDA memory management and data transfer API functions. It explains how to allocate and free device memory using cudaMalloc and cudaFree. It also demonstrates how to transfer data between host and device memory using cudaMemcpy. The document includes code for a vector addition example that allocates device memory, copies the vectors to device, launches the kernel, and copies the result back to host.

Uploaded by

BagongJaruh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Lesson 1.

Introduction to CUDA
- Memory Allocation and Data Movement API Functions

Objective

To learn the basic API functions in CUDA host


code
Device Memory Allocation
Host-Device Data Transfer

Data Parallelism - Vector Addition Example


vector A

A[0]

A[1]

A[2]

A[N-1]

vector B

B[0]

B[1]

B[2]

B[N-1]

C[0]

C[1]

C[2]

vector C

C[N-1]

Vector Addition Traditional C Code


// Compute vector sum C = A+B
void vecAdd(float *h_A, float *h_B, float *h_C,
int n)
{
int i;
for (i = 0; i<n; i++) h_C[i] = h_A[i]+h_B[i];
}
int main()
{
// Memory allocation for h_A, h_B, and h_C
// I/O to read h_A and h_B, N elements

vecAdd(h_A, h_B, h_C, N);


}
4

Part 1
Host Memory

Heterogeneous Computing vecAdd


CUDA Host Code

Device Memory
GPU
Part 2

CPU

Part 3

#include <cuda.h>
void vecAdd(float *h_A, float *h_B, float *h_C,
int n)
{
int size = n* sizeof(float);
float *d_A, *d_B, *d_C;
1. // Allocate device memory for A, B, and C
// copy A and B to device memory

2. // Kernel launch code the device performs the


actual vector addition
3. // copy C from the device memory // Free device
vectors
}
5

Partial Overview of CUDA Memories


(Device) Grid

Device code can:


R/W per-thread
registers
R/W all-shared global
memory

Host code can


Transfer data to/from
per grid global
memory

Block (0, 1)

Block (0, 0)
Registers

Registers

Thread (0, 0)

Thread (0, 1)

Registers

Registers

Thread (0, 0) Thread (0, 1)

Host
Global
Memory

We will cover more memory types later.


6

CUDA Device Memory Management


API functions

(Device) Grid

Registers

Registers

Thread (0, 0)

Thread (0, 1)

Host
Global
Memory

cudaMalloc()

Block (0, 1)

Block (0, 0)

Registers

Registers

Thread (0, 0) Thread (0, 1)

Allocates object in the


device global memory
Two parameters
Address of a pointer to
the allocated object
Size of allocated object
in terms of bytes

cudaFree()
Frees object from device
global memory
Pointer to freed object
7

Host-Device Data Transfer


API functions

(Device) Grid

Registers

Registers

Thread (0, 0)

Thread (0, 1)

Host
Global
Memory

cudaMemcpy()

Block (0, 1)

Block (0, 0)

Registers

Registers

Thread (0, 0) Thread (0, 1)

memory data transfer


Requires four parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type/Direction of
transfer
Transfer to device is
asynchronous
8

Vector Addition Host Code


void vecAdd(float *h_A, float *h_B, float *h_C, int n)
{

int size = n * sizeof(float); float *d_A, *d_B, *d_C;


cudaMalloc((void **) &d_A, size);
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_B, size);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
cudaMalloc((void **) &d_C, size);
// Kernel invocation code to be shown later
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);
}

In Practice, Check for API Errors in Host Code


cudaError_t err = cudaMalloc((void **) &d_A, size);
if (err != cudaSuccess) {
printf(%s in %s at line %d\n,
cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}

10

To Learn More, Read


Chapter 3. Thank you!

You might also like