0% found this document useful (0 votes)

106 views28 pages

GPU Computing 2

This document discusses GPU computing and CUDA programming. It begins by comparing GPU and CPU architectures, noting key differences like GPUs having hundreds of lightweight cores compared to CPUs having fewer heavier cores. The document then provides an overview of the CUDA programming model, describing how CUDA programs consist of a CPU part and GPU part that execute concurrently. It explains CUDA's main abstractions of hierarchies of threads, shared memory, and barrier synchronization. Kernel launches and the unique IDs of threads and blocks are also summarized.

Uploaded by

QuantumChromist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views28 pages

GPU Computing 2

Uploaded by

QuantumChromist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

GPU COMPUTING - ARCHITECTURE +

PROGRAMMING
LECTURE 02 - CUDA PROGRAMMING
Holger Fröning
[email protected]
Institute of Computer Engineering
Ruprecht-Karls University of Heidelberg
With material from D. Kirk, W. Hwu (“Programming Massively Parallel Processors”)
DIE SHOTS - CPU OR GPU?
PCIe QPI links
Memory
controller

4x Setup
pipeline

Command
Setup pipeline processor Setup pipeline
Caching agents, router, Mrs

Core LLC slice

“Core”
Scalable Memory Interface (SMI)

NVIDIA Kepler- GK110 Intel Xeon E7 - Westmere-EX 2

MAIN DIFFERENCES BETWEEN GPU AND CPU

3
CUDA & GPU - OVERVIEW 96 GFLOPS (DP)
CPU SOCKET
NVIDIA CUDA
CPU
Compute kernel as C program 60GB/S
CORES
Explicit data- and thread-level parallelism
system request
Computing, not graphics processing queue
Host communication NORTH HOST
Memory hierarchy BRIDGE memory MEMORY
interface
Host memory 16GB/S
system interface
GPU (device) memory
GPU on-chip memory (later)
IO
BRIDGE 288GB/S
More HW details exposed 16GB/S peripheral
Use of pointers interface
Load/store architecture GPU GPU
Barrier synchronization of thread blocks 1,165 GFLOPS (DP) CORES MEMORY
memory
3,494 GFLOPS (SP) interface
4
G80 ARCHITECTURE FOR GRAPHICS PROCESSING
Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP

Thread Processor
SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB

5
G80 ARCHITECTURE FOR GENERAL-PURPOSE
PROCESSING
Host
SM = Streaming Multiprocessor
Input Assembler SM = Shared Memory
Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
SM SM SM SM SM SM SM SM

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory

6
CUDA PROGRAMMING MODEL
PROGRAMMING MODEL
CUDA program consists of CPU & GPU part

CPU
CPU part: part of the program with no or little parallelism

}
GPU part: high parallel part, SPMD-style

Kernel
GPU
Concurrent execution …
Non-blocking thread execution
Explicit synchronization

CPU
C Extension with three main abstractions
1.Hierarchy of threads

GPU
2.Shared memory …
3.Barrier synchronization
Exploiting parallelism

CPU
Inner loops
Fine-grain data-level parallelism (DLP)
Thread-level parallelism (TLP) } Threads
Kernels
8
KERNEL LAUNCH
Kernels: N-fold execution by N __global__ void matAdd ( float A[N][N],
float B[N][N],
threads float C[N][N] )
{
__global__ int i = threadIdx.x;
int j = threadIdx.y;
Execution: C[i][j] = A[i][j] + B[i][j];
}
kernel <<< numBlocks,
threadsPerBlock >>> int main()
{
(args) // Kernel invocation
dim3 dimBlock ( N, N );
Unique ID matAdd <<< 1, dimBlock >>> ( A, B, C );
}
threadIdx.{x,y,z}
Control flow for SPMD programs
Memory access orchestration

9
KERNEL LAUNCH
Each thread block has up to 3 dimensions __global__ void matAdd ( float A[N][N],
float B[N][N],
“Block” in the following float C[N][N])
{
Number of blocks is limited int i = threadIdx.x;
512 x 512 x 64 -> 1024 x 1024 x 64 int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
GPU dep. }

Additional hierarchy level: grid = int main()

multiple blocks {
// Kernel invocation
Grid = kernel in execution dim3 dimBlock ( N, N );
Unique ID blockIdx, up to 3 dimensions matAdd <<< 1, dimBlock >>> ( A, B, C );
}
Blocks are executed independently and
implementation-dependent
Number of blocks limited (typ. 64k-1 per
dimension) up to 3D

10
KERNEL LAUNCH
__global__ void matAdd ( float A[N][N], Super-fine grained: one thread
float B[N][N],
float C[N][N] )
computes one element
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if ( i < N && j < N )
C[i][j] = A[i][j] + B[i][j]; Operator “/” rounds down, so add
}
block size to round up!
int main() E.g. N=50
{ grid size = (50+16-1)/16=4.0625 => 4
// Kernel invocation
dim3 dimBlock ( 16, 16 );
dim3 dimGrid ( ( N + dimBlock.x – 1 ) / dimBlock.x,
( N + dimBlock.y – 1 ) / dimBlock.y );
matAdd <<< dimGrid, dimBlock >>> ( A, B, C );
}

GPU
…
grid size block size
11
THREAD HIERARCHY
Thread hierarchy
Grid of thread blocks
Blocks of equal size
Given problem size N, how to choose the
parameter threads per block respectively blocks
per grid?
Recommendations wrt block count
>2x number of SMs
Optimal: 100 – 1000 (max. 64k-1)
Recommendations wrt threads/block
Required concurrency for latency toleration vs
resources per thread
12
THREAD COMMUNICATION
Host Device

Communication and synchronization Grid 1

only within one thread block Kernel

Block Block
1
Shared memory (0, 0) (1, 0)

Atomic operations Block

(0, 1)
Block
(1, 1)

Barrier synchronization
Grid 2

Threads from different blocks cannot Kernel

interact
2

Block (1, 1)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Exception: global memory
Very weak coherence & consistency Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)

guarantees
Thread Thread Thread Thread

Iterative kernel invokations

(0,1,0) (1,1,0) (2,1,0) (3,1,0)

13
MEMORY HIERARCHY – GLOBAL MEMORY
Global memory
Grid
Communication between host and device
Block Block
Accessible from all threads (R/W)
Shared memory Shared memory
High latency
Lifetime exceeds thread lifetime Registers Registers Registers Registers
Sensitive to fine-grained accesses
Allocation Thread Thread Thread Thread
(0,0) (0,1) (0,0) (0,1)
cudaMalloc (&dmem, size);
Deallocation
Host Global memory
cudaFree (dmem);
Data transfer (blocking)
cudaMemcpy (*dst, *src, size, transfer_type);
cudaMemcpyAsync ( … )

14
MEMORY HIERARCHY – GLOBAL MEMORY
void *dmem = cudaMalloc ( N*sizeof ( float ) ); // Allocate GPU memory
Annotate
variable void *hmem = malloc ( N*sizeof ( float ) ); // Allocate CPU memory
scope!
// Transfer data from host to device
cudaMemcpy ( dmem, hmem, N*sizeof ( float ), cudaMemcpyHostToDevice );

// Do calculations
kernel1 <<< numBlocks, numThreadsPerBlock >>> ( dmem, N );
Only ...
references kernel2 <<< numBlocks, numThreadsPerBlock >>> ( dmem, N );
to device
// Transfer data from device to host
memory cudaMemcpy ( hmem, dmem, N*sizeof ( float ), cudaMemcpyDeviceToHost );

cudaFree ( dmem ); // Free device buffer

free ( hmem ); // Free host buffer

15
MEMORY HIERARCHY – SHARED MEMORY
On-chip memory Grid
Block Block
Lifetime: thread lifetime
Shared memory Shared memory
Access costs in the best case
equal register access Registers Registers Registers Registers

Organized in n banks Thread Thread Thread Thread

(0,0) (0,1) (0,0) (0,1)
Typ. 16-32 banks with 32bit
width
Low-order interleaving Host Global memory
Parallel access if no conflict
Conflicts result in access
serialization
16
VARIABLE DECLARATION
Location Access from Lifetime

global memory
__device__ float var; device/host program
(device memory)
constant memory
__constant__ float var; device/host program
(device memory)
__shared__ float var; shared memory threads thread block
texture memory
texture <float> ref; device/host program
(device memory)

device can be combined with others (e.g., constant)

Shared memory & consistency model
__syncthreads to wait for completion of outstanding write operations
Unconstrained completion of read/write operations (exception: volatile)
17
TYPE SPECIFIERS

Vector types
char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1,
short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3,
uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4,
float1, float2, float3, float4, double2
Derived from basic types (int, float, …)
Dimension type: dim3
Based on uint3
Unspecified components are initialized with 1

18
FUNCTION DECLARATION
Executed on Callable from

device float DeviceFunc() device device

__global__ void KernelFunc() device host
__host__ float HostFunc() host host
__global__ defines a kernel (return type: void)
__host__ is optional
__host__ and __device__ can be combined
No pointers to __device__ functions (exception: __global__ functions)
For functions that are executed on the GPU:
No recursions
Only static variable declarations
No variable parameter count
19
JUST-IN-TIME COMPILATION
Device code only supports C-subset of C++ (getting better) Virtual
Compile with nvcc CUDA program
Compiler Driver
Calls other tools as required
cudacc, g++, clang, … nvcc
Output
C code (host CPU Code) PTX x86
Either PTX object code, or source code for run-time
interpretation Physical
PTX (Parallel Thread Execution) PTX to target
Virtual Machine and ISA
Execution resources and state
Linking GF100 GK110 GP100
CUDA runtime library cudart
CUDA core library cuda

20
BRIEF PROPERTY SURVEY (DEVICEQUERY)
Shared
Total Total Regis- Max Max. Max.
CC Multi- memory Threads Clock Concurrent
global constant ters Warp dimen- dimen- memory
Model Revi- proces Cores per per rate copy and
memory memory per size sion of sion of pitch
sion sors block block [GHz] execution
[bytes] [bytes] block a block a grid [bytes]
[bytes]
65535
x
GeForce 1k x 1k Y
2.0 1.5G 15 480 64k 48k 32k 32 1k 65535 2G 1.4
GTX 480 x 64 1
x
65535

2G x
Tesla 1k x 1k 65535 Y
3.5 5G 13 2496 64k 48k 64k 32 1k 2G 0.7
K20c x 64 x 2
65535

2G x
RTX 1k x 1k 65535 Y
7.5 11G 68 4352 64k 48k 64k 32 1k 2G 1.54
2080Ti x 64 x 3
65535 21
CUDA EXAMPLE: SAXPY
SAXPY EXAMPLE

y[i] = α ⋅ x[i] + y[i]

SAXPY: Scalar Alpha X Plus Y
Simple test to compare GPU and CPU performance
Objective: runtime reduction
Max. gridSize * threadsPerBlock elements
65535*1k -> ~ 64M elements
Memory requirement = 32M elements * 2 arrays * 4 Byte/element = 256MB
Source code contains kernels for the GPU and the CPU

23
SAXPY EXAMPLE
// kernel function (CPU)
void saxpy_serial(int n, float alpha, float *x, float *y)
{
int i;
for (i=0; i<n; i++) { CPU version
y[i] = alpha*x[i] + y[i];
}
}

// kernel function (CUDA device)

__global__ void saxpy_parallel(int n, float alpha, float *x, float *y)
{
// compute the global index in vector from
int i = blockIdx.x * blockDim.x + threadIdx.x;

// avoid writing past the allocated memory

GPU version
if (i<n) {
y[i] = alpha*x[i] + y[i];
}
}

24
Huge advantage
for GPU when no
data movements
INITIAL PERFORMANCE Pageable memory
vs. pinned memory

64k
16k

16k
Tesla K10, PCIe 2.0 x16, Intel Xeon E5 (single-threaded) 25
PINNED MEMORY
Replace malloc with cudaMallocHost
Significant reduction of data movement costs
Pinned memory is a scarce resource!
float *h_x;
float *h_y;
float *d_x;
float *d_y;

if (USE_PINNED_MEMORY) {
cudaMallocHost ( (void**) &h_x, N*sizeof(float) );
cudaMallocHost ( (void**) &h_y, N*sizeof(float) );
} else {
h_x = (float*) malloc ( N*sizeof(float) );
h_y = (float*) malloc ( N*sizeof(float) );
}
cudaMalloc ( (void**)&d_x, N*sizeof(float) );
cudaMalloc ( (void**)&d_y, N*sizeof(float) );

26
COMMON ERRORS
CUDA Error: the launch timed out and was terminated
-> Stop X11
CUDA Error: unspecified launch failure
-> Typically a segmentation fault
CUDA Error: invalid configuration argument
-> Too many threads per block, too many resources per thread (shared memory,
register count)
Compile problem:
mmult.cu(171): error: identifier "__eh_curr_region" is
undefined
-> Non-static shared memory, use static allocation of shared memory
27
SUMMARY 96 GFLOPS (DP)
CPU SOCKET
Introduction to CUDA
CPU
Pretty unusual concept compared to CPU CORES 60GB/S
programming
system request
Once understood: pretty easy programming model queue

Did you see any vector instructions today? NORTH HOST

BRIDGE memory MEMORY
Direct control over hardware interface
16GB/S
Plenty of opportunities for the (experienced) user system interface
Increases the burden IO
BRIDGE 288GB/S
Main differences to CPU programming
16GB/S peripheral
Sophisticated resource planning
interface
Many manual data movements
GPU GPU
Limited memory capacity 1,165 GFLOPS (DP) CORES MEMORY
memory
3,494 GFLOPS (SP) interface
28

01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Motor Vehicle Inspection
No ratings yet
Motor Vehicle Inspection
9 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Research Proposal On Grapevine Communication
No ratings yet
Research Proposal On Grapevine Communication
17 pages
Tesco Ar25 Interactive
No ratings yet
Tesco Ar25 Interactive
248 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
6712917-V12 Front Office User Guide - Labels - June 2024
No ratings yet
6712917-V12 Front Office User Guide - Labels - June 2024
12 pages
GM 34 Full Program With Questions 2024 FINAL
No ratings yet
GM 34 Full Program With Questions 2024 FINAL
7 pages
Fci-Management Trainee 2013
No ratings yet
Fci-Management Trainee 2013
23 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Manchester Piccadilly Station Map
0% (1)
Manchester Piccadilly Station Map
1 page
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
2 Introduction To Management Science
100% (1)
2 Introduction To Management Science
16 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
The Standard - 2014-07-30
0% (1)
The Standard - 2014-07-30
71 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Training Course On AppCoMS For Nodal Officers
No ratings yet
Training Course On AppCoMS For Nodal Officers
18 pages
Prepared Food Photos, Inc V New Kianis Pizza & Subs, Inc: Judgment Entered $51,461.50
No ratings yet
Prepared Food Photos, Inc V New Kianis Pizza & Subs, Inc: Judgment Entered $51,461.50
5 pages
Digest By: Shimi Fortuna Ali Akang Vs Municipality of Isulan
No ratings yet
Digest By: Shimi Fortuna Ali Akang Vs Municipality of Isulan
2 pages
Nabl 100
No ratings yet
Nabl 100
45 pages
Internship at D'Decor
No ratings yet
Internship at D'Decor
38 pages
CUDA
No ratings yet
CUDA
18 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Threads
No ratings yet
Threads
54 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Module 1 - Shear & Moment Diagram
No ratings yet
Module 1 - Shear & Moment Diagram
11 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Public Liability Insurance
No ratings yet
Public Liability Insurance
23 pages
Untitled Document
No ratings yet
Untitled Document
1 page
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
NCP Format
No ratings yet
NCP Format
2 pages
LM32 Ait L21
No ratings yet
LM32 Ait L21
19 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
RESUME - Payam Rahrow
No ratings yet
RESUME - Payam Rahrow
2 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
B. Ujwala Libre
No ratings yet
B. Ujwala Libre
5 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Class 10
No ratings yet
Class 10
13 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Whistleblower Statement PDF
No ratings yet
Whistleblower Statement PDF
2 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Lec 1
No ratings yet
Lec 1
27 pages
UGC - NET December 2024 Admit Card: Ugcnet - Nta.ac - in
No ratings yet
UGC - NET December 2024 Admit Card: Ugcnet - Nta.ac - in
2 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
CHAPTER 6 Valve Timing 13 Two Stroke Engine
No ratings yet
CHAPTER 6 Valve Timing 13 Two Stroke Engine
20 pages
Datasheet Freecom Dual Drive Network Center en
No ratings yet
Datasheet Freecom Dual Drive Network Center en
2 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
MEP Myanmar
No ratings yet
MEP Myanmar
27 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDA
No ratings yet
CUDA
33 pages
Hazard Pay
No ratings yet
Hazard Pay
2 pages
Chuyên Đề 22 - Từ Chỉ Số Lượng
No ratings yet
Chuyên Đề 22 - Từ Chỉ Số Lượng
4 pages
Lingeswaran Vs Thirunagalingam
No ratings yet
Lingeswaran Vs Thirunagalingam
5 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Css12 1st Week5 SSLM
No ratings yet
Css12 1st Week5 SSLM
6 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
UT Dallas Syllabus For cs6390.001 05s Taught by Jorge Cobb (Jcobb)
No ratings yet
UT Dallas Syllabus For cs6390.001 05s Taught by Jorge Cobb (Jcobb)
3 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
From Everand
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
Rodrigo Copetti
2/5 (1)
Game Boy Advance Architecture: Architecture of Consoles: A Practical Analysis, #7
From Everand
Game Boy Advance Architecture: Architecture of Consoles: A Practical Analysis, #7
Rodrigo Copetti
No ratings yet

GPU Computing 2

Uploaded by

GPU Computing 2

Uploaded by

GPU COMPUTING - ARCHITECTURE +

Core LLC slice

NVIDIA Kepler- GK110 Intel Xeon E7 - Westmere-EX 2

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

Load/store Load/store Load/store Load/store Load/store Load/store

Additional hierarchy level: grid = int main()

Communication and synchronization Grid 1

only within one thread block Kernel

Atomic operations Block

Threads from different blocks cannot Kernel

Iterative kernel invokations

cudaFree ( dmem ); // Free device buffer

Organized in n banks Thread Thread Thread Thread

__device__ can be combined with others (e.g., __constant__)

__device__ float DeviceFunc() device device

y[i] = α ⋅ x[i] + y[i]

// kernel function (CUDA device)

// avoid writing past the allocated memory

Did you see any vector instructions today? NORTH HOST

You might also like

device can be combined with others (e.g., constant)

device float DeviceFunc() device device