0% found this document useful (0 votes)

716 views37 pages

High Performance Computing On Gpu

Uploaded by

Sushant Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

716 views37 pages

High Performance Computing On Gpu

Uploaded by

Sushant Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

HIGH PERFORMANCE COMPUTING ON GPU

Graphics Processing Units

•A graphics processing unit or GPU is a specialized microprocessor that offloads
and accelerates 3D or 2D graphics rendering.

•Modern GPU’s highly parallel structure makes them more effective.

•NVIDIA's Tesla Architecture exposes the computational Horse power of the NVIDIA's
GPU.

GPU is specialized for compute-

intensive, highly parallel
designed such that more
transistors are devoted to data
processing rather than data
caching and flow control.
Physical Memory Layout of NVIDIA GPUs

The Device has its own Global Memory which all the
cores(Thread processors) can access. N multiprocessors
have M cores each. Cores share an instruction unit with
other cores in a multiprocessor. Each processor has its own
local memory(residing on DRAM), separate register set and
all the M cores shares an on chip memory called shared
memory. The Host can write to the global memory but not
the shared memory.
TESLA C1060

• NVIDIA® Tesla™ C1060 has 10 series NVIDIA

architecture having 30 multiprocessor with 8
cores , a double precision unit and on chip
shared memory.
What is CUDA?
CUDA is a scalable parallel programming
model and a software environment for
parallel computing
• Minimal extensions to familiar C/C++ environment
• Heterogeneous serial-parallel programming model
Kernels and Threads
Parallel portions of an application are executed on
the device as kernels
• One kernel is executed at a time
• All the parallel threads execute the same kernel.
• Some devices of High computation power can
execute more than one concurrent kernels.
Important Definition
More about threads
• A CUDA kernel is executed by an array of
threads.
• All threads run the same code.
• Each thread has an ID that it uses to compute
memory addresses and make control decisions.
Computation of memory address and control decisions will be discussed later.
THREAD BATCHING
Kernel launches a grid of thread blocks.

•Threads within a block cooperate via shared memory

•Threads within a block can synchronize(Thread
Coorporation)
•Threads in different blocks cannot cooperate
MEMORY ACCESS
EXECUTION MODEL
CUDA C and Compilation
• CUDA C provides a simple path for users familiar with the C programming language
to easily write programs for execution by the device. It consists of a minimal set of
extensions to the C language and a runtime library.
• CUDA provides nvcc compiler which spilts the CUDA code into PTX code(Used at
runtime) and standard C code (calls the Standard C Compiler at compile time).

RUNS ON CPU RUNS ON GPU

Managing memory
GPU’s memory can only be managed by the CPU and CPU has access only
to the Global Memory .
The following memory operations implies only for the Global memory
(Not to the local or shared memory).
• Allocate/Free memory
– cudaMalloc(void ** pointer, size_t nbytes) //To allocate nbytes of memory
– cudaMemset(void * pointer, int value, size_t count) //To set “count” bytes to
“value”.
– cudaFree(void* pointer) // To free memory allocated by cudaMalloc
• HOST <-> DEVICE data transfer
– cudaMemcpy(void *dst, void *src, size_t nbytes,enum cudaMemcpyKind direction)
//Transfers “nbytes” of data from “src” to “dst”. “direction specify the initial and final memory type”
CUDA function Qulaifiers
__global__
• Function called from host and executed on device
• Must return void
Eg, kernels
__device__
• Function called from device and run on device.
• Cannot be called from host code
__host__
• Function called from host and executed on host (default)
Kernel Calls and Unique Thread Index
Kernels are called by the modified syntax:
kernel<<<dim3 dG, dim3 dB>>>(…)
Here dim3 is a vector type with x , y , z as the members.
We can initialise dim3 objects by the constructor
For 1D grid: dim3 dG(var_x,1,1) or dim3 dG(var)
For 2D grid: dim3 dG(var_x,var_y,1) or dim3 dG(var1,var2)

Similarly for blocks:

For 1D block: dim3 dB(var_x,1,1) or dim3 dB(var)
For 2D block: dim3 dB(var_x,var_y,1) or dim3 dV(var_x,var_y)
For 3D blocks: dim3 dB(var_x,var_y,var_z) or dim3 dB(var_x,var_y,var_z)
Thread Synchronization
Host synchronization:
void CudaThreadsynchronize();
Blocks until all the CUDA calls are executed.
Device synchronization:
void __syncthreads();
Synchronizes the threads in a Blocks.
There no way to synchronize threads outside the block.
Programmer should be careful to avoid RAW/WAW/WAR
hazards.
Heteroprogramming and Synchronization

// copy data from host to device

• cudaMemcpy(a_d, a_h, numBytes,
cudaMemcpyHostToDevice);
// execute the kernel
• inc_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a_d, N);
// run independent CPU code
• run_cpu_stuff();
// copy data from device back to host
• cudaMemcpy(a_h, a_d, numBytes,
cudaMemcpyDeviceToHost);
Error Reporting
All CUDA calls return error code but some calls are
Asynchronous, so programming should synchronize to
Keep checks.

Example:
cudaThreadSynchronize();
Kernel_Launch<<<config_arguments>>>(arg_list);
cudaThreadSynchronize();
printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );
Hardware Implementation
• The CUDA architecture is built around a scalable array of multithreaded
Multiprocessors . When a CUDA program on the host CPU invokes a kernel grid, the
blocks of the grid are enumerated and distributed to multiprocessors with
available execution capacity. The threads of a thread block execute concurrently on
one multiprocessor, and multiple thread blocks can execute concurrently on one
multiprocessor. As thread blocks terminate, new blocks are launched on the
vacated multiprocessors. This makes the Framework Scalable.

• A multiprocessor is designed to execute hundreds of threads concurrently. To

manage such a large amount of threads, it employs a unique architecture called
SIMT (Single-Instruction, Multiple-Thread). When a multiprocessor is given one or
more thread blocks to execute, it partitions them into warps. A warp executes one
common instruction at a time, so full efficiency is realized when all 32 threads of a
warp agree on their execution path.
PERFROMANCE OPTIMIZATION

Performance optimization revolves around three

basic strategies:
• Maximizing parallel execution
• Optimizing memory usage
• Optimizing instruction usage to achieve
maximum instruction throughput
Maximizing parallel execution

• Amdahl’s law states that the maximum speed-up (S) of a program is

where P is the fraction of the total serial execution time taken by the portion of code that can be parallelized and N is the number of
processors over which the parallel portion of the code runs. The larger N is (that is, the greater the number of processors), the smaller
the P/N fraction.

It can be simpler to view N as a very large number, which essentially transforms the
equation into
S = 1 / 1−P

Now, if ¾ of a program is parallelized, the maximum speed-up over serial code is 1/ (1

– ¾) = 4. So our aim is to increase P by increasing the fraction of parallel code.
Optimizing memory transfers
• To run kernels, data values must be transferred from the host
to the device along the PCI Express (PCIe) bus. It is important
to minimize data transfer between the host and the device,
even if that means running kernels on the GPU that do not
demonstrate any speed-up.
Device<->Device transfer
CUDA provides function for device to device data transfer which
can only be called from the Host code.
The call to cudaMemcopy() is Asynchronous but
• next kernel wont start until memory transfer is complete.
• and what if there is large amout of memory trasfer?
The GPU cores will be idle.

To increase the performance we can allot the job, of copying N

bytes of data, to B blocks each running k threads in parallel. For
best performance N=k * B .( Eg, it takes 4.5 times less time if we
allot the job of copying 1 MB of data to around 1k threads ).
Shared Memory
Each Multiprocessor has 16 kb of Shared Memory
associated with it
• Provides thread corporation within a block of
threads.
• Sharing of memory access
• Redundant computations
• Because it is on-chip, shared memory is much
faster than local and global memory.
Coalesced Access to Global Memory
• Global memory can be viewed in terms of aligned segments
of 16 and 32 words.

Coalesced access in which all Misaligned sequential

threads but one access the addresses that fall within two
corresponding word in a segment 128-byte segments

Choosing thread block sizes as multiples of 16, facilitates

memory accesses by half warps that are aligned to segments.
But a warp-size is 32 , so there should be minimum 32
threads.
Optimizing Instruction Usage
A warp executes one common instruction at a time, so
full efficiency is realized when all 32 threads of a warp
agree on their execution path. Any flow control
instruction (if, switch, do, for, while) can significantly
affect the instruction throughput by causing threads of
the same warp to different execution paths. If this
happens, the different execution paths must be
serialized, increasing the total number of instructions
executed for this warp.
Parallelizing w.r.t. pixels
If the processing of pixels are independent .
Eg ,Conversion from rgb to grey ,Conversion from one format to
another.
char* host_img_rgb= malloc(3*height*width*sizeof(char)) ;
char* host_img_grey= malloc(height*width*sizeof(char)) ; //allocating in Host
char* dev_img_rgb, dev_img_grey; //device pointers
cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char));
cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char)); //allocating in Device

//read image into the HOST memory

//copy that rgb image into the Device memory
cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostToDevice);

Kernel<<<(height*width)/256 , 256>>>(dev_img_rgb,dev_img_grey);

//copy back to host memory

cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char),cudaHostToDevice) ;
Visualizing the kernel execution

Reading from

……….
Every Block contains
256 threads
Writes to

• Parallel Execution of Each Block

• 256 as the threads per block encourages
coleased memory access
Improvements
char* host_img_rgb= malloc(3*height*width*sizeof(char)) ;
char* host_img_grey= malloc(height*width*sizeof(char)) ;//allocating in Host
char* dev_img_rgb, dev_img_grey; //device pointers
cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char)); //allocating in
Device
//read image into the HOST memory
//copy that rgb image into the Device memory
cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostTo
Device);
cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char));

Kernel<<<(height*width)/256 , 256>>>(dev_img_rgb,dev_img_grey);

cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char),cudaDeviceTo
Host) ;
IMPROVEMENT IN ALLOCATION
cudaMalloc((void **) &(dev_img_rgb), 3
*width*height*sizeof(char));
cudaMalloc((void **) &(dev_img_grey),
width*height*sizeof(char));
Better way:
cudaMalloc((void**)&temp_dev_point,4*width*height*sizeof(c
har));
dev_img_rgb=temp_dev_point;
dev_img_grey=temp_dev_point + (width*height);

For eg, it takes 12 times less time allocating 6000 bytes than to
allocate 4 arrays of 1500 each.
Problems in data transfer and execution
cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cuda
HostToDevice);
Kernel<<<(height*width)/256 , 256>>>(dev_img_rgb,dev_img_grey);

• Kernel has to wait for the data transfer .

Therefore the cores are idle.
• Moreover the Host<->Device Transfer is slow.
Page Locked Memory
• Cuda allows the programmer to allocate Page
locked Host memory.
• The data transfer rate, between Page locked
host memory and the device memory, is high.
• It allows Asynchronous Data transfer.
Concurrency
//Use of Streams
//Creating Streams
cudaStream_t stream[height];
for (int i = 0; i < height; ++i)
cudaStreamCreate(&stream[i]);
//Specifying sequence of host to device transfers.
for (int i = 0; i < height; ++i)
cudaMemcpyAsync(dev_img_rgb + (i * 3* width), host_img_rgb + (i* 3 * width) , width *
sizeof(char), cudaMemcpyHostToDevice, stream[i]);
//Specifying sequence of kernel launches
for (int i = 0; i < height; ++i)
Kernel<<<width/256, 256, 0, stream[i]>>> (dev_img_rgb + i * 3*width, dev_img_grey + i *
width);
//Specifying sequence of device to host transfers.
for (int i = 0; i < height; ++i)
cudaMemcpyAsync(host_img_grey + (i * width), dev_img_grey + (i * width) , width *
sizeof(char), cudaMemcpyDeviceToHost, stream[i]);
Comparison of Timelines for non-concurrent
and concurrent execution
Host->Device
Execution

Device->Host

Host->Device
Execution
Device->Host
Parallelizing nested loops
Eg, Parallelizing w.r.t. the pixels in a Patch

for(int i=0;i<width/width_of_patch,i++)
for(int j=0;j<height/height_of_patch,j++)
for(int k=0;k<width_of_patch,k++)
for(int k=0;k<width_of_patch,k++)
{…..}
//We launch a 2-D grid

dim3 grid(width/patch_width,
height/patch_height);

//and grid with 2-D blocks

dim3 block(patch_width,patch_height);

//launch kernel
Kernel_name<<<grid,block>>>
(arg list…..);

How to find the index of the patch inside the grid?

blockIdx.y * gridDim.x + blockIdx.x

How to find the index of the pixel inside the block?

threadIdx.y * blockDim.x + threadIdx.x

How to choose best configuration argument.

• Cuda provides an OCCUPANCY Calculator as

an excel file.
• Occupancy is the ratio of the number of
active warps per multiprocessor to the
maximum number of possible active warps.

1 Cuda
100% (1)
1 Cuda
173 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
26 pages
Cuda C
No ratings yet
Cuda C
70 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
Understanding CUDA Architecture and GPU
No ratings yet
Understanding CUDA Architecture and GPU
6 pages
CUDA Memory for HPC Students
No ratings yet
CUDA Memory for HPC Students
27 pages
Image Rotation Using CUDA
No ratings yet
Image Rotation Using CUDA
18 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
GPU Computing Revolution CUDA
100% (1)
GPU Computing Revolution CUDA
5 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Advanced Performance Optimization in CUDA (S62192)
100% (1)
Advanced Performance Optimization in CUDA (S62192)
127 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
Unit 3
No ratings yet
Unit 3
14 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
25 pages
LeNet-5 and AlexNet Architectures Explained
No ratings yet
LeNet-5 and AlexNet Architectures Explained
13 pages
Compute Cores Whitepaper
No ratings yet
Compute Cores Whitepaper
6 pages
Machine Learning and AI Workloads Hardware Requirements
No ratings yet
Machine Learning and AI Workloads Hardware Requirements
2 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Microprocessor RISC vs CISC Explained
No ratings yet
Microprocessor RISC vs CISC Explained
5 pages
GPU Wiki
No ratings yet
GPU Wiki
9 pages
ELEC 271 Review by Madeline Van Der Paelt
No ratings yet
ELEC 271 Review by Madeline Van Der Paelt
22 pages
Numerical Methods Implementation On CUDA
No ratings yet
Numerical Methods Implementation On CUDA
73 pages
Dense Net
No ratings yet
Dense Net
28 pages
Understanding GPU Architecture and CUDA
No ratings yet
Understanding GPU Architecture and CUDA
12 pages
Survey of Deep Learning Accelerators
No ratings yet
Survey of Deep Learning Accelerators
44 pages
Domain-Specific Architectures Overview
No ratings yet
Domain-Specific Architectures Overview
4 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
CUDA
No ratings yet
CUDA
46 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
DL Unit 5
No ratings yet
DL Unit 5
63 pages
Introduction To High Performance Scientific Computing
No ratings yet
Introduction To High Performance Scientific Computing
464 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
PixelJunk Shooter Fluid Sim and Rendering
No ratings yet
PixelJunk Shooter Fluid Sim and Rendering
60 pages
Hc2024 Amd Vpeng
No ratings yet
Hc2024 Amd Vpeng
36 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Convolutional Neural Networks (CNN)
No ratings yet
Convolutional Neural Networks (CNN)
7 pages
Unit 1 - Computer Graphics & Multimedia - WWW - Rgpvnotes.in-2-18
No ratings yet
Unit 1 - Computer Graphics & Multimedia - WWW - Rgpvnotes.in-2-18
17 pages
CUDA Matrix Transpose Overview
100% (1)
CUDA Matrix Transpose Overview
19 pages
48423B Fusion Whitepaper WEB
No ratings yet
48423B Fusion Whitepaper WEB
8 pages
CUDA Parallel Programming Patterns
No ratings yet
CUDA Parallel Programming Patterns
35 pages
3D Transformation
No ratings yet
3D Transformation
11 pages
Gpu-Applications-Catalog 2021
No ratings yet
Gpu-Applications-Catalog 2021
76 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
H - LLM: Hardware-Dataflow Co-Exploration For Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
50% (2)
H - LLM: Hardware-Dataflow Co-Exploration For Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
17 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
22 pages
ML Lec 14 LeNeT CNN Architecture
No ratings yet
ML Lec 14 LeNeT CNN Architecture
14 pages
CUDA Execution Model
No ratings yet
CUDA Execution Model
67 pages
2 Convolutional Neural Network For Image Classification
No ratings yet
2 Convolutional Neural Network For Image Classification
6 pages
CNN, MTCNN, Caps-net Face Recognition Analysis
No ratings yet
CNN, MTCNN, Caps-net Face Recognition Analysis
35 pages
IPCV Unit 04
No ratings yet
IPCV Unit 04
12 pages
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-V Deep Learning Techniques (WWW - Jntumaterials.co - In)
61 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
31 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA
No ratings yet
CUDA
18 pages
Anjum 2017 Cloud-Based Scalable Object Detection and Classification in Video Streams Accepted
No ratings yet
Anjum 2017 Cloud-Based Scalable Object Detection and Classification in Video Streams Accepted
35 pages
s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance
No ratings yet
s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance
49 pages
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
No ratings yet
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
24 pages
Milan Maragiri: Education
No ratings yet
Milan Maragiri: Education
1 page
Vector Processor
No ratings yet
Vector Processor
83 pages
PDC 21 - Graphical Processing Unit
No ratings yet
PDC 21 - Graphical Processing Unit
19 pages
Overview of GPU Architecture and CUDA
No ratings yet
Overview of GPU Architecture and CUDA
18 pages
07b - CUDA Parallel Patterns + Notes
No ratings yet
07b - CUDA Parallel Patterns + Notes
36 pages
GPU-Based Polyhedral DEM Simulations
No ratings yet
GPU-Based Polyhedral DEM Simulations
75 pages
GPU Architecture for Engineers
No ratings yet
GPU Architecture for Engineers
32 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Colfax Gemm Kernels Hopper
No ratings yet
Colfax Gemm Kernels Hopper
17 pages
Ec8552-Cao Unit 5
No ratings yet
Ec8552-Cao Unit 5
72 pages
Parallel Programming Module 4
No ratings yet
Parallel Programming Module 4
93 pages
GPU Microarchitecture Insights via Microbenchmarking
No ratings yet
GPU Microarchitecture Insights via Microbenchmarking
12 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
PDC Example Exam Questions
No ratings yet
PDC Example Exam Questions
9 pages
569 - 2 - CUDA and Warps
No ratings yet
569 - 2 - CUDA and Warps
93 pages
A GPGPU Compiler For Memory Optimization And: Parallelism Management
No ratings yet
A GPGPU Compiler For Memory Optimization And: Parallelism Management
12 pages
Ampere GPU CUDA Tuning Guide
No ratings yet
Ampere GPU CUDA Tuning Guide
5 pages
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
No ratings yet
A New Approach For Parallel Region Growing Algorithm in Image Segmentation Using MATLAB On GPU Architecture
5 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Understanding CUDA Programming Basics
No ratings yet
Understanding CUDA Programming Basics
101 pages
BCS702 Module 5 Textbook
No ratings yet
BCS702 Module 5 Textbook
48 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
CUDA ClustalW: Fast GPU Sequence Alignment
No ratings yet
CUDA ClustalW: Fast GPU Sequence Alignment
7 pages
CUDA Programming Exam Solutions
No ratings yet
CUDA Programming Exam Solutions
11 pages
A Fast and Accurate Splitting Method For Optimal Transport: Analysis and Implementation
No ratings yet
A Fast and Accurate Splitting Method For Optimal Transport: Analysis and Implementation
24 pages

High Performance Computing On Gpu

Uploaded by

High Performance Computing On Gpu

Uploaded by

HIGH PERFORMANCE COMPUTING ON GPU

Graphics Processing Units

•Modern GPU’s highly parallel structure makes them more effective.

GPU is specialized for compute-

• NVIDIA® Tesla™ C1060 has 10 series NVIDIA

•Threads within a block cooperate via shared memory

RUNS ON CPU RUNS ON GPU

Similarly for blocks:

// copy data from host to device

• A multiprocessor is designed to execute hundreds of threads concurrently. To

Performance optimization revolves around three

• Amdahl’s law states that the maximum speed-up (S) of a program is

Now, if ¾ of a program is parallelized, the maximum speed-up over serial code is 1/ (1

To increase the performance we can allot the job, of copying N

Coalesced access in which all Misaligned sequential

Choosing thread block sizes as multiples of 16, facilitates

//read image into the HOST memory

//copy back to host memory

• Parallel Execution of Each Block

• Kernel has to wait for the data transfer .

//and grid with 2-D blocks

How to find the index of the patch inside the grid?

blockIdx.y * gridDim.x + blockIdx.x

How to find the index of the pixel inside the block?

threadIdx.y * blockDim.x + threadIdx.x

• Cuda provides an OCCUPANCY Calculator as

You might also like