0% found this document useful (0 votes)

706 views244 pages

CS8076 - GPU Architecture and Programming

Uploaded by

ksathishkm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

706 views244 pages

CS8076 - GPU Architecture and Programming

Uploaded by

ksathishkm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 244

SUBJECT CODE : CS8076

Strictly as per Revised Syllabus of

Anna University
Choice Based Credit System (CBCS)
Semester - VIII (CSE) Professional Elective - V

GPU Architecture and Programming

Santosh D. Nikam
MS (SW Systems), MCS
(Technical Project Manager)

Anamitra Deshmukh-Nimbalkar
CTO & Chief Software Trainer, Mentor
(PGDBM, PGDPC, NET, SET, MCS)

® ®

TECHNICAL
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge

(i)
GPU Architecture and Programming
Subject Code : CS8076

Semester - VIII (Computer Science and Engineering) Professional Elective - V

ã Copyright with Authors

All publishing rights (printed and ebook version) reserved with Technical Publications. No part of this book
should be reproduced in any form, Electronic, Mechanical, Photocopy or any information storage and
retrieval system without prior permission in writing, from Technical Publications, Pune.

Published by :
® ®

TECHNICAL Amit Residency, Office No.1, 412, Shaniwar Peth, Pune - 411030, M.S. INDIA
Ph.: +91-020-24495496/97, Email : [email protected]
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge Website : www.technicalpublications.org

Printer :
Yogiraj Printers & Binders
Sr.No. 10\1A,
Ghule Industrial Estate, Nanded Village Road,
Tal-Haveli, Dist-Pune - 411041.

ISBN 978-93-90450-70-1

9 789390 450701 AU 17

9789390450701 [1] (ii)

preface
The importance of GPU Architecture and Programming is well known in various
engineering fields. Overwhelming response to our books on various subjects inspired us to
write this book. The book is structured to cover the key aspects of the subject GPU
Architecture and Programming.

The book uses plain, lucid language to explain fundamentals of this subject. The book
provides logical method of explaining various complicated concepts and stepwise methods
to explain the important topics. Each chapter is well supported with necessary illustrations,
practical examples and solved problems. All the chapters in the book are arranged in a
proper sequence that permits each topic to build upon earlier studies. All care has been
taken to make students comfortable in understanding the basic concepts of the subject.

The book not only covers the entire scope of the subject but explains the philosophy of
the subject. This makes the understanding of this subject more clear and makes it more
interesting. The book will be very useful not only to the students but also to the subject
teachers. The students have to omit nothing and possibly have to cover nothing more.

We wish to express our profound thanks to all those who helped in making this book a
reality. Much needed moral support and encouragement is provided on numerous
occasions by our whole family. We wish to thank the Publisher and the entire team of
Technical Publications who have taken immense pain to get this book in time with quality
printing.

Any suggestion for the improvement of the book will be acknowledged and well
appreciated.
Authors
Santosh D. Nikam
Anamitra Deshmukh-Nimbalkar

(iii)
Dedicated to,
“ My Family, Well-Wishers and God”

- Santosh D. Nikam

“My Discerning Amigo, my Father-In-Law,

Shri. Bhagwan B. Nimbalkar,

Who in my life, reappraised the meaning of ‘Family’, ‘Understanding’,

‘Compromise’ and ‘Being Selfless’ that brought me up once again as a
‘Daughter’ than a ‘Daughter-in-Law’.

- Anamitra Deshmukh-Nimbalkar

(iv)
Syllabus
GPU Architecture and Programming - CS8076

UNIT I GPU ARCHITECTURE

Evolution of GPU architectures - Understanding Parallelism with GPU –Typical GPU

Architecture - CUDA Hardware Overview - Threads, Blocks, Grids, Warps, Scheduling -
Memory Handling with CUDA : Shared Memory, Global Memory, Constant Memory and
Texture Memory. (Chapter 1)

UNIT II CUDA PROGRAMMING

Using CUDA - Multi GPU - Multi GPU Solutions - Optimizing CUDA Applications : Problem
Decomposition, Memory Considerations, Transfers, Thread Usage, Resource Contentions.
(Chapter 2)

UNIT III PROGRAMMING ISSUES

Common Problems: CUDA Error Handling, Parallel Programming Issues, Synchronization,

Algorithmic Issues, finding and Avoiding Errors. (Chapter 3)

UNIT IV OPENCL BASICS

OpenCL Standard – Kernels – Host Device Interaction – Execution Environment – Memory

Model – Basic OpenCL Examples. (Chapter 4)

UNIT V ALGORITHMS ON GPU

Parallel Patterns: Convolution, Prefix Sum, Sparse Matrix - Matrix Multiplication -

Programming Heterogeneous Cluster. (Chapter 5)

(v)
TABLE OF CONTENTS
UNIT - I
Chapter 1 : GPU Architecture 1 - 1 to 1 - 28

1.1 Evolution of GPU Architectures ................................................................................... 1 - 2

1.1.1 Key Differences between CPU and GPU ...................................................... 1 - 2
1.1.2 Early GPU Architectures and Brief History .................................................... 1 - 2
1.2 Understanding Parallelism with GPU .......................................................................... 1 - 6
1.3 Typical GPU Architecture ............................................................................................ 1 - 8
1.4 CUDA Hardware Overview ........................................................................................ 1 - 10
1.4.1 Threads........................................................................................................ 1 - 13
1.4.2 Blocks .......................................................................................................... 1 - 14
1.4.3 Grids ............................................................................................................ 1 - 15
1.4.4 Warps .......................................................................................................... 1 - 16
1.4.5 Scheduling ................................................................................................... 1 - 17
1.5 Memory Handling with CUDA .................................................................................... 1 - 19
1.5.1 Shared Memory ........................................................................................... 1 - 21
1.5.2 Global Memory ............................................................................................ 1 - 22
1.5.3 Constant Memory ........................................................................................ 1 - 23
1.5.4 Texture Memory........................................................................................... 1 - 24
1.6 Part A : Short Answered Questions [2 Marks Each] ............................................ 1 - 25
1.7 Part B : Long Answered Questions ....................................................................... 1 - 27

UNIT - II
Chapter 2 : CUDA Programming 2 - 1 to 2 - 30

2.1 Using CUDA ................................................................................................................ 2 - 2

2.2 Multi GPU .................................................................................................................. 2 - 12
2.3 Multi GPU Solutions .................................................................................................. 2 - 13
2.4 Optimizing CUDA Applications .................................................................................. 2 - 14
2.4.1 Problem Decomposition .............................................................................. 2 - 17
(vi)
2.4.2 Memory Considerations ............................................................................... 2 - 18
2.4.3 Transfers...................................................................................................... 2 - 20
2.4.4 Thread Usage .............................................................................................. 2 - 22
2.4.5 Resource Contentions ................................................................................. 2 - 24
2.5 Part A : Short Answered Questions [2 Marks Each] ............................................ 2 - 26
2.6 Part B : Long Answered Questions ....................................................................... 2 - 29

UNIT - III
Chapter 3 : Programming Issues 3 - 1 to 3 - 30

3.1 Common Problems ...................................................................................................... 3 - 2

3.2 CUDA Error Handling .................................................................................................. 3 - 6
3.3 Parallel Programming Issues....................................................................................... 3 - 9
3.4 Synchronization Issues .............................................................................................. 3 - 12
3.5 Algorithmic Issues ..................................................................................................... 3 - 17
3.6 Finding and Avoiding Errors ...................................................................................... 3 - 20
3.7 Part A : Short Answered Questions [2 Marks Each] ............................................ 3 - 27
3.8 Part B : Long Answered Questions ....................................................................... 3 - 30

UNIT - IV
Chapter 4 : OpenCL Basics 4 - 1 to 4 - 86

4.1 OpenCL Standards ...................................................................................................... 4 - 2

4.1.1 OpenCL Introduction ..................................................................................... 4 - 2
4.1.2 OpenCL Standards History............................................................................ 4 - 3
4.1.3 OpenCL Objective ......................................................................................... 4 - 4
4.1.4 OpenCL Components .................................................................................... 4 - 5
4.1.5 OpenCL - Hardware and Software Vendors .................................................. 4 - 5
4.1.5.1 Advanced Micro Devices, Inc. (AMD) ................................................... 4 - 5
4.1.5.2 NVIDIA® ............................................................................................... 4 - 6
4.1.5.3 Intel® .................................................................................................... 4 - 7
4.1.5.4 ARM Mali™ GPUs ................................................................................ 4 - 8
4.2 Kernels and Host Device Interaction ........................................................................... 4 - 9
4.2.1 Kernel ............................................................................................................ 4 - 9

(vii)
4.2.2 Host to Device Interaction ............................................................................. 4 - 9
4.2.3 C++ for OpenCL Kernel Language .............................................................. 4 - 10
4.2.4 Kernel Language Extensions....................................................................... 4 - 10
4.3 The OpenCL Architecture - OpenCL Programming Models Specification ................ 4 - 11
4.3.1 Platform Model............................................................................................. 4 - 11
4.3.2 Execution Model .......................................................................................... 4 - 12
4.3.3 Kernel Programming Model ......................................................................... 4 - 14
4.3.4 Memory Model ............................................................................................. 4 - 17
4.3.4.1 Basic Concept of Memory Model ....................................................... 4 - 17
4.3.4.2 OpenCL Device Memory Model ......................................................... 4 - 18
4.4 Exploring OpenCL Memory Model - The Memory Objects ....................................... 4 - 21
4.4.1 Various Memory Objects ............................................................................. 4 - 21
4.4.1.1 Memory Object - Buffer ...................................................................... 4 - 21
4.4.1.2 Memory Object - Image ..................................................................... 4 - 23
4.4.1.3 Pipe .................................................................................................... 4 - 58
4.4.2 Various Operations on Memory Objects ..................................................... 4 - 61
4.4.2.1 Retaining and Releasing Memory Objects ........................................ 4 - 61
4.4.2.2 Unmapping Mapped Memory Objects ............................................... 4 - 64
4.4.2.3 Accessing Mapped Regions of a Memory Object .............................. 4 - 65
4.4.2.4 Migrating Memory Objects ................................................................. 4 - 66
4.4.2.5 Memory Object Queries ..................................................................... 4 - 69
4.5 OpenCL Program and OpenCL Programming Examples ......................................... 4 - 72
4.5.1 OpenCL Program......................................................................................... 4 - 72
4.5.2 The Process of Creating a Kernel from Source Code. ................................ 4 - 73
4.5.3 OpenCL Program Flow ................................................................................ 4 - 75
4.5.4 Starting Kernel Execution on a Device ........................................................ 4 - 76
4.5.5 Main Steps to Execute a Simple OpenCL Application ................................ 4 - 77
4.5.6 Example Program - Serial Vector Addition .................................................. 4 - 80
4.6 Part A : Short Answered Questions [2 Marks Each] ............................................ 4 - 84
4.7 Part B : Long Answered Questions ....................................................................... 4 - 85

(viii)
UNIT - V
Chapter 5 : Algorithms on GPU 5 - 1 to 5 - 54

5.1 Concept of Parallelism ................................................................................................. 5 - 2

5.1.1 Parallel Computations / Concurrency ............................................................ 5 - 2
5.1.2 Types of Parallelism ...................................................................................... 5 - 3
5.1.2.1 Task-based Parallelism ........................................................................ 5 - 3
5.1.2.2 Data-based Parallelism ........................................................................ 5 - 5
5.1.3 Common Parallel Patterns ............................................................................. 5 - 7
5.1.3.1 Loop-based Patterns ............................................................................ 5 - 7
5.1.3.2 Fork / Join Pattern ................................................................................ 5 - 8
5.1.4 Data Parallelism Versus Task Parallelism ................................................... 5 - 10
5.2 GPU Stream Types ................................................................................................... 5 - 11
5.2.1 Vertex Streams ............................................................................................ 5 - 11
5.2.2 Fragment Streams ....................................................................................... 5 - 11
5.2.3 Frame-Buffer Streams ................................................................................. 5 - 12
5.2.4 Texture Streams .......................................................................................... 5 - 12
5.2.5 GPU Kernel Memory Access ....................................................................... 5 - 12
5.3 GPU Parallel Algorithms ........................................................................................... 5 - 13
5.3.1 Convolutions ................................................................................................ 5 - 13
5.3.1.1 Convolutions Fundamentals ............................................................... 5 - 13
5.3.1.2 Mathematical Foundation for Convolution .......................................... 5 - 13
5.3.1.3 Convolution Operation Example......................................................... 5 - 14
5.3.1.4 Applications of Convolution ................................................................ 5 - 15
5.3.1.5 Working of Convolution Algorithm ...................................................... 5 - 15
5.3.1.6 Serial Convolution Code Written in C / C++ and
OpenCL Kernel Code ......................................................................... 5 - 15
5.3.1.7 Source Code for the Image Convolution Host Program ..................... 5 - 20
5.3.1.8 Real World Interpretations of Convolution.......................................... 5 - 25
5.3.2 Prefix Sum ................................................................................................... 5 - 25
5.3.2.1 Prefix Sum Fundamentals .................................................................. 5 - 25
5.3.2.2 Sequential Implementation of Scan on CPU ...................................... 5 - 26

(ix)
5.3.2.3 Parallel Scan Algorithm - A Solution by Hillis and Steele ................... 5 - 27
5.3.2.4 Doubled Buffer Version Algorithm ...................................................... 5 - 28
5.3.2.5 Hillis and Steele Algorithm - Kernel Function ..................................... 5 - 28
5.3.2.6 Improving Algorithm Efficiency ........................................................... 5 - 29
5.3.2.7 Work - Efficient Prefix Sum ................................................................ 5 - 30
5.3.2.8 Avoiding Bank Conflicts...................................................................... 5 - 30
5.3.2.9 Applications Prefix Sum Algorithm ..................................................... 5 - 30
5.3.3 Sparse Matrix - Matrix Multiplication............................................................ 5 - 31
5.3.3.1 Sparse Matrix Fundamentals ............................................................. 5 - 31
5.3.3.2 Sparse Matrix and Compressed Sparse Row (CSR)
Storage Format .................................................................................. 5 - 32
5.3.3.3 cuSPARSE Library ............................................................................. 5 - 33
5.3.3.4 Load Balancing Problem .................................................................... 5 - 33
5.3.3.5 Parallelizations of SpMM .................................................................... 5 - 34
5.4 Heterogeneous Cluster .............................................................................................. 5 - 36
5.4.1 Concept of Heterogeneous Clusters ........................................................... 5 - 36
5.4.2 MPI (Message Passing Interface) ............................................................... 5 - 37
5.4.2.1 MPI Fundamentals ........................................................................ 5 - 37
5.4.2.2 MPI Working .................................................................................. 5 - 38
5.4.2.3 MPI Programming Example ........................................................... 5 - 39
5.4.2.4 MPI Point to Point Communication Types ..................................... 5 - 43
5.5 Part A : Short Answered Questions [2 Marks Each] ............................................ 5 - 51
5.6 Part B : Long Answered Questions ....................................................................... 5 - 53
 Solved Model Question Paper ............................................................. M - 1 to M - 4

(x)
UNIT - I

1 GPU Architecture

Syllabus

Evolution of GPU architectures - Understanding Parallelism with GPU –Typical GPU

Architecture - CUDA Hardware Overview - Threads, Blocks, Grids, Warps, Scheduling -
Memory Handling with CUDA : Shared Memory, Global Memory, Constant Memory and
Texture Memory.

Contents

1.1 Evolution of GPU Architectures

1.2 Understanding Parallelism with GPU

1.3 Typical GPU Architecture

1.4 CUDA Hardware Overview

1.5 Memory Handling with CUDA

1.6 Part A : Short Answered Questions (2 Marks Each)

1.7 Part B : Long Answered Questions

(1 - 1)
GPU Architecture and Programming (1 - 2) GPU Architecture

 1.1 Evolution of GPU Architectures

 CPU (Central Processing Unit) and GPU (Graphics Processing Unit) are processor units.
CPU is brain of the computer and consists of
(1) ALU (Arithmetic Logic Unit) which stores the information, performs calculations and
(2) CU(Control Unit) which performs instructions and branching.
 GPU was designed to offload and accelerate 2D or 3D rendering from the CPU. GPUs can
be found in desktops, laptops, mobile phones and super computers. GPUs have parallel
structure, with which they implement a number of 2D and 3D graphics primitives
processing in hardware, making them much faster than CPUs at these operations. GPU
integrates 2D/3D graphics, images, and video which enables things like windows based
OSes, GUIs, video games, and visual imaging applications.

 1.1.1 Key Differences between CPU and GPU

CPU GPU
Focus CPU focuses on low latency. GPU focuses on high throughput.
Cache CPU needs large cache to GPU does need large cache as latency
minimize latency. is not the primary concern.
Control logic CPU has complex control logic. GPU has simple control logic.
Memory CPU consumes more memory GPU memory requirements are low.
than GPU.
Speed CPU speed is less than GPU. GPU is faster than CPU.
Cores CPU cores are powerful; but in GPU cores are comparatively less
less numbers (10s). powerful; but in large numbers
(1000s).
Instruction CPU suits more for Serial GPU suits more for Parallel
Processing Instruction Processing. Instruction Processing. E.g. Graphics,
Image processing, Video processing,
Scientific computing

 1.1.2 Early GPU Architectures and Brief History

 Early GPU architectures were specific single core, fixed function hardware pipeline
implementation solely for graphics. Over the period, this has evolved into set of highly
parallel and programmable cores for more general purpose computation. The trend in GPU
technology has been to keep adding more programmability and parallelism to a GPU core
architecture for enabling CPU-like general purpose computation. Below section provides
summary of evolution of GPU architectures.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (1 - 3) GPU Architecture

 1. Early days (1970s)

a. In the 1970s, the term GPU stood for programmable processing unit independently
working from the CPU and responsible for graphics manipulation and output.
b. Early GPUs were based on the concept of a graphics pipeline. The graphics pipeline is a
set of stages namely Vertices generation and processing, Primitive generation and
processing, Fragment generation and processing, and Pixel operations. Graphics data is
sent through these stages, and is usually implemented via a combination of hardware
(like GPU cores) and CPU software (like OpenGL, DirectX).
c. Early GPUs implemented only the rendering stage in hardware. This required CPU to
generate triangles for the GPU to operate on.
d. As GPU technology progressed, more and more stages of the pipeline were
implemented in hardware on the GPU, thereby reducing workload on CPU.
 2. 1980s
a. Until early 1980's, the GPUs were just integrated frame buffers. They relied on the CPU
for computation and could only draw wire-frame shapes to raster displays.
b. NEC 7220was one of the first implementation of a PC graphics display processor as a
single Large Scale Integration (LSI) integrated circuit chip, enabling the design of low-
cost, high-performance video graphics cards.
c. In 1984, Hitachi released ARTC HD63484. This was the first major CMOS graphics
processor for PC. It was capable of displaying up to 4K resolution in monochrome
mode, and was used in a number of PC graphics cards during that period.
d. By 1987, more features were being added to early GPUs, such as Shaded solids, Vertex
lighting, Rasterization of filled polygons, Pixel depth buffer, and color blending. There
was still more reliance on sharing computation with the CPU.
e. In 1987, IBM introduced Video Graphics Array (VGA) display standard with max
resolution of 640 x 480 pixels. And, in late 1988 NEC announced its successor named
Super VGA (SVGA) which enabled graphics display resolution up to 800 x 600 pixels.
f. In the late 1980's, Silicon Graphics Inc. (SGI) emerged as a high performance computer
graphics hardware and software company.
g. Around 1989 SGI released platform independent 2D/3D application programming
interface (API) named OpenGL.
 3. 1990s
a. In early 1990’s, graphics on desktop computers were handled by Video Graphics Array
(VGA) controller. A VGA controller is a memory controller attached to DRAM and a
display generator. The main function of a VGA is to receive image data, arrange it
properly, and send it to a video device, such as a computer monitor for display.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 4) GPU Architecture

b. By 1995, all major PC graphics chip makers had added 2D acceleration support to their
chips.
c. Throughout the 1990s, 2D GUI acceleration continued to evolve. Additional Application
Programming Interfaces (APIs) arrived for a variety of tasks, such as Microsoft's WinG
graphics library and later DirectDraw interface for hardware acceleration of 2D games
within Windows 95 and later.
d. The 3dfx Voodoo (1996) was considered one of the first true 3D game cards. It only
offered 3D acceleration. The CPU still did vertex transformations, while the Voodoo
provided texture mapping, z-buffering, and rasterization.
e. By 1997, VGA controllers had started incorporating some 3D acceleration functions.
f. DirectX became popular among Windows game developers during the late 90s.
g. By the end of 1990s, different graphics acceleration components were being added to
the VGA controller for rasterizing triangles, texture mapping, and simple shading.
h. In 1999, the first cards to implement the entire pipeline in GPU hardware were made
available to consumers. NVIDIA’s “GeForce 256” and ATI’s Radeon 7500 were one of
the first true GPUs.
 4. 2000s
a. This period saw introduction of the programmable pipeline on the GPU. Each pixel
could now be processed by a short program (called shaders) that could include
additional image textures as inputs. Similarly, each geometric vertex could be processed
by a short program before it was projected onto the screen. Some examples were
NVIDIA’s GeForce3, ATI’s Radeon 8500, and Microsoft’s Xbox.
b. In late 2002, fully programmable graphics cards hit the market like NVIDIA’s GeForce
FX and ATI’s Radeon 9700. In these cards, separate dedicated hardware was allocated
for pixel and vertex shader processing.
c. In 2003, the first wave of GPU computing started by taking advantage of GPU hardware
programmability for non-graphics computations.
d. In 2004, on hardware side, manufactures started introducing support for a) higher
precision 64-bit double b) multiple rendering buffers c) increased GPU memory and d)
texture access. On software side, early high level GPU languages such as Brook and Sh
started to appear which offered a) true conditionals b) loops c) dynamic flow control in
shader programs.
e. In 2006, with introduction of NVIDIA’s GeForce 8 series, it marked next step in the
evolution of GPUs by exposing the GPU as massively parallel processors.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 5) GPU Architecture

f. Year 2007 saw NVIDIA introducing CUDA platform as programming model for GPU
computing to harness general purpose power of the GPU.
g. In 2009, OpenCL (Open Computing Language) standard was defined that allowed for
development of code for both GPUs and CPUs with an emphasis on portability.
h. Trend towards more CPU like programmable GPU for general purpose computing
continued in 2009-2010 with the introduction of NVIDIA’s Fermi architecture.
 5. 2010
a. In recent years, processor instructions and memory hardware were added to support
general-purpose programming languages. Hardware has evolved to include double-
precision floating-point operations and massive parallel programmable processors.
b. In 2010, NVIDIA collaborated with Audi. They used Tegra GPUs to power the cars’
dashboard and increase navigation and entertainment systems. These advancements in
graphics cards in vehicles pushed self-driving technology.
c. AMD released Radeon HD 6000 series cards in 2010. And, in 2011 they released their
6000M series discrete GPUs for using in mobile devices.
d. NVIDIA’s Kepler line of graphics cards came out in 2012 and were used in the
NVIDIA’s 600 and 700 series cards. Features of GPU architecture included GPU boost,
a technology adjusts the clock-speed of a video card to increase or decrease it according
to its power draw.
e. GPU came into existence for graphical purpose. However, over the period, it has now
evolved into computing, accuracy and performance.
f. The PS4 (PlayStation 4) and Xbox One were released in 2013. They both used GPUs
based on AMD's Radeon HD 7850 and 7790.
g. Pascal is the next generation of graphics cards by NVIDIA, released in 2016. They are
made using the 16 nm manufacturing process. The GeForce 10 series of cards are under
this generation of graphics cards.
h. In 2018, NVIDIA launched the RTX 20 series GPUs that added ray-tracing cores to
GPUs, improving their performance on lighting effects.
i. AMD released Polaris 11 and Polaris 10 GPUs featuring 14 nm process, which resulted
in a robust increase in performance per watt of AMD video cards.
j. Many companies have produced GPUs under a number of brand names. In 2009, Intel,
Nvidia and AMD/ATI were the market leaders. Lately, Nvidia and AMD control nearly
100 % of the market share. In addition, S3 Graphics and Matrox produce GPUs. Modern
smartphones mostly use Adreno GPUs from Qualcomm, Power VR GPUs from
Imagination Technologies and Mali GPUs from ARM.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 6) GPU Architecture

k. Modern GPU processors are massively parallel, and are fully programmable. This has
opened up a new world of possibilities for high-speed computation. Today, GPUs are
not only for graphics, but they have found their way into fields like machine learning,
oil exploration, scientific image processing, computer vision applications, biomedical
applications, statistics, linear algebra, 3D reconstruction, medical research, stock options
pricing determination, etc.
 1.2 Understanding Parallelism with GPU

Fig. 1.2.1 : Parallel problem overview

1) CPUs use Task Parallelism wherein

a. Multiple tasks map to multiple threads and tasks run different instructions
b. Generally threads are heavyweight
c. Programming is done for the individual thread.
Whereas GPUs use data parallelism wherein
a. Same instruction is executed on different data
b. Generally threads are lightweight
c. Programming is done for batches of threads (e.g. one pixel shader per group of pixels).
2) In Data Parallelism, performance improvement is achieved by applying the same small set
of tasks iteratively over multiple streams of data. It is nothing but way of performing
parallel execution of an application on multiple processors.
3) In Data Parallelism, the goal is to scale the throughput of processing based on the ability to
decompose the data set into concurrent processing streams, all performing the same set of
operations.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 7) GPU Architecture

4) CPU application manages the GPU and uses it to offload specific computations. GPU code
is encapsulated in parallel routines called kernels. CPU executes the main program, which
prepares the input data for GPU processing, invokes the kernel on the GPU, and then
obtains the results after the kernel terminates. A GPU kernel maintains its own application
state. A GPU kernel is ordinary sequential function, but it is executed in parallel by
thousands of GPU threads.
5) Data Parallelism is achieved in SIMD (Single Instruction Multiple Data) mode. In SIMD
mode, an instruction is only decoded once and multiple ALUs perform the work on
multiple data elements in parallel. In this either single controller controls the parallel data
operations or multiple threads work in the same way on the individual compute nodes
(SPMD). SIMD parallelism enhances the performance of computationally intensive
applications that execute the same operation on distinct elements in a dataset.

Fig. 1.2.2 : Data parallelism

6) Modern applications process large amounts of data which incurs significant execution time
on sequential computers.
a. Rendering pixels is one such application wherein sRGB (standard Red Green Blue)
pixels are converted to grayscale. To process a 1920  1080 image, the application has
to process 2073600 pixels. Processing all these pixels on a CPU will take a very long
time since the execution will be done sequentially. That is, time taken will be
proportional to the number of pixels in the image. Further, it is very inefficient since the
operation that is performed on each pixel is the same, but different on the data.
Since processing one pixel is independent of the processing of any other pixel, all the
pixels can be processed in parallel. If we use 2073600 threads and each thread processes
one pixel, the task can be reduced to constant time. Millions of such threads can be
launched on modern GPUs.
b. Another example is “customer address standardization process” which iteratively grabs
an address and attempts to transform it into a standard form. This task is adaptable to

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 8) GPU Architecture

data parallelism and can be sped up by a factor of “n” by instantiating “n” address
standardization processes and streaming “1/n” of the address records through each
instantiation.
c. Data parallelism is used to advantage in applications like image processing, computer
graphics, algebra libraries like matrix multiplication, etc.

 1.3 Typical GPU Architecture

 In this section, we will understand high level architecture of GPU. GPU architecture is
mainly driven by following key factors :
1. Amount of data processed at one time (Parallel processing).
2. Processing speed on each data element (Clock frequency).
3. Amount of data transferred at one time (Memory bandwidth).
4. Time for each data element to be transferred (Memory latency).
 To begin with, let us first look at main design distinctions between CPU and GPU. CPU
design consists of multicore processors having large cores and large caches using control
units for optimal serial performance.
 Whereas, GPU design consists of large number of threads with small caches and minimized
control units for optimizing execution throughput. GPU provides much higher instruction
throughput and memory bandwidth than the CPU within a similar price and power
envelope.

Fig. 1.3.1 : CPU design Fig. 1.3.2 : GPU design

 GPU architecture focuses more on putting available cores to work and is less focused on
low latency cache memory access. In generic many core GPU, less space is devoted to
control logic and caches. And, large numbers of transistors are devoted to support parallel
data processing. Following diagram shows GPU architecture.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (1 - 9) GPU Architecture

Fig. 1.3.3 : GPU architecture

1) The GPU consists of multiple Processor Clusters (PC).
2) Each Processor Cluster (PC) contains multiple Streaming Multiprocessors (SM).
3) Each Streaming Multiprocessor (SM) has number of Streaming Processors (SPs) (also
known as cores) that share control logic and L1(layer 1) instruction cache.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (1 - 10) GPU Architecture

4) One Streaming Multiprocessor (SM) uses dedicated L1 (Layer 1) cache and shared L2
(Layer 2) cache before pulling data from global memory i.e. Graphic Double Data Rate
(GDDR) DRAM.
5) The number of Streaming Multiprocessors (SMs) and cores per Streaming
Multiprocessor (SM) varies as per the targeted price and market of the GPU.
6) Global memory of GPU consists of multiple GBs of DRAM. The growing size of global
memory allows keeping data longer in global memory thereby reducing transfers to the
CPU.
7) GPU architecture is tolerant of memory latency. Higher bandwidth makes up for
memory latency.
8) In comparison to CPU, GPU works with fewer and small memory cache layers. This is
because GPU has more transistors dedicated to computations and it worries less about
retrieving data from memory.
9) Memory bus is optimized for bandwidth allowing serving large number of ALUs
simultaneously.
10) GPU architecture is more optimized for data parallel throughput computations.
11) In order to execute tasks in parallel, tasks are scheduled at Processor Cluster (PC) or
Streaming Multiprocessor (SM) level.

 1.4 CUDA Hardware Overview

CUDA (Compute Unified Device Architecture) is scalable parallel computing platform and
programming model for general computing on GPUs and multicore CPUs.
 CUDA is introduced by NVIDIA in 2006.
 CUDA is data-parallel extension to the C/C++ languages and an API model for parallel
programming.
 CUDA parallel programming model has three key abstractions –
(1) a hierarchy of thread groups
(2) shared memories, and
(3) barrier synchronization.
 The programmer or compiler decomposes large computing problems into many small
problems that can be solved in parallel.
 Programs written using CUDA harness the power of GPU and thereby increase the
computing performance.
 In GPU accelerated applications, the sequential part of the workload runs on the CPU (as it
is optimized for single threaded performance) and the compute intensive portion of the
application runs on thousands of GPU cores in parallel.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 11) GPU Architecture

 Using CUDA, developers can utilize the power of GPUs to perform general computing
tasks like multiplying matrices and performing linear algebra operations (instead of just
doing graphical calculations).
 In CUDA, developers program in popular languages such as C, C++, Fortran, Python,
DirectCompute and MATLAB. And, express parallelism through extensions in the form of
basic keywords.
 At high level, graphics card with a many-core GPU and high speed graphics device
memory sits inside a standard PC / server with one or two multicore CPUs. Following
diagrams 1.4.1 and 1.4.2 shows hardware view of Motherboard and Graphics card.
Indiagram, DDR4 is Double Data Rate 4memory and GDDR5 is an abbreviation for
Graphics Double Data Rate 5memory. HBM stands for High Bandwidth Memory.

Fig. 1.4.1 : Hardware view of motherboard and graphics card

 The GPU consists of multiple Streaming Multiprocessors (SM). And, each Streaming
Multiprocessor (SM) has number of Streaming Processors (SPs) also known as cores.
Streaming Multiprocessor (SM) uses dedicated L1 cache and shared L2 cache. Following
diagram shows high level overview GPU hardware.

Fig. 1.4.2 : GPU hardware overview

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 12) GPU Architecture

 CUDA programming model

 With a single processor executing a program, it is easy to visualize what is going on and
where it is happening. With a CUDA device, there are lots of things going on at once in a
lot of different places. The CUDA programming model provides an abstraction of GPU
architecture. It acts as a bridge between an application and its possible implementation on
GPU hardware.
 CUDA organizes a parallel computation using the abstractions of threads, blocks, grids and
warps. Threads, thread blocks, and grid are essentially a programmer’s perspective. The
hardware groups threads that execute the same instruction into warps. Several warps
constitute a thread block. Several thread blocks are assigned to a Streaming Multiprocessor
(SM). Several Streaming Multiprocessors(SMs) constitute the whole GPU unit. And, GPU
unit executes the whole kernel grid.
 In CUDA programming model, two widely used keywords are Host and Device. The host is
the CPU available in the system. The system memory associated with the CPU is called as
host memory. The GPU is called a device and GPU memory is called as device memory.
Following diagram shows interaction between CPU and GPU during execution of CUDA
program.

Fig. 1.4.3 : CPU and GPU interaction

 To execute any CUDA program, there are three main steps :

1) Copy the input data from host memory to device memory. This is also known as host-to-
device transfer.
2) Load the GPU program and execute, caching data on chip for performance.
3) Copy the results from device memory to host memory, also called device-to-host
transfer.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (1 - 13) GPU Architecture

 CUDA program execution model

 Following Fig. 1.4.4 shows the kernel execution and mapping on hardware resources
available in GPU.

Fig. 1.4.4 : Kernel execution and mapping on hardware resources

1) Threads are executed by thread processors / core.

2) Thread blocks are executed on Streaming Multiprocessor (SM) and several
concurrent thread blocks can reside on one multiprocessor. Number of concurrent thread
blocks is limited based on multiprocessor resources like shared memory and registers.
3) A kernel is launched as a grid of thread blocks.

 1.4.1 Threads
1) In CUDA, the thread is an abstract entity that represents the execution of the kernel.
2) Kernel is a small program or a function compiled for and is executed on GPU. Refer
following Fig. 1.4.5.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 14) GPU Architecture

Fig. 1.4.5 : CUDA kernel

3) Multi-threaded applications use many such threads that are running at the same time, to
organize parallel computation.
4) Every thread has an index, which is used for calculating memory address locations and also
for taking control decisions.
5) Each thread uses its index to access elements in array such that the collection of all threads
together processes the entire data set.
6) Streaming Multiprocessor (SM) is organized as group of Streaming Processors (SPs).
Thread gets executed and is handled by one Streaming Processor (SP).
7) Streaming Processor (SP) can handle one or more threads of same block.
8) CUDA gives each thread a unique ThreadID to distinguish between each other even though
the kernel instructions are the same.

Fig. 1.4.6 : Thread is executed by Streaming Processor (SP)/core

 1.4.2 Blocks
1) A “Thread Block” is a programming abstraction which represents a group of “Threads”.
2) Threads in block could execute concurrently or serially and in no particular order.
3) Threads in the same thread block can communicate with each other.
4) CUDA provides some functions for coordinating threads with which thread can be made to
stop at a certain point in kernel until all the other threads in its block reach the same point.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 15) GPU Architecture

5) GPU unit / chip organized as group of Streaming Multiprocessors (SMs). The hardware
schedules thread blocks to Streaming Multiprocessor (SM). That is, thread block gets
executed and is handled by one Streaming Multiprocessor (SM).
6) In general, Streaming Multiprocessor (SM) can handle one or more blocks of same grid in
parallel.

7) Thread block generally does not get divided across multiple Streaming Multiprocessors
(SMs).
8) Whenever Streaming Multiprocessor (SM) executes a thread block, all the threads inside
the thread block are executed at the same time. Hence to free a memory of a thread block
inside the Streaming Multiprocessor (SM), it is critical that the entire set of threads have
concluded execution.

Fig. 1.4.7 : Thread block is executed by Streaming Multiprocessor (SM)

 1.4.3 Grids
1) A “Grid” is a programming abstraction which represents a group of “Blocks”. Following
Fig. 1.4.8 explains grid and block hierarchy in CUDA programming.
2) Multiple thread blocks are combined to form a grid.
3) All the thread blocks in the same grid contain the same number of threads.
4) The thread blocks in a grid must be able to be executed independently because any
communication or cooperation between blocks in a grid is not possible.
5) A kernel is executed as a grid of blocks of threads.
6) Each kernel is executed on one device and CUDA supports running multiple kernels on
a device at one time.
7) Grid gets executed and is handled by GPU unit / chip.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 16) GPU Architecture

Fig. 1.4.8 : Grid and block hierarchy

Fig. 1.4.9 : Kernel grid is executed by complete GPU Unit

 1.4.4 Warps
1) From hardware perspective, a thread block is composed of warps. That is, each thread
block is divided in scheduled units known as a warp.
2) A warp is a set of 32 threads within a thread block such that all the threads in a warp
execute the same instruction.
3) Threads run in both parallel and sequential manner.
4) The warp size is the number of threads running concurrently on Streaming
Multiprocessor (SM).
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (1 - 17) GPU Architecture

5) Following Fig. 1.4.10 illustrates implementation of double warp scheduler.

Fig. 1.4.10 : Double warp scheduler implementation

6) Within a warp, all the threads have sequential indices up to the total number of threads in a
block.
7) Once a thread block is launched on Streaming Multiprocessor (SM), all of its warps are
resident until their execution finishes. Thus, a new thread block is not launched on
Streaming Multiprocessor (SM) until there is a) sufficient number of free registers for all
warps of the new thread block and b) there is enough free shared memory for the new
thread block.
8) In warp, homogeneity of the threads has an impact on the computational throughput. That
is, if all the threads are executing the same instruction, then all the Streaming Processors
(SPs) in Streaming Multiprocessor (SM) can execute the same instruction in parallel.
However, if one or more threads in a warp are executing a different instruction from the
others, then warp has to be partitioned into groups of threads based on the instructions
being executed, after which the groups are executed one after the other. This serialization
reduces the throughput as the threads become more and more divergent and is split into
smaller and smaller groups. Therefore, in order to maximize throughput, it is important to
keep the threads as homogenous as possible.

 1.4.5 Scheduling
 A warp is a unit of thread scheduling in Streaming Multiprocessors (SMs). That is, thread
block is divided into warps for scheduling purposes.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 18) GPU Architecture

 A Streaming Multiprocessor (SM) is composed of many SPs (Streaming Processors). These

are the actual CUDA cores. Normally, the number of CUDA cores in a Streaming
Multiprocessor (SM) is less than the total number of threads that are assigned to it. Thus,
there is need for scheduling.
 Each Streaming Multiprocessor (SM) has access to a hierarchy of memories. In addition to
the device memory, Streaming Multiprocessor (SM) can access registers, shared memory,
constant cache and texture cache. All memories except the device memory are located
within the Streaming Multiprocessor (SM) and thus have very short access latencies.
Memory requests from a warp are handled together. When memory requests from the
threads of a warp are sequential, the memory requests can be combined into fewer
transactions. These kinds of accesses are called coalesced memory accesses. However, if
the memory addresses are scattered, each memory request generates a separate transaction,
called un-coalesced memory accesses.
 While a warp is waiting for the results of a previously executed long latency operation (like
data fetch from the RAM), a different warp that is not waiting and is ready to be assigned is
selected for execution. The warp scheduler of Streaming Multiprocessor (SM) decides
which of the warp gets prioritized during issuance of instructions. If any of the threads in
the executing warpstalls (un-cached memory read) the scheduler makes it inactive. If there
are no eligible warps left then GPU idles. Streaming Multiprocessor (SM) uses warp
scheduling policy. Below is the list of different scheduling policies for warps :
1) Round Robin (RR) – In this policy, warp instructions are fetched in Round Robin (RR)
manner. This policy assigns equal priority to all warps. When sufficient numbers of
blocks are assigned to Streaming Multiprocessor (SM), RR ensures that Streaming
Multiprocessors (SMs) are kept busy without wasting cycles waiting for memory
accesses to complete.
2) Least Recently Fetched (LRF) - In this policy, warp for which instruction has not been
fetched for the longest time gets priority in the fetching of an instruction. LRF policy
tries to ensure that starvation of warps doesnot occur.
3) Greedy Then Old (GTO) - In this policy, same warp is selected as long as possible and
then oldest ready warp selected. That is, GTO policy always gives a higher priority to
warps that are launched earlier.
4) Fair (FAIR) - In this policy, the scheduler makes sure all warps are given ‘fair’
opportunity in the number of instruction fetched for them. It fetches instruction to a
warp for which minimum number of instructions have been fetched. This policy ensures
that all warps progress in a uniform manner. This increases the probability of merging of
memory requests if warps are accessing overlapping regions in memory or regions that
are close to each other.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 19) GPU Architecture

5) Thread block based - In this policy, emphasis is on improving the execution time of the
thread blocks. It allocates more resources to the warp that shall take the longest time to
execute. By giving priority to the most critical warp, this policy allows thread blocks to
finish faster, such that the resources become available quicker.
 Below is the list of different DRAM scheduling policies :
1) First Come First Serve (FCFS) - In this policy, memory requests are served in the
order they arrive i.e. in a First Come First Serve manner.
2) First Ready First Come First Serve (FRFCFS) - FRFCFS gives higher priority to row
buffer hits. In case of multiple potential row buffer hits (or no row buffer hits), requests
are served in a First Come First Serve manner.
3) First Ready Fair (FRFAIR) - FRFAIR is similar to FRFCFS, except that in case of
multiple potential row buffer hits (or no row buffer hits), request from the warp which
has retired the fewest instructions is served. FRFAIR tries to ensure uniform progress of
warps while continuing to take advantage of row buffer hits. This policy is more suitable
for symmetric applications than asymmetric applications.
4) First-Ready Remaining Instructions (REMINST) - Like FRFCFS and FRFAIR,
REMINST also gives higher priority to row buffer hits. But, in the case of multiple
potential row buffer hits (or no row buffer hits), request from the warp which has the
most instructions remaining to its completion is served. REMINST gives priority to
warps which have longer execution remaining; this ensures that all warps finish at the
same time.
 Most of the fetch and DRAM scheduling policies explained above try to either ensure that
all warps progress uniformly or all warps terminate at the same time. This ensures that the
occupancy of the SM remains high and also ensures that when one warp blocks, other ready
warps from which instructions can be scheduled are available. Also, if threads are accessing
overlapping or neighboring memory regions, this increases the chances of both intra-core
and inter-core merging of memory requests.

 1.5 Memory Handling with CUDA

 In this section, we will understand about different types of memories that are exposed by
the CUDA capable GPU architecture. Before going into CUDA specific details, let us
understand the difference between different types of memories.
1. DRAM stands for Dynamic RAM. DRAM is the most common RAM found in systems
today. It is slowest and the least expensive one. DRAM is named so because the
information stored on it is lost, and the processor has to refresh it several times in a
second to preserve data.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 20) GPU Architecture

2. SRAM stands for Static RAM. SRAM does not need to be refreshed like DRAM, and it
is significantly faster than DRAM. However, SRAM is more costly than DRAM. SRAM
is also called the microprocessor cache RAM.
3. VRAM stands for Video RAM. VRAM can be written to and read from simultaneously.
This property is essential for better video performance. Using it, the video card can read
data from VRAM and send it to the screen without having to wait for the CPU to finish
writing it into the global memory. This property is of not much use in the rest of the
computer, and therefore, VRAM are mostly used in video cards. Also, VRAMs are more
expensive than DRAM.
 In CUDA capable GPU, memory hierarchy is Global Memory > Texture Memory >
Constant Memory > and Shared Memory. Upper level of the memory hierarchy provides
larger size but slower storage access. Whereas, lower levels provides smaller size and faster
access. Following Fig. 1.5.1 shows which memory resides where.

Fig. 1.5.1 : Memory hierarchy

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 21) GPU Architecture

 How the threads access global memory also affects the throughput. Things go much faster
if the GPU can combine several global addresses into a single burst access over the wide
data bus that goes to the external SDRAM. Conversely, reading / writing separated memory
addresses requires multiple accesses to the SDRAM which slows things down. To help the
GPU combine multiple accesses, the addresses generated by the threads in a warp must be
sequential with respect to the thread indices. Following Fig. 1.5.2 shows how kernel
accesses different memories.

Fig. 1.5.2 : Kernel memory access

 1.5.1 Shared Memory

1) Shared Memory is a small memory within each Streaming Multiprocessor (SM). It is also is
called scratchpad memory.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 22) GPU Architecture

2) Shared Memory is accessible only by threads within the thread block. That is, it can be
read / written by any thread in a block that is assigned to that Streaming Multiprocessor
(SM).

Fig. 1.5.3 : Per block shared memory

3) Shared Memory can be used for inter-thread communication.

4) Each thread has its own private local memory. This exists only for the lifetime of the thread
and is generally handled automatically by the compiler.

Fig. 1.5.4 : Per thread local memory

5) Since Shared Memory resides on-chip, shared memory has shorter latency and higher
bandwidth than global memory.
6) Shared Memory exists only for the lifetime of the thread block.
7) Shared Memory requires special handling to get maximum performance.

 1.5.2 Global Memory

1) Global Memory is built of SDRAM chips connected to the GPU chip.
2) Global Memory is accessible to all threads and to the host CPU.
3) Any thread from any Streaming Multiprocessor (SM) can read or write to any location in
the Global Memory.
4) Global Memory is allocated and de-allocated by the host CPU.
5) Global Memory is used to initialize the data that the GPU will work on.
6) Host CPU can read / write Global Memory; but - cannot to Shared Memory.
7) Global Memory is also called as Device Memory.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 23) GPU Architecture

Fig. 1.5.5 : Per device global memory

 1.5.3 Constant Memory

1) Constant Memory is a read-only memory within each Streaming Multiprocessor (SM).
2) Constant Memory is accessible by all threads.
3) Constant Memory is used to cache values that are shared by all functional units.
4) Constant Memory is cached on chip.
5) Constant Memory is used to store data that does not change over the course of a kernel
execution.
6) Constant Memory space is cached. As a result of this, a read from constant memory costs
one memory read from device memory only on a cache miss; otherwise, it just costs one
read from the constant cache.
7) Working of Constant Memory is divided in following steps:
a. A constant memory request for a warp is first split into two requests, one for each half-
warp, that are issued independently.
b. A request is then split into as many separate requests as there are different memory
addresses in the initial request, decreasing throughput by a factor equal to the number of
separate requests.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (1 - 24) GPU Architecture

c. The resulting requests are then serviced at the throughput of the constant cache in case
of a cache hit, or at the throughput of global memory otherwise.
8) Programmers use constant memory in following cases:
a. When input data does not change during the execution.
b. When all threads access data from same part of memory.

 1.5.4 Texture Memory

1) Texture Memory is a memory within each Streaming Multiprocessor (SM) that can be filled
with data from the global memory.
2) Texture Memory acts like a cache and threads running in the Streaming Multiprocessor
(SM) are restricted to read-only access of this memory.
3) Texture Memory is optimized for texturing operations provided by the hardware.
4) Texture Memory is cached on chip.
5) Texture caches are designed for graphics applications where memory access patterns
exhibit a great deal of spatial locality. Spatial locality means that a thread is likely to read
from an address “near” the address that nearby threads read. Refer figure “Spatial
Locality”.

Fig. 1.5.6 : Spatial locality

6) Texture memory space is cached. As a result of this, a read from texture memory costs one
memory read from device memory only on a cache miss; otherwise, it just costs one read
from the texture cache.
7) Programmers should use texture memory in following cases :
a. If data update is rare but read it often.
b. If there tends to be some kind of spatial locality to the read access pattern i.e. nearby
threads access nearby locations in the texture.
c. If the precise read access pattern is difficult to predict.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 25) GPU Architecture

 1.6 Part A : Short Answered Questions (2 Marks Each)

Q.1 List down significant milestones in evolution of GPU architecture.
 Answer : Early GPUs implemented only the rendering stage in hardware. This required
CPU to generate triangles for the GPU to operate on. In 1987, IBM introduced Video
Graphics Array (VGA) display standard with max resolution of 640  480 pixels. The 3dfx
Voodoo (1996) was considered one of the first true 3D game cards. Early 2000 period saw
introduction of the programmable pipeline on the GPU. In 2003, the first wave of GPU
computing started by taking advantage of GPU hardware programmability for non-graphics
computations. Year 2007 saw NVIDIA introducing CUDA platform as programming model
for GPU computing to harness general purpose power of the GPU. Modern GPU processors
are massively parallel, and are fully programmable.
Q.2 What is task parallelism and data parallelism ?
 Answer : Task parallelism means multiple tasks map to multiple threads and tasks run
different instructions. Generally threads are heavyweight and programming is done for the
individual thread. Task parallelism is used in CPUs.
Data parallelism means same instruction is executed on different data. Here threads are
generally lightweight and programming is done for batches of threads. Data parallelism is
used in GPUs.
Q.3 What are different components of GPU architecture ?
 Answer : The GPU consists of multiple Processor Clusters (PC). Each Processor Cluster
(PC) contains multiple Streaming Multiprocessors (SM). Each Streaming Multiprocessor
(SM) has number of Streaming Processors (SPs) (also known as cores) that share control logic
and L1 (layer 1) instruction cache. One Streaming Multiprocessor (SM) uses dedicated L1
(layer 1) cache and shared L2 (layer 2) cache before pulling data from global memory i.e.
Graphic Double Data Rate (GDDR) DRAM.
Q.4 What are threads and thread blocks ?
 Answer : The thread is an abstract entity that represents the execution of the kernel. A
“Thread Block” is a programming abstraction which represents a group of “Threads”. Threads
in the same thread block can communicate with each other.
Q.5 What are grids and warps ?
 Answer : A “Grid” is a programming abstraction which represents a group of “Blocks”.
Multiple thread blocks are combined to form a grid. A warp is a set of 32 threads within a
thread block such that all the threads in a warp execute the same instruction.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 26) GPU Architecture

Q.6 List out different scheduling policies.

 Answer : There are 4 scheduling policies : 1) Round Robin (RR) – Warp instructions are
fetched in Round Robin (RR) manner. 2) Least Recently Fetched (LRF) - Warp for which
instruction has not been fetched for the longest time gets priority in the fetching of an
instruction. 3) Fair (FAIR) - The scheduler fetches instruction to a warp for which minimum
number of instructions have been fetched. 4) Thread block based – Scheduler allocates more
resources to the warp that shall take the longest time to execute.
Q.7 What is shared memory ?

 Answer : Shared Memory is a small memory within each Streaming Multiprocessor (SM).
It is also is called scratchpad memory. Shared Memory can be used for inter-thread
communication. Shared Memory exists only for the lifetime of the thread block and requires
special handling to get maximum performance.
Q.8 What is global memory ?

 Answer : Global Memory is accessible to all threads and to the host CPU. It is allocated
and de-allocated by the host CPU. Global Memory is used to initialize the data that the GPU
will work on. Global Memory is also called as Device Memory.
Q.9 What is constant memory and when to use it ?

 Answer : Constant Memory is a read-only memory within each Streaming Multiprocessor

(SM). Constant Memory is accessible by all threads. Programmers use constant memory when
a) input data does not change during the execution and b) all threads access data from same
part of memory.
Q.10 What is texture memory and when to use it ?

 Answer : Texture Memory is a memory within each Streaming Multiprocessor (SM) that
can be filled with data from the global memory. Texture Memory acts like a cache and threads
running in the Streaming Multiprocessor (SM) are restricted to read-only access of this
memory. Texture Memory should be used in cases where there tends to be some kind of
spatial locality to the read access pattern i.e. nearby threads access nearby locations in the
texture.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 27) GPU Architecture

 1.7 Part B : Long Answered Questions

Q.1 What are differences between GPU and CPU ? (Refer section 1.1.1)
Q.2 Describe evaluation of GPU architectures. (Refer section 1.1.2)
Q.3 What is concept of parallelism in GPU ? (Refer section 1.2)
Q.4 Explain GPU architecture and its components. (Refer section 1.3)
Q.5 Describe CUDA hardware components. (Refer section 1.4)
Q.6 Describe thread hierarchy by explaining threads, blocks, grids, and warps.
(Refer sections 1.4.1, 1.4.2, 1.4.3 and 1.4.4)
Q.7 Explain different types of memories provided by CUDA. (Refer section 1.5)




TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (1 - 28) GPU Architecture

Notes

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

UNIT - II

2 CUDA Programming

Syllabus

Using CUDA - Multi GPU - Multi GPU Solutions - Optimizing CUDA Applications :
Problem Decomposition, Memory Considerations, Transfers, Thread Usage, Resource
Contentions.

Contents

2.1 Using CUDA

2.2 Multi GPU

2.3 Multi GPU Solutions

2.4 Optimizing CUDA Applications

2.5 Part A : Short Answered Questions [2 Marks Each]

2.6 Part B : Long Answered Questions

(2 - 1)
GPU Architecture and Programming (2 - 2) CUDA Programming

 2.1 Using CUDA

 CUDA is a parallel computing platform and programming model developed by NVIDIA
for general computing on its own GPUs (Graphics Processing Units). CUDA enables
developers to speed up compute intensive applications by harnessing the power of GPUs
for the parallelizable part of the computation.
 CUDA is a platform and programming model for CUDA enabled GPUs. The platform
exposes GPUs for general purpose computing. CUDA provides C / C++ language extension
and APIs for programming and managing GPUs.
 Using the CUDA Toolkit developers can accelerate their C or C++ applications by updating
the computationally intensive portions of code to run on GPUs. To accelerate applications,
developers can call functions from drop-in libraries as well as develop custom applications
using languages including C, C++, Fortran and Python.
 CUDA was developed with several design goals in mind :
 CUDA provides a small set of extensions to standard programming languages, like C, that
enable a straightforward implementation of parallel algorithms.
1. With CUDA, C / C++, programmers can focus on the task of parallelization of the
algorithms rather than spending time on their implementation.
2. CUDA supports heterogeneous computation where applications use both the CPU and
GPU.
3. In CUDA, serial portions of applications are run on the CPU, and parallel portions are
offloaded to the GPU.
4. CUDA can also be incrementally applied to existing applications.
5. In CUDA, the CPU and GPU are treated as separate devices that have their own
memory spaces. This configuration allows simultaneous computation on the CPU and
GPU without contention for memory resources.
6. CUDA capable GPUs have hundreds of cores that can collectively run thousands of
computing threads. These cores have shared resources including a register file and a
shared memory. The on-chip shared memory allows parallel tasks running on these
cores to share data without sending it over the system memory bus.
 While there have been other proposed APIs for GPUs, such as OpenCL, and there are
competitive GPUs from other companies, such as AMD, the combination of CUDA and
NVIDIA GPUs dominates several application areas, including deep learning, and is a
foundation for some of the fastest computers in the world.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 3) CUDA Programming

 Since its introduction in 2006, CUDA has been widely deployed through thousands of
applications and published research papers, and supported by an installed base of hundreds
of millions of CUDA-enabled GPUs in notebooks, workstations, compute clusters and
supercomputers. Applications used in astronomy, biology, chemistry, physics, data mining,
manufacturing, finance, and other computationally intense fields are increasing using
CUDA to deliver the benefits of GPU acceleration.
 CUDA Installation
 To use CUDA on your system, you will need the following installed :
1) A CUDA capable GPU
2) A supported version of Microsoft Windows
3) A supported version of Microsoft Visual Studio
4) The NVIDIA CUDA Toolkit
 The CUDA Toolkit is collection of libraries, debugging and optimization tools, a compiler,
documentation, sample code, and a runtime library to deploy your applications. It has
components that support Deep Learning, Parallel Algorithms, Math Libraries, Image and
Video Libraries, Communication Libraries, and Partner Libraries. In general, CUDA
libraries support all families of NVIDIA GPUs, but perform best on the latest generation.
Using CUDA, developers can deliver dramatically higher performance applications
compared to CPU only alternatives across multiple application domains, from Artificial
Intelligence (AI) to High Performance Computing (HPC).
 CUDA libraries run everywhere from resource-constrained IoT devices, to self-driving
cars, to the largest supercomputers in the world. Because of this, you get highly optimized
implementations of an ever expanding set of algorithms. Whether you are building a new
application or accelerating an existing application, CUDA libraries provide the easiest way
to get started with GPU acceleration. Following are list of different libraries made available
by CUDA.
1) Deep Learning Libraries - This is collection of GPU - accelerated libraries for Deep
Learning applications that leverage CUDA and specialized hardware components of
GPUs.
a. NVIDIA cuDNN - GPU - accelerated library of primitives for deep neural networks
b. NVIDIA TensorRT - High performance deep learning inference optimizer and
runtime for production deployment
c. NVIDIA Jarvis - Platform for developing engaging and contextual AI-powered
conversation apps

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 4) CUDA Programming

d. NVIDIA DeepStream SDK - Real-time streaming analytics toolkit for AI-based

video understanding and multi-sensor processing
e. NVIDIA DALI - Portable, open-source library for decoding and augmenting images
and videos to accelerate deep learning applications
2) Parallel Algorithm Libraries - This is collection of GPU - accelerated libraries of
highly efficient parallel algorithms for several operations in C++ and for use with graphs
when studying relationships in natural sciences, logistics, travel planning, and more.
a. Thrust - GPU-accelerated library of C++ parallel algorithms and data structures
3) Math Libraries - This is collection of GPU - accelerated math libraries which lay the
foundation for compute intensive applications in areas such as molecular dynamics,
computational fluid dynamics, computational chemistry, medical imaging, and seismic
exploration.
a. cuBLAS - GPU - accelerated basic linear algebra (BLAS) library
b. cuFFT - GPU - accelerated library for Fast Fourier Transforms
c. CUDA Math Library - GPU-accelerated standard mathematical function library
d. cuRAND - GPU-accelerated random number generation (RNG)
e. cuSOLVER - GPU-accelerated dense and sparse direct solvers
f. cuSPARSE - GPU-accelerated BLAS for sparse matrices
g. cuTENSOR - GPU-accelerated tensor linear algebra library
h. AmgX - GPU-accelerated linear solvers for simulations and implicit unstructured
methods
4) Image and Video Libraries - This is collection of GPU-accelerated libraries for image
and video decoding, encoding, and processing that leverage CUDA and specialized
hardware components of GPUs.
a. nvJPEG - High performance GPU-accelerated library for JPEG decoding
b. NVIDIA Performance Primitives - Provides GPU-accelerated image, video, and
signal processing functions
c. NVIDIA Video Codec SDK - A complete set of APIs, samples, and documentation
for hardware-accelerated video encode and decode on Windows and Linux
d. NVIDIA Optical Flow SDK - Exposes the latest hardware capability of NVIDIA
Turing™ GPUs dedicated to computing the relative motion of pixels between images

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 5) CUDA Programming

5) Communication Libraries - This is collection of performance-optimized multi-GPU

and multi-node communication primitives.
a. NVSHMEM - OpenSHMEM standard for GPU memory, with extensions for
improved performance on GPUs.
b. NCCL - Open-source library for fast multi-GPU, multi-node communications that
maximize bandwidth while maintaining low latency.
6) Partner Libraries
a. OpenCV - GPU-accelerated open-source library for computer vision, image
processing, and machine learning
b. FFmpeg - Open-source multimedia framework with a library of plugins for audio and
video processing
c. ArrayFire - GPU-accelerated open source library for matrix, signal, and image
processing
d. MAGMA - GPU-accelerated linear algebra routines for heterogeneous architectures,
by Magma
e. IMSL Fortran Numerical Library - GPU-accelerated open-source Fortran library with
functions for math, signal, and image processing, statistics, by RogueWave
 Let us now understand basic instructions needed to install CUDA.
1) You can verify that you have a CUDA-capable GPU through the Display Adapters
section in the Windows Device Manager. Here you will find the vendor name and model
of your graphics card(s). Same should be checked on NVIDIA website if CUDA
supports that graphic card.
2) When installing CUDA on Windows, you can choose between the Network Installer and
the Local Installer.
a. The Network Installer is minimal installer which later downloads packages required for
installation. Only the packages selected during the selection phase of the installer are
downloaded. This installer is useful for users who want to minimize download time.
b. Local Installer is full installer which contains all the components of the CUDA Toolkit
and does not require any further download. This installer is useful for systems which
lack network access and for enterprise deployment.
3) The CUDA installation packages for Windows are available on the NVIDIA’s website.
4) The CUDA Toolkit installs the CUDA driver and tools needed to create, build and run a
CUDA application as well as libraries, header files, CUDA samples source code, and
other resources.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (2 - 6) CUDA Programming

5) For windows installation, perform the following steps and verify the installation:
a. Launch the downloaded installer package.
b. Read and accept the EULA.
c. Select "next" to download and install all components.
d. Once the download completes, the installation will begin automatically.
e. Once the installation completes, click "next" to acknowledge the Nsight Visual
Studio Edition installation summary.
f. Click "close" to close the installer.
g. Navigate to the CUDA Samples' nbody directory.
h. Open the nbody Visual Studio solution file for the version of Visual Studio you have
installed.
6) Installation instructions for Linux are available on NVIDIA’s website.
7) You can verify if installation was successful by compiling and running some of the
sample programs.
8) All subpackages can be uninstalled through the Windows Control Panel by using the
Programs and Features widget.
9) If you do not have a GPU, you can access one of the thousands of GPUs available from
cloud service providers including Amazon AWS, Microsoft Azure and IBM SoftLayer.
For example, the NVIDIA-maintained CUDA Amazon Machine Image (AMI) on AWS
comes pre-installed with CUDA.
 Sample CUDA Program
 In this section, we will write CUDA C program and offload computation to a GPU. We will
use different CUDA runtime APIs. In CUDA programming, both CPUs and GPUs are used
for computing. In general, CPU is referred as host and GPU is referred as device. CPUs and
GPUs have their own memory space. Serial workload is run on CPU and parallel
computations are offloaded to GPUs.
 Below is sample program written in C and CUDA.
C

void c_welcome(){

printf("Welcome!\n");

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 7) CUDA Programming

int main() {

c_welcome();

return 0;

CUDA

global void cuda_welcome(){

printf("Welcome from GPU!\n");

int main() {

cuda_welcome<<<1,1>>>();

return 0;

1) The major difference between C and CUDA implementation is __global__ specifier and
<<<...>>> syntax.
2) The __global__ specifier indicates a function that runs on device i.e. GPU. Such
function (known as "kernels") can be called through host code, the main() function in
the example.
3) When a kernel is called, its execution configuration is provided through <<<...>>>
syntax, e.g. cuda_welcome<<<1,1>>>(). In CUDA terminology, this is called "kernel
launch". Kernel execution configuration <<<...>>> tells CUDA runtime how many
threads to launch on GPU. <<<M,T>>>indicates that a kernel launches with a grid of M
thread blocks. And, each thread block has T parallel threads.
4) Compiling a CUDA program is similar to C program. NVIDIA provides a CUDA
compiler called nvcc in the CUDA Toolkit to compile CUDA code. For example
$> nvcc welcome.cu -o welcome

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 8) CUDA Programming

 The CUDA welcome example above does not do anything significant. To get things into
action, we will look at another example vector addition. Following is an example of vector
addition implemented in C (./vector_add.c). The example computes the addition of two
vectors stored in array a and b and puts the result in array out.
#define N 10000000

void vector_add(float out, float a, float *b, int n) {

for(int i = 0; i < n; i++){

out[i] = a[i] + b[i];

int main(){

float a, b, *out;

// Allocate memory

a = (float*)malloc(sizeof(float) * N);

b = (float*)malloc(sizeof(float) * N);

out = (float)malloc(sizeof(float) N);

// Initialize array

for(int i = 0; i < N; i++){

a[i] = 1.0f;

b[i] = 2.0f;

// Main function

vector_add(out, a, b, N);

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 9) CUDA Programming

We will convert vector_add.c to CUDA program vector_add.cu

1. Copy vector_add.c to vector_add.cu

$> cp vector_add.c vector_add.cu

2. Convert vector_add() to GPU kernel

global void vector_add(float out, float a, float *b, int n) {

for(int i = 0; i < n; i++){

out[i] = a[i] + b[i];

Change vector_add() call in main() to kernel call

vector_add<<<1,1>>>(out, a, b, N);

Compile and run the program

$> nvcc vector_add.c -o vector_add

$> ./vector_add

5) You will notice that the program does not work correctly. The reason is CPU and GPUs
are separate entities. Both have their own memory space. CPU cannot directly access
GPU memory, and vice versa. In CUDA terminology, CPU memory is called host
memory and GPU memory is called device memory. Pointers to CPU and GPU memory
are called host pointer and device pointer, respectively.
6) For data to be accessible by GPU, it must be presented in the device memory. CUDA
provides APIs for allocating device memory and data transfer between host and device
memory. Following is the common workflow of CUDA programs.
a. Allocate host memory and initialize host data
b. Allocate device memory
c. Transfer input data from host to device memory
d. Execute kernels
e. Transfer output from device memory to host
7) So far, we have done step “a” and “d”. We still need to add step “b”, “c”, and “e” to our
vector addition program.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 10) CUDA Programming

8) Device memory management - CUDA provides several functions for allocating device
memory. The most common ones are cudaMalloc() and cudaFree(). The syntax for both
functions are as follow :
cudaMalloc(void **devPtr, size_t count);

cudaFree(void *devPtr);

cudaMalloc() allocates memory of size count in the device memory and

updates the device pointer devPtr to the allocated memory. cudaFree()
de-allocates a region of the device memory where the device pointer
devPtr points to. They are comparable to malloc() and free() in C,
respectively.

9) Memory transfer - Transferring data between host and device memory can be done
through cudaMemcpy function, which is similar to memcpy in C. The syntax of
cudaMemcpy is as follow :
cudaMemcpy(void *dst, void *src, size_t count, cudaMemcpyKind kind)

The function copies a memory of size count from src to dst. kind indicates
the direction. For typical usage, the value of kind is either
cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost.

10) Updated program now looks like below :

#define N 10000000

global void vector_add(float out, float a, float *b, int n) {

for(int i = 0; i < n; i++){

out[i] = a[i] + b[i];

void main(){

float a, b, *out;

float d_a, d_b, *d_out;

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 11) CUDA Programming

a = (float*)malloc(sizeof(float) * N);
b = (float*)malloc(sizeof(float) * N);
out = (float*)malloc(sizeof(float) * N);
// Initialize array
for(int i = 0; i < N; i++){
a[i] = 1.0f;
b[i] = 2.0f;
}
// Allocate device memory
cudaMalloc((void**)&d_a, sizeof(float) * N);
cudaMalloc((void**)&d_b, sizeof(float) * N);
cudaMalloc((void**)&d_out, sizeof(float) * N);

// Transfer data from host to device memory

cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, sizeof(float) * N, cudaMemcpyHostToDevice);

// Main function
vector_add<<<1,1>>>(d_out, d_a, d_b, N);

// Transfer data back from device to host memory

cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);

// Cleanup after kernel execution

cudaFree(d_a);
free(a);
cudaFree(d_b);
free(b);

cudaFree(d_out);
free(out);
}

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 12) CUDA Programming

 2.2 Multi GPU

1. CUDA supports working with multiple GPUs. Using multiple GPUs, you can further
accelerate performance for your applications using GPU parallelization. Additionally, you
can handle cases where in single application instance does not fit into a single GPU’s
memory.
2. There exist different topologies in case of multiple GPUs. Some of them are explained here.
a. Shared system GPUs - This is single system containing multiple GPUs that
communicates through a shared CPU.
b. Distributed GPUs - This is networked group of distributed systems each containing a
single GPU.
c. GPU Parallel Execution - Here, execution takes place across multiple GPUs in parallel
as opposed to parallel execution on a single GPU.
3. Following multi GPU programming Fig. 2.2.1 explains how CPU threads establish contexts
for interacting with GPUs.

Fig. 2.2.1 : Multi GPU programming

a. A host system can have multiple devices. Several host threads can execute device
code on the same device, but by design, a host thread can execute device code on
only one device at any given time. As a consequence, multiple host threads are
required to execute device code on multiple devices.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 13) CUDA Programming

b. In order to issue work to a GPU, a context is established between a CPU thread and
the GPU. Only one context can be active on GPU at a time.
c. Even though a GPU can execute calls from one context at a time, it can belong to
multiple contexts. For example, it is possible for several CPU threads to establish
contexts with the same GPU.
4. Following are some of the benefits of multi GPU processing
a. The hardware cost of using multiple GPUs is much smaller than buying an
equivalent CPU based machine.
b. The energy consumption is less with GPUs.
c. GPU based machines are easier to upgrade by adding more cards or buying newer
ones.
5. For developers it is important to note that before exploring multi GPU support to improve
application performance, spend sufficient time on code optimizations. And, only if need be
go for multi GPU implementation.

 2.3 Multi GPU Solutions

CUDA and NVIDIA GPUs have been adopted in many areas that need high computing
performance. Below are some of the areas / solutions where multi GPUs have been / can be
used.
1. Creation of computer clusters for using in very high calculation intensive tasks
a. High performance computing (HPC) clusters or supercomputers.
b. Grid computing - This is a form of distributed computing where many heterogeneous
computers work together to create a virtual computer architecture.
c. Load balancing clusters or server farm.
2. Computational finance - Computational finance is the study of data and algorithms of
computer programs used in finance and the mathematics that realize financial models or
systems. Examples are - algorithmic trading, quantitative investing, high frequency trading.
3. Climate research, weather forecasting, and ocean modeling.
4. Machine learning - Machine learning is the study of computer algorithms that improve
automatically through experience. Machine learning algorithms build a model based on
sample data, known as "training data" in order to make predictions or decisions without
being explicitly programmed to do so. Machine learning algorithms are used where it is
difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
5. Manufacturing / AEC (Architecture, Engineering, and Construction) - CAD and CAE
(including computational fluid dynamics, computational structural mechanics, design and
visualization, and electronic design automation)

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 14) CUDA Programming

6. Media and entertainment - This includes animation, modeling, and accelerated rendering;
color correction and edge enhancement; composing, finishing and effects, and editing;
accelerated inter-conversion of video file formats and digital distribution; on-air graphics;
and weather graphics)
7. Medical imaging - Medical imaging is the technique and process in which visual
representations of the a) interior of a body and b) function of some organs or tissues are
created for clinical analysis and medical intervention. Medical imaging also reveals internal
structures hidden by the skin and bones to diagnose and treat disease. Additionally, medical
imaging is used in establishing a database of normal anatomy and physiology to make it
possible to identify abnormalities. Examples are - virtual reality based on CT and MRI scan
images, CT scan reconstruction, face recognition.
8. Research - Higher education and supercomputing. This includes computational chemistry
and biology, numerical analytics, statistical physics, and scientific visualization.
9. Audio signal processing - This includes audio and sound effects processing, analog signal
processing, and speech processing.
10. Cryptography - Cryptography is the practice and study of techniques about constructing
and analyzing protocols that prevent third parties or the public from reading private
messages Examples are- accelerated encryption, accelerated decryption, accelerated
compression, mining cryptocurrencies, and password cracking.
11. Bioinformatics - Bioinformatics is an interdisciplinary field which develops methods and
software tools for understanding biological data having large and complex data sets.
Bioinformatics combines biology, computer science, information engineering,
mathematics and statistics to analyze and interpret the biological data. Some of the
examples are - DNA sequencing, microscopy and image analysis, gene and protein
expression, and analysis of mutations in cancer.

 2.4 Optimizing CUDA Applications

 This section introduces the Assess, Parallelize, Optimize, and Deplo (APOD) design cycle
for optimizing CUDA applications. This design cycle helps application developers to :
1) identify the portions of their code that would most readily benefit from GPU
acceleration,
2) rapidly realize that benefit, and
3) begin leveraging the resulting speedups in production as early as possible.
 APOD is a cyclic process. In this process, initial speedups can be achieved, tested, and
deployed with only minimal initial investment of time. After this, cycle can begin again by
identifying further optimization opportunities, seeing additional speedups, and then
deploying the even faster versions of the application into production.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (2 - 15) CUDA Programming

Fig. 2.4.1 : APOD process

 Assess
1. For an existing application, the first step is to assess the application to locate the parts of the
code that are responsible for taking the bulk of the execution time.
2. With the knowledge of performance bottlenecks and understanding of current and future
workload, the developer can evaluate these for parallelization and start to investigate GPU
acceleration.
3. By understanding the end user's requirements and constraints, the developer can determine
the upper bound of performance improvement from acceleration of the identified portions
of the application.
4. From list of hotspots identified, those items should be picked up first which can give
maximum benefits with minimal amount of development efforts.
 Parallelize
1. Once hotspots are identified and end user’s expectations are understood, the developer
needs to parallelize the code.
2. The amount of performance benefit an application will realize depends entirely on the
extent to which it can be parallelized. In general, code that cannot be sufficiently
parallelized should run on the host, unless doing so would result in excessive transfers
between the host and the device.
3. Developer needs to keep in mind that to get the maximum benefit from CUDA, focus
should first on finding ways to parallelize sequential code.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 16) CUDA Programming

4. Depending on the original code, parallelizing can be achieved by a) calling into an existing
GPU optimized libraries such as cuBLAS, cuFFT, or Thrust b) Or, it could be as simple as
adding a few preprocessor directives as hints to a parallelizing compiler.
5. Some applications' designs might require some amount of refactoring to expose parallelism.
6. CPU architectures as will might require exposing parallelism in order to improve or simply
maintain the performance of sequential applications.
7. CUDA supports parallel programming languages like CUDA C++, CUDA Fortran. These
languages aim at making the expression of this parallelism as simple as possible. Also, they
support operation on CUDA capable GPUs designed for maximum parallel throughput.
 Optimize
1. After each round of application parallelization is complete, the developer moves to
optimizing the implementation to further improve the performance.
2. Since there are many possible optimizations that can be considered, having a good
understanding of the needs of the application can help to make the process as smooth as
possible.
3. With APOD as a whole, program optimization is an iterative process i.e. a) identify an
opportunity for optimization, b) apply and test the optimization, c) verify the speedup
achieved, and d) repeat. This means that it is not necessary for a developer to spend large
amounts of time memorizing the bulk of all possible optimization strategies prior to seeing
good speedups. Instead, strategies can be applied incrementally as they are learned.
4. Optimizations can be applied at various levels, from overlapping data transfers with
computation to all the way down to fine tuning floating point operation sequences.
5. CUDA provides profiling tools that are valuable for guiding this process and can help
suggest a next-best course of action for the developer's optimization efforts.
 Deploy
1. Once GPU acceleration of one or more components of the application is completed,
outcome should be compared with the original expectation.
2. Before tackling other hotspots to improve the total speedup, the developer should consider
taking the partially parallelized implementation and carry it through to production.
3. Taking partially completed implementation to production has number of benefits like a) it
allows the end user to see result of their investment as early as possible b) it minimizes risk
for the developer and the user by providing an incremental changes to the application.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 17) CUDA Programming

4. Before taking up lower priority recommendations, it is good practice to make sure all
higher priority recommendations that are relevant have already been implemented and
applied in production.

 2.4.1 Problem Decomposition

1) From supercomputers, servers, desktops, to mobile phones; modern processors rely on
parallelism to provide performance.
2) In application, many codes accomplish a significant portion of the work with a relatively
small amount of code. Using a profiler, the developer can identify such hotspots and start to
compile a list of candidates for parallelization. Objective of profiling is to identify the
function or functions in which the application is spending most of its execution time.
3) One of the most important consideration with profiling activity is to ensure that the
workload is realistic i.e. workload matches with that of production like real data. If this is
not taken care of then it can lead to suboptimal results and wasted effort by causing
developers to concentrate on the wrong functions.
4) CUDA programming involves running code on two different platforms concurrently 1) host
system with one or more CPUs and 2) one or more CUDA-enabled NVIDIA GPU devices.
GPUs have distinctly different design from the host system. For developers, during problem
decomposition, it is important to understand those differences and how they impact the
performance of CUDA applications in order to use CUDA effectively.
5) NVIDIA provides number of solutions that can be used to analyse application's
performance profile. Following is the list of solutions provided :
a. NVIDIA Nsight Systems - It is a system-wide performance analysis tool designed to
visualize application’s algorithm, help developer select the largest opportunities to
optimize, and tune to scale efficiently across any quantity of CPUs and GPUs.
b. NVIDIA Visual Profiler - This is a cross platform performance profiling tool that
delivers developers with vital feedback for optimizing CUDA C / C++ applications.
c. TAU Performance System - This is a profiling and tracing toolkit for performance
analysis of hybrid parallel programs written in CUDA, and pyCUDA, and OpenACC.
d. VampirTrace - This is a performance monitor that comes with CUDA, and PyCUDA
support to give detailed insight into the runtime behaviour of accelerators. It enables
extensive performance analysis and optimization of hybrid programs.
e. The PAPI CUDA Component - This is a hardware performance counter measurement
technology for the NVIDIA CUDA platform which provides access to the hardware
counters inside the GPU. It provides detailed performance counter information
regarding the execution of GPU kernels.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (2 - 18) CUDA Programming

f. The NVIDIA CUDA Profiling Tools Interface (CUPTI) - This provides performance
analysis tools with detailed information about GPU usage in a system. CUPTI is used by
performance analysis tools such as the NVIDIA Visual Profiler, TAU and VampirTrace.
g. NVIDIA Topology Aware GPU Selection (NVTAGS) - This is a toolset for HPC
applications that enables faster solve times with high GPU communication-to-
application run-time ratios. NVTAGS intelligently and automatically assigns GPUs to
message passing interface (MPI) processes, thereby reducing overall GPU-to-GPU
communication time.
6) Based on the information from the performance profiles, developer needs to identify
hotspots i.e. function or functions in which the application is spending most of its execution
time.
7) By understanding how applications can scale it is possible to set expectations and plan an
incremental parallelization strategy. Strong Scaling sets up an upper bound for the speedup
with a fixed problem size. Whereas, Weak Scaling attains speedup by growing the problem
size. In many applications, a combination of strong and weak scaling is desirable.

 2.4.2 Memory Considerations

Memory optimizations are the most important area for gaining performance improvement.
In memory optimizations one of key goal is to maximize the use of the hardware by
maximizing bandwidth. Bandwidth is best served by using as much fast memory and as little
slow access memory as possible. This section explains the various kinds of memory on the
host and device and how best to set up data items to use the memory effectively.
1) Following diagram shows different memories available on a CUDA device.

Fig. 2.4.2 : Memory spaces on a CUDA device

TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (2 - 19) CUDA Programming

2) Coalesced access to global memory - Of all different memory spaces, global memory is
the most plentiful. But, it has the greatest access latency of all. Global memory is located
off the chip. Because of these characteristics, to get optimal performance, global memory
loads and stores by threads of a warp should be coalesced into as few as possible
transactions wherever possible.
3) L2 cache -
a. L2 cache is on-chip and because of this it provides higher bandwidth and lower latency
accesses to global memory.
b. When a CUDA kernel accesses a data region in the global memory repeatedly, such data
accesses can be considered to be persisting. On the other hand, if the data is only
accessed once, such data accesses can be considered to be streaming.
c. A portion of the L2 cache can be set aside for persistent accesses to a data region in
global memory. If this set-aside portion is not used by persistent accesses, then
streaming or normal data accesses can use it.
4) Shared memory -
a. Shared memory is on-chip and because of this it has much higher bandwidth and lower
latency than local and global memory, provided there are no bank conflicts between the
threads.
b. To achieve high memory bandwidth for concurrent accesses, shared memory is divided
into equally sized memory modules (banks) that can be accessed simultaneously.
c. Any memory load or store of n addresses that spans n distinct memory banks can be
serviced simultaneously. However, if multiple addresses of a memory request map to the
same memory bank, the accesses are serialized.
5) Local memory -
a. Local memory is named so because its scope is local to the thread. Local memory is off-
chip. Because of this access to local memory is as expensive as access to global
memory.
b. Local memory is used only to hold automatic variables. This is done by the nvcc
compiler when it determines that there is insufficient register space to hold the variable.
c. Automatic variables that are likely to be placed in local memory are i) large structures or
arrays that would consume too much register space and ii) arrays that the compiler
determines may be indexed dynamically.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 20) CUDA Programming

6) Texture memory -
a. Texture memory space is read-only and is cached.
b. The texture cache is optimized for 2D spatial locality, so threads of the same warp that
read texture addresses that are close together will achieve best performance.
c. Texture memory is designed for streaming fetches with a constant latency. That is, a
cache hit reduces DRAM bandwidth demand, but not the fetch latency.
d. In certain addressing situations, reading device memory through texture fetching can be
an advantageous alternative to reading device memory from global or constant memory.
7) Constant memory -
a. Constant memory is present on device and is cached. As accesses to different addresses
by threads within a warp are serialized, cost scales linearly with the number of unique
addresses read by all threads within a warp.
b. Because of this, the constant memory is best when threads in the same warp access only
a few distinct locations. That is - for example if all threads of a warp access the same
location, then constant memory can be as fast as a register access.
c. Developers need to utilize constant memory appropriately to optimize overall
application performance.

 2.4.3 Transfers
1) Minimize data transfer - The bandwidth between the device memory and the GPU is
much higher than the bandwidth between host memory and device memory. Therefore, for
best overall application performance, it is important to minimize data transfer between the
host and the device. This means it is okay to run kernels on the GPU even if they do not
demonstrate any speedup compared with running them on the host CPU.
2) Pinned (or Pagelocked or Non pageable) Memory - Host (CPU) data allocations are
pageable by default. The GPU cannot access data directly from pageable host memory, so
when a data transfer from pageable host memory to device memory is invoked, the CUDA
driver must first allocates a temporary page locked or pinned host array and copies the host
data to the pinned array and then transfers the data from the pinned array to device memory.
This is explained in following Fig. 2.4.3.
As you can see in the Fig. 2.4.3, pinned memory is used as a staging area for transfers from
the host to device. We can avoid the cost of the transfer between pageable and pinned host
arrays by directly allocating our host arrays in pinned memory. You can allocate pinned
host memory in CUDA using cudaMallocHost() or cudaHostAlloc(), and de-allocate it with
cudaFreeHost().

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 21) CUDA Programming

Fig. 2.4.3 : Pageable vs. Pinned data transfer

3) Intermediate data structures - In order to reduce transfers between host and device,
intermediate data structures should be created in device memory, operated on by the device,
and destroyed without ever being mapped by the host or copied to host memory.
4) Batching small transfers - There is overhead associated with each transfer. Therefore
batching many small transfers into one larger transfer performs significantly better than
making each transfer separately. This is recommended even if doing so requires packing
non-contiguous regions of memory into a contiguous buffer and then unpacking after the
transfer. This is easy to do in CUDA by using a temporary array, preferably pinned, and
packing it with the data to be transferred.
5) Asynchronous and overlapping transfers with computation
a. Data transfers between the host and the device using cudaMemcpy() are blocking
transfers. That means, control is returned to the host thread only after the data transfer is
complete. The cudaMemcpyAsync() function is a non-blocking variant of
cudaMemcpy() in which control is returned immediately to the host thread.
b. In contrast with cudaMemcpy(), the asynchronous transfer version requires pinned host
memory, and it contains an additional argument, a stream ID. A stream is simply a
sequence of operations that are performed in order on the device. Operations in different
streams can be interleaved and in some cases overlapped. This property of device can be
used to hide data transfers between the host and the device.
c. Kernel calls that use the default stream begin only after all preceding calls on the device
(in any stream) have completed, and no operation on the device (in any stream)
commences until they are finished.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (2 - 22) CUDA Programming

d. Asynchronous data transfers enables overlap of data transfers with host computations.
e. On devices that are capable of concurrent copy and compute, it is possible to overlap
kernel execution on the device with data transfers between the host and the device. For
this it is important that kernel execution and data transfer must use different non-default
streams and host memory involved in transfer should be pinned memory. Whether a
device has this capability is indicated by the asyncEngineCount field of the
cudaDeviceProp structure.
f. On devices that are capable of concurrent kernel execution, streams can also be used to
execute multiple kernels simultaneously to more fully take advantage of the device's
multiprocessors. Whether a device has this capability is indicated by the
concurrentKernels field of the cudaDeviceProp structure. Non-default streams are
required for concurrent execution.
6) Zero Copy - Zero copy feature enables GPU threads to directly access host memory. For
this purpose, it requires mapped pinned memory and device that supports mapping of host
memory to the device's address space. Zero copy can be used in place of streams because
kernel originated data transfers automatically overlap kernel execution without the
overhead of setting up and determining the optimal number of streams.

 2.4.4 Thread Usage

1) Accessing a register by thread consumes zero extra clock cycles per instruction. However,
there is possibility of delays due to register read-after-write dependencies and register
memory bank conflicts. The compiler and hardware thread scheduler tries scheduling
instructions as optimally as possible to avoid register memory bank conflicts.
2) The dimension and size of blocks per grid and the dimension and size of threads per block
are important factors. The multidimensional aspect of these parameters allows easier
mapping of multidimensional problems to CUDA and does not play a role in performance.
However, size does play important role during performance optimization. Latency hiding
and occupancy depend on the number of active warps per multiprocessor, which is
implicitly determined by the execution parameters along with resource (register and shared
memory) constraints. Therefore choosing execution parameters is a matter of striking a
balance between latency hiding (occupancy) and resource utilization.
3) When choosing the first execution configuration parameter the number of blocks per grid,
or grid size, the primary concern is keeping the entire GPU busy. The number of blocks in a
grid should be larger than the number of multiprocessors so that all multiprocessors have at

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 23) CUDA Programming

least one block to execute. Furthermore, there should be multiple active blocks per
multiprocessor so that blocks that are not waiting for a __syncthreads() and can keep the
hardware busy. This recommendation is subject to resource availability; therefore, it should
be determined in the context of the second execution parameter the number of threads per
block, or block size as well as shared memory usage. To scale to future devices, the number
of blocks per kernel launch should be in the thousands.
4) There are different factors involved in selecting block size. Below are some of the thumb
rules that should be followed :
a. The number of threads per block should be a multiple of 32 threads. This avoids wasting
computation on under-populated warps (i.e. provides optimal computing efficiency) and
facilitates coalescing.
b. A minimum of 64 threads per block should be used.
c. Between 128 and 256 threads per block is a good initial range for experimentation with
different block sizes.
d. Use several smaller thread blocks rather than one large thread block per multiprocessor
if latency affects performance. This is particularly beneficial to kernels that frequently
call __syncthreads().
5) Thread instructions are executed sequentially in CUDA. As a result, executing other warps
when one warp is paused or stalled is the only way to hide latencies and keep the hardware
busy. Occupancy is the ratio of the number of active warps per multiprocessor to the
maximum number of possible active warps. One of the keys to good performance is to keep
the multiprocessors on the device as busy as possible. A device in which work is poorly
balanced across the multiprocessors will deliver suboptimal performance. Hence, it is
important to design your application to use threads and blocks in a way that maximizes
hardware utilization.
6) During performance optimization and parallelization, obtaining the right answer is the
principal goal of all computations. On parallel systems, it is possible to run into difficulties
not typically found in traditional serial oriented programming. These include threading
issues, unexpected values due to the way floating-point values are computed, and
challenges arising from differences in the way CPU and GPU processors operate.
Developers need to keep these points in mind during debugging and performance
optimization.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 24) CUDA Programming

 2.4.5 Resource Contentions

1) Global memory access - Memory instructions include any instruction that reads from or
writes to shared, local, or global memory. When accessing un-cached local or global
memory, there are hundreds of clock cycles of memory latency.
As an example, the assignment operator in the following sample code has a high
throughput, but - crucially there is a latency of hundreds of clock cycles to read data from
global memory.
__shared__ float shared[32];

device float device[32];

shared[threadIdx.x] = device[threadIdx.x];

Much of this global memory latency can be hidden by the thread scheduler if there are
sufficient independent arithmetic instructions that can be issued while waiting for the global
memory access to complete. However, it is best to avoid / minimize use of global memory
whenever possible.
2) Device memory allocation and de-allocation via cudaMalloc() and cudaFree() are
expensive operations. So, device memory should be reused and / or sub-allocated by the
application wherever possible to minimize the impact of allocations on overall
performance.
3) Shared memory is helpful in situations to coalesce or eliminate redundant access to global
memory. However, it also can act as a constraint on occupancy. In many cases, the amount
of shared memory required by a kernel is related to the block size that was chosen, but the
mapping of threads to shared memory elements does not need to be one-to-one. For
example, it may be desirable to use a 64  64 element shared memory array in a kernel, but
because the maximum number of threads per block is 1024, it is not possible to launch a
kernel with 64  64 threads per block. In such cases, kernels with 32  32 or 64  16
threads can be launched with each thread processing four elements of the shared memory
array.
4) Register dependencies - Register dependencies arise when an instruction uses a result
stored in a register written by an instruction before it.To hide latency arising from register
dependencies, developer needs to maintain sufficient numbers of active threads per
multiprocessor (i.e. sufficient occupancy).
5) Register pressure - This occurs when there are not enough registers available for a given
task. Even though each multiprocessor contains thousands of 32-bit registers, these are
partitioned among concurrent threads. To prevent the compiler from allocating too many
registers, developer can use the -maxrregcount=N compiler commandline option or the
launch bounds kernel definition qualifier to control the maximum number of registers to be
allocated per thread.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (2 - 25) CUDA Programming

6) When a thread block allocates more registers than are available on a multiprocessor, or
request too much shared memory, or request too many threads; the kernel launch fails.
7) Avoiding multiple contexts -
a. CUDA work occurs within a process space for a particular GPU known as a context.
This context encapsulates kernel launches and memory allocations for that GPU as well
as supporting constructs such as the page tables.
b. With the CUDA Driver API, a CUDA application process can potentially create more
than one context for a given GPU. If multiple CUDA application processes access the
same GPU concurrently, this almost always implies multiple contexts.
c. While multiple contexts can be allocated concurrently on a given GPU, only one of
these contexts can execute work at any given moment on that GPU. That is, contexts
sharing of the same GPU are time sliced. Creating additional contexts incurs memory
overhead for per context data and time overhead for context switching.
d. In summary, therefore it is best to avoid multiple contexts per GPU within the same
CUDA application.
In summary, it is recommended that developers focus on following recommendations to
achieve the best performance :
1) Maximize parallel execution to achieve maximum utilization
a. Find opportunities to parallelize sequential code.
b. Maximizing parallel execution starts with structuring the algorithm in a way that
exposes as much parallelism as possible.
c. Once the parallelism of the algorithm has been exposed, it needs to be mapped to the
hardware as efficiently as possible. This is done by carefully choosing the execution
configuration of each kernel launch. Kernel launch configuration should be chosen in
such a way that it maximizes device utilization.
d. The application should also maximize parallel execution at a higher level by explicitly
exposing concurrent execution on the device through streams, as well as maximizing
concurrent execution between the host and the device.
2) Optimize memory usage to achieve maximum memory throughput
a. Optimizing memory usage starts with minimizing data transfers between the host and
the device because those transfers have much lower bandwidth than internal device data
transfers.
b. Kernel access to global memory also should be minimized by maximizing the use of
shared memory on the device. That is, efforts should be made to ensure global memory
accesses are coalescedand redundant accesses to global memory are minimized
whenever possible.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 26) CUDA Programming

c. Sometimes, the best optimization might even be to avoid any data transfer in the first
place by simply re-computing the data whenever it is needed. That is, efforts should be
made to avoid / minimize data transfers between the host and the device.
d. The effective bandwidth can vary by an order of magnitude depending on the access
pattern for each type of memory. The next step in optimizing memory usage is therefore
to organize memory accesses according to the optimal memory access patterns. This
optimization is especially important for global memory accesses, because latency of
access costs hundreds of clock cycles. Shared memory accesses, in counterpoint, are
usually worth optimizing only when there exists a high degree of bank conflicts.
3) Optimize instruction usage to achieve maximum instruction throughput
a. It is recommended to avoid use of arithmetic instructions that have low throughput. This
means trading precision for speed when it does not affect the end result. For example-
using intrinsics instead of regular functions or using single precision instead of double
precision.
b. It is recommended to avoid different execution paths within the same warp.
c. It is also important that attention must be paid to control flow instructions due to the
single instruction multiple thread nature of the device.

 2.5 Part A : Short Answered Questions [2 Marks Each]

Q.1 Explain CUDA design goals.
 Answer : Following are some of the CUDA design goals a) enable a straightforward
implementation of parallel algorithms b) allow programmers to focus on the task of
parallelization of the algorithms rather than spending time on their implementation c) support
heterogeneous computation where applications can use both the CPU and GPU d) enable
running serial portions of applications on the CPU, and parallel portions on the GPU.
Q.2 Explain multi GPU programming and its benefits.
 Answer : Multi GPU programming is leveraging multiple GPUs in parallel to accelerate
application performance. Following are some of the benefits of multi GPU processing:
a. The hardware cost of using multiple GPUs is much smaller than buying an equivalent
CPU based machine.
b. The energy consumption is less with GPUs.
c. GPU based machines are easier to upgrade by adding more cards or buying newer ones.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 27) CUDA Programming

Q.3 What are different areas where multi GPUs are used ?
 Answer : Multi GPUs are used in following areas :

a. Creation of computer clusters, b. Computational finance,

c. Machine learning, d. Media and entertainment,
e. Medical imaging, f. Audio signal processing,
g. Cryptography, h. Bioinformatics, etc.
Q.4 What is APOD cycle ?
 Answer : APOD stands for Assess, Parallelize, Optimize, and Deploy (APOD). APOD
design cycle is used for optimizing CUDA applications. In this cycle, application developers
identify the portions of their code that would most readily benefit from GPU acceleration,
implement parallelization, check performance results, and deploy optimized implementation.
Q.5 How should performance problem be identified ?
 Answer : Using a profiler, the developer can identify hotspots and compile a list of
candidates for parallelization. Hotspots are the function or functions in which the application is
spending most of its execution time. CUDA toolkit provides several tools/solutions which can
be used by developers in application performance profiling and identifying hotspots.
Q.6 List down of CUDA tools/solutions that help developers in analyzing application's
performance.
 Answer : Following is the list of solutions provided :
a. NVIDIA Nsight Systems,
b. NVIDIA Visual Profiler,
c. TAU Performance System,
d. VampirTrace,
e. The PAPI CUDA Component,
f. The NVIDIA CUDA Profiling Tools Interface (CUPTI),
g. NVIDIA Topology Aware GPU Selection (NVTAGS)
Q.7 What are memory considerations that should be thought of while improving
performance ?
 Answer : Following are memory considerations that should be considered by developers :

a) Access to global memory should be coalesced.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 28) CUDA Programming

b) L2 cache is on-chip, because of this data region in the global memory which is accessed
repeatedly can be persisted in L2 cache.
c) Wherever possible shared memory should be used instead of global memory.
d) Local memory is off-chip and because of this access to local memory is as expensive as
access to global memory.
e) Texture cache is optimized for 2D spatial locality.
f) Constant memory is best when threads in the same warp access only a few distinct
locations.
Q.8 What guidelines should be followed while transferring data between host and
device ?
 Answer : Following guidelines should be followed while transferring data between host
and device :
1) minimize data transfer between host and device
2) used pinned memory
3) batch small transfers
4) use asynchronous and overlapping transfers with computation
5) use zero copy.
Q.9 What care should be taken while launching kernel ?
 Answer : First parameter number of blocks per grid should be larger than the number of
multiprocessors so that all multiprocessors have at least one block to execute. Second
execution parameter the number of threads per block should be multiple active blocks per
multiprocessor so that blocks that aren't waiting for a __syncthreads() and can keep the
hardware busy.
Q.10 What are constrained resources that should be kept in mind during writing
applications ?
 Answer : Following are some of the constrained resources that should be kept in mind
during writing applications :
a) access to global memory
b) memory allocation and de-allocation operations on device
c) shared memory
d) registers.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (2 - 29) CUDA Programming

 2.6 Part B : Long Answered Questions

Q.1 What are different libraries provided by CUDA ? (Refer section 2.1)
Q.2 Write simple C and corresponding CUDA program. (Refer section 2.1)
Q.3 Describe APOD cycle in detail. (Refer section 2.4)
Q.4 Explain different memory spaces available on CUDA device along with memory
constraints. (Refer section 2.4.2)
Q.5 List down recommendations that developers should focus for achieving the best
performance. (Refer section 2.4.5)




TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (2 - 30) CUDA Programming

Notes

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

UNIT - III

3 Programming Issues

Syllabus

Common Problems : CUDA Error Handling, Parallel Programming Issues,

Synchronization, Algorithmic Issues, Finding and Avoiding Errors.

Contents

3.1 Common Problems

3.2 CUDA Error Handling

3.3 Parallel Programming Issues

3.4 Synchronization Issues

3.5 Algorithmic Issues

3.6 Finding and Avoiding Errors

3.7 Part A : Short Answered Questions [2 Marks Each]

3.8 Part B : Long Answered Questions

(3 - 1)
GPU Architecture and Programming (3 - 2) Programming Issues

 3.1 Common Problems

 In this chapter we will look at some of the issues that every CUDA programmer faces and
how programmers can avoid these issues by using some of the best practices. Issues with
CUDA programs often fall into one the following categories :
1. Errors relating to usage of various CUDA APIs.
2. Errors relating to parallel programming.
3. Errors relating to synchronization.
4. Errors relating to algorithms.
 In this section we will look at some of the common problems faced by CUDA
programmers.
1. Error in using CUDA APIs - This is one of the most common issues with people new
to CUDA. All of the CUDA API functions return an error code. Anything other than
CUDA Success generally indicates that something wrong happened in calling the API.
The CUDA APIs are asynchronous, meaning the error code returned at the point of the
query, may have happened at some distant point in the past. Each error code can be
turned into a semi-useful error string, rather than a number you have to look up in the
API documentation. The error string is a somewhat helpful in first attempt to identify
the potential cause of the problem. However, it relies on the programmer explicitly
checking the return code in the host program.
Some examples are :
cudaErrorMemoryAllocation = This error means the API call failed

because it was unable to allocate enough memory to perform the

requested operation.

cudaErrorMissingConfiguration = This error means the device function

being invoked via cudaLaunchKernel() was not previously configured via the

cudaConfigureCall() function.

cudaErrorLaunchOutOfResources = This means that a launch did not occur

because it did not have appropriate resources. This error indicates that the

user has attempted to pass too many arguments to the device kernel, or the

kernel launch specifies too many threads for the kernel's register count.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 3) Programming Issues

2. Array overrun - Array overrun is one of the most common errors in CUDA. It is
important that you ensure all your kernel invocations start with a check to ensure the
data they will access, both for read and write purposes, is guarded by a condition. For
example -
if (i < num_elements)

array[i] = ...

This condition takes a marginal amount of time, but saves lot of debugging effort later.
Such problem is typically observed where number of data elements that are not
multiples of the thread block size. Suppose, we have 256 threads per block and 1024
data elements, this would invoke four blocks of 256 threads. Each thread would
contribute to the result. Now, suppose we had 1025 data elements. You would typically
have two types of errors here. The first is to not invoke a sufficient number of threads,
due to using an integer division. This will usually truncate the number of blocks needed.
Typically people write -
const int num_blocks = num_elements / num_threads;

This will work, but only when the number of elements is an exact multiple of the
number of threads.
In the 1025 elements case we launch 4 X 256 threads, some 1024 threads in total. The
last element remains unprocessed. Some attempt to “get around” this issue by writing
something like -
const int num_blocks = ((float) num_elements / num_threads);

This does not solve the problem. You cannot have 4.1 blocks. The assignment to integer
truncates the number to four blocks. The solution is a simple one. You write the
following instead -
const int num_blocks = (num_elements + (num_threads-1)) / num_threads;

This will ensure you always allocate enough blocks.

3. Invalid device handles - This type of error is generally related to incorrect mixing of
handles / pointers. When you allocate memory on the device or on the host, you receive
a pointer to that memory. However, that pointer comes with an implicit requirement that
only the host may access host pointers and only the device may access device pointers.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 4) Programming Issues

The standard CUDA runtime checks for this type of incorrect mixing of device and host
pointers. The CUDA API checks the pointer’s origin and will generate a runtime error if
you pass a host pointer to a kernel function without first converting it to a device pointer
to host memory. However, the same cannot be said for the standard C / C++ system
libraries. If you call the standard free function as opposed to the cudaFree function with
a device pointer, the system libraries will try to free that memory on the host, and then
will likely crash. The host libraries have no concept of a memory space they cannot
access.
The other type of invalid handle comes from the usage of a type before it has been
initialized. This is somewhat similar to using a variable before assigning it a value. For
example,
cudaStream_t my_stream;

my_kernel<<<num_blocks, num_threads, dynamic_shared, my_stream

>>>(a, b, c);

In this example we are missing the call to cudaStreamCreate and subsequent

cudaStreamDestroy functions. The create call performs some initialization to register the
event in the CUDA API. The destroy call releases those resources. The correct code is
as follows :
cudaStream_t my_stream;

cudaStreamCreate(&my_stream);

my_kernel<<<num_blocks, num_threads, dynamic_shared, my_stream

>>>(a, b, c);

cudaStreamSynchronize(my_stream);

cudaStreamDestroy(my_stream);

Invalid device handles, however, are not simply caused by forgetting to create them.
They can also be caused by destroying them prior to the device finishing usage of them.
Try deleting the cudaStreamSynchronize call from the original code. This will cause the
stream in use by the asynchronous kernel to be destroyed while the kernel is potentially
still running on the device. Due to the asynchronous nature of streams, the
cudaStreamDestroy function will not fail, it will return cudaSuccess. In fact, you will
not get an error until sometime later, from an entirely unrelated call into the CUDA API.
CUDA provides initcheck tool that checks for uninitialized device global memory
access. This tool can identify when device global memory is accessed without it being
initialized via device side writes, or via CUDA memcpy and memset API calls.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (3 - 5) Programming Issues

4. Not understanding the compute dependent features and programming

environment - This type of errors is related to writing programs without understanding
the programming environment. With each generation of NVIDIA processors, new
features are added to the GPU that CUDA can leverage. Therefore it is important for
programmers to be aware of the compute capability and version number of the CUDA
Runtime and CUDA Driver APIs.
CUDA compute capability describes the features of the hardware and reflects the set of
instructions supported by the device as well as other specifications, such as the
maximum number of threads per block and the number of registers per multiprocessor.
Higher compute capability versions are supersets of lower / earlier versions, so they are
backward compatible. The compute capability of the GPU in the device can be queried
programmatically by calling cudaGetDeviceProperties() and accessing the information
in the structure it returns. In particular, developers should note the number of
multiprocessors on the device, the number of registers, the amount of memory available,
and any special capabilities of the device.
Certain hardware features are not described by the compute capability. For example, the
ability to overlap kernel execution with asynchronous data transfers between the host
and the device is available on most but not all GPUs irrespective of the compute
capability. In such cases, call cudaGetDeviceProperties() to determine whether the
device is capable of a certain feature. For example, the asyncEngineCount field of the
device property structure indicates whether overlapping kernel execution and data
transfers is possible (and, if so, how many concurrent transfers are possible); likewise,
the canMapHostMemory field indicates whether zero-copy data transfers can be
performed.
The host runtime component of the CUDA software environment can be used only by
host functions. It provides functions to handle the following : Device management,
Context management, Memory management, Code module management, Execution
control, Texture reference management, Interoperability with OpenGL and Direct3D.
As compared to the lower-level CUDA Driver API, the CUDA Runtime greatly eases
device management by providing implicit initialization, context management, and
device code module management. The C++ host code generated by nvcc utilizes the
CUDA Runtime, so applications that link to this code will depend on the CUDA
Runtime; similarly, any code that uses the cuBLAS, cuFFT, and other CUDA Toolkit
libraries will also depend on the CUDA Runtime, which is used internally by these
libraries.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (3 - 6) Programming Issues

The CUDA Runtime handles kernel loading and setting up kernel parameters and launch
configuration before the kernel is launched. The implicit driver version checking, code
initialization, CUDA context management, CUDA module management, kernel
configuration, and parameter passing are all performed by the CUDA Runtime.

 3.2 CUDA Error Handling

This section describes the error handling functions provided by CUDA runtime API and
CUDA driver API.
1. cudaGetErrorName (CUDA runtime API)
a. Syntax
i. __host____device__const char* cudaGetErrorName ( cudaError_t error )
b. Parameters
i. error - Error code to convert to string
ii. char* - pointer to a NULL-terminated string
c. Returns a string containing the name of an error code in the enum. If the error code is
not recognized, "unrecognized error code" is returned.
2. cudaGetErrorString (CUDA runtime API)
d. Syntax
i. __host____device__const char* cudaGetErrorString ( cudaError_t error )
e. Parameters
i. error - Error code to convert to string
ii. char* - pointer to a NULL-terminated string
f. Returns the description string for an error code. If the error code is not recognized,
"unrecognized error code" is returned.
3. cudaGetLastError (CUDA runtime API)
g. Syntax
i. __host____device__cudaError_t cudaGetLastError ( void )
h. Returns
cudaSuccess, cudaErrorMissingConfiguration, cudaErrorMemoryAllocation,

cudaErrorInitializationError, cudaErrorLaunchFailure,

cudaErrorLaunchTimeout, cudaErrorLaunchOutOfResources,

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 7) Programming Issues

cudaErrorInvalidDeviceFunction, cudaErrorInvalidConfiguration,

cudaErrorInvalidDevice, cudaErrorInvalidValue, cudaErrorInvalidPitchValue,

cudaErrorInvalidSymbol, cudaErrorUnmapBufferObjectFailed,

cudaErrorInvalidDevicePointer, cudaErrorInvalidTexture,

cudaErrorInvalidTextureBinding, cudaErrorInvalidChannelDescriptor,

cudaErrorInvalidMemcpyDirection, cudaErrorInvalidFilterSetting,

cudaErrorInvalidNormSetting, cudaErrorUnknown,

cudaErrorInvalidResourceHandle, cudaErrorInsufficientDriver,

cudaErrorNoDevice, cudaErrorSetOnActiveProcess, cudaErrorStartupFailure,

cudaErrorInvalidPtx, cudaErrorUnsupportedPtxVersion,

cudaErrorNoKernelImageForDevice, cudaErrorJitCompilerNotFound

i. Returns the last error that has been produced by any of the runtime calls in the same host
thread and resets it to cudaSuccess. This function may also return error codes from
previous asynchronous launches.
4. cudaPeekAtLastError (CUDA runtime API)
j. Syntax
i. __host____device__cudaError_t cudaPeekAtLastError ( void )
k. Returns
cudaSuccess, cudaErrorMissingConfiguration, cudaErrorMemoryAllocation,

cudaErrorInitializationError, cudaErrorLaunchFailure,

cudaErrorLaunchTimeout, cudaErrorLaunchOutOfResources,

cudaErrorInvalidDeviceFunction, cudaErrorInvalidConfiguration,

cudaErrorInvalidDevice, cudaErrorInvalidValue, cudaErrorInvalidPitchValue,

cudaErrorInvalidSymbol, cudaErrorUnmapBufferObjectFailed,

cudaErrorInvalidDevicePointer, cudaErrorInvalidTexture,

cudaErrorInvalidTextureBinding, cudaErrorInvalidChannelDescriptor,

cudaErrorInvalidMemcpyDirection, cudaErrorInvalidFilterSetting,

cudaErrorInvalidNormSetting, cudaErrorUnknown,

cudaErrorInvalidResourceHandle, cudaErrorInsufficientDriver,

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 8) Programming Issues

cudaErrorNoDevice, cudaErrorSetOnActiveProcess, cudaErrorStartupFailure,

cudaErrorInvalidPtx, cudaErrorUnsupportedPtxVersion,

cudaErrorNoKernelImageForDevice, cudaErrorJitCompilerNotFound

l. Returns the last error that has been produced by any of the runtime calls in the same host
thread. Note that this call does not reset the error to cudaSuccess like
cudaGetLastError(). This function may also return error codes from previous
asynchronous launches.
5. cuGetErrorName (CUDA driverAPI)
m. Syntax
i. CUresult cuGetErrorName ( CUresult error, const char** pStr )
n. Parameters
i. error - Error code to convert to string
ii. pStr - Address of the string pointer
o. Returns
CUDA_SUCCESS, CUDA_ERROR_INVALID_VALUE

p. Sets *pStr to the address of a NULL-terminated string representation of the name of the
enum error code error. If the error code is not recognized,
CUDA_ERROR_INVALID_VALUE will be returned and *pStr will be set to the
NULL address.
6. cuGetErrorString (CUDA driverAPI)
q. Syntax
i. CUresult cuGetErrorString ( CUresult error, const char** pStr )
r. Parameters
i. error - Error code to convert to string
ii. pStr - Address of the string pointer
s. Returns
CUDA_SUCCESS, CUDA_ERROR_INVALID_VALUE

t. Sets *pStr to the address of a NULL-terminated string description of the error code
error. If the error code is not recognized, CUDA_ERROR_INVALID_VALUE will be
returned and *pStr will be set to the NULL address.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 9) Programming Issues

 Error Checking
1) All runtime functions return an error code. However for an asynchronous function this error
code cannot possibly report any of the asynchronous errors that could occur on the device
since the function returns before the device has completed the task. The error code only
reports errors that occur on the host prior to executing the task, typically related to
parameter validation; if an asynchronous error occurs, it will be reported by some
subsequent unrelated runtime function call.
2) The only way to check for asynchronous errors just after some asynchronous function call
is therefore to synchronize just after the call by calling cudaDeviceSynchronize()and
checking the error code returned by cudaDeviceSynchronize().
3) The runtime maintains an error variable for each host thread that is initialized to
cudaSuccess and is overwritten by the error code every time an error occurs (be it a
parameter validation error or an asynchronous error). cudaPeekAtLastError() returns this
variable. cudaGetLastError() returns this variable and resets it to cudaSuccess.
4) Kernel launches do not return any error code, so cudaPeekAtLastError() or
cudaGetLastError() must be called just after the kernel launch to retrieve any pre-launch
errors. To ensure that any error returned by cudaPeekAtLastError() or cudaGetLastError()
does not originate from calls prior to the kernel launch, one has to make sure that the
runtime error variable is set to cudaSuccess just before the kernel launch, for example, by
calling cudaGetLastError() just before the kernel launch.
5) Kernel launches are asynchronous, so to check for asynchronous errors, the application
must synchronize in-between the kernel launch and the call to cudaPeekAtLastError() or
cudaGetLastError().
6) Note that cudaErrorNotReady that may be returned by cudaStreamQuery() and
cudaEventQuery() is not considered an error. And, is therefore not reported by
cudaPeekAtLastError() or cudaGetLastError().

 3.3 Parallel Programming Issues

 Having understood issues with usage APIs, the next set of issues that most CUDA
developers face are related to nature of parallel software development. In this section, we
look at some of these issues and how they affect application development.
 In a single-thread application, the problem of producer / consumer is quite easy to handle. It
is simply a case of looking at the data flow and seeing if a variable was read before
anything wrote to it. Many of the better compilers highlight such issues. However, even
with this assistance, complex code can suffer from this issue.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 10) Programming Issues

 As soon as you introduce threads into the equation, producer / consumer problems become
a real headache if not thought about carefully in advance. The threading mechanism in most
operating systems including CUDA tries to operate to achieve the best overall throughput.
This usually means threads can run in any order and the program must not be sensitive to
this ordering. Consider a loop where iteration i depends on loop iteration i – 1. If we simply
assign a thread to each element of the array and do nothing else, the program will work
only when the processor executes one thread at a time according to the thread ID from low
to high thread numbers. Reverse this order or execute more than one thread in parallel and
the program breaks. However, this is a rather simple example and not all programs break.
Many run and produce the answer correctly sometimes. If you ever find you have a correct
answer on some runs, but the wrong answer on others, it is likely you have a
producer / consumer or race hazard issue.
 Race hazard
 A race hazard, as its name implies, occurs when sections of the program “race” toward a
critical point, such as a memory read / write. Sometimes warp 0 may win the race and the
result is correct. Other times warp 1 might get delayed and warp 3 hits the critical section
first, producing the wrong answer. The major problem with race hazards is they do not
always occur. This makes debugging them and trying to place a breakpoint on the error
difficult. The second feature of race hazards is they are extremely sensitive to timing
disturbances. Thus, adding a breakpoint and single-stepping the code always delays the
thread being observed. This delay often changes the scheduling pattern of other warps,
meaning the particular conditions of the wrong answer may never occur.
 The first question in such a situation is not where in the code is this happening, but requires
you to take a step backward and look at the larger picture. Consider under what
circumstances the answer can change. If there is some assumption about the ordering of
thread or block execution in the design, then we already have the cause of the problem. As
CUDA does not provide any guarantee of block ordering or warp execution ordering, any
such assumption means the design is flawed. For instance, take a simple sum-based
reduction to add all the numbers in a large array. If each run produces a different answer,
then this is likely because the blocks are running in a different order, which is to be
expected. The order should not and must not affect the outcome of the result. In such an
example we can fix the ordering issues by sorting the array and combining values from low
to high in a defined order. We can and should define an order for such problems. However,
the actual execution order in the hardware should be considered as undefined with known
synchronization points.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (3 - 11) Programming Issues

 CUDA provides racecheck tool to help identify memory access race conditions in CUDA
applications that use shared memory. In CUDA applications, storage declared with the
__shared__ qualifier is placed in on chip shared memory. All threads in a thread block can
access this per block shared memory. Shared memory goes out of scope when the thread
block completes execution. As shared memory is on chip, it is frequently used for inter
thread communication and as a temporary buffer to hold data being processed. As this data
is being accessed by multiple threads in parallel, incorrect program assumptions may result
in data races. Racecheck is a tool built to identify these hazards and help users write
programs free of shared memory races.
 A data access hazard is a case where two threads attempt to access the same location in
memory resulting in nondeterministic behavior, based on the relative order of the two
accesses. These hazards cause data races where the behavior or the output of the application
depends on the order in which all parallel threads are executed by the hardware. The
racecheck tool identifies three types of canonical hazards in a program. These are :
1. Write-After-Write (WAW) hazards - This hazard occurs when two threads attempt to
write data to the same memory location. The resulting value in that location depends on
the relative order of the two accesses.
2. Write-After-Read (WAR) hazards - This hazard occurs when two threads access the
same memory location, with one thread performing a read and another a write. In this
case, the writing thread is ordered before the reading thread and the value returned to the
reading thread is not the original value at the memory location.
3. Read-After-Write (RAW) hazards - This hazard occurs when two threads access the
same memory location, with one thread performing a read and the other a write. In this
case, the reading thread reads the value before the writing thread commits it.
 Atomic operations
 As you know, we cannot rely on, or make any assumption about, ordering to ensure an
output is correct. However, neither can you assume a read / modify / write operation will be
completed synchronously with the other SMs within the device. Consider the scenario of
SM0 and SM1 both performing a read/modify / write. They must perform it in series to
ensure the correct answer is reached. If SM0 and SM1 both read 10 from a memory
address, add 1 to it, and both write 11 back, one of the increments to the counter has been
lost.
 Atomic operations are used where we have many threads that need to write to a common
output. They guarantee that the read/write / modify operation will be performed as an entire
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (3 - 12) Programming Issues

serial operation. They, however, do not guarantee any ordering of the read/write/modify
operation. Thus, if both SM0 and SM1 ask to perform an atomic operation on the same
address, which SM goes first is not defined.
 CUDA provides atomic functions which perform a read-modify-write atomic operation on
one 32-bit or 64-bit word residing in global or shared memory. For example, atomic Add()
reads a word at some address in global or shared memory, adds a number to it, and writes
the result back to the same address. The operation is atomic in the sense that it is
guaranteed to be performed without interference from other threads. In other words, no
other thread can access this address until the operation is complete. Following are some of
the other examples of atomic functions : atomicSub(), atomicExch(), atomicMin(),
atomicMax(), atomicInc(), atomicDec().

 3.4 Synchronization Issues

 On both host and device, the CUDA runtime offers an API for launching kernels, for
waiting for launched work to complete, and for tracking dependencies between launches via
streams and events. On the host system, the state of launches and the CUDA primitives
referencing streams and events are shared by all threads within a process; however
processes execute independently and may not share CUDA objects. A similar hierarchy
exists on the device i.e. launched kernels and CUDA objects are visible to all threads in a
thread block, but are independent between thread blocks. This means for example that a
stream may be created by one thread and used by any other thread in the same thread block,
but may not be shared with threads in any other thread block.
 CUDA runtime operations from any thread, including kernel launches, are visible across a
thread block. This means that an invoking thread in the parent grid may perform
synchronization on the grids launched by that thread, by other threads in the thread block,
or on streams created within the same thread block. Execution of a thread block is not
considered complete until all launches by all threads in the block have completed. If all
threads in a block exit before all child launches have completed, a synchronization
operation will automatically be triggered.
1. Thread synchronization - CUDA provides __syncthreads() function. This function is
used to coordinate communication between the threads of the same block. When some
threads within a block access the same addresses in shared or global memory, there are
potential read-after-write, write-after-read, or write-after-write hazards for some of these
memory accesses. These data hazards can be avoided by synchronizing threads in-
between these accesses. That is this function waits until all threads in the thread block

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 13) Programming Issues

have reached this point and all global and shared memory accesses made by these
threads prior to __syncthreads() are visible to all threads in the block.
2. Group synchronization - cooperative_groups::sync(T and group)synchronizes the
threads named in the group. T can be any of the existing group types, as all of them
support synchronization. If the group is a grid_group or a multi_grid_group the kernel
must have been launched using the appropriate cooperative launch APIs.
3. Grid synchronization - To synchronize across the grid, from within a kernel, you
would simply use the grid.sync() functionality:
grid_group grid = this_grid();

grid.sync();

And when launching the kernel it is necessary to use, instead of the <<<...>>> execution
configuration syntax, the cudaLaunchCooperativeKernel CUDA runtime launch API or the
CUDA driver equivalent.
4. Multi-Device Synchronization - In order to enable synchronization across multiple
devices with Cooperative Groups, programmers needs to use cudaLaunch Cooperative
Kernel MultiDevice CUDA API.
a. This API ensures that a launch is atomic, i.e. if the API call succeeds, then the provided
number of thread blocks will launch on all specified devices.
b. The functions launched via this API must be identical. No explicit checks are done by
the driver in this regard because it is largely not feasible. It is up to the application to
ensure this.
c. No two entries in the provided cudaLaunchParams may map to the same device.
d. All devices being targeted by this launch must be of the same compute capability -
major and minor versions.
e. The block size, grid size and amount of shared memory per grid must be the same across
all devices. This means the maximum number of blocks that can be launched per device
will be limited by the device with the least number of SMs.
f. Any user defined __device__, __constant__ or __managed__ device global variables
present in the module that owns the CUfunction being launched are independently
instantiated on every device. The user is responsible for initializing such device global
variables appropriately.
g. Optimal performance in multi-device synchronization is achieved by enabling peer
access via cuCtxEnablePeerAccess or cudaDeviceEnablePeerAccess for all participating
devices.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (3 - 14) Programming Issues

h. The launch parameters should be defined using an array of structs (one per device), and
launched with cudaLaunchCooperativeKernelMultiDevice
 You can achieve concurrency in your CUDA programs by using Streams. All CUDA calls
are either synchronous or asynchronous with respect to the host. In synchronous calls, work
is added to queue and host waits for completion. Whereas, in asynchronous calls work is
added to queue and control returns to host immediately. Kernel launches are asynchronous
in nature.
 A stream is a queue of device work. The host places work in the queue and continues on
immediately. Device schedules work from streams when resources are free. CUDA
operations are placed within a stream e.g. Kernel launches, memory copies. Operations
within the same stream are ordered (FIFO) and cannot overlap. Operations in different
streams are unordered and can overlap.
 Let us understand how to manage streams :
1. cudaStream_t stream – This declares a stream handle
2. cudaStreamCreate(&stream) – This allocates a stream
3. cudaStreamDestroy(stream) – This de-allocates a stream and synchronizes host until
work in stream has completed
th
4. Placing work in stream can be achieved by using the 4 launch parameter during kernel
launch as kernel<<< blocks , threads, smem, stream>>>(). And, stream can be passed
into some API calls as cudaMemcpyAsync ( dst, src, size, dir, stream).
5. Unless otherwise specified all calls are placed into a default stream. Default stream is
synchronous with all streams and operations in default stream cannot overlap other
streams.
6. Explicit Synchronization - There are different ways to explicitly synchronize streams
with each other
a. cudaDeviceSynchronize() waits until all preceding commands in all streams of all
host threads have completed.
b. cudaStreamSynchronize()takes a stream as a parameter and waits until all preceding
commands in the given stream have completed. It can be used to synchronize the
host with a specific stream, allowing other streams to continue executing on the
device.
c. cudaStreamWaitEvent()takes a stream and an event as parameters and makes all the
commands added to the given stream delay their execution until the given event has
completed.
d. cudaStreamQuery()provides applications with a way to know if all preceding
commands in a stream have completed.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 15) Programming Issues

7. Implicit Synchronization - Two commands from different streams cannot run

concurrently if any one of the following operations is issued in-between them by the
host thread :
a. a page-locked host memory allocation
b. a device memory allocation
c. a device memory set
d. a memory copy between two addresses to the same device memory
e. any CUDA command to the NULL stream
f. a switch between the L1 / shared memory configurations
Therefore, applications should follow these guidelines to improve their potential for
concurrent kernel execution:
a. All independent operations should be issued before dependent operations.
b. Synchronization of any kind should be delayed as long as possible.
NVIDIA Nsight Systems, NVIDIA Visual Profiler, and The NVIDIA CUDA Profiling
Tools Interface are some of the commonly used tools by CUDA programmers for code
profiling and identifying issues with concurrency and synchronization. Following is list of the
most common streaming/synchronization related issues that are seen many CUDA
applications.
1. Using the default stream
a. Symptoms
i. One stream will not overlap other streams.
b. Solution
i. Search for cudaEventRecord(event), cudaMemcpyAsync(), etc. If stream is not
specified it is placed into the default stream. Specify non-default stream.
ii. Search for kernel launches in the default stream <<<a,b>>> and move into a non-
default stream.
2. Memory transfer issues
a. Symptoms
i. Memory copy is synchronous.
ii. Memory copies are not overlapping.
iii. Host does not get ahead and spends excessive time in memory copy API.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 16) Programming Issues

b. Solution
i. Use asynchronous memory copies.
ii. Use pinned host memory using cudaMallocHost or cudaHostRegister.
3. Implicit synchronization
a. Symptoms
i. Host does not get ahead.
ii. Host shows excessive time in certain API calls - cudaMalloc, cudaFree,
cudaEventCreate, cudaEventDestroy, cudaStreamCreate, cudaHostRegister,
cudaFuncSetCacheConfig
iii. Allocation and deallocation synchronizes the device.
b. Solution
i. Reuse CUDA memory and objects / data structures including streams and events.
4. Limited by host
a. Symptoms
i. Host is limiting performance and is outside of API calls.
ii. Large gaps in timeline where the host and device are empty.
b. Solution
i. Move more work to the GPU.
ii. Multi-thread host code.
5. Limited by launch overhead
a. Symptoms
i. Host does not get ahead. Host is in cudaLaunch or other APIs. There is not enough
work to cover launch overhead.
ii. Kernels are short < 30 s.
iii. Time between successive kernels is >10 s.
b. Solution
i. Make longer running kernels by a) fusing nearby kernels together b) batching work
within a single kernel c) solving larger problems.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 17) Programming Issues

6. Excessive synchronization
a. Symptoms
i. Host does not get ahead.
ii. Large gaps of idle time in timeline.
iii. Host shows synchronization API calls.
b. Solution
i. Use events to limit the amount of synchronization.
ii. Use cudaStreamWaitEvent to prevent host synchronization.
iii. Use cudaEventSynchronize.
7. Profiler overhead
a. Symptoms
i. Large gaps in timeline.
ii. Timeline shows profiler overhead.
iii. Real code likely does not have the same problem.
b. Solution
i. Avoid cudaDeviceSynchronize(), cudaStreamSynchronize(),
cudaEventRecord(event,stream), cudaEventSynchronize(event).
 If we have to summarize common mistakes by programmers relating to synchronization
then they would be following :
1. Using the default stream
2. Using synchronous memory copies
3. Not using pinned memory
4. Overuse of synchronization primitives
 CUDA provides synccheck tool that can identify whether a CUDA application is correctly
using synchronization primitives, specifically __syncthreads() and __syncwarp() intrinsics
and their Cooperative Groups API counterparts.

 3.5 Algorithmic Issues

Algorithmic types of issues are those issues wherein program runs successfully without any
errors, but does not produce correct/expected output. These types of problem are bit tricky to
solve.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 18) Programming Issues

1) Back-to-back testing
This is a technique which acknowledges that it is much harder to write code that executes in
parallel than write a functionally equivalent set of code for a serial processor. With this in
mind programmer always develops serial implementation of the problem along with CUDA
parallel implementation. Programmer then runs the identical dataset through both sets of
code and compares the output. Any difference in the output points out that you may have an
issue.
2) Memory leaks
Memory leaks are a common problem and something that is not just restricted to the CPU
domain. Memory leaks are device side allocations that have not been freed by the time the
context is destroyed.
a. The most common cause of memory leaks is where a program allocates, or mallocs,
memory space but does not free that space later. If you have ever left a computer / server
on for weeks at a time, sooner or later it will start to slow down. Sometime afterwards it
will start to display out of memory warnings. This is caused by badly written programs
that do not clean up after themselves.
b. Explicit memory management is something you are responsible for within CUDA. If
you allocate memory, you are responsible for deallocating that memory when the
program completes its task. You are also responsible for not using a device handle or
pointer that you previously released back to the CUDA runtime.
c. Several of the CUDA operations, in particular streams and events, require you to create
an instance of that stream. During that initial creation the CUDA runtime may allocate
memory internally. Failing to call cudaStreamDestroy() or cudaEventDestroy() means
that memory, which may be both on the host and on the GPU, stays allocated. Your
program may exit, but without the explicit release of this data by the programmer, the
runtime does not know it should be released.
d. To take care of this problem, programmers can call cudaResetDevice() API, which
completely clears all allocations on the device. This should be the last call you make
before exiting the host program. Even if you have released all the resources you think
you have allocated, with a program of a reasonable size, you or a colleague on the team
may have forgotten one or more allocations. It is a simple and easy way to ensure
everything is cleaned up.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 19) Programming Issues

e. The CUDA-MEMCHECK is very useful tool designed to detect memory access errors
and thread ordering hazards that are hard to detect and time consuming to debug. This
tool is supported on Windows, Linux, and Android; and can interoperate with CUDA-
GDB on Android and Linux. Tool can be invoked by running the cuda-memcheck
executable as follows :
cuda-memcheck [options] app_name [app_options]

f. The tool will execute your kernel and report appropriate errors like following :
i. Memory access error - Errors on device due to out of bounds or misaligned
accesses to memory by a global, local, shared or global atomic access.
ii. Hardware exception - Errors on device that are reported by the hardware error
reporting mechanism.
iii. Malloc / Free errors - Device errors that occur due to incorrect use of malloc() /
free() in CUDA kernels.
iv. CUDA API errors - Errors on host when a CUDA API call in the application returns
a failure.
v. cudaMalloc memory leaks - Errors on host resulting from allocations of device
memory using cudaMalloc() that have not been freed by the application.
vi. Device Heap Memory Leaks - Errors on device resulting from allocations of device
memory using malloc() in device code that have not been freed by the application.
3) Long kernels
 Kernel taking long time to execute can cause a number of problems. One of the most
noticeable is slow screen updates when the kernel is executing in the background on a
device also used to display the screen. To run a CUDA kernel and at the same time
support a display, the GPU must context switch between the display updates and the
kernel. When the kernels take a short time, the user has little perception of this.
However, when they become longer, it can become quite annoying to the point of the
user not using the program.
 The solution to this issue is to ensure you have small kernels in the first instance.
However, that would likely mean your overall problem execution time would increase
considerably, as the GPU would need to continuously switch between the graphics
context and the CUDA context. There is no easy solution to this particular issue. Users
often prefer slightly slower programs if it means they can still use the machine for other
tasks.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (3 - 20) Programming Issues

 3.6 Finding and Avoiding Errors

 In this section, we will understand various debugging solutions provided by NVIDIA for
CUDA programmers to troubleshoot CUDA programs and solve errors. Also, we will learn
about different debugging techniques and approaches that can be followed by CUDA
programmers to troubleshoot and solve issues.
 NVIDIA provides number of solutions that can be used to debug your host and GPU code
to find errors. Debugger provides vital information and assists programmers during
troubleshooting by allowing setting breakpoints in the program, stepping into and over
functions, watching program expressions, and inspecting the memory contents at selected
points during the execution. Following is the list of solutions provided NVIDIA :
1) Arm Forge Debugger - This is single tool that provides application developers ability
to debug hybrid MPI (Message Passing Interface), OpenMP, CUDA and OpenACC
applications on a single workstation or GPU cluster.
2) TotalView - This is a GUI-based tool that allows programmers to debug one or many
processes/threads with complete control over program execution. It supports basic
debugging operations like stepping through code to concurrent programs that take
advantage of threads, OpenMP, MPI, or GPUs.
3) NVIDIA Nsight - This is development platform for heterogeneous computing and
works with powerful debugging and profiling tools.
4) CUDA-GDB - This tool delivers seamless debugging experience that allows
programmers to debug both the CPU and GPU portions of application simultaneously.
CUDA-GDB can be used on Linux or MacOS, from the command line, DDD or
EMACS.
5) CUDA-MEMCHECK - This tool identifies memory access errors in your GPU code
and allows you to locate and resolve problems quickly. CUDA-MEMCHECK also
reports runtime execution errors, identifying situations that could result in an
“unspecified launch failure” error while your application is running.
 Having understood debugging tools, let us now understand debugging techniques and
approaches that CUDA programmers can apply in finding and avoiding errors in CUDA
applications.
1. Employ divide and conquer strategy
The divide-and-conquer approach is a common approach for debugging and is not GPU
specific. If you have thousands of lines of code in which the bug might be hiding, going

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 21) Programming Issues

through each one line by line would take too long. What you need to do is divide your code
in half and see if the bug is in the first half or the second. Once you have identified which
half contains the problem, repeat the process until you narrow down to exactly where the
problem lies. This approach is useful where your kernel is causing some exception that is
not handled by the runtime. This usually means you get an error message and the program
stops running or, in the worst case, the machine simply hangs.
a) The first approach in this sort of problem should be to run through with the debugger,
stepping over each line at a high level. Sooner or later you will hit the call that triggers
the crash. Start with the host debugger, and see at which point the error occurs. It is most
likely will be the kernel invocation or the first call into the CUDA API after the kernel
invocation.
b) If you identify the issue as within the kernel, switch to a GPU debugger such as CUDA-
GDB. Then simply repeat the process following a single thread through the kernel
execution process. This should allow you to see the top-level call that triggers the fault.
If not, the cause may be a thread other than the one you are tracking. Typically the
“interesting” threads are threads 0 and 32 within any given block. Most CUDA kernel
errors that are not otherwise detected are either to do with interwarp or interblock
behaviour not working as the programmer imagined they would work.
c) Single step through the code and check that the answer for every calculation is what it is
expected to be. As soon as you have one wrong answer, you simply have to understand
why it’s wrong and often the solution is then clear. What you are attempting to do is a
very high level binary search. By stepping over the code until you hit the failure point,
you are eliminating a single level of functionality. You can then very quickly identify
the problem function / code line.
d) You can also use this approach without a debugger if for whatever reason you have no
access to such a debugger within your environment or the debugger is in some way
interfering with the visibility of the problem. Simply place #if 0 and #endif pre-
processor directives around the code you wish to remove for this run. Compile and run
the kernel and check the results. When the code runs error free, the error is likely to be
somewhere within the section that is removed. Gradually reduce the size of this section
until it breaks again. The point it breaks is a clear indicator of the likely source of the
issue.
e) You may also wish to try the approach of seeing if the program runs with the following:
i) One block of 1 thread , One block of 32 threads, One block of 64 threads

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 22) Programming Issues

ii) Two blocks of 1 thread, Two blocks of 32 threads, Two blocks of 64 threads
iii) Sixteen blocks of 1 thread, Sixteen blocks of 32 threads, Sixteen blocks of 64
threads
iv) If one or more of these tests fail, it tells you there is some interaction of either the
threads within a warp, threads within a block, or blocks within a kernel launch that
is causing the issue. It provides a pointer as to what to look for in the code.
2. Use defensive programming and assertions
a) Defensive programming is programming that assumes the caller will do something
wrong.
b) char * ptr malloc(1024);
free(ptr);
The code assumes that malloc will return a valid pointer to 1024 bytes of memory.
Given the small amount of memory we are requesting, in reality it is unlikely to fail. If it
fails, malloc will return a null pointer. For the code to work correctly, the free() function
also needs to handle null pointers. Thus, the start of the free function might be
if (ptr != NULL)

free(ptr);

The free() function needs to consider both receiving a null pointer and also a valid
pointer. The NULL pointer, does not point to a valid area of allocated memory.
Typically, if you call free() with a null or an invalid pointer, a function that is written
defensively will not corrupt the heap storage, but will instead do nothing.
c) Defensive programming is about doing nothing erroneous in the case of bad inputs to a
function. However, this has a rather serious side effect. While the user no longer sees the
program crash, neither does the testing person, or the programmer. In fact, the program
now silently fails, despite the programming errors in the caller.
If a function has implicit requirements on the bounds or range of an input, this should be
checked. For example, if a parameter is an index into an array, you should absolutely
check this value to ensure the array access does not generate an out-of bounds access.
Another example is, when writing a function, one should assume worst-case inputs to
that function, i.e. inputs that are too large, too small, or inputs that violate some
property, condition, or invariant; the code should deal with these cases, even if the
programmer doesn't expect them to happen under normal circumstances.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (3 - 23) Programming Issues

d) When a program fails, to have it fail silently is bad practice. It allows bugs to remain in
the code and go undetected. Programmers can use assert directive to handle such cases.
Assert help programmers by outputting the error. Thus, if a null pointer is not allowed as
one of the input parameters to the function, then replace the if (ptr != NULL) check with
the following :
// Null pointers not supported

assert(ptr != NULL);

This means we no longer require an additional indent, plus we document in the code the
precondition for entry into the function. Always make sure you place a comment above
the assertion explaining why the assertion is necessary. It will likely fail at some point in
the future and you want the caller of that function to understand as quickly as possible
why their call to the function is invalid. That caller may very often be yourself, so it is in
your own best interests to ensure it is commented. Six months from now you will have
forgotten why this precondition was necessary. You will then have to search around
trying to remember why it was needed. It also helps prevent future programmers from
removing the “incorrect” assertion. Removing assertions should not be done without
entirely understanding why the assertion was put there in the first place. In almost all
cases, removing the assert check will simply mask an error later in the program.
e) When using assertions, be careful not to mix handling of programming errors with valid
failure conditions. For example, this following code is incorrect :
char * ptr = malloc(1024);

assert(ptr != NULL);

It is a valid condition for malloc to return a NULL pointer. It does so when the heap
space is exhausted. This is something the programmer should have a valid error
handling case for, as it is something that will always happen eventually. Assertions
should be reserved for handling an invalid condition, such as index out of bounds,
default switch case when processing enumerations, etc.
f) void assert(int expression); - This stops the kernel execution if expression is equal to
zero. If the program is run within a debugger, this triggers a breakpoint and the
debugger can be used to inspect the current state of the device. Otherwise, each thread
for which expression is equal to zero prints a message to stderr after synchronization
with the host via cudaDeviceSynchronize(), cudaStreamSynchronize(), or
cudaEventSynchronize().

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 24) Programming Issues

The format of assert message is as follows :

<filename>:<line number>:<function>:

block: [blockId.x,blockId.x,blockIdx.z],

thread: [threadIdx.x,threadIdx.y,threadIdx.z]

Assertion `<expression>` failed.

g) Any subsequent host-side synchronization calls made for the same device will return
cudaErrorAssert. No more commands can be sent to this device until cudaDeviceReset()
is called to reinitialize the device.
h) If expression is different from zero, the kernel execution is unaffected. For example, the
following program from source file test.cu
#include <assert.h>

global void testAssert(void)

int is_one = 1;

int should_be_one = 0;

// This will have no effect

assert(is_one);

// This will halt kernel execution

assert(should_be_one);

int main(int argc, char* argv[])

testAssert<<<1,1>>>();

cudaDeviceSynchronize();

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 25) Programming Issues

return 0;

Above code will output:

test.cu:19: void testAssert(): block: [0,0,0], thread: [0,0,0] Assertion

`should_be_one` failed.

i) Assertions are for debugging purposes. With using defensive programming and
assertions, processor spends time checking conditions that for the most part will always
be valid. It can do this on each and every function call, loop iteration, etc., depending on
how widespread the use of assertions are. All this can affect performance and it is
therefore recommended to disable them in production code. They can be disabled at
compile time by defining the NDEBUG preprocessor macro before including assert.h.
3. Use CUDA-MEMCHECK
 CUDA-MEMCHECK is a functional correctness checking suite included in the CUDA
toolkit. This suite contains multiple tools like memcheck, racecheck, initcheck, and
synccheck.
a) The memcheck tool is capable of precisely detecting and attributing out of bounds
and misaligned memory access errors in CUDA applications. The tool also reports
hardware exceptions encountered by the GPU. The memcheck tool can also be
enabled in integrated mode inside CUDA-GDB.
b) The racecheck tool can report shared memory data access hazards that can cause data
races.
c) The initcheck tool can report cases where the GPU performs uninitialized accesses to
global memory.
d) The synccheck tool can report cases where the application is attempting invalid
usages of synchronization primitives.
 CUDA-MEMCHECK can be run in standalone mode where the user's application is
started under CUDA-MEMCHECK.
4. Use version control
Version control is a key aspect of any professional software development. Version control
is important to keep track of changes and keep every team member working off the latest
version.
a) Consider for a moment that you are required to debug errorin very large program. You
will realize that it is very hard if that program was not versioned regularly or whenever
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (3 - 26) Programming Issues

major changes weremade. Versioning code allows programmers to look at previous

working version and compare changes with current code to see where possibly problem
has got introduced.
b) Programmers are generally a fairly confident that changes they have implemented in
code will work without problems. However, when it doesn’t quite work to plan,
remembering exactly the set of changes that were made can be very difficult. Without a
working backup of the program it can be difficult (sometimes nearly impossible) to get
back to exactly the working version before the changes.
c) Most programs in the professional world are developed in teams. A colleague can be
extremely helpful in providing a fresh pair of eyes with which to see a problem. If you
have a versioned or baselined copy of the working code it makes it relatively easy to
look simply at the differences and see what is now breaking the previously working
solution. Without these periodic baselines it’s not easy to identify the place where the
error might be, and thus instead of a few hundred lines of code, you may have to look at
a few thousand.
5. Use printf
Programmers can use a lot of printfs for debugging. This allows programmer to execute the
code in “Release” mode, and see what exactly is going wrong at a fast execution speed. It is
important to make sure that all printfs are disabled through a macro when you want to move
code in production
a) int printf(const char *format[, arg, ...]);
prints formatted output from a kernel to a host-side output stream.
The in-kernel printf() function behaves in a similar way to the standard C-library printf()
function. In essence, the string passed in as format is output to a stream on the host, with
substitutions made from the argument list wherever a format specifier is encountered.
b) The printf() command is executed as any other device-side function. That is, from a
multi-threaded kernel, this means that a straightforward call to printf() will be executed
by every thread, using that thread's data as specified. It is up to the programmer to limit
the output to a single thread if only a single output string is desired.
6. Write back output to files
Even with debugging, the data structures you use are hard to check because of the massive
parallelism that is inherent with CUDA. You can try writing the effects of the intermediate
steps of algorithm by doing a cudaMemCpy from device to host. Output data can be written
into CSV files or image files and can be checked for any issues. With data in front of you, you
can visualize and check for errors in your code.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (3 - 27) Programming Issues

7. Incremental and bottom-up program development

 One of the most effective ways to localize errors is to develop the program
incrementally, and test it often, after adding each piece of code. It is highly likely that if
there is an error, it occurs in the last piece of code that you wrote. With incremental
program development, the last portion of code is small; the search for bugs is therefore
limited to small code fragments. An added benefit is that small code increments will
likely lead to few errors, so the programmer is not overwhelmed with long lists of
errors.
 Bottom-up development maximizes the benefits of incremental development. With
bottom-up development, once a piece of code has been successfully tested, its behaviour
won't change when more code is incrementally added later. Existing code does not rely
on the new parts being added, so if an error occurs, it must be in the newly added code
(unless the old parts were not tested well enough).

 3.7 Part A : Short Answered Questions [2 Marks Each]

Q.1 What are common problems faced by CUDA programmers ?

 Answer : Common problems faced by CUDA programmers are

1) Using CUDA APIs

2) Not checking bounds

3) Using invalid device handles

4) Not understanding the compute dependent features and programming environment
Q.2 Which are commonly used CUDA error handling APIs ?

 Answer : Following are commonly used CUDA error handling APIs:

1. cudaGetErrorName - Returns a string containing the name of an error code

2. cudaGetErrorString - Returns the description string for an error code

3. cudaGetLastError - Returns the last error that has been produced by any of the runtime
calls in the same host thread and resets it to cudaSuccess
4. cudaPeekAtLastError - This API is same as cudaGetLastError but does not reset the
error to cudaSuccess

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 28) Programming Issues

Q.3 What is RACE hazard ?

 Answer : A race hazard occurs when sections of the program “race” toward a critical
point, such as a memory read/write. Sometimes warp 0 may win the race and the result is
correct. Other times warp 1 might get delayed and warp 3 hits the critical section first,
producing the wrong answer. The major problem with race hazards is they do not always
occur. This makes debugging them and trying to place a breakpoint on the error difficult. The
second feature of race hazards is they are extremely sensitive to timing disturbances. Thus,
adding a breakpoint and single-stepping the code always delays the thread being observed.
This delay often changes the scheduling pattern of other warps, meaning the particular
conditions of the wrong answer may never occur.
Q.4 Explain common mistakes by programmers relating to synchronization.
 Answer : Following are common mistakes made by programmers relating to
synchronization then they would be following :
1. Using the default stream
2. Using synchronous memory copies
3. Not using pinned memory
4. Overuse of synchronization primitives
Q.5 What is memory leak and what is common cause of it ?
 Answer : Memory leaks are device side allocations that have not been freed by the time
the context is destroyed. The most common cause of memory leaks is where a program
allocates, or mallocs, memory space but does not free that space later.
6. List down of CUDA tools/solutions that help developers in debugging and solving
CUDA program errors.
 Answer : Following is the list of debugging solutions provided NVIDIA:
1) Arm Forge Debugger 2) TotalView
3) NVIDIA Nsight 4) CUDA-GDB
5) CUDA-MEMCHECK
Q.7 Describe divide and conquer error finding strategy.
 Answer : The divide-and-conquer approach is a common approach for debugging and is
not GPU specific. If you have thousands of lines of code in which the bug might be hiding,
going through each one line by line would take too long. What you need to do is divide your

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 29) Programming Issues

code in half and see if the bug is in the first half or the second. Once you’ve identified which
half contains the problem, repeat the process until you narrow down to exactly where the
problem lies. This approach is useful where your kernel is causing some exception that is not
handled by the runtime.
Q.8 For what CUDA-MEMCHECK tool is used for ?
 Answer : CUDA-MEMCHECK is a functional correctness checking suite included in the
CUDA toolkit. This suite contains multiple tools like memcheck, racecheck, initcheck, and
synccheck.
a) The memcheck tool is capable of precisely detecting and attributing out of bounds and
misaligned memory access errors in CUDA applications. The tool also reports hardware
exceptions encountered by the GPU.
b) The racecheck tool can report shared memory data access hazards that can cause data
races.
c) The initcheck tool can report cases where the GPU performs uninitialized accesses to
global memory.
d) The synccheck tool can report cases where the application is attempting invalid usages
of synchronization primitives.
Q.9 Explain defensive programming and assertions technique.
 Answer : Defensive programming is about doing nothing erroneous in the case of bad
inputs to a function. However, this has a rather serious side effect. While the user no longer
sees the program crash, neither does the testing person, or the programmer. In fact, the
program now silently fails, despite the programming errors in the caller. When a program fails,
to have it fail silently is bad practice. It allows bugs to remain in the code and go undetected.
Programmers can use assert directive to handle such cases. Assert help programmers by
outputting the error.
Q.10 Why is version control important ?
 Answer : Version control is important to

a) Keep track of changes and keep every team member working off the latest version
b) Look at previous working version and compare changes with current code to see where
possibly problem has got introduced
c) Keep working backup of the program
d) See visual differences between baselined copy of the working code and current code

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (3 - 30) Programming Issues

 3.8 Part B : Long Answered Questions

Q.1 Describe CUDA error handling APIs and explain how they can be used for error
checking. (Refer section 3.2)
Q.2 Describe parallel programming issues. (Refer section 3.3)

Q.3 Explain synchronization problems along with possible solutions. (Refer section 3.4)
Q.4 Describe different algorithmic issues and strategies to tackle them. (Refer section 3.5)

Q.5 What are tools and techniques that you should employ for finding and solving CUDA
errors ? (Refer section 3.6)




TECHNICAL PUBLICATIONS® - An up thrust for knowledge

UNIT - IV

4 OpenCL Basics

Syllabus

OpenCL Standard - Kernels - Host Device Interaction - Execution Environment - Memory

Model - Basic OpenCL Examples.

Contents

4.1 OpenCL Standards

4.2 Kernels and Host Device Interaction

4.3 The OpenCL Architecture - OpenCL Programming Models Specification

4.4 Exploring OpenCL Memory Model - The Memory Objects

4.5 OpenCL Program and OpenCL Programming Examples

4.6 Part A : Short Answered Questions [2 Marks Each]

4.7 Part B : Long Answered Questions

(4 - 1)
GPU Architecture and Programming (4 - 2) OpenCL Basics

 4.1 OpenCL Standards

 4.1.1 OpenCL Introduction

 OpenCL (Open Computing Language) provides a common language, programming
interfaces, and hardware abstractions enabling developers to accelerate applications with
task-parallel or data-parallel computations in a heterogeneous computing environment
consisting of the host CPU and any attached OpenCL “devices”. OpenCL devices may or
may not share memory with the host CPU, and typically have a different machine
instruction set, so the OpenCL programming interfaces assume heterogeneity between the
host and all attached devices.

 Open Computing Language is a framework for writing programs that execute across
heterogeneous platforms. They consist for example of CPUs GPUs DSPs and FPGAs.
OpenCL specifies a programming language (based on C99) for programming these devices
and application programming interfaces (APIs) to control the platform and execute
programs on the compute devices. OpenCL provides a standard interface for parallel
computing using task-based and data-based parallelism.
 OpenCL speeds applications by offloading their most computationally intensive code onto
accelerator processors - or devices. OpenCL developers use C or C++-based kernel
languages to code programs that are passed through a device compiler for parallel execution
on accelerator devices.
 The key programming interfaces provided by OpenCL include functions for enumerating
available target devices (CPUs, GPUs, and Accelerators of various types), managing
“contexts” containing the devices to be used, managing memory allocations, performing
host-device memory transfers, compiling OpenCL programs and “kernel” functions to be
executed on target devices, launching kernels on target devices, querying execution
progress, and checking for errors.
 OpenCL provides the industry with the lowest 'close-to-metal' processor-agile execution
layer for accelerating applications, libraries and engines, and also providing a code
generation target for compilers. Unlike 'GPU-only' APIs, such as Vulkan, OpenCL enables
use of a diverse range of accelerators including multi-core CPUs, GPUs, DSPs, FPGAs and
dedicated hardware such as inferencing engines.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 3) OpenCL Basics

Fig. 4.1.1 : Open source software tools that enable OpenCL kernels to be
executed over multiple target APIs

 Both AMD and NVIDIA have released OpenCL implementations supporting their
respective GPUs. These devices require a large number of OpenGL work-items and work-
groups to fully saturate the hardware and hide latency.
 NVIDIA GPUs use a scalar processor architecture for the individual PEs seen by OpenCL,
enabling them to work with high efficiency on most OpenCL data types. AMD GPUs use a
vector architecture, and typically achieve best performance such that OpenCL work-items
operate on 4-element vector types such as float4.
 In many cases, a vectorized OpenCL kernel can be made to perform well on x86 CPUs and
on AMD and NVIDIA GPUs, though the resulting kernel code may be less readable than
the scalar equivalent. Differences in low level GPU architecture including variations on
what memory is cached and what memory access patterns create bank conflicts affect
kernel optimality. Vendor-provided OpenCL literature typically contains low level
optimization guidelines.
 4.1.2 OpenCL Standards History
1. OpenCL was a initial proposal by Apple in association with technical teams at AMD, IBM,
Qualcomm, Intel, and NVIDIA, and was submitted to the Khronos Group.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 4) OpenCL Basics

2. First, 1.0 specification was released by the Khronos Group in 2008. OpenCL 1.0 defined
the host application programming interface (API) and the OpenCL C kernel language used
for executing data-parallel programs on different heterogeneous devices.
3. Further releases of OpenCL 1.1 and OpenCL 1.2 enhanced the OpenCL standard with
features such as OpenGL interoperability, additional image formats, synchronization
events, and device partitioning.
4. In November 2013, the Khronos Group announced the ratification and public release of the
finalized OpenCL 2.0 specification.
5. Multiple additional features were added to the OpenCL standard, such as shared virtual
memory, nested parallelism, and generic address spaces. Advanced features have the
potential to simplify parallel application development, and improve the performance
portability of OpenCL applications.
 4.1.3 OpenCL Objective
1. Open programming standards objective is to develop a common set of programming
standards that are acceptable to a range of competing needs and requirements.
2. The Khronos Group has developed an API that is general enough to run on significantly
different architectures while being adaptable enough that each hardware platform can still
achieve high performance.
3. Using the core language and correctly following the specification, any program designed
for one vendor can execute on another vendor’s hardware. OpenCL creates portable,
vendor- and device-independent programs that are capable of being accelerated on many
different hardware platforms.
4. The code that executes on an OpenCL device, which in general is not the same device as
the host central processing unit (CPU), is written in the OpenCL C language. OpenCL C is
a restricted version of the C99 language with extensions appropriate for executing data-
parallel code on a variety of heterogeneous devices.
5. Additionally, the OpenCL C programming language implements a subset of the C11
atomics and synchronization operations. While the OpenCL API itself is a C API, there are
third-party bindings for many languages, including Java, C++, Python, and .NET.
6. Also, performance wise improved popular libraries in domains such as linear algebra and
computer vision have integrated OpenCL to leverage heterogeneous platforms.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 5) OpenCL Basics

 4.1.4 OpenCL Components

1. C Host API : C API used to control the devices. (Ex : Memory transfer, kernel compilation)
2. OpenCL C : Used on the device (Kernel Language)

Fig. 4.1.2 : OpenCL Components

 4.1.5 OpenCL - Hardware and Software Vendors

Various hardware vendors are there who support OpenCL. Every OpenCL vendor provides
OpenCL runtime libraries. These runtimes are capable of running only on their specific
hardware architectures. Across different vendors as well as within a vendor there may be
different types of architectures which might need a different approach towards OpenCL
programming. Below are various hardware vendors who provide an implementation of
OpenCL, so as to use their underlying hardware efficiently.

 4.1.5.1 Advanced Micro Devices, Inc. (AMD)

1. With the launch of AMD A Series APU, one of industry's first Accelerated Processing
Unit (APU), AMD is leading the efforts of integrating both the x86_64 CPU and GPU dies
in one chip.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 6) OpenCL Basics

2. It has four cores of CPU processing power, and also a four or five graphics SIMD engine,
depending on the silicon part which is required to buy.
3. The following Fig. 4.1.3 depicts the block diagram of AMD APU architecture.

Fig. 4.1.3 : AMD APU Architecture

4. An AMD GPU consist of a number of Compute Engines (CU) and each CU has 16 ALUs.
Further, each ALU is a VLIW4 SIMD processor and it could execute a bundle of four or
five independent instructions.
5. Each CU could be issued a group of 64 work items which form the work group (wavefront).
6. AMD Radeon ™ HD 6XXX graphics processors uses this design. Starting with the AMD
Radeon HD 7XXX series of graphics processors from AMD, there were significant
architectural changes.
7. AMD introduced the new Graphics Core Next (GCN) architecture.

 4.1.5.2 NVIDIA®
1. NVIDIA has one of the GPU architectures codenamed as "Kepler". GeForce® GTX 680 is
one Kepler architectural silicon part.
2. Each Kepler GPU consists of different configurations of Graphics Processing Clusters
(GPC) and streaming multiprocessors.
3. The GTX 680 consists of four GPCs and eight SMXs as depicted in below Fig. 4.1.4.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 7) OpenCL Basics

Fig. 4.1.4 : GTX 680 – A Kepler architectural silicon part

4. Kepler architecture is part of the GTX 6XX and GTX 7XX family of NVIDIA discrete
cards. Prior to Kepler, NVIDIA had Fermi architecture which was part of the GTX 5XX
family of discrete and mobile graphic processing units.

 4.1.5.3 Intel®
1. Sandy Bridge and Ivy Bridge processor families support the Intel's OpenCL
implementation.
2. This architecture is also synonymous with the AMD's APU. These processor architectures
also integrated a GPU into the same silicon as the CPU by Intel.
3. Intel changed the design of the L3 cache, and allowed the graphic cores to get access to the
L3, which is also called as the last level cache. It is because of this L3 sharing that the
graphics performance is good in Intel.
4. Each of the CPUs including the graphics execution unit is connected via Ring Bus. Also
each execution unit is a true parallel scalar processor.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 8) OpenCL Basics

5. Sandy Bridge provides the graphics engine HD 2000, with six

Execution Units (EU), and HD 3000 (12 EU), and Ivy Bridge
provides HD 2500(six EU) and HD 4000 (16 EU).
6. The following Fig. 4.1.5 depicts the Sandy bridge architecture
with a ring bus, which acts as an interconnect between the
cores and the HD graphics :

Fig. 4.1.5 : Intel’s Sandy bridge architecture

 4.1.5.4 ARM Mali™ GPUs

1. ARM provides GPUs by the name of Mali Graphics processors. The Mali T6XX series of
processors come with two, four, or eight graphics cores. These graphic engines deliver
graphics compute capability to entry level smartphones, tablets, and Smart TVs.
2. Below Fig. 4.1.6 depicts the Mali T628 graphics processor.

Fig. 4.1.6 : Mali T628 Graphics Processor

3. Mali T628 has eight shader cores or graphic cores. These cores also support Renderscripts
APIs besides supporting OpenCL.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 9) OpenCL Basics

4. Besides the four key competitors, companies such as TI (DSP), Altera (FPGA), and Oracle
are providing OpenCL implementations for their respective hardware.

 4.2 Kernels and Host Device Interaction

 4.2.1 Kernel
i. A kernel is a small unit of execution that performs a clearly defined function and that can
be executed in parallel. Such a kernel can be executed on each element of an input stream
(also termed as NDRange) or simply at each point in an arbitrary index space. A kernel is
analogous and, on some devices identical, to what graphics programmers call a shader
program.
ii. The kernel is not to be compared with an OS kernel, which controls hardware. The most
basic form of an NDRange is simply mapped over input data and produces one output item
for each input tuple. Subsequent extensions of the basic model provide random-access
functionality, variable output counts, and reduction / accumulation operations.
iii. The OpenCL programming model consists of producing complicated task graphs from
data-parallel execution nodes. In a given data-parallel execution, commonly known as a
kernel launch, a computation is defined in terms of a sequence of instructions that executes
at each point in an N-dimensional index space. It is a common, though by not required,
formulation of an algorithm that each computation index maps to an element in an input
data set.
iv. The OpenCL data-parallel programming model is hierarchical. The hierarchical subdivision
can be specified in two ways :
o Explicitly - The developer defines the total number of work-items to execute in parallel,
as well as the division of work-items into specific work-groups.
o Implicitly - The developer specifies the total number of work-items to execute in
parallel, and OpenCL manages the division into work-groups.

 4.2.2 Host to Device Interaction

1. An OpenCL application is split into host and device parts with host code written using a
general programming language such as C or C++ and compiled by a conventional compiler
for execution on a host CPU. An OpenCL platform consists of a host connected to one or
more OpenCL devices. The platform model defines the roles of the host and the devices,
and provides an abstract hardware model for devices. A device is divided into one or more
compute units, which are further divided into one or more processing elements.
2. The device compilation phase can be done online, i.e. during execution of an application
using special API calls. It can alternatively be compiled before executing the application
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (4 - 10) OpenCL Basics

into the machine binary or special portable intermediate representation defined by Khronos
called SPIR-V. There are also domain specific languages and frameworks that can
compile to OpenCL either using source-to-source translations or generating binary / SPIR-
V, for example Halide.

Fig. 4.2.1 : OpenCL System Role Players and Host to device interaction
iv. Application host code is frequently written in C or C++ but bindings for other languages
are also available, such as Python. Kernel programs can be written in a dialect of C
(OpenCL C) or C++ (C++ for OpenCL) that enables a developer to program
computationally intensive parts of their application in a kernel program. All versions of the
OpenCL C language are based on C99. The community driven C++ for OpenCL language
brings together capabilities of OpenCL and C++17.
v. An OpenCL “program”contains one or more “kernels” and any supporting routines that
run on a target device. An OpenCL kernel is the basic unit of parallel code that can be
executed on a target device.
 4.2.3 C++ for OpenCL Kernel Language
The OpenCL working group has transitioned from the original OpenCL C++ kernel
language first defined in OpenCL 2.0 to C++ for OpenCL developed by the open source
community to provide improved features and compatibility with OpenCL C. C++ for OpenCL
is supported by Clang. It enables developers to use most C++17 features in OpenCL kernels. It
is largely backwards compatible with OpenCL C 2.0 enabling it to be used to program
accelerators with OpenCL 2.0 or above with conformant drivers that support SPIR-V.
 4.2.4 Kernel Language Extensions
Some extensions are available to the existing published kernel language standards.
Conformant compilers and drivers may optionally support the extensions and so there is a
mechanism to detect their support at the compile time. Developers should be aware that not all
extensions may be supported across all devices.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 11) OpenCL Basics

 4.3 The OpenCL Architecture - OpenCL Programming Models

Specification
The openCL architecture aka the OpenCL specification is defined in four parts, which it
refers to as models. The models are explained below.

 4.3.1 Platform Model

i. It specifies that there is one host processor coordinating execution, and one or more device
processors whose job it is to execute OpenCL C kernels. It also defines an abstract
hardware model for devices.
The OpenCL platform model allows to build a topology of a system with a coordinating
host processor, and one or more devices that will be targeted to execute OpenCL kernels. In
order for the host to request that a kernel be executed on a device, a context must be
configured that enables the host to pass commands and data to the device.
ii. The platform model is key to application development for portability between OpenCL-
capable systems. Even within a single capable system, there could be a number of different
OpenCL platforms which could be targeted by any given application. The platform model’s
API allows an OpenCL application to adapt and choose the desired platform and compute
device for executing its computation. In the API, a platform can be thought of as a common
interface a vendor-specific OpenCL runtime. The devices that a platform can target are thus
limited to those with which a vendor knows how to interact. For example, if company A’s
platform is chosen, it likely will not be able to communicate with company B’s GPU.
iii. However, platforms are not necessarily vendor exclusive. For example, implementations
from AMD and Intel should be able to create platforms that target each other’s x86 CPUs
as devices.
iv. The platform model also presents an abstract device architecture that programmers target
when writing OpenCL C code. Vendors map this abstract architecture to the physical
hardware. The platform model defines a device as a group of multiple compute units,
where each compute unit is functionally independent. Compute units are further divided
into processing elements. Fig. 4.3.1 illustrates this hierarchical model. As an example, the
AMD Radeon R9 290X graphics card (device) comprises 44 vector processors (compute
units). Each compute unit has four 16-lane SIMD engines, for a total of 64 lanes
(processing elements). Each SIMD lane on the Radeon R9 290X executes a scalar
instruction. This allows the GPU device to execute a total of 44  16  4 = 2816
instructions at a time.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 12) OpenCL Basics

Fig. 4.3.1 : An OpenCL platform with multiple compute devices

(compute device contains one or more compute units)

 4.3.2 Execution Model

i. OpenCL applications run on the Host, which submit work to the compute devices.
1. Work Item - Basic unit of work on a compute device
2. Kernel - The code that runs on a work item (Basically a C function)
3. Program - Collection of kernels and other functions
4. Context - The environment where work-items execute (Devices, their memories and
command queues)
5. Command Queue - Queue used by the host to submit work (kernels, memory copies) to
the device.

Fig. 4.3.2 : OpenCL Execution model

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 13) OpenCL Basics

ii. Execution model defines how the OpenCL environment is configured by the host, and how
the host may direct the devices to perform work. This includes defining an environment for
execution on the host, mechanisms for host-device interaction, and a concurrency model
used when configuring kernels. The concurrency model defines how an algorithm is
decomposed into OpenCL work-items and work-groups.
iii. CONTEXTS - In OpenCL, a context is an abstract environment within which coordination
and memory management for kernel execution is valid and well defined. A context
coordinates the mechanisms for host-device interaction, manages the memory objects
available to the devices, and keeps track of the programs and kernels that are created for
each device.

Fig. 4.3.3 : Context

iv. COMMAND - QUEUES - The execution model specifies that devices perform tasks
based on commands. These commands are sent from the host to the device. Actions
specified by commands include executing kernels, performing data transfers, and
performing synchronization. A device can also send specific commands to host. A
command-queue forms the communication mechanism that the host uses to request action
by a device. Once the host has decided which devices to work with and a context has been
created, one command-queue needs to be created per device. Each command-queue is
associated with only one device. Single queue per device is necessary as the host may get
connected to multiple devices and needs to communicate certain specific device.

Fig. 4.3.4 : OpenCL device command execution

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 14) OpenCL Basics

v. EVENTS - Events are the objects used to specify dependencies between commands. Many
of clEnqueue API calls have three parameters in common : a pointer to a list of events that
specify dependencies for the current command, the number of events in the wait list, and a
pointer to an event that will represent the execution of the current command. The returned
event can in turn be used to specify a dependency for future events. A wait list is
maintained which is a array of events used to specify dependencies for a command.
vi. OpenCL is a framework that define how kernel execute on each point on a problem
(N-Dimension vector) Or can be seen as the decomposition of a task in work-items.
What need to be defined is,
o Global work-size : Number of elements on input vector.
o Global offset
o Work-group size : Size of compute partition.

Fig. 4.3.5 : OpenCL Framework

 4.3.3 Kernel Programming Model

i. Kernel programming model defines how the concurrency model is mapped to physical
hardware. The execution model API enables an application to manage the execution of
OpenCL commands. The OpenCL commands describe themovement of data and the
execution of kernels that process this data to perform some meaningful task. OpenCL
kernels are the parts of an OpenCL application that actually execute on a device.
ii. Similar to many CPU concurrency models, an OpenCL kernel is syntactically similar to a
standard C function; the key differences are a set of additional keywords and the
concurrency model that OpenCL kernels implement. When developing concurrent
programs for a CPU using operating system threading APIs or OpenMP, for example, the
programmer considers the physical resources available (e.g. CPU cores) and the overhead

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 15) OpenCL Basics

of creating and switching between threads when their number substantially exceeds the
resource availability.
iii. Central goal of OpenCL is to represent parallelism programmatically at the finest
granularity possible. The generalization of the OpenCL interface and the low-level kernel
language allows efficient mapping to a wide range of hardware. The following discussion
presents three versions of a function that performs an element-wise vector addition: a
serial C implementation, a threaded C implementation, and an OpenCL C implementation.

Fig. 4.3.6 : OpenCL Programming Model

iv. The devices can run data- and task-parallel work. A kernel can be executed as a function
of multi-dimensional domains of indices. Each element is called a work-item; the total
number of indices is defined as the global work-size. The global work-size can be divided
into sub-domains, called work-groups, and individual work-items within a group can
communicate through global or locally shared memory. Work-items are synchronized
through barrier or fence operations. The host / device architecture with a single platform,
consisting of a GPU and a CPU is depicted by Fig. 4.3.6.
v. At the first place, an OpenCL application is built by querying the runtime to determine
which platforms are present. There can be any number of different OpenCL
implementations installed on a single system. The desired OpenCL platform can be
selected by matching the platform vendor string to the desired vendor name, such as
“Advanced Micro Devices, Inc.” The next step is to create a context. As shown in
Fig. 4.3.6, an OpenCL context has associated with it a number of compute devices (for
example, CPU or GPU devices). Within a context, OpenCL guarantees a relaxed
consistency between these devices. This means that memory objects, such as buffers or
images, are allocated per context; but changes made by one device are only guaranteed to
be visible by another device at well-defined synchronization points. To enforce the correct
order of execution, OpenCL provides events, with the ability to synchronize on a given
event.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 16) OpenCL Basics

vi. Various operations are performed with respect to a given context; there also are many
operations that are specific to a device. For example, program compilation and kernel
execution are done on a per-device basis. Performing work with a device, such as
executing kernels or moving data to and from the device’s local memory, is done using a
corresponding command queue. A command queue is associated with a single device and a
given context; all work for a specific device is done through this interface. Note that while
a single command queue can be associated with only a single device, there is no limit to
the number of command queues that can point to the same device.
vii. Usually, OpenCL programs follow the same pattern. Given a specific platform, select a
device or devices to create a context, allocate memory, create device-specific command
queues, and perform data transfers and computations. Generally, the platform is the
gateway to accessing specific devices, given these devices and a corresponding context,
the application is independent of the platform. Given a context, the application can,
o Create one or more command queues.
o Create programs to run on one or more associated devices.
o Create kernels within those programs.
o Allocate memory buffers or images, either on the host or on the device(s). (Memory can
be copied between the host and device.)
o Write data to the device.
o Submit the kernel (with appropriate arguments) to the command queue for execution.
o Read data back to the host from the device.
viii. Synchronization
The two major entities of synchronization in OpenCL are work-items in a single work-
group and command-queue(s) in a single context. Work-group barriers enable
synchronization of work-items in a work-group. Each work-item in a work-group must first
execute the barrier before executing any instruction beyond this barrier. Either all of, or
none of, the work-items in a work-group must encounter the barrier. A barrier or
mem_fence operation does not have global scope, but is relevant only to the local
workgroup on which they operate.
There are two types of synchronization between commands in a command- queue,
o Command-queue barrier - enforces ordering within a single queue. Any resulting
changes to memory are available to the following commands in the queue.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 17) OpenCL Basics

o Events - enforces ordering between, or within, queues. Enqueued commands in OpenCL

return an event identifying the command as well as the memory object updated by it.
This ensures that following commands waiting on that event see the updated memory
objects before they execute.
OpenCL 2.0 provides additional synchronization options as well.

 4.3.4 Memory Model

 4.3.4.1 Basic Concept of Memory Model

i. Memory model defines memory object types, and the abstract memory hierarchy that
kernels use regardless of the actual underlying memory architecture. It also contains
requirements for memory ordering and optional shared virtual memory between the host
and devices.
ii. The OpenCL memory hierarchy is shown below in Fig. 4.3.7 is structured in order to
“loosely” resemble the physical memory configurations in ATI and NVIDIA hardware.
The mapping is not 1 to 1 since NVIDIA and ATI define their memory hierarchies
differently. However the basic structure of top global memory vs local memory per work-
group is consistent across both platforms. Furthermore, the lowest level execution unit has
a small private memory space for program registers.

Fig. 4.3.7 : OpenCL Memory Hierarchy

iii. The work-groups can communicate through shared memory and synchronization
primitives, however their memory access is independent of other work-groups. This is

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 18) OpenCL Basics

essentially a data-parallel execution model, where the domain of independent execution

units is closely tied and defined by the underlining memory access patterns. For these
groups, OpenCL implements a relaxed consistency, shared memory model. There are
exceptions, and some compute devices (notably CPUs) can execute task-parallel compute
Kernels, however the bulk of OpenCL applications on GPGPU hardware will execute
strictly data-parallel workers.
iv. An important issue to take care of when programming OpenCL Kernels is that memory
access on the DRAM global and local mem-ory blocks is not protected in any way. This
means that segfaults are not reported when work-items dereference memory outside their
own global storage. As a result, GPU memory set aside for the OS can be clobbered
unintentionally, which can result in behaviors ranging from benign screen flickering up to
frustrating blue screens of death and OS level crashes.
v. Other issue that might be faced related to memory is that mode-switches may result in
GPU memory allocated to OpenCL to be cannibalized by the operating system. Typically
the OS allocates some portion of the GPU memory to the “primary-surface”, which is a
frame buffer store for the rendering of the OS. If the resolution is changed during OpenCL
execution, and the size of this primary-surface needs to grow, it will use OpenCL memory
space to do so. Luckily these events are caught at the driver level and will cause any call
to the OpenCL runtime to fail and return an invalid context error. Memory fences are
possible within threads in a work-group as well as synchronization barriers for threads at
the work-item level (between individual threads in a processing element) as well as at the
work-group level (for coarse synchronization between work-groups).
vi. On the host side, blocking API functions can perform waits for certain events to complete,
such as all events in the queue to finish, specific events to finish, etc. Using this coarse
event control the host can decide to run work in parallel across different de-vices or
sequentially, depending on how markers are placed in the work-queue.
vii. Finally a care should be taken when statically allocating local data (per work-group). One
should check the return conditions from the host API for flags indicating that one is
allocating too much perwork-group, however one should also be aware that sometimes the
Kernel will compile anyway and will result in a program crash.

 4.3.4.2 OpenCL Device Memory Model

The OpenCL memory model defines the behavior and hierarchy of memory that can be
used by OpenCL applications. This hierarchical representation of memory is common across
all OpenCL implementations, but it is up to individual vendors to define how the OpenCL
memory model maps to specific hardware. This section defines the mapping used by SDAccel.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 19) OpenCL Basics

Fig. 4.3.8 : OpenCL Memory Model

 1. Device memories
 Host memory : The host memory is defined as the region of system memory that is
directly (and only) accessible from the host processor. Any data needed by compute
kernels must be transferred to and from OpenCL device global memory using the
OpenCL API.
 Global Memory : It is shared with all Device, but slow. It is persistent between kernel
calls. The global memory is defined as the region of device memory that is accessible to
both the OpenCL host and device. Global memory permits read/write access to the host
processor as well to all compute units in the device. The host is responsible for the
allocation and de-allocation of buffers in this memory space. There is a handshake
between host and device over control of the data stored in this memory. The host
processor transfers data from the host memory space into the global memory space.
Then, once a kernel is launched to process the data, the host loses access rights to the
buffer in global memory. The device takes over and is capable of reading and writing
from the global memory until the kernel execution is complete. Upon completion of the
operations associated with a kernel, the device turns control of the global memory buffer
back to the host processor. Once it has regained control of a buffer, the host processor
can read and write data to the buffer, transfer data back to the host memory, and
de-allocate the buffer.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 20) OpenCL Basics

 Constant Memory : It is faster than global memory, it is used for filter parameters.
Constant global memory is defined as the region of system memory that is accessible
with read and write access for the OpenCL host and with read only access for the
OpenCL device. As the name implies, the typical use for this memory is to transfer
constant data needed by kernel computation from the host to the device.
 Local Memory : It is private to each compute unit, and shared to all processing
elements. Local memory is a region of memory that is local to a single compute unit.
The host processor has no visibility and no control on the operations that occur in this
memory space. This memory space allows read and write operations by all the
processing elements with a compute units. This level of memory is typically used to
store data that must be shared by multiple work-items. Operations on local memory are
un-ordered between work-items but synchronization and consistency can be achieved
using barrier and fence operations. In SDAccel, the structure of local memory can be
customized to meet the requirements of an algorithm or application.
 Private Memory : It is faster but local to each processing element. Private memory is
the region of memory that is private to an individual work-item executing within an
OpenCL processing element. As with local memory, the host processor has no visiblilty
into this memory region. This memory space can be read from and written to by all
work-items, but variables defined in one work-item's private memory are not visible to
another work-item. In SDAccel, the structure of private memory can be customized to
meet the requirements of an algorithm or application.
 2. The Constant, Local, and Private memory are scratch space so its contents
cannot be saved to be used by other kernels.
 3. Device Memory Summary
Memory Type Description
Private Specific to a work-item; it is not visible to other work-items.
Local Specific to a work-group; accessible only by work-items belonging to
that work-group.
Global Accessible to all work-items executing in a context, as well as to the host
(read, write, and map commands).
Constant Read-only region for host-allocated and -initialized objects that are not
changed during kernel execution.
Host (CPU) Host-accessible region for an application’s data structures and program
data.
PCIe Part of host (CPU) memory accessible from, and modifiable by, the host
program and the GPU compute device. Modifying this memory requires
synchronization between the GPU compute device and the CPU.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 21) OpenCL Basics

ii. Memory objects are allocated by host APIs. The host program can provide the runtime
with a pointer to a block of continuous memory to hold the memory object when the object
is created (CL_MEM_USE_HOST_PTR). Alternatively, the physical memory can be
managed by the OpenCL runtime and not be directly accessible to the host program.
iii. Allocation and access to memory objects within the different memory regions varies
between the host and work-items running on a device. This is summarized in the Memory
Regions table, which describes whether the kernel or the host can allocate from a memory
region, the type of allocation (static at compile time vs. dynamic at runtime) and the type
of access allowed (i.e. whether the kernel or the host can read and/or write to a memory
region).

 4.4 Exploring OpenCL Memory Model - The Memory Objects

The contents of global memory are memory objects. A memory object is a handle to a
reference counted region of global memory. Memory objects use the OpenCL type cl_mem
and fall into three distinct classes.

 4.4.1 Various Memory Objects

 4.4.1.1 Memory Object - Buffer

1. A memory object stored as a block of contiguous memory and used as a general purpose
object to hold data used in an OpenCL program. The types of the values within a buffer
may be any of the built in types (such as int, float), vector types, or user-defined structures.
The buffer can be manipulated through pointers much as one would with any block of
memory in C.
2. Buffer objects package any type of data that does not involve images. These are created by
the clCreateBuffer function, whose signature is as follows :
clCreateBuffer(cl_context context, cl_mem_flags options, size_t size,

void host_ptr, cl_int error)

This returns a cl_mem that wraps around the data identified by the host_ptr argument. The
options parameter configures many of the object’s characteristics, such as whether the
buffer data is read-only or write-only and the manner in which the data is allocated on the
host. Below Table 4.4.1 lists the six values of the cl_mem_flags enumerated type.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 22) OpenCL Basics

3. Memory object properties (cl_mem_flags)

Table 4.4.1 : Memory Object Properties

Flag value Meaning

CL_MEM_READ_WRITE The memory object can be read from and written to.
CL_MEM_WRITE_ONLY The memory object can only be written to.
CL_MEM_READ_ONLY The memory object can only be read from.
CL_MEM_USE_HOST_PTR The memory object will access the memory region
specified by the host pointer.
CL_MEM_COPY_HOST_PTR The memory object will set the memory region
specified by the host pointer.
CL_MEM_ALLOC_HOST_PTR A region in host-accessible memory will be allocated
for use in data transfer.
The first three properties determine the buffer object’s accessibility, and they are all easy to
understand. The only point to remember is that they constrain the device’s access to the
buffer object, not the host’s. If a device attempts to modify a buffer object created with the
CL_MEM_READ_ONLY flag, the operation will produce an undefined result.
4. Allocating buffer objects
When one sets the second argument of clCreateBuffer, one commonly provides a
combination of two flags. First, one select one of the first three flags in above table 1 to set
the buffer object’s accessibility. Then one should select one or more of the second three to
specify where the buffer object should be allocated. As an example, the following function
creates a buffer object to package vec, an array of 32 floats :
vec_buff = clCreateBuffer(context,

CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,

sizeof(float)*32, vec, &error);

5. Creating subbuffer objects

Just as one can create a substring from a string, one can create a subbuffer object from a
buffer object. One may want to do this if one kernel needs a subset of the data required by
another kernel. Subbuffer objects are created by clCreateSubBuffer, whose signature is as
follows :

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 23) OpenCL Basics

clCreateSubBuffer(cl_mem buffer,

cl_mem_flags flags, cl_buffer_create_type type,

const void info, cl_int error)

Fig. 4.4.1 : Creating sub buffer

 4.4.1.2 Memory Object - Image

1. An image memory object holds one, two or three dimensional images. The formats are
based on the standard image formats used in graphics applications. An image is an opaque
data structure managed by functions defined in the OpenCL API. To optimize the
manipulation of images stored in the texture memories found in many GPUs, OpenCL
kernels have traditionally been disallowed from both reading and writing a single image. In
OpenCL 2.0, however, this restriction is relaxed by providing synchronization and fence
operations that let programmers properly synchronize their code to safely allow a kernel to
read and write a single image.
2. Image processing is a major priority in high-performance computing. This is particularly
true for OpenCL, which is one of the few languages capable of targeting graphics cards. For
this reason, OpenCL provides a specific type of memory object for holding pixel data. The
standard refers to them as image objects, but there is no separate data structure for them.
Like buffer objects, image objects are represented by cl_mem structures. Much of
discussion of buffer objects applies to image objects as well. Image objects are created with
the same configuration flags as those listed in table 1, and their allocation properties are
exactly the same.
3. Creating image objects
 A 1D image, 1D image buffer, 1D image array, 2D image, 2D image array and 3D
image object can be created using the following function.
cl_mem clCreateImage(

cl_context context,

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 24) OpenCL Basics

cl_mem_flags flags,

const cl_image_format* image_format,

const cl_image_desc* image_desc,

void* host_ptr,

cl_int* errcode_ret);

 context is a valid OpenCL context on which the image object is to be created.

 flags is a bit-field that is used to specify allocation and usage information about the
image memory object being created and is described in the Memory Flags Table 4.4.1.
 image_format is a pointer to a structure that describes format properties of the image to
be allocated. A 1D image buffer or 2D image can be created from a buffer by specifying
a buffer object in the image_desc  mem_object. A 2D image can be created from
another 2D image object by specifying an image object in the image_desc 
_mem_object_. Refer page 4-29 for a detailed description of the image format
descriptor.
 image_desc is a pointer to a structure that describes type and dimensions of the image to
be allocated. Refer page 4-29 for a detailed description of the image descriptor.
 host_ptr is a pointer to the image data that may already be allocated by the application.
It is only used to initialize the image, and can be freed after the call to clCreateImage.
Refer to Table 4.4.2 below for a description of how large the buffer that host_ptr points
to must be.
 For all image types except CL_MEM_OBJECT_IMAGE1D_BUFFER, if value
specified for flags is 0, the default is used which is CL_MEM_READ_WRITE.
 For CL_MEM_OBJECT_IMAGE1D_BUFFER image type, or an image created from
another memory object (image or buffer), if the CL_MEM_READ_WRITE, CL_MEM_
READ_ONLY or CL_MEM_WRITE_ONLY values are not specified in flags, they are
inherited from the corresponding memory access qualifiers associated with mem_object.
 The CL_MEM_USE_HOST_PTR, CL_MEM_ALLOC_HOST_PTR and CL_MEM_
COPY_HOST_PTR values cannot be specified in flags but are inherited from the
corresponding memory access qualifiers associated with mem_object.
 If CL_MEM_COPY_HOST_PTR is specified in the memory access qualifier values
associated with mem_object it does not imply any additional copies when the image is
created from mem_object.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 25) OpenCL Basics

 If the CL_MEM_HOST_WRITE_ONLY, CL_MEM_HOST_READ_ONLY or CL_

MEM_HOST_NO_ACCESS values are not specified in flags, they are inherited from
the corresponding memory access qualifiers associated with mem_object.
Table 4.4.2 : Image Types and Host Pointer Buffer Size

Image Type Size of buffer that host_ptr points to

CL_MEM_OBJECT_IMAGE1D  image_row_pitch
CL_MEM_OBJECT_IMAGE1D_BUFFER  image_row_pitch
CL_MEM_OBJECT_IMAGE2D  image_row_pitch  image_height
CL_MEM_OBJECT_IMAGE3D  image_slice_pitch  image_depth
CL_MEM_OBJECT_IMAGE1D_ARRAY  image_slice_pitch  image_array_size
CL_MEM_OBJECT_IMAGE2D_ARRAY  image_slice_pitch  image_array_size
 For a 3D image or 2D image array, the image data specified by host_ptr is stored as a
linear sequence of adjacent 2D image slices or 2D images respectively. Each 2D image
is a linear sequence of adjacent scanlines. Each scanline is a linear sequence of image
elements.
 For a 2D image, the image data specified by host_ptr is stored as a linear sequence of
adjacent scanlines. Each scanline is a linear sequence of image elements.
 For a 1D image array, the image data specified by host_ptr is stored as a linear sequence
of adjacent 1D images. Each 1D image is stored as a single scanline which is a linear
sequence of adjacent elements.
 For 1D image or 1D image buffer, the image data specified by host_ptr is stored as a
single scanline which is a linear sequence of adjacent elements.
 Image objects come in two major types : two-dimensional and three-dimensional. Two-
dimensional image objects are created by clCreateImage2D. Three-dimensional image
objects, which are essentially successions of two-dimension images, are created with
clCreateImage3D. Both functions return a cl_mem structure and their signatures are as
follows :
clCreateImage2D (cl_context context, cl_mem_flags opts,

const cl_image_format *format, size_t width, size_t height,

size_t row_pitch, void data, cl_int error)

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 26) OpenCL Basics

clCreateImage3D (cl_context context, cl_mem_flags opts,

const cl_image_format *format, size_t width, size_t height,

size_t depth, size_t row_pitch, size_t slice_pitch,

void data, cl_int error)

4. Image Format Descriptor

 The cl_image_format image format descriptor structure describes an image format, and
is defined as :
typedef struct cl_image_format {

cl_channel_order image_channel_order;

cl_channel_type image_channel_data_type;

} cl_image_format;

 image_channel_order specifies the number of channels and the channel layout i.e. the
memory layout in which channels are stored in the image. Valid values are described in
the Image Channel Order table.
 image_channel_data_type describes the size of the channel data type. The list of
supported values is described in the Image Channel Data Types table. The number of
bits per element determined by the image_channel_data_type and image_channel_order
must be a power of two.
Table 4.4.3 : List of supported Image Channel Order Values
Image Channel Order Description
CL_R, CL_A, Single channel image formats where the single channel
represents a RED or ALPHA component.
CL_DEPTH A single channel image format where the single channel
represents a DEPTH component.
CL_LUMINANCE A single channel image format where the single channel
represents a LUMINANCE value. The LUMINANCE value is
replicated into the RED, GREEN, and BLUE components.
CL_INTENSITY, A single channel image format where the single channel
represents an INTENSITY value. The INTENSITY value is
replicated into the RED, GREEN, BLUE, and ALPHA
components.
CL_RG, CL_RA Two channel image formats. The first channel always represents
a RED component. The second channel represents a GREEN
component or an ALPHA component.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 27) OpenCL Basics

Image Channel Order Description

CL_Rx A two channel image format, where the first channel represents
a RED component and the second channel is ignored.
CL_RGB A three channel image format, where the three channels
represent RED, GREEN, and BLUE components.
CL_RGx A three channel image format, where the first two channels
represent RED and GREEN components and the third channel is
ignored.
CL_RGBA, CL_ARGB, Four channel image formats, where the four channels represent
CL_BGRA, CL_ABGR RED, GREEN, BLUE, and ALPHA components.
CL_RGBx A four channel image format, where the first three channels
represent RED, GREEN, and BLUE components and the fourth
channel is ignored.
CL_sRGB A three channel image format, where the three channels
represent RED, GREEN, and BLUE components in the sRGB
color space.
CL_sRGBA, CL_ Four channel image formats, where the first three channels
sBGRA represent RED, GREEN, and BLUE components in the sRGB
color space. The fourth channel represents an ALPHA
component.
CL_sRGBx A four channel image format, where the three channels represent
RED, GREEN, and BLUE components in the sRGB color
space. The fourth channel is ignored.
Table 4.4.4 : List of supported Image Channel Data Types
Image Channel Data Type Description
CL_SNORM_INT8 Each channel component is a normalized signed 8-bit
integer value.
CL_SNORM_INT16 Each channel component is a normalized signed 16-bit
integer value.
CL_UNORM_INT8 Each channel component is a normalized unsigned 8-bit
integer value.
CL_UNORM_INT16 Each channel component is a normalized unsigned 16-bit
integer value.
CL_UNORM_SHORT_565 Represents a normalized 5-6-5 3-channel RGB image. The
channel order must be CL_RGB or CL_RGBx.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 28) OpenCL Basics

Image Channel Data Type Description

CL_UNORM_SHORT_555 Represents a normalized x-5-5-5 4-channel xRGB image.
The channel order must be CL_RGB or CL_RGBx.
CL_UNORM_INT_101010 Represents a normalized x-10-10-10 4-channel xRGB
image. The channel order must be CL_RGB or CL_RGBx.
CL_UNORM_INT_ Represents a normalized 10-10-10-2 four-channel RGBA
101010_2 image. The channel order must be CL_RGBA.
CL_SIGNED_INT8 Each channel component is an unnormalized signed 8-bit
integer value.
CL_SIGNED_INT16 Each channel component is an unnormalized signed 16-bit
integer value.
CL_SIGNED_INT32 Each channel component is an unnormalized signed 32-bit
integer value.
CL_UNSIGNED_INT8 Each channel component is an unnormalized unsigned 8-bit
integer value.
CL_UNSIGNED_INT16 Each channel component is an unnormalized unsigned 16-bit
integer value.
CL_UNSIGNED_INT32 Each channel component is an unnormalized unsigned 32-bit
integer value.
CL_HALF_FLOAT Each channel component is a 16-bit half-float value.
CL_FLOAT Each channel component is a single precision floating-point
value.
 Consider as example, in which it needs to specify a normalized unsigned 8-bit / channel
RGBA image, image_channel_order = CL_RGBA, and image_channel_data_type
= CL_UNORM_INT8. The memory layout of this image format is described below :
R G B A …
with the corresponding byte offsets
0 1 2 3 …
Similar, if image_channel_order = CL_RGBA and image_channel_data_type = CL_
SIGNED_INT16, the memory layout of this image format is described below:
R G B A …
with the corresponding byte offsets
0 2 4 6 …
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (4 - 29) OpenCL Basics

 image_channel_data_type values of CL_UNORM_SHORT_565, CL_UNORM

_SHORT_555, CL_UNORM_INT_101010, and CL_UNORM_INT_101010_2 are
special cases of packed image formats where the channels of each element are packed
into a single unsigned short or unsigned int. For these special packed image formats, the
channels are normally packed with the first channel in the most significant bits of the
bitfield, and successive channels occupying progressively less significant locations.
 For CL_UNORM_SHORT_565, R is in bits 15:11, G is in bits 10:5 and B is in bits 4:0.
For CL_UNORM_SHORT_555, bit 15 is undefined, R is in bits 14:10, G in bits 9:5 and
B in bits 4:0. For CL_UNORM_INT_101010, bits 31:30 are undefined, R is in bits
29:20, G in bits 19:10 and B in bits 9:0. For CL_UNORM_INT_101010_2, R is in bits
31:22, G in bits 21:12, B in bits 11:2 and A in bits 1:0.
 OpenCL implementations must maintain the minimum precision specified by the
number of bits in image_channel_data_type. If the image format specified by
image_channel_order, and image_channel_data_type cannot be supported by the
OpenCL implementation, then the call to clCreateImage will return a NULL memory
object.
5. Image Descriptor
 The cl_image_desc image descriptor structure describes the type and dimensions of an
image or image array, and is defined as :
typedef struct cl_image_desc {

cl_mem_object_type image_type;

size_t image_width;

size_t image_height;

size_t image_depth;

size_t image_array_size;

size_t image_row_pitch;

size_t image_slice_pitch;

cl_uint num_mip_levels;

cl_uint num_samples;

#ifdef __GNUC__

extension /* Prevents warnings about anonymous union in -pedantic

builds */
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (4 - 30) OpenCL Basics

#endif

union {

cl_mem buffer;

cl_mem mem_object;

};

} cl_image_desc;

 image_type describes the image type and must be either CL_MEM_OBJECT_

IMAGE1D, CL_MEM_OBJECT_IMAGE1D_BUFFER, CL_MEM_OBJECT_
IMAGE1D_ARRAY, CL_MEM_OBJECT_IMAGE2D, CL_MEM_OBJECT_
IMAGE2D_ARRAY, or CL_MEM_OBJECT_IMAGE3D.
 image_width is the width of the image in pixels. For a 2D image and image array, the
image width must be a value ≥ 1 and ≤ CL_DEVICE_IMAGE2D_MAX_WIDTH. For a
3D image, the image width must be a value ≥ 1 and ≤ CL_DEVICE_IMAGE3D_MAX_
WIDTH. For a 1D image buffer, the image width must be a value ≥ 1 and ≤ CL_
DEVICE_IMAGE_MAX_BUFFER_SIZE. For a 1D image and 1D image array, the
image width must be a value ≥1 and ≤ CL_DEVICE_IMAGE2D_MAX_WIDTH.
 image_height is the height of the image in pixels. This is only used if the image is a 2D
or 3D image, or a 2D image array. For a 2D image or image array, the image height
must be a value ≥ 1 and ≤ CL_DEVICE_IMAGE2D_MAX_HEIGHT. For a 3D image,
the image height must be a value ≥ 1 and ≤ CL_DEVICE_IMAGE3D_MAX_HEIGHT.
 image_depth is the depth of the image in pixels. This is only used if the image is a 3D
image and must be a value ≥ 1 and ≤ CL_DEVICE_IMAGE3D_MAX_DEPTH.
 image_array_size is the number of images in the image array. This is only used if the
image is a 1D or 2D image array. The values for image_array_size, if specified, must be
a value ≥ 1 and ≤ CL_DEVICE_IMAGE_MAX_ARRAY_SIZE.
 (It should be noted that reading and writing 2D image arrays from a kernel with
image_array_size = 1 may be lower performance than 2D images.)
 image_row_pitch is the scan-line pitch in bytes. This must be 0 if host_ptr is NULL and
can be either 0 or ≥ image_width × size of element in bytes if host_ptr is not NULL. If
host_ptr is not NULL and image_row_pitch = 0, image_row_pitch is calculated as
image_width  size of element in bytes. If image_row_pitch is not 0, it must be a
multiple of the image element size in bytes. For a 2D image created from a buffer, the
pitch specified (or computed if pitch specified is 0) must be a multiple of the maximum
of the CL_DEVICE_IMAGE_PITCH_ALIGNMENT value for all devices in the
context associated with the buffer specified by mem_object that support images.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 31) OpenCL Basics

 image_slice_pitch is the size in bytes of each 2D slice in the 3D image or the size in
bytes of each image in a 1D or 2D image array. This must be 0 if host_ptr is NULL. If
host_ptr is not NULL, image_slice_pitch can be either 0 or ≥ image_row_pitch 
image_height for a 2D image array or 3D image and can be either 0 or ≥
image_row_pitch for a 1D image array. If host_ptr is not NULL and image_slice_pitch
= 0, image_slice_pitch is calculated as image_row_pitch × image_height for a 2D image
array or 3D image and image_row_pitch for a 1D image array. If image_slice_pitch is
not 0, it must be a multiple of the image_row_pitch.
 num_mip_levels and num_samples must be 0.
 mem_object may refer to a valid buffer or image memory object. mem_object can be a
buffer memory object if image_type is CL_MEM_OBJECT_IMAGE1D_BUFFER or
CL_MEM_OBJECT_IMAGE2D6. mem_object can be an image object if image_type is
CL_MEM_OBJECT_IMAGE2D7. Otherwise it must be NULL. The image pixels are
taken from the memory objects data store. When the contents of the specified memory
objects data store are modified, those changes are reflected in the contents of the image
object and vice-versa at corresponding synchronization points.
 To create a 2D image from a buffer object that share the data store between the image
and buffer object. To create an image object from another image object that share the
data store between these image objects.
 For a 1D image buffer created from a buffer object, the image_width × size of element
in bytes must be ≤ size of the buffer object. The image data in the buffer object is stored
as a single scanline which is a linear sequence of adjacent elements.
 For a 2D image created from a buffer object, the image_row_pitch × image_height must
be ≤ size of the buffer object specified by mem_object. The image data in the buffer
object is stored as a linear sequence of adjacent scanlines. Each scanline is a linear
sequence of image elements padded to image_row_pitch bytes.
 For an image object created from another image object, the values specified in the image
descriptor except for mem_object must match the image descriptor information
associated with mem_object.
 Image elements are stored according to their image format as described in Image Format
Descriptor.
 If the buffer object specified by mem_object was created with CL_MEM_USE_HOST_
PTR, the host_ptr specified to clCreateBuffer must be aligned to the maximum of the
CL_DEVICE_IMAGE_BASE_ADDRESS_ALIGNMENT value for all devices in the
context associated with the buffer specified by mem_object that support images.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 32) OpenCL Basics

 Creating a 2D image object from another 2D image object allows users to create a new
image object that shares the image data store with mem_object but views the pixels in
the image with a different channel order. The restrictions are:
 All the values specified in image_desc except for mem_object must match the image
descriptor information associated with mem_object.
 The image_desc used for creation of mem_object may not be equivalent to image
descriptor information associated with mem_object. To ensure the values in
`image_desc will match one can query mem_object for associated information using
clGetImageInfo function described in Image Object Queries.
 The channel data type specified in image_format must match the channel data type
associated with mem_object. The channel order values supported are :
Table 4.4.5 : Image Channel Order
image_channel_order specified in image channel order of
image_format mem_object
CL_sBGRA CL_BGRA
CL_BGRA CL_sBGRA
CL_sRGBA CL_RGBA
CL_RGBA CL_sRGBA
CL_sRGB CL_RGB
CL_RGB CL_sRGB
CL_sRGBx CL_RGBx
CL_RGBx CL_sRGBx
CL_DEPTH CL_R
 The channel order specified must have the same number of channels as the channel
order of mem_object. This allows developers to create a sRGB view of the image from a
linear RGB view or vice-versa i.e. the pixels stored in the image can be accessed as
linear RGB or sRGB values.
6. Querying List of Supported Image Formats
 To get the list of image formats supported by an OpenCL implementation for a specified
context, image type, and allocation information, call the function
cl_int clGetSupportedImageFormats(

cl_context context,
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (4 - 33) OpenCL Basics

cl_mem_flags flags,

cl_mem_object_type image_type,

cl_uint num_entries,

cl_image_format* image_formats,

cl_uint* num_image_formats);

 context is a valid OpenCL context on which the image object(s) will be created.
 flags is a bit-field that is used to specify usage information about the image formats
being queried and is described in the Memory Flags table. flags may be CL_MEM_
READ_WRITE to query image formats that may be read from and written to by
different kernel instances when correctly ordered by event dependencies, or CL_MEM_
READ_ONLY to query image formats that may be read from by a kernel, or CL_
MEM_WRITE_ONLY to query image formats that may be written to by a kernel, or
CL_MEM_KERNEL_READ_AND_WRITE to query image formats that may be both
read from and written to by the same kernel instance. Please see Image Format Mapping
for clarification.
 image_type describes the image type and must be either CL_MEM_OBJECT_
IMAGE1D, CL_MEM_OBJECT_IMAGE1D_BUFFER, CL_MEM_OBJECT_
IMAGE2D, CL_MEM_OBJECT_IMAGE3D, CL_MEM_OBJECT_IMAGE1D_
ARRAY, or CL_MEM_OBJECT_IMAGE2D_ARRAY.
 num_entries specifies the number of entries that can be returned in the memory location
given by image_formats.
 image_formats is a pointer to a memory location where the list of supported image
formats are returned. Each entry describes a cl_image_format structure supported by the
OpenCL implementation. If image_formats is NULL, it is ignored.
 num_image_formats is the actual number of supported image formats for a specific
context and values specified by flags. If num_image_formats is NULL, it is ignored.
 clGetSupportedImageFormats returns a union of image formats supported by all devices
in the context.
 clGetSupportedImageFormats returns CL_SUCCESS if the function is executed
successfully. Otherwise, it returns one of the following errors:
 CL_INVALID_CONTEXT if context is not a valid context.
 CL_INVALID_VALUE if flags or image_type are not valid, or if num_entries is 0 and
image_formats is not NULL.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 34) OpenCL Basics

 CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the

OpenCL implementation on the device.
 CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by
the OpenCL implementation on the host.
7. Minimum List of Supported Image Formats
 For 1D, 1D image from buffer, 2D, 3D image objects, 1D and 2D image array objects,
the mandated minimum list of image formats that can be read from and written to by
different kernel instances when correctly ordered by event dependencies and that must
be supported by all devices that support images is described in the Supported Formats -
Kernel Read Or Write table.
Table 4.4.6 : Minimum list of required image formats: kernel read or write

num_channels channel_order channel_data_type

1. CL_R CL_UNORM_INT8
CL_UNORM_INT16
CL_SNORM_INT8
CL_SNORM_INT16
CL_SIGNED_INT8
CL_SIGNED_INT16
CL_SIGNED_INT32
CL_UNSIGNED_INT8
CL_UNSIGNED_INT16
CL_UNSIGNED_INT32
CL_HALF_FLOAT
CL_FLOAT
2 CL_DEPTH CL_UNORM_INT16
CL_FLOAT
3. CL_RG CL_UNORM_INT8
CL_UNORM_INT16
CL_SNORM_INT8
CL_SNORM_INT16

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 35) OpenCL Basics

num_channels channel_order channel_data_type

CL_SIGNED_INT8
CL_SIGNED_INT16
CL_SIGNED_INT32
CL_UNSIGNED_INT8
CL_UNSIGNED_INT16
CL_UNSIGNED_INT32
CL_HALF_FLOAT
CL_FLOAT
4 CL_RGBA CL_UNORM_INT8
CL_UNORM_INT16
CL_SNORM_INT8
CL_SNORM_INT16
CL_SIGNED_INT8
CL_SIGNED_INT16
CL_SIGNED_INT32
CL_UNSIGNED_INT8
CL_UNSIGNED_INT16
CL_UNSIGNED_INT32
CL_HALF_FLOAT
CL_FLOAT
5. CL_BGRA CL_UNORM_INT8
6. CL_sRGBA CL_UNORM_INT810
 It should be noted that CL_DEPTH channel order is supported only for 2D image and
2D image array objects. sRGB channel order support is not required for 1D image
buffers. Writes to images with sRGB channel orders requires device support of the
cl_khr_srgb_image_writes extension.
 For 1D, 1D image from buffer, 2D, 3D image objects, 1D and 2D image array objects,
the mandated minimum list of image formats that can be read from and written to by the
same kernel instance and that must be supported by all devices that support images is
described in the Supported Formats - Kernel Read And Write Table 4.4.7.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 36) OpenCL Basics

Table 4.4.7 : Minimum list of required image formats: kernel read and write
num_channels channel_order channel_data_type
1 CL_R CL_UNORM_INT8
CL_SIGNED_INT8
CL_SIGNED_INT16
CL_SIGNED_INT32
CL_UNSIGNED_INT8
CL_UNSIGNED_INT16
CL_UNSIGNED_INT32
CL_HALF_FLOAT
CL_FLOAT
4 CL_RGBA CL_UNORM_INT8
CL_SIGNED_INT8
CL_SIGNED_INT16
CL_SIGNED_INT32
CL_UNSIGNED_INT8
CL_UNSIGNED_INT16
CL_UNSIGNED_INT32
CL_HALF_FLOAT
CL_FLOAT
8. Image format mapping to OpenCL kernel language image access qualifiers
 Image arguments to kernels may have the read_only, write_only or read_write qualifier.
Not all image formats supported by the device and platform are valid to be passed to all
of these access qualifiers. For each access qualifier, only images whose format is in the
list of formats returned by clGetSupportedImageFormats with the given flag arguments
in the Image Format Mapping table are permitted. It is not valid to pass an image
supporting writing as both a read_only image and a write_only image parameter, or to a
read_write image parameter and any other image parameter.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 37) OpenCL Basics

Table 4.4.8 : Mapping from format flags passed to clGetSupportedImageFormats to

OpenCL kernel language image access qualifiers

Access Qualifier cl_mem_flags

read_only CL_MEM_READ_ONLY,
CL_MEM_READ_WRITE,
CL_MEM_KERNEL_READ_AND_WRITE
write_only CL_MEM_WRITE_ONLY,
CL_MEM_READ_WRITE,
CL_MEM_KERNEL_READ_AND_WRITE
read_write CL_MEM_KERNEL_READ_AND_WRITE
9. Reading, Writing and Copying Image Objects
 The following functions enqueue commands to read from an image or image array
object to host memory or write to an image or image array object from host memory.
cl_int clEnqueueReadImage(

cl_command_queue command_queue,

cl_mem image,

cl_bool blocking_read,

const size_t* origin,

const size_t* region,

size_t row_pitch,

size_t slice_pitch,

void* ptr,

cl_uint num_events_in_wait_list,

const cl_event* event_wait_list,

cl_event* event);

cl_int clEnqueueWriteImage(

cl_command_queue command_queue,

cl_mem image,

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 38) OpenCL Basics

cl_bool blocking_write,

const size_t* origin,

const size_t* region,

size_t input_row_pitch,

size_t input_slice_pitch,

const void* ptr,

cl_uint num_events_in_wait_list,

const cl_event* event_wait_list,

cl_event* event);

 command_queue refers to the host command-queue in which the read / write command
will be queued. command_queue and image must be created with the same OpenCL
context.
 image refers to a valid image or image array object.
 blocking_read and blocking_write indicate if the read and write operations are blocking
or non-blocking.
 origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset
and the image index in the 2D image array or the (x) offset and the image index in the
1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D
image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D
image array object, origin[2] must be 0. If image is a 1D image array object, origin[1]
describes the image index in the 1D image array. If image is a 2D image array object,
origin[2] describes the image index in the 2D image array.
 region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the
(width, height) in pixels of the 2D rectangle and the number of images of a 2D image
array or the (width) in pixels of the 1D rectangle and the number of images of a 1D
image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image
or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image
array object, region[2] must be 1. The values in region cannot be 0.
 row_pitch in clEnqueueReadImage and input_row_pitch in clEnqueueWriteImage is the
length of each row in bytes. This value must be greater than or equal to the element size
in bytes  width. If row_pitch (or input_row_pitch) is set to 0, the appropriate row pitch
is calculated based on the size of each element in bytes multiplied by width.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 39) OpenCL Basics

 slice_pitch in clEnqueueReadImage and input_slice_pitch in clEnqueueWriteImage is

the size in bytes of the 2D slice of the 3D region of a 3D image or each image of a 1D or
2D image array being read or written respectively. This must be 0 if image is a 1D or 2D
image. Otherwise this value must be greater than or equal to row_pitch × height. If
slice_pitch (or input_slice_pitch) is set to 0, the appropriate slice pitch is calculated
based on the row_pitch × height.
 ptr is the pointer to a buffer in host memory where image data is to be read from or to be
written to. The alignment requirements for ptr are specified in Alignment of Application
Data Types.
 event_wait_list and num_events_in_wait_list specify events that need to complete
before this particular command can be executed. If event_wait_list is NULL, then this
particular command does not wait on any event to complete. If event_wait_list is NULL,
num_events_in_wait_list must be 0. If event_wait_list is not NULL, the list of events
pointed to by event_wait_list must be valid and num_events_in_wait_list must be
greater than 0. The events specified in event_wait_list act as synchronization points. The
context associated with events in event_wait_list and command_queue must be the
same. The memory associated with event_wait_list can be reused or freed after the
function returns.
 event returns an event object that identifies this particular read / write command and can
be used to query or queue a wait for this particular command to complete. event can be
NULL in which case it will not be possible for the application to query the status of this
command or queue a wait for this command to complete. If the event_wait_list and the
event arguments are not NULL, the event argument should not refer to an element of the
event_wait_list array.
 If blocking_read is CL_TRUE i.e. the read command is blocking, clEnqueueReadImage
does not return until the buffer data has been read and copied into memory pointed to by
ptr.
 If blocking_read is CL_FALSE i.e. the read command is non-blocking,
clEnqueueReadImage queues a non-blocking read command and returns. The contents
of the buffer that ptr points to cannot be used until the read command has completed.
The event argument returns an event object which can be used to query the execution
status of the read command. When the read command has completed, the contents of the
buffer that ptr points to can be used by the application.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 40) OpenCL Basics

 If blocking_write is CL_TRUE, the write command is blocking and does not return until
the command is complete, including transfer of the data. The memory pointed to by ptr
can be reused by the application after the clEnqueueWriteImage call returns.
 If blocking_write is CL_FALSE, the OpenCL implementation will use ptr to perform a
non-blocking write. As the write is non-blocking the implementation can return
immediately. The memory pointed to by ptr cannot be reused by the application after the
call returns. The event argument returns an event object which can be used to query the
execution status of the write command. When the write command has completed, the
memory pointed to by ptr can then be reused by the application.
 clEnqueueReadImage and clEnqueueWriteImage return CL_SUCCESS if the function is
executed successfully. Otherwise, it returns one of the following errors :
 CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host command-
queue.
 CL_INVALID_CONTEXT if the context associated with command_queue and image
are not the same or if the context associated with command_queue and events in
event_wait_list are not the same.
 CL_INVALID_MEM_OBJECT if i_mage_ is not a valid image object.
 CL_INVALID_VALUE if the region being read or written specified by origin and
region is out of bounds or if ptr is a NULL value.
 CL_INVALID_VALUE if values in origin and region do not follow rules described in
the argument description for origin and region.
 CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and
num_events_in_wait_list > 0, or event_wait_list is not NULL and
num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events.
 CL_INVALID_IMAGE_SIZE if image dimensions (image width, height, specified or
compute row and/or slice pitch) for image are not supported by device associated with
queue.
 CL_IMAGE_FORMAT_NOT_SUPPORTED if image format (image channel order and
data type) for image are not supported by device associated with queue.
 CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate memory
for data store associated with image.
 CL_INVALID_OPERATION if the device associated with command_queue does not
support images (i.e. CL_DEVICE_IMAGE_SUPPORT CL_FALSE).

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 41) OpenCL Basics

 CL_INVALID_OPERATION if clEnqueueReadImage is called on image which has

been created with CL_MEM_HOST_WRITE_ONLY or CL_MEM_HOST_NO_
ACCESS.
 CL_INVALID_OPERATION if clEnqueueWriteImage is called on image which has
been created with CL_MEM_HOST_READ_ONLY or CL_MEM_HOST_NO_
ACCESS.
 CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the read and write
operations are blocking and the execution status of any of the events in event_wait_list
is a negative integer value.
 CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the
OpenCL implementation on the device.
 CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by
the OpenCL implementation on the host.
To enqueue a command to copy image objects, call the function
cl_int clEnqueueCopyImage(

cl_command_queue command_queue,

cl_mem src_image,

cl_mem dst_image,

const size_t* src_origin,

const size_t* dst_origin,

const size_t* region,

cl_uint num_events_in_wait_list,

const cl_event* event_wait_list,

cl_event* event);

 src_image and dst_image can be 1D, 2D, 3D image or a 1D, 2D image array objects. It
is possible to copy subregions between any combinations of source and destination
types, provided that the dimensions of the subregions are the same e.g., one can copy a
rectangular region from a 2D image to a slice of a 3D image.
 command_queue refers to the host command-queue in which the copy command will be
queued. The OpenCL context associated with command_queue, src_image and
dst_image must be the same.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 42) OpenCL Basics

 src_origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y)
offset and the image index in the 2D image array or the (x) offset and the image index in
the 1D image array. If image is a 2D image object, src_origin[2] must be 0. If src_image
is a 1D image object, src_origin[1] and src_origin[2] must be 0. If src_image is a 1D
image array object, src_origin[2] must be 0. If src_image is a 1D image array object,
src_origin[1] describes the image index in the 1D image array. If src_image is a 2D
image array object, src_origin[2] describes the image index in the 2D image array.

 dst_origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y)
offset and the image index in the 2D image array or the (x) offset and the image index in
the 1D image array. If dst_image is a 2D image object, dst_origin[2] must be 0. If
dst_image is a 1D image or 1D image buffer object, dst_origin[1] and dst_origin[2]
must be 0. If dst_image is a 1D image array object, dst_origin[2] must be 0. If
dst_image is a 1D image array object, dst_origin[1] describes the image index in the 1D
image array. If dst_image is a 2D image array object, dst_origin[2] describes the image
index in the 2D image array.

 region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the
(width, height) in pixels of the 2D rectangle and the number of images of a 2D image
array or the (width) in pixels of the 1D rectangle and the number of images of a 1D
image array. If src_image or dst_image is a 2D image object, region[2] must be 1. If
src_image or dst_image is a 1D image or 1D image buffer object, region[1] and
region[2] must be 1. If src_image or dst_image is a 1D image array object, region[2]
must be 1. The values in region cannot be 0.
 event_wait_list and num_events_in_wait_list specify events that need to complete
before this particular command can be executed. If event_wait_list is NULL, then this
particular command does not wait on any event to complete. If event_wait_list is NULL,
num_events_in_wait_list must be 0. If event_wait_list is not NULL, the list of events
pointed to by event_wait_list must be valid and num_events_in_wait_list must be
greater than 0. The events specified in event_wait_list act as synchronization points. The
context associated with events in event_wait_list and command_queue must be the
same. The memory associated with event_wait_list can be reused or freed after the
function returns.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 43) OpenCL Basics

 event returns an event object that identifies this particular copy command and can be
used to query or queue a wait for this particular command to complete. event can be
NULL in which case it will not be possible for the application to query the status of this
command or queue a wait for this command to complete.
clEnqueueBarrierWithWaitList can be used instead. If the event_wait_list and the event
arguments are not NULL, the event argument should not refer to an element of the
event_wait_list array.
 It is currently a requirement that the src_image and dst_image image memory objects
for clEnqueueCopyImage must have the exact same image format (i.e. the
cl_image_format descriptor specified when src_image and dst_image are created must
match).
 clEnqueueCopyImage returns CL_SUCCESS if the function is executed successfully.
Otherwise, it returns one of the following errors :
o CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host
command-queue.
o CL_INVALID_CONTEXT if the context associated with command_queue,
src_image and dst_image are not the same or if the context associated with
command_queue and events in event_wait_list are not the same.
o CL_INVALID_MEM_OBJECT if src_image and dst_image are not valid image
objects.
o CL_IMAGE_FORMAT_MISMATCH if src_image and dst_image do not use the
same image format.
o CL_INVALID_VALUE if the 2D or 3D rectangular region specified by src_origin
and src_origin + region refers to a region outside src_image, or if the 2D or 3D
rectangular region specified by dst_origin and dst_origin + region refers to a region
outside dst_image.
o CL_INVALID_VALUE if values in src_origin, dst_origin and region do not follow
rules described in the argument description for src_origin, dst_origin and region.
o CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and
num_events_in_wait_list > 0, or event_wait_list is not NULL and
num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid
events.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 44) OpenCL Basics

o CL_INVALID_IMAGE_SIZE if image dimensions (image width, height, specified

or compute row and/or slice pitch) for src_image or dst_image are not supported by
device associated with queue.
o CL_IMAGE_FORMAT_NOT_SUPPORTED if image format (image channel order
and data type) for src_image or dst_image are not supported by device associated
with queue.
o CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
memory for data store associated with src_image or dst_image.
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the
OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.
o CL_INVALID_OPERATION if the device associated with command_queue does
not support images (i.e. CL_DEVICE_IMAGE_SUPPORT specified in the Device
Queries table is CL_FALSE).
o CL_MEM_COPY_OVERLAP if src_image and dst_image are the same image
object and the source and destination regions overlap.
10. Filling Image Objects
 To enqueue a command to fill an image object with a specified color, call the function
cl_int clEnqueueFillImage(

cl_command_queue command_queue,

cl_mem image,

const void* fill_color,

const size_t* origin,

const size_t* region,

cl_uint num_events_in_wait_list,

const cl_event* event_wait_list,

cl_event* event);

 command_queue refers to the host command-queue in which the fill command will be
queued. The OpenCL context associated with command_queue and image must be the
same.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 45) OpenCL Basics

 image is a valid image object.

 fill_color is the color used to fill the image. The fill color is a single floating point value
if the channel order is CL_DEPTH. Otherwise, the fill color is a four component RGBA
floating-point color value if the image channel data type is not an unnormalized signed
or unsigned integer type, is a four component signed integer value if the image channel
data type is an unnormalized signed integer type and is a four component unsigned
integer value if the image channel data type is an unnormalized unsigned integer type.
The fill color will be converted to the appropriate image channel format and order
associated with image.
 origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset
and the image index in the 2D image array or the (x) offset and the image index in the
1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D
image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D
image array object, origin[2] must be 0. If image is a 1D image array object, origin[1]
describes the image index in the 1D image array. If image is a 2D image array object,
origin[2] describes the image index in the 2D image array.
 region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the
(width, height) in pixels of the 2D rectangle and the number of images of a 2D image
array or the (width) in pixels of the 1D rectangle and the number of images of a 1D
image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image
or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image
array object, region[2] must be 1. The values in region cannot be 0.
 event_wait_list and num_events_in_wait_list specify events that need to complete
before this particular command can be executed. If event_wait_list is NULL, then this
particular command does not wait on any event to complete. If event_wait_list is NULL,
num_events_in_wait_list must be 0. If event_wait_list is not NULL, the list of events
pointed to by event_wait_list must be valid and num_events_in_wait_list must be
greater than 0. The events specified in event_wait_list act as synchronization points. The
context associated with events in event_wait_list and command_queue must be the
same. The memory associated with event_wait_list can be reused or freed after the
function returns.
 event returns an event object that identifies this particular command and can be used to
query or queue a wait for this particular command to complete. event can be NULL in
which case it will not be possible for the application to query the status of this command
or queue a wait for this command to complete. clEnqueueBarrierWithWaitList can be

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 46) OpenCL Basics

used instead. If the event_wait_list and the event arguments are not NULL, the event
argument should not refer to an element of the event_wait_list array.
 The usage information which indicates whether the memory object can be read or
written by a kernel and/or the host and is given by the cl_mem_flags argument value
specified when image is created is ignored by clEnqueueFillImage.
 clEnqueueFillImage returns CL_SUCCESS if the function is executed successfully.
Otherwise, it returns one of the following errors :
o CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host
command-queue.
o CL_INVALID_CONTEXT if the context associated with command_queue and
image are not the same or if the context associated with command_queue and events
in event_wait_list are not the same.
o CL_INVALID_MEM_OBJECT if image is not a valid image object.
o CL_INVALID_VALUE if fill_color is NULL.
o CL_INVALID_VALUE if the region being filled as specified by origin and region
is out of bounds or if ptr is a NULL value.
o CL_INVALID_VALUE if values in origin and region do not follow rules described
in the argument description for origin and region.
o CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and
num_events_in_wait_list > 0, or event_wait_list is not NULL and
num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid
events.
o CL_INVALID_IMAGE_SIZE if image dimensions (image width, height, specified
or compute row and/or slice pitch) for image are not supported by device associated
with queue.
o CL_IMAGE_FORMAT_NOT_SUPPORTED if image format (image channel
order and data type) for image are not supported by device associated with queue.
o CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
memory for data store associated with image.
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by
the OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 47) OpenCL Basics

11. Copying between Image and Buffer Objects

 To enqueue a command to copy an image object to a buffer object, call the function
cl_int clEnqueueCopyImageToBuffer(

cl_command_queue command_queue,

cl_mem src_image,

cl_mem dst_buffer,

const size_t* src_origin,

const size_t* region,

size_t dst_offset,

cl_uint num_events_in_wait_list,

const cl_event* event_wait_list,

cl_event* event);

 command_queue must be a valid host command-queue. The OpenCL context associated

with command_queue, src_image and dst_buffer must be the same.
 src_image is a valid image object.
 dst_buffer is a valid buffer object.
 src_origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y)
offset and the image index in the 2D image array or the (x) offset and the image index in
the 1D image array. If src_image is a 2D image object, src_origin[2] must be 0. If
src_image is a 1D image or 1D image buffer object, src_origin[1] and src_origin[2]
must be 0. If src_image is a 1D image array object, src_origin[2] must be 0. If
src_image is a 1D image array object, src_origin[1] describes the image index in the 1D
image array. If src_image is a 2D image array object, src_origin[2] describes the image
index in the 2D image array.
 region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the
(width, height) in pixels of the 2D rectangle and the number of images of a 2D image
array or the (width) in pixels of the 1D rectangle and the number of images of a 1D
image array. If src_image is a 2D image object, region[2] must be 1. If src_image is a
1D image or 1D image buffer object, region[1] and region[2] must be 1. If src_image is
a 1D image array object, region[2] must be 1. The values in region cannot be 0.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 48) OpenCL Basics

 dst_offset refers to the offset where to begin copying data into dst_buffer. The size in
bytes of the region to be copied referred to as dst_cb is computed as width × height ×
depth × bytes/image element if src_image is a 3D image object, is computed as width ×
height × bytes/image element if src_image is a 2D image, is computed as width × height
× arraysize × bytes/image element if src_image is a 2D image array object, is computed
as width × bytes/image element if src_image is a 1D image or 1D image buffer object
and is computed as width × arraysize × bytes/image element if src_image is a 1D image
array object.
 event_wait_list and num_events_in_wait_list specify events that need to complete
before this particular command can be executed. If event_wait_list is NULL, then this
particular command does not wait on any event to complete. If event_wait_list is NULL,
num_events_in_wait_list must be 0. If event_wait_list is not NULL, the list of events
pointed to by event_wait_list must be valid and num_events_in_wait_list must be
greater than 0. The events specified in event_wait_list act as synchronization points. The
context associated with events in event_wait_list and command_queue must be the
same. The memory associated with event_wait_list can be reused or freed after the
function returns.
 event returns an event object that identifies this particular copy command and can be
used to query or queue a wait for this particular command to complete. event can be
NULL in which case it will not be possible for the application to query the status of this
command or queue a wait for this command to complete.
clEnqueueBarrierWithWaitList can be used instead. If the event_wait_list and the event
arguments are not NULL, the event argument should not refer to an element of the
event_wait_list array.
 clEnqueueCopyImageToBuffer returns CL_SUCCESS if the function is executed
successfully. Otherwise, it returns one of the following errors :
o CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host
command-queue.
o CL_INVALID_CONTEXT if the context associated with command_queue,
src_image and dst_buffer are not the same or if the context associated with
command_queue and events in event_wait_list are not the same.
o CL_INVALID_MEM_OBJECT if src_image is not a valid image object or
dst_buffer is not a valid buffer object or if src_image is a 1D image buffer object
created from dst_buffer.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 49) OpenCL Basics

o CL_INVALID_VALUE if the 1D, 2D or 3D rectangular region specified by

src_origin and src_origin + region refers to a region outside src_image, or if the
region specified by dst_offset and dst_offset + dst_cb to a region outside dst_buffer.
o CL_INVALID_VALUE if values in src_origin and region do not follow rules
described in the argument description for src_origin and region.
o CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and
num_events_in_wait_list > 0, or event_wait_list is not NULL and
num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid
events.
o CL_MISALIGNED_SUB_BUFFER_OFFSET if dst_buffer is a sub-buffer object
and offset specified when the sub-buffer object is created is not aligned to
CL_DEVICE_MEM_BASE_ADDR_ALIGN value for device associated with
queue.
o CL_INVALID_IMAGE_SIZE if image dimensions (image width, height, specified
or compute row and/or slice pitch) for src_image are not supported by device
associated with queue.
o CL_IMAGE_FORMAT_NOT_SUPPORTED if image format (image channel
order and data type) for src_image are not supported by device associated with
queue.
o CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
memory for data store associated with src_image or dst_buffer.
o CL_INVALID_OPERATION if the device associated with command_queue does
not support images (i.e. CL_DEVICE_IMAGE_SUPPORT specified in the Device
Queries table is CL_FALSE).
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by
the OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.
 To enqueue a command to copy a buffer object to an image object, call the function
cl_int clEnqueueCopyBufferToImage(

cl_command_queue command_queue,

cl_mem src_buffer,

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 50) OpenCL Basics

cl_mem dst_image,

size_t src_offset,

const size_t* dst_origin,

const size_t* region,

cl_uint num_events_in_wait_list,

const cl_event* event_wait_list,

cl_event* event);

 command_queue must be a valid host command-queue. The OpenCL context associated

with command_queue, src_buffer and dst_image must be the same.
 src_buffer is a valid buffer object.
 dst_image is a valid image object.
 src_offset refers to the offset where to begin copying data from src_buffer.
 dst_origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y)
offset and the image index in the 2D image array or the (x) offset and the image index in
the 1D image array. If dst_image is a 2D image object, dst_origin[2] must be 0. If
dst_image is a 1D image or 1D image buffer object, dst_origin[1] and dst_origin[2]
must be 0. If dst_image is a 1D image array object, dst_origin[2] must be 0. If
dst_image is a 1D image array object, dst_origin[1] describes the image index in the 1D
image array. If dst_image is a 2D image array object, dst_origin[2] describes the image
index in the 2D image array.
 region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the
(width, height) in pixels of the 2D rectangle and the number of images of a 2D image
array or the (width) in pixels of the 1D rectangle and the number of images of a 1D
image array. If dst_image is a 2D image object, region[2] must be 1. If dst_image is a
1D image or 1D image buffer object, region[1] and region[2] must be 1. If dst_image is
a 1D image array object, region[2] must be 1. The values in region cannot be 0.
 event_wait_list and num_events_in_wait_list specify events that need to complete
before this particular command can be executed. If event_wait_list is NULL, then this
particular command does not wait on any event to complete. If event_wait_list is NULL,
num_events_in_wait_list must be 0. If event_wait_list is not NULL, the list of events
pointed to by event_wait_list must be valid and num_events_in_wait_list must be
greater than 0. The events specified in event_wait_list act as synchronization points. The
context associated with events in event_wait_list and command_queue must be the
same. The memory associated with event_wait_list can be reused or freed after the
function returns.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 51) OpenCL Basics

 event returns an event object that identifies this particular copy command and can be
used to query or queue a wait for this particular command to complete. event can be
NULL in which case it will not be possible for the application to query the status of this
command or queue a wait for this command to complete.
clEnqueueBarrierWithWaitList can be used instead. If the event_wait_list and the
event arguments are not NULL, the event argument should not refer to an element of the
event_wait_list array.
The size in bytes of the region to be copied from src_buffer referred to as src_cb is
computed as width × height × depth × bytes/image element if dst_image is a 3D image
object, is computed as width × height × bytes/image element if dst_image is a 2D image,
is computed as width × height × arraysize × bytes/image element if dst_image is a 2D
image array object, is computed as width × bytes/image element if dst_image is a 1D
image or 1D image buffer object and is computed as width × arraysize × bytes/image
element if dst_image is a 1D image array object.
 clEnqueueCopyBufferToImage returns CL_SUCCESS if the function is executed
successfully. Otherwise, it returns one of the following errors:
o CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host
command-queue.
o CL_INVALID_CONTEXT if the context associated with command_queue,
src_buffer and dst_image are not the same or if the context associated with
command_queue and events in event_wait_list are not the same.
o CL_INVALID_MEM_OBJECT if src_buffer is not a valid buffer object or
dst_image is not a valid image object or if dst_image is a 1D image buffer object
created from src_buffer.
o CL_INVALID_VALUE if the 1D, 2D or 3D rectangular region specified by
dst_origin and dst_origin + region refer to a region outside dst_image, or if the
region specified by src_offset and src_offset + src_cb refer to a region outside
src_buffer.
o CL_INVALID_VALUE if values in dst_origin and region do not follow rules
described in the argument description for dst_origin and region.
o CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and
num_events_in_wait_list > 0, or event_wait_list is not NULL and
num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid
events.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (4 - 52) OpenCL Basics

o CL_MISALIGNED_SUB_BUFFER_OFFSET if src_buffer is a sub-buffer object

and offset specified when the sub-buffer object is created is not aligned to
CL_DEVICE_MEM_BASE_ADDR_ALIGN value for device associated with
queue.
o CL_INVALID_IMAGE_SIZE if image dimensions (image width, height, specified
or compute row and/or slice pitch) for dst_image are not supported by device
associated with queue.
o CL_IMAGE_FORMAT_NOT_SUPPORTED if image format (image channel
order and data type) for dst_image are not supported by device associated with
queue.
o CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
memory for data store associated with src_buffer or dst_image.
o CL_INVALID_OPERATION if the device associated with command_queue does
not support images (i.e. CL_DEVICE_IMAGE_SUPPORT specified in the Device
Queries table is CL_FALSE).
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by
the OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.
12. Mapping Image Objects
 To enqueue a command to map a region in the image object given by image into the
host address space and returns a pointer to this mapped region, call the function.
void* clEnqueueMapImage(

cl_command_queue command_queue,

cl_mem image,

cl_bool blocking_map,

cl_map_flags map_flags,

const size_t* origin,

const size_t* region,

size_t* image_row_pitch,

size_t* image_slice_pitch,

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 53) OpenCL Basics

cl_uint num_events_in_wait_list,

const cl_event* event_wait_list,

cl_event* event,

cl_int* errcode_ret);

 command_queue must be a valid host command-queue.

 image is a valid image object. The OpenCL context associated with command_queue
and image must be the same.
 blocking_map indicates if the map operation is blocking or non-blocking.
 map_flags is a bit-field.
 origin defines the (x, y, z) offset in pixels in the 1D, 2D or 3D image, the (x, y) offset
and the image index in the 2D image array or the (x) offset and the image index in the
1D image array. If image is a 2D image object, origin[2] must be 0. If image is a 1D
image or 1D image buffer object, origin[1] and origin[2] must be 0. If image is a 1D
image array object, origin[2] must be 0. If image is a 1D image array object, origin[1]
describes the image index in the 1D image array. If image is a 2D image array object,
origin[2] describes the image index in the 2D image array.
 region defines the (width, height, depth) in pixels of the 1D, 2D or 3D rectangle, the
(width, height) in pixels of the 2D rectangle and the number of images of a 2D image
array or the (width) in pixels of the 1D rectangle and the number of images of a 1D
image array. If image is a 2D image object, region[2] must be 1. If image is a 1D image
or 1D image buffer object, region[1] and region[2] must be 1. If image is a 1D image
array object, region[2] must be 1. The values in region cannot be 0.
 image_row_pitch returns the scan-line pitch in bytes for the mapped region. This must
be a non-NULL value.
 image_slice_pitch returns the size in bytes of each 2D slice of a 3D image or the size of
each 1D or 2D image in a 1D or 2D image array for the mapped region. For a 1D and
2D image, zero is returned if this argument is not NULL. For a 3D image, 1D and 2D
image array, image_slice_pitch must be a non-NULL value.
 event_wait_list and num_events_in_wait_list specify events that need to complete
before clEnqueueMapImage can be executed. If event_wait_list is NULL, then
clEnqueueMapImage does not wait on any event to complete. If event_wait_list is
NULL, num_events_in_wait_list must be 0. If event_wait_list is not NULL, the list of
events pointed to by event_wait_list must be valid and num_events_in_wait_list must be

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 54) OpenCL Basics

greater than 0. The events specified in event_wait_list act as synchronization points. The
context associated with events in event_wait_list and command_queue must be the
same. The memory associated with event_wait_list can be reused or freed after the
function returns.
 event returns an event object that identifies this particular command and can be used to
query or queue a wait for this particular command to complete. event can be NULL in
which case it will not be possible for the application to query the status of this command
or queue a wait for this command to complete. If the event_wait_list and the event
arguments are not NULL, the event argument should not refer to an element of the
event_wait_list array.
 errcode_ret will return an appropriate error code. If errcode_ret is NULL, no error code
is returned.
 If blocking_map is CL_TRUE, clEnqueueMapImage does not return until the
specified region in image is mapped into the host address space and the application can
access the contents of the mapped region using the pointer returned by
clEnqueueMapImage.
 If blocking_map is CL_FALSE i.e. map operation is non-blocking, the pointer to the
mapped region returned by clEnqueueMapImage cannot be used until the map command
has completed. The event argument returns an event object which can be used to query
the execution status of the map command. When the map command is completed, the
application can access the contents of the mapped region using the pointer returned by
clEnqueueMapImage.
 clEnqueueMapImage will return a pointer to the mapped region. The errcode_ret is set
to CL_SUCCESS.
 A NULL pointer is returned otherwise with one of the following error values returned in
errcode_ret :
o CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host
command-queue.
o CL_INVALID_CONTEXT if context associated with command_queue and image
are not the same or if context associated with command_queue and events in
event_wait_list are not the same.
o CL_INVALID_MEM_OBJECT if image is not a valid image object.
o CL_INVALID_VALUE if region being mapped given by (origin, origin+region) is
out of bounds or if values specified in map_flags are not valid.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 55) OpenCL Basics

o CL_INVALID_VALUE if values in origin and region do not follow rules described

in the argument description for origin and region.
o CL_INVALID_VALUE if image_row_pitch is NULL.
o CL_INVALID_VALUE if image is a 3D image, 1D or 2D image array object and
image_slice_pitch is NULL.
o CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and
num_events_in_wait_list > 0, or event_wait_list is not NULL and
num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid
events.
o CL_INVALID_IMAGE_SIZE if image dimensions (image width, height, specified
or compute row and/or slice pitch) for image are not supported by device associated
with queue.
o CL_IMAGE_FORMAT_NOT_SUPPORTED if image format (image channel
order and data type) for image are not supported by device associated with queue.
o CL_MAP_FAILURE if there is a failure to map the requested region into the host
address space. This error cannot occur for image objects created with CL_MEM_
USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR.
o CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST if the map
operation is blocking and the execution status of any of the events in event_wait_list
is a negative integer value.
o CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
memory for data store associated with image.

o CL_INVALID_OPERATION if the device associated with command_queue does

not support images (i.e. CL_DEVICE_IMAGE_SUPPORT is CL_FALSE).
o CL_INVALID_OPERATION if image has been created with CL_MEM_HOST_
WRITE_ONLY or CL_MEM_HOST_NO_ACCESS and CL_MAP_READ is set in
map_flags or if image has been created with CL_MEM_HOST_READ_ONLY or
CL_MEM_HOST_NO_ACCESS and CL_MAP_WRITE or CL_MAP_WRITE_
INVALIDATE_REGION is set in map_flags.
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by
the OpenCL implementation on the device.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 56) OpenCL Basics

o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required

by the OpenCL implementation on the host.
o CL_INVALID_OPERATION if mapping would lead to overlapping regions being
mapped for writing.
 The pointer returned maps a 1D, 2D or 3D region starting at origin and is at least
region[0] pixels in size for a 1D image, 1D image buffer or 1D image array,
(image_row_pitch × region[1]) pixels in size for a 2D image or 2D image array, and
(image_slice_pitch × region[2]) pixels in size for a 3D image. The result of a memory
access outside this region is undefined.
 If the image object is created with CL_MEM_USE_HOST_PTR set in mem_flags, the
following will be true :
o The host_ptr specified in clCreateImage is guaranteed to contain the latest bits in
the region being mapped when the clEnqueueMapImage command has completed.
o The pointer value returned by clEnqueueMapImage will be derived from the
host_ptr specified when the image object is created.
o Mapped image objects are unmapped using clEnqueueUnmapMemObject. This is
described in Unmapping Mapped Memory Objects.
13. Image Object Queries
 To get information that is common to all memory objects, use the
clGetMemObjectInfo function described in Memory Object Queries.
 To get information specific to an image object created with clCreateImage, call the
function.
cl_int clGetImageInfo(

cl_mem image,

cl_image_info param_name,

size_t param_value_size,

void* param_value,

size_t* param_value_size_ret);

 image specifies the image object being queried.

 param_name specifies the information to query. The list of supported param_name types
and the information returned in param_value by clGetImageInfo.
 param_value is a pointer to memory where the appropriate result being queried is
returned. If param_value is NULL, it is ignored.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 57) OpenCL Basics

 param_value_size is used to specify the size in bytes of memory pointed to by

param_value. This size must be ≥ size of return type.
 param_value_size_ret returns the actual size in bytes of data being queried by
param_name. If param_value_size_ret is NULL, it is ignored.
Table 4.4.9 : List of supported param_names by clGetImageInfo

cl_image_info Return type Information returned in param_value

CL_IMAGE_ cl_image_format Return image format descriptor specified when
FORMAT image is created with clCreateImage.
CL_IMAGE_ size_t Return size of each element of the image memory
ELEMENT_SIZE object given by image in bytes. An element is
made up of n channels. The value of n is given in
cl_image_format descriptor.
CL_IMAGE_ size_t Return calculated row pitch in bytes of a row of
ROW_PITCH elements of the image object given by image.
CL_IMAGE_ size_t Return calculated slice pitch in bytes of a 2D slice
SLICE_PITCH for the 3D image object or size of each image in a
1D or 2D image array given by image. For a 1D
image, 1D image buffer and 2D image object
return 0.
CL_IMAGE_ size_t Return width of the image in pixels.
WIDTH
CL_IMAGE_ size_t Return height of the image in pixels. For a 1D
HEIGHT image, 1D image buffer and 1D image array
object, height = 0.
CL_IMAGE_ size_t Return depth of the image in pixels. For a 1D
DEPTH image, 1D image buffer, 2D image or 1D and 2D
image array object, depth = 0.
CL_IMAGE_ size_t Return number of images in the image array. If
ARRAY_SIZE image is not an image array, 0 is returned.
CL_IMAGE_ cl_uint Return num_mip_levels associated with image.
NUM_MIP_
LEVELS
CL_IMAGE_ cl_uint Return num_samples associated with image.
NUM_SAMPLES

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 58) OpenCL Basics

 clGetImageInfo returns CL_SUCCESS if the function is executed successfully.

Otherwise, it returns one of the following errors :
o CL_INVALID_VALUE if param_name is not valid, or if size in bytes specified by
param_value_size is < size of return type and param_value is not NULL.
o CL_INVALID_MEM_OBJECT if image is a not a valid image object.
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by
the OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.

 4.4.1.3 Pipe
1. The pipe memory object conceptually is an ordered sequence of data items. A pipe has two
endpoints : a write endpoint into which data items are inserted, and a read endpoint from
which data items are removed. At any one time, only one kernel instance may write into a
pipe, and only one kernel instance may read from a pipe. To support the producer consumer
design pattern, one kernel instance connects to the write endpoint (the producer) while
another kernel instance connects to the reading endpoint (the consumer).
2. A pipe is a memory object that stores data organized as a FIFO. Pipe objects can only be
accessed using built-in functions that read from and write to a pipe. Pipe objects are not
accessible from the host. A pipe object encapsulates the following information :
 Packet size in bytes
 Maximum capacity in packets
 Information about the number of packets currently in the pipe
 Data packets
3. Creating Pipe Objects
 To create a pipe object, call the function
cl_mem clCreatePipe(

cl_context context,

cl_mem_flags flags,

cl_uint pipe_packet_size,

cl_uint pipe_max_packets,

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 59) OpenCL Basics

const cl_pipe_properties* properties,

cl_int* errcode_ret);

 context is a valid OpenCL context used to create the pipe object.

 flags is a bit-field that is used to specify allocation and usage information such as the
memory arena that should be used to allocate the pipe object and how it will be used.
The Memory Flags table describes the possible values for flags. Only CL_MEM_
READ_WRITE and CL_MEM_HOST_NO_ACCESS can be specified when creating a
pipe object. If the value specified for flags is 0, the default is used which is
CL_MEM_READ_WRITE | CL_MEM_HOST_NO_ACCESS.
 pipe_packet_size is the size in bytes of a pipe packet.
 pipe_max_packets specifies the pipe capacity by specifying the maximum number of
packets the pipe can hold.
 properties specifies a list of properties for the pipe and their corresponding values. Each
property name is immediately followed by the corresponding desired value. The list is
terminated with 0. In OpenCL 2.2, properties must be NULL.
 errcode_ret will return an appropriate error code. If errcode_ret is NULL, no error code
is returned.
 clCreatePipe returns a valid non-zero pipe object and errcode_ret is set to
CL_SUCCESS if the pipe object is created successfully. Otherwise, it returns a NULL
value with one of the following error values returned in errcode_ret :
o CL_INVALID_CONTEXT if context is not a valid context.
o CL_INVALID_VALUE if values specified in flags are not as defined above.
o CL_INVALID_VALUE if properties is not NULL.
o CL_INVALID_PIPE_SIZE if pipe_packet_size is 0 or the pipe_packet_size exceeds
CL_DEVICE_PIPE_MAX_PACKET_SIZE value for all devices in context or if
pipe_max_packets is 0.
o CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
memory for the pipe object.
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the
OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 60) OpenCL Basics

 Pipes follow the same memory consistency model as defined for buffer and image
objects. The pipe state i.e. contents of the pipe across kernel-instances (on the same or
different devices) is enforced at a synchronization point.
4. Pipe Object Queries
 To get information that is common to all memory objects, use the
clGetMemObjectInfo function described in Memory Object Queries.
 To get information specific to a pipe object created with clCreatePipe, call the function
cl_int clGetPipeInfo(

cl_mem pipe,

cl_pipe_info param_name,

size_t param_value_size,

void* param_value,

size_t* param_value_size_ret);

 pipe specifies the pipe object being queried.

 param_name specifies the information to query. The list of supported param_name types
and the information returned in param_value by clGetPipeInfo.
 param_value is a pointer to memory where the appropriate result being queried is
returned. If param_value is NULL, it is ignored.
 param_value_size is used to specify the size in bytes of memory pointed to by
param_value. This size must be ≥ size of return type.
 param_value_size_ret returns the actual size in bytes of data being queried by
param_name. If param_value_size_ret is NULL, it is ignored.
 clGetPipeInfo returns CL_SUCCESS if the function is executed successfully.
Otherwise, it returns one of the following errors :
o CL_INVALID_VALUE if param_name is not valid, or if size in bytes specified by
param_value_size is < size of return type as described in the Pipe Object Queries
table and param_value is not NULL.
o CL_INVALID_MEM_OBJECT if pipe is a not a valid pipe object.
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by
the OpenCL implementation on the device.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 61) OpenCL Basics

o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required

by the OpenCL implementation on the host.
Table 4.4.10 : List of supported param_names by clGetPipeInfo

cl_pipe_info Return type Information returned in param_value

CL_PIPE_PACKET_SIZE cl_uint Return pipe packet size specified when
pipe is created with clCreatePipe.
CL_PIPE_MAX_PACKETS cl_uint Return max. number of packets specified
when pipe is created with clCreatePipe.

 4.4.2 Various Operations on Memory Objects

 4.4.2.1 Retaining and Releasing Memory Objects

 To retain a memory object, below function is used,
cl_int clRetainMemObject(

cl_mem memobj);

 memobj specifies the memory object to be retained.

 The memobj reference count is incremented.
 clRetainMemObject returns CL_SUCCESS if the function is executed successfully.
Otherwise, it returns one of the following errors :
o CL_INVALID_MEM_OBJECT if memobj is not a valid memory object (buffer or
image object).
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by
the OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.
 clCreateBuffer, clCreateSubBuffer, clCreateImage and clCreatePipe perform an
implicit retain.
 To release a memory object, call the function
cl_int clReleaseMemObject(

cl_mem memobj);

 memobj specifies the memory object to be released.

 The memobj reference count is decremented.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 62) OpenCL Basics

 clReleaseMemObject returns CL_SUCCESS if the function is executed successfully.

Otherwise, it returns one of the following errors :
o CL_INVALID_MEM_OBJECT if memobj is not a valid memory object.
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by
the OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.
 After the memobj reference count becomes zero and commands queued for execution on
a command-queue(s) that use memobj have finished, the memory object is deleted. If
memobj is a buffer object, memobj cannot be deleted until all sub-buffer objects
associated with memobj are deleted. Using this function to release a reference that was
not obtained by creating the object or by calling clRetainMemObject causes undefined
behavior.
 To register a user callback function with a memory object, below function is called,
cl_int clSetMemObjectDestructorCallback(

cl_mem memobj,

void (CL_CALLBACK* pfn_notify)(cl_mem memobj, void* user_data),

void* user_data);

 memobj is a valid memory object.

 pfn_notify is the callback function that can be registered by the application. This
callback function may be called asynchronously by the OpenCL implementation. It is
the applications responsibility to ensure that the callback function is thread-safe. The
parameters to this callback function are :
o memobj is the memory object being deleted. When the user callback is called by the
implementation, this memory object is not longer valid. memobj is only provided for
reference purposes.
o user_data is a pointer to user supplied data.
 user_data will be passed as the user_data argument when pfn_notify is called. user_data
can be NULL.
 Each call to clSetMemObjectDestructorCallback registers the specified user callback
function on a callback stack associated with memobj. The registered user callback
functions are called in the reverse order in which they were registered. The user callback
functions are called and then the memory objects resources are freed and the memory
object is deleted. This provides a mechanism for the application (and libraries) using

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 63) OpenCL Basics

memobj to be notified when the memory referenced by host_ptr, specified when the
memory object is created and used as the storage bits for the memory object, can be
reused or freed.
 clSetMemObjectDestructorCallback returns CL_SUCCESS if the function is
executed successfully. Otherwise, it returns one of the following errors :
o CL_INVALID_MEM_OBJECT if memobj is not a valid memory object.
o CL_INVALID_VALUE if pfn_notify is NULL.
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the
OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.
 When the user callback function is called by the implementation, the contents of the
memory region pointed to by host_ptr (if the memory object is created with
CL_MEM_USE_HOST_PTR) are undefined. The callback function is typically used
by the application to either free or reuse the memory region pointed to by host_ptr.
 The behavior of calling expensive system routines, OpenCL API calls to create contexts
or command-queues, or blocking OpenCL operations from the following list below, in a
callback is undefined.
o clFinish,
o clWaitForEvents,
o blocking calls to clEnqueueReadBuffer, clEnqueueReadBufferRect,
clEnqueueWriteBuffer, clEnqueueWriteBufferRect,
o blocking calls to clEnqueueReadImage and clEnqueueWriteImage,
o blocking calls to clEnqueueMapBuffer, clEnqueueMapImage,
o blocking calls to clBuildProgram, clCompileProgram or clLinkProgram
 If an application needs to wait for completion of a routine from the above list in a
callback, please use the non-blocking form of the function, and assign a completion
callback to it to do the remainder of the work. Note that when a callback (or other code)
enqueues commands to a command-queue, the commands are not required to begin
execution until the queue is flushed. In standard usage, blocking enqueue calls serve this
role by implicitly flushing the queue. Since blocking calls are not permitted in callbacks,
those callbacks that enqueue commands on a command queue should either call clFlush
on the queue before returning or arrange for clFlush to be called later on another thread.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 64) OpenCL Basics

 The user callback function may not call OpenCL APIs with the memory object for
which the callback function is invoked and for such cases the behavior of OpenCL APIs
is considered to be undefined.

 4.4.2.2 Unmapping Mapped Memory Objects

 To enqueue a command to unmap a previously mapped region of a memory object, below
function is used,
cl_int clEnqueueUnmapMemObject(

cl_command_queue command_queue,

cl_mem memobj,

void* mapped_ptr,

cl_uint num_events_in_wait_list,

const cl_event* event_wait_list,

cl_event* event);

 command_queue must be a valid host command-queue.

 memobj is a valid memory (buffer or image) object. The OpenCL context associated
with command_queue and memobj must be the same.
 mapped_ptr is the host address returned by a previous call to clEnqueueMapBuffer, or
clEnqueueMapImage for memobj.
 event_wait_list and num_events_in_wait_list specify events that need to complete
before clEnqueueUnmapMemObject can be executed. If event_wait_list is NULL,
then clEnqueueUnmapMemObject does not wait on any event to complete. If
event_wait_list is NULL, num_events_in_wait_list must be 0. If event_wait_list is not
NULL, the list of events pointed to by event_wait_list must be valid and
num_events_in_wait_list must be greater than 0. The events specified in event_wait_list
act as synchronization points. The context associated with events in event_wait_list and
command_queue must be the same. The memory associated with event_wait_list can be
reused or freed after the function returns.
 event returns an event object that identifies this particular command and can be used to
query or queue a wait for this particular command to complete. event can be NULL in
which case it will not be possible for the application to query the status of this command
or queue a wait for this command to complete. clEnqueueBarrierWithWaitList can be
used instead. If the event_wait_list and the event arguments are not NULL, the event
argument should not refer to an element of the event_wait_list array.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 65) OpenCL Basics

 Reads or writes from the host using the pointer returned by clEnqueueMapBuffer or
clEnqueueMapImage are considered to be complete.
 clEnqueueMapBuffer and clEnqueueMapImage increment the mapped count of the
memory object. The initial mapped count value of the memory object is zero. Multiple
calls to clEnqueueMapBuffer, or clEnqueueMapImage on the same memory object
will increment this mapped count by appropriate number of calls.
clEnqueueUnmapMemObject decrements the mapped count of the memory object.
 clEnqueueMapBuffer, and clEnqueueMapImage act as synchronization points for a
region of the buffer object being mapped.
 clEnqueueUnmapMemObject returns CL_SUCCESS if the function is executed
successfully. Otherwise, it returns one of the following errors :
o CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host
command-queue.
o CL_INVALID_MEM_OBJECT if memobj is not a valid memory object or is a
pipe object.
o CL_INVALID_VALUE if mapped_ptr is not a valid pointer returned by
clEnqueueMapBuffer or clEnqueueMapImage for memobj.
o CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and
num_events_in_wait_list > 0, or if event_wait_list is not NULL and
num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid
events.
o CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by
the OpenCL implementation on the device.
o CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required
by the OpenCL implementation on the host.
o CL_INVALID_CONTEXT if context associated with command_queue and
memobj are not the same or if the context associated with command_queue and
events in event_wait_list are not the same.

 4.4.2.3 Accessing Mapped Regions of a Memory Object

1. The contents of the region of a memory object and associated memory objects (sub-buffer
objects or 1D image buffer objects that overlap this region) mapped for writing (i.e. CL_
MAP_WRITE or CL_MAP_WRITE_INVALIDATE_REGION is set in map_flags
argument to clEnqueueMapBuffer, or clEnqueueMapImage) are considered to be undefined
until this region is unmapped.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (4 - 66) OpenCL Basics

2. Multiple commands in command-queues can map a region or overlapping regions of a

memory object and associated memory objects (sub-buffer objects or 1D image buffer
objects that overlap this region) for reading (i.e. map_flags = CL_MAP_READ). The
contents of the regions of a memory object mapped for reading can also be read by kernels
and other OpenCL commands (such as clEnqueueCopyBuffer) executing on a device(s).
3. Mapping (and unmapping) overlapped regions in a memory object and / or associated
memory objects (sub-buffer objects or 1D image buffer objects that overlap this region) for
writing is an error and will result in CL_INVALID_OPERATION error returned by
clEnqueueMapBuffer, or clEnqueueMapImage.
4. If a memory object is currently mapped for writing, the application must ensure that the
memory object is unmapped before any enqueued kernels or commands that read from or
write to this memory object or any of its associated memory objects (sub-buffer or 1D
image buffer objects) or its parent object (if the memory object is a sub-buffer or 1D image
buffer object) begin execution; otherwise the behavior is undefined.
5. If a memory object is currently mapped for reading, the application must ensure that the
memory object is unmapped before any enqueued kernels or commands that write to this
memory object or any of its associated memory objects (sub-buffer or 1D image buffer
objects) or its parent object (if the memory object is a sub-buffer or 1D image buffer object)
begin execution; otherwise the behavior is undefined.
6. A memory object is considered as mapped if there are one or more active mappings for the
memory object irrespective of whether the mapped regions span the entire memory object.
7. Accessing the contents of the memory region referred to by the mapped pointer that has
been unmapped is undefined.
8. The mapped pointer returned by clEnqueueMapBuffer or clEnqueueMapImage can be
used as the ptr argument value to clEnqueueReadBuffer, clEnqueueWriteBuffer,
clEnqueueReadBufferRect, clEnqueueWriteBufferRect, clEnqueueReadImage, or
clEnqueueWriteImage provided the rules described above are obeyed.

 4.4.2.4 Migrating Memory Objects

1. A user may wish to have more explicit control over the location of their memory objects on
creation. This could be used to :
 Ensure that an object is allocated on a specific device prior to usage.
 Preemptively migrate an object from one device to another.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 67) OpenCL Basics

2. To enqueue a command to indicate which device a set of memory objects should be

associated with, below function is called,
cl_int clEnqueueMigrateMemObjects(

cl_command_queue command_queue,

cl_uint num_mem_objects,

const cl_mem* mem_objects,

cl_mem_migration_flags flags,

cl_uint num_events_in_wait_list,

const cl_event* event_wait_list,

cl_event* event);

 command_queue is a valid host command-queue. The specified set of memory objects

in mem_objects will be migrated to the OpenCL device associated with
command_queue or to the host if the CL_MIGRATE_MEM_OBJECT_HOST has
been specified.
 num_mem_objects is the number of memory objects specified in mem_objects.
 mem_objects is a pointer to a list of memory objects.
 flags is a bit-field that is used to specify migration options. The Memory Migration
Flags describes the possible values for flags.
 event_wait_list and num_events_in_wait_list specify events that need to complete
before this particular command can be executed. If event_wait_list is NULL, then this
particular command does not wait on any event to complete. If event_wait_list is NULL,
num_events_in_wait_list must be 0. If event_wait_list is not NULL, the list of events
pointed to by event_wait_list must be valid and num_events_in_wait_list must be
greater than 0. The events specified in event_wait_list act as synchronization points. The
context associated with events in event_wait_list and command_queue must be the
same. The memory associated with event_wait_list can be reused or freed after the
function returns.
 event returns an event object that identifies this particular command and can be used to
query or queue a wait for this particular command to complete. event can be NULL in
which case it will not be possible for the application to query the status of this command
or queue a wait for this command to complete. If the event_wait_list and the event
arguments are not NULL, the event argument should not refer to an element of the
event_wait_list array.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 68) OpenCL Basics

Table 4.4.11 : Supported values for cl_mem_migration_flags

cl_mem_migration flags Description

CL_MIGRATE_MEM_ This flag indicates that the specified set of memory objects
OBJECT_HOST are to be migrated to the host, regardless of the target
command-queue.
CL_MIGRATE_MEM_ This flag indicates that the contents of the set of memory
OBJECT_CONTENT_ objects are undefined after migration. The specified set of
UNDEFINED memory objects are migrated to the device associated with
command_queue without incurring the overhead of
migrating their contents.
3. Typically, memory objects are implicitly migrated to a device for which enqueued
commands, using the memory object, are targeted. clEnqueueMigrateMemObjects allows
this migration to be explicitly performed ahead of the dependent commands. This allows a
user to preemptively change the association of a memory object, through regular command
queue scheduling, in order to prepare for another upcoming command. This also permits an
application to overlap the placement of memory objects with other unrelated operations
before these memory objects are needed potentially hiding transfer latencies. Once the
event, returned from clEnqueueMigrateMemObjects, has been marked CL_COMPLETE
the memory objects specified in mem_objects have been successfully migrated to the
device associated with command_queue. The migrated memory object shall remain resident
on the device until another command is enqueued that either implicitly or explicitly
migrates it away.
clEnqueueMigrateMemObjects can also be used to direct the initial placement of a
memory object, after creation, possibly avoiding the initial overhead of instantiating the
object on the first enqueued command to use it.
4. The user is responsible for managing the event dependencies, associated with this
command, in order to avoid overlapping access to memory objects. Improperly specified
event dependencies passed to clEnqueueMigrateMemObjects could result in undefined
results.
clEnqueueMigrateMemObjects return CL_SUCCESS if the function is executed
successfully. Otherwise, it returns one of the following errors :
 CL_INVALID_COMMAND_QUEUE if command_queue is not a valid host
command-queue.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 69) OpenCL Basics

 CL_INVALID_CONTEXT if the context associated with command_queue and

memory objects in mem_objects are not the same or if the context associated with
command_queue and events in event_wait_list are not the same.
 CL_INVALID_MEM_OBJECT if any of the memory objects in mem_objects is not a
valid memory object.
 CL_INVALID_VALUE if num_mem_objects is zero or if mem_objects is NULL.
 CL_INVALID_VALUE if flags is not 0 or is not any of the values described in the
Table 4.4.1.
 CL_INVALID_EVENT_WAIT_LIST if event_wait_list is NULL and
num_events_in_wait_list > 0, or event_wait_list is not NULL and
num_events_in_wait_list is 0, or if event objects in event_wait_list are not valid events.
 CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate
memory for the specified set of memory objects in mem_objects.
 CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the
OpenCL implementation on the device.
 CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by
the OpenCL implementation on the host.

 4.4.2.5 Memory Object Queries

1. To get information that is common to all memory objects (buffer and image objects), below
function is used,
cl_int clGetMemObjectInfo(

cl_mem memobj,

cl_mem_info param_name,

size_t param_value_size,

void* param_value,

size_t* param_value_size_ret);

 memobj specifies the memory object being queried.

 param_name specifies the information to query. The list of supported param_name types
and the information returned in param_value by clGetMemObjectInfo.
 param_value is a pointer to memory where the appropriate result being queried is
returned. If param_value is NULL, it is ignored.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 70) OpenCL Basics

 param_value_size is used to specify the size in bytes of memory pointed to by

param_value. This size must be ≥ size of return type as described in Table 4.4.12.
 param_value_size_ret returns the actual size in bytes of data being queried by
param_name. If param_value_size_ret is NULL, it is ignored.
Table 4.4.12 : List of supported param_names by clGetMemObjectInfo
cl_mem_info Return type Information returned in param_value
CL_MEM_TYPE cl_mem_object_type Returns one of the following values :
CL_MEM_OBJECT_BUFFER if memobj is
created with clCreateBuffer or
clCreateSubBuffer.
cl_image_desc.image_type argument value if
memobj is created with clCreateImage.
CL_MEM_OBJECT_PIPE if memobj is
created with clCreatePipe.
CL_MEM_ cl_mem_flags Return the flags argument value specified
FLAGS clCreateSubBuffer, when memobj is created with clCreateBuffer,

clCreateImage or If memobj is a sub-buffer the memory access

clCreatePipe. qualifiers inherited from parent buffer is also
returned.
CL_MEM_SIZE size_t Return actual size of the data store associated
with memobj in bytes.
CL_MEM_ void * If memobj is created with clCreateBuffer or
HOST_PTR clCreateImage and CL_MEM_USE_HOST_
PTR is specified in mem_flags, return the
host_ptr argument value specified when
memobj is created. Otherwise a NULL value is
returned.
If memobj is created with clCreateSubBuffer,
return the host_ptr + origin value specified
when memobj is created. host_ptr is the
argument value specified to clCreateBuffer and
CL_MEM_USE_HOST_PTR is specified in
mem_flags for memory object from which
memobj is created. Otherwise a NULL value is
returned.
CL_MEM_MAP_ cl_uint Map count.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 71) OpenCL Basics

cl_mem_info Return type Information returned in param_value

COUNT11
CL_MEM_ cl_uint Return memobj reference count.
REFERENCE_
COUNT12
CL_MEM_ cl_context Return context specified when memory object
CONTEXT is created. If memobj is created using
clCreateSubBuffer, the context associated with
the memory object specified as the buffer
argument to clCreateSubBuffer is returned.

CL_MEM_ cl_mem Return memory object from which memobj is

ASSOCIATED_ created. This returns the memory object
MEMOBJECT specified as buffer argument to
clCreateSubBuffer if memobj is a subbuffer
object created using clCreateSubBuffer.
This returns the mem_object specified in
cl_image_desc if memobj is an image object.
Otherwise a NULL value is returned.
CL_MEM_ size_t Return offset if memobj is a sub-buffer object
OFFSET created using clCreateSubBuffer.
This return 0 if memobj is not a subbuffer
object.
CL_MEM_ cl_bool Return CL_TRUE if memobj is a buffer object
USES_SVM_ that was created with CL_MEM_USE_HOST_
POINTER PTR or is a sub-buffer object of a buffer object
that was created with CL_MEM_USE_HOST_
PTR and the host_ptr specified when the buffer
object was created is a SVM pointer; otherwise
returns CL_FALSE.
2. The map count returned should be considered immediately stale. It is unsuitable for general
use in applications. This feature is provided for debugging. The reference count returned
should be considered immediately stale. It is unsuitable for general use in applications. This
feature is provided for identifying memory leaks.
3. clGetMemObjectInfo returns CL_SUCCESS if the function is executed successfully.
Otherwise, it returns one of the following errors :

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 72) OpenCL Basics

 CL_INVALID_VALUE if param_name is not valid, or if size in bytes specified by

param_value_size is < size of return type as described in the Memory Object Info table
and param_value is not NULL.
 CL_INVALID_MEM_OBJECT if memobj is a not a valid memory object.
 CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the
OpenCL implementation on the device.
 CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by
the OpenCL implementation on the host.

 4.5 OpenCL Program and OpenCL Programming Examples

 4.5.1 OpenCL Program

1. An OpenCL program is a collection of OpenCL C kernels, functions called by the kernel,
and constant data. For example, an algebraic solver application could contain a vector
addition kernel, a matrix multiplication kernel, and a matrix transpose kernel within the
same OpenCL program.
2. A person involved in OpenCL programming is expected to be very proficient in C
programming, and having prior experience in any parallel programming tool will be an
added advantage. He or she should be able to break a large problem and find out the data
and task parallel regions of the code which he or she is trying to accelerate using OpenCL.
An OpenCL programmer should know the underlying architecture for which he/she is
trying to program. If one is porting an existing parallel code into OpenCL, then one just
needs to start learning the OpenCL programming architecture. Besides this a programmer
should also have the basic system software details, such as compiling the code and linking
it to an appropriate 32 bit or 64 bit library. He should also have knowledge of setting the
system path on Windows to the correct DLLs or set the LD_LIBRARY_PATH
environment variable in Linux to the correct shared libraries.
3. An OpenCL code consists of the host code and the device code. The OpenCL kernel code is
compiled at run time and runs on the selected device.
4. OpenCL source code is compiled at runtime through a series of API calls. Runtime
compilation gives the system an opportunity to optimize OpenCL kernels for a specific
compute device. Runtime compilation also enables OpenCL kernel source code to run on a
previously unknown OpenCL compatible compute device.
5. There is no need for an OpenCL application to have been prebuilt against the AMD,
NVIDIA, or Intel runtimes, for example, if it is to run on compute devices produced by all

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 73) OpenCL Basics

of these vendors then OpenCL software links only to a common runtime layer called the
installable client driver (ICD). All platform specific activity is delegated to the respective
vendor runtime through a dynamic library interface.

 4.5.2 The Process of Creating a Kernel from Source Code

i. The OpenCL C source code is stored in a character array. If the source code is stored in a
file on a disk, it must be read into memory and stored as a character array.
ii. The source code is turned into a program object, cl_program, by calling
clCreateProgramWithSource().
iii. The program object is then compiled, for one or more OpenCL devices, with
clBuildProgram(). If there are compile errors, they will be reported here.
iv. A kernel object, cl_kernel, is then created by calling clCreateKernel and specifying the
program object and kernel name.
v. The final step of obtaining a cl_kernel object is similar to obtaining an exported function
from a dynamic library. The name of the kernel that the program exports is used to request
it from the compiled program object. The name of the kernel is passed to clCreateKernel(),
along with the program object, and the kernel object will be returned if the program object
was valid and the particular kernel is found. The relationship between an OpenCL
program and OpenCL kernels is shown in below Fig. 4.5.1. The Fig. 4.5.1 shows the
OpenCL runtime shown denotes an OpenCL context with two compute devices (a CPU
device and a GPU device). Each compute device has its own command-queues. Host-side
and device-side command-queues are shown. The device-side queues are visible only from
kernels executing on the compute device. The memory objects have been defined within
the memory model. In such as case multiple kernels can be extracted from an OpenCL
program. Each context can have multiple OpenCL programs that have been generated
fromOpenCL source code.
cl_kernel

clCreateKernel (

cl_program program,

const char *kernel_name,

cl_int *errcode_ret)

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 74) OpenCL Basics

Fig. 4.5.1 : OpenCL Runtime showing OpenCL context with two compute devices
one is CPU and another is GPU

6. The precise binary representation of an OpenCL kernel object is vendor specific. In the
AMD runtime, there are two main classes of devices: x86 CPUs and GPUs. For x86 CPUs,
clBuildProgram() generates x86 instructions that can be directly executed on the device.
For the GPUs, it will create AMD’s GPU intermediate language, a high-level intermediate
language that will be just-in-time compiled for a specific GPU’s architecture later,
generating what is often known as instruction set architecture (ISA) code. NVIDIA uses a
similar approach, calling its intermediate representation parallel thread execution (PTX).
The advantage of using such an intermediate language is to allow the GPU ISA to change
from one device or generation to another in what is still a very rapidly developing
architectural space.
7. An additional feature of the build process is the ability to generate both the final binary
format and various intermediate representations and serialize them (e.g. write them out to
disk). As with most objects, OpenCL provides a function to return information about
program objects, clGetProgramInfo(). One of the flags to this function is
CL_PROGRAM_BINARIES, which returns a vendor-specific set of binary objects
generated by clBuildProgram(). In addition to clCreateProgramWithSource(), OpenCL
provides clCreateProgramWithBinary(), which takes a list of binaries that matches its
device list. The binaries are previously created using clGetProgramInfo().
8. Using a binary representation of OpenCL kernels allows OpenCL programs to be
distributed without exposing kernel source code as plain text. Unlike invoking functions in
C programs, one cannot simply call a kernel with a list of arguments. Executing a kernel
requires dispatching it through an enqueue function. Owing to the syntax of C and the fact
that kernel arguments are persistent (and hence there is no need to repeatedly set them to
construct the argument list for such a dispatch), one must specify each kernel argument

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 75) OpenCL Basics

individually using clSetKernelArg(). This function takes a kernel object, an index

specifying the argument number, the size of the argument, and a pointer to the argument.
The type information in the kernel parameter list is then used by the runtime to unbox
(similar to casting) the data to its appropriate type as shown below.
cl_int

clSetKernelArg (

cl_kernel kernel,

cl_uint arg_index,

size_t arg_size,

const void *arg_value)

 4.5.3 OpenCL Program Flow

Every OpenCL code consists of the host-side code and the device code. The host code
coordinates and queues the data transfer and kernel execution commands. The device code
executes the kernel code in an array of threads called NDRange. An OpenCL C host code does
the following steps.
1. Allocates memory for host buffers and initializes them.
2. Gets platform and device information.
3. Sets up the platform.
4. Gets the devices list and choose the type of device one want to run on.
5. Creates an OpenCL context for the device.
6. Creates a command queue.
7. Creates memory buffers on the device for each vector.
8. Copies the Buffer A and B to the device.
9. Creates a program from the kernel source.
10. Builds the program and creates the OpenCL kernel.
11. Sets the arguments of the kernel.
12. Executes the OpenCL kernel on the device.
13. Reads back the memory from the device to the host buffer. This step is optional, one may
want to keep the data resident in the device for further processing.
14. Cleans up and waits for all the commands to complete.
15. Finally releases all OpenCL allocated objects and host buffers.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 76) OpenCL Basics

 4.5.4 Starting Kernel Execution on a Device

i. Enqueuing a command to a device to begin kernel execution is done with a call to
clEnqueueNDRangeKernel().
ii. A command-queue must be specified so the target device is known. The kernel object
identifies the code to be executed.
iii. Four fields are then related to work-item creation. The work_dim parameter specifies the
number of dimensions (one, two, or three) in which work-items will be created.
iv. The global_work_size parameter specifies the number of work-items in each dimension of
the NDRange, and local_work_size specifies the number of work-items in each dimension
of the work-groups.
v. The parameter global_work_offset can be used to provide an offset so that the global IDs
of the work-items do not start at zero.
cl_int

clEnqueueNDRangeKernel(

cl_command_queue command_queue,

cl_kernel kernel,

cl_uint work_dim,

const size_t *global_work_offset,

const size_t *global_work_size,

const size_t *local_work_size,

cl_uint num_events_in_wait_list,

const cl_event *event_wait_list,

cl_event *event)

vi. As with all clEnqueue API calls, an event_wait_list is provided, and for non-NULL values
the runtime will guarantee that all corresponding events will have completed before the
kernel begins execution. Similarly, clEnqueueNDRangeKernel() is asynchronous : it will
return immediately after the command is enqueued in the command-queue and likely
before the kernel has even started execution. By the time the kernel completes the
execution, an API call such as clWaitForEvents() or clFinish() can be used to block host
execution on the host.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 77) OpenCL Basics

 4.5.5 Main Steps to Execute a Simple OpenCL Application

1. Discovering the platform and devices - Before a host can request that a kernel be executed
on a device, a platform and a device or devices must be discovered.
cl_int status; // Used for error checking

// Retrieve the number of platforms

cl_uint numPlatforms = 0;

status = clGetPlatformIDs(0, NULL, &numPlatforms);

// Allocate enough space for each platform

cl_platform_id *platforms = NULL;

platforms = (cl_platform_id*)malloc(numPlatforms*sizeof

(cl_platform_id));

// Fill in the platforms

status = clGetPlatformIDs(numPlatforms, platforms, NULL);

// Retrieve the number of devices

cl_uint numDevices = 0;

status = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_ALL,

0, NULL, &numDevices);

// Allocate enough space for each device

cl_device_id *devices;

devices = (cl_device_id*)malloc(numDevices*sizeof(cl_device_id));

// Fill in the devices

status = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_ALL,

numDevices, devices, NULL);

In the complete program listing that follows, it is assumed that the first platform is used and
device that are found, which will allow to reduce the number of function calls required.
This will help provide clarity and brevity when viewing the source code.
2. Creating a context - Once the device or devices have been discovered, the context can be
configured on the host.
// Create a context that includes all devices

cl_context context = clCreateContext(NULL, numDevices,

devices, NULL, NULL, &status);

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 78) OpenCL Basics

3. Creating a command-queue per device.

Once the host has decided which devices to work with and a context has been created, one
command-queue needs to be created per device (i.e. each command-queue is associated
with only one device). The host will ask the device to perform work by submitting
commands to the command-queue.
// Only create a command-queue for the first device

cl_command_queue cmdQueue = clCreateCommandQueueWithProperties

(context, devices[0], 0, &status);

4. Creating memory objects (buffers) to hold data - Creating a buffer requires supplying the
size of the buffer and a context in which the buffer will be allocated; it is visible to all
devices associated with the context. Optionally, the caller can supply flags that specify that
the data is read only, write only, or read-write. By passing NULL as the fourth argument,
buffer is not initialized at this step.
// Allocate 2 input and one output buffer for the three vectors in

the vector addition

cl_mem bufA = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize,

NULL, &status);

cl_mem bufB = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize,

NULL, &status);

cl_mem bufC = clCreateBuffer(context, CL_MEM_WRITE_ONLY, datasize,

NULL, &status);

5. Copying the input data onto the device - The next step is to copy data from a host pointer
to a buffer. The API call takes a command-queue argument, so data will likely be copied
directly to the device. By setting the third argument to CL_TRUE, one can ensure that data
is copied before the API call returns.
// Write data from the input arrays to the buffers

status = clEnqueueWriteBuffer(cmdQueue, bufA, CL_TRUE, 0,

datasize, A, 0, NULL, NULL);

status = clEnqueueWriteBuffer(cmdQueue, bufB, CL_TRUE, 0,

datasize, B, 0, NULL, NULL);

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 79) OpenCL Basics

6. Creating and compiling a program from the OpenCL C source code - The vector addition
kernel shown in below code is stored in a character array, programSource, and is used to
create a program object which is then compiled. When program is compiled, the
information is provided for each device that the program may target.
// Create a program with source code

cl_program program = clCreateProgramWithSource(context, 1,

(const char**)&programSource, NULL, &status);

// Build (compile) the program for the device

status = clBuildProgram(program, numDevices, devices, NULL,NULL, NULL);

7. Extracting the kernel from the program - The kernel is created by selecting the desired
function from within the program.
// Create the vector addition kernel

cl_kernel kernel = clCreateKernel(program, "vecadd", &status);

8. Executing the kernel - Once the kernel has been created and data has been initialized, the
buffers are set as arguments to the kernel. A command to execute the kernel can now be
enqueued into the command-queue. Along with the kernel, the command requires
specification of the NDRange configuration.
// Set the kernel arguments

status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufA);

status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &bufB);

status = clSetKernelArg(kernel, 2, sizeof(cl_mem), &bufC);

// Define an index space of work-items for execution.

// A work-group size is not required, but can be used.

size_t indexSpaceSize[1], workGroupSize[1];

indexSpaceSize[0] = datasize/sizeof(int);

workGroupSize[0] = 256;

// Execute the kernel for execution

status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL,

indexSpaceSize, workGroupSize, 0, NULL, NULL);

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 80) OpenCL Basics

9. Copying output data back to the host -This step reads data back to a pointer on the host.
// Read the device output buffer to the host output array

status = clEnqueueReadBuffer(cmdQueue, bufC, CL_TRUE, 0,

datasize, C, 0, NULL, NULL);

10. Releasing the OpenCL resources - Once the kernel has completed execution and the
resulting output has been retrieved from the device, the OpenCL resources that were
allocated can be freed. This is similar to any C or C++ program where memory
allocations, file handles, and other resources are explicitly released by the developer. As
shown below, each OpenCL object has its own API calls to release its resources. The
OpenCL context should be released last since all OpenCL objects such as buffers and
command-queues are bound to a context. This is similar to deleting objects in C++, where
member arrays must be freed before the object itself is freed.
clReleaseKernel(kernel);

clReleaseProgram(program);

clReleaseCommandQueue(cmdQueue);

clReleaseMemObject(bufA);

clReleaseMemObject(bufB);

clReleaseMemObject(bufC);

clReleaseContext(context);

 4.5.6 Example Program - Serial Vector Addition

 The code for a serial C implementation of the vector addition is shown in below and
executes a loop with as many iterations as there are elements to compute. In each iteration
of loop the corresponding locations in the input arrays are added together and the result is
stored into the output array.
/ / Perform an element –wise addition of V1 and V2 and store in V3.

/ / There are Nos elements per array .

void vectoradd1 ( int ∗V1, int ∗ V2, int ∗V3, int Nos)

for ( int i = 0; i < Nos; ++i )

V3[ i ] = V1[ i ] + V2[ i ] ;

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 81) OpenCL Basics

 For a simple multicore device, either a low-level coarse-grained threading API is used, such
as Win32 or POSIX threads, or a data-parallel model such as OpenMP is used. Writing a
coarse-grained multithreaded version of the same function would require dividing the work
(i.e. loop iterations) between the threads. Because there may be a large number of loop
iterations and the work per iteration is small, it is required to chunk the loop iterations into
a larger granularity, a technique called strip mining. The code for the multithreaded version
can be of the below form.
/ / Perform an element –wise addition of V1 and V2 and store in V3.

/ / There are Nos elements per array.

/*Vector addition chunked for coarse-grained parallelism (e.g., POSIX

threads on a CPU). The input vector is partitioned among the available
cores.*/

void vectoradd2 ( int ∗V3, int ∗ V1, int ∗V2, int NT, int NP, int tid )

int cnt = NT/NP; / / elements per thread

for ( int i = thid ∗ cnt ; i < ( thid +1) ∗ cnt ; ++ i )

V3[ i ] = V1[ i ] + V2[ i ];

 Analysis of Serial Vector Addition and Points to Note

1. The unit of concurrent execution in OpenCL C is a work-item. Each work-item executes
the kernel function body. In above code instead of manually strip mining the loop, a
single iteration of the loop is mapped to a work-item.
2. OpenCL runtime is informed to generate as many work-items as elements in the input
and output arrays and allow the runtime to map those work-items to the underlying
hardware, and hence CPU or GPU cores, in whatever way it deems appropriate.
3. Conceptually, this is very similar to the parallelism inherent in a functional “map”
operation or a data-parallel for loop in OpenMP.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 82) OpenCL Basics

4. When an OpenCL device begins executing a kernel, it provides intrinsic functions that
allow a work-item to identify itself. In the following code, the call to get_global_id(0)
allows the programmer to make use of the position of the current work-item to access a
unique element in the array.
5. The parameter “0” to the get_global_id() function assumes that a one-dimensional
configuration of work-items is specified, and therefore only need its ID in the first
dimension.
 The Kernel code
/ / Perform an element – wise addition of V1 and V2 and store in V3.

/ / Nitms work– items will be created to execute this kernel.

// OpenCL vector addition kernel.

_ _kernel

void vectoradd3 (_ _ global int ∗V3, _ _global int ∗V1, _ _ global int ∗V2)

int thrid = get _ global _ id (0) ; / / OpenCL intrinsic function

V3[ thrid ] = V1[ thrid ] + V2[ thrid ] ;

 Kernel Execution - Points To Note

1. Mentioned that OpenCL describes execution in fine-grained work-items and can dispatch
vast numbers of work-items on architectures with hardware support for fine-grained
threading, it is easy to have concerns about scalability.
2. The hierarchical concurrency model implemented by OpenCL ensures that scalable
execution can be achieved even while supporting a large number of work-items.
3. When a kernel is executed, the programmer specifies the number of work-items that
should be created as an n-dimensional range (NDRange).
4. An NDRange is a one-, two-, or three-dimensional index space of work-items that will
often map to the dimensions of either the input or the output data.
5. The dimensions of the NDRange are specified as an Nelement array of type size_t, where
N represents the number of dimensions used to describe the work-items being created.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (4 - 83) OpenCL Basics

6. In the vector addition example, data will be one-dimensional and, assuming that there are
1024 elements, the size can be specified by an array of one, two, or three values.
7. The host code to specify a one-dimensional NDRange for 1024 elements may look like the
following :
size_t indexSpace[3] = 1024, 1, 1};
8. Achieving scalability comes from dividing the work-items of an NDRange into smaller,
equally sized work-groups (Fig. 4.5.1). An index space with N dimensions requires work-
groups to be specified using the same N dimensions; thus, a three dimensional index space
requires three-dimensional work-groups.
9. Work-items within a work-group have a special relationship with one another - they can
perform barrier operations to synchronize and they have access to a shared memory
address space.
10. A work-group’s size is fixed per dispatch, and so communication costs between work
items do not increase for a larger dispatch.
11. The fact that the communication cost between work-items is not dependent on the size of
the dispatch allows OpenCL implementations to maintain scalability for larger dispatches.

Fig. 4.5.1 : Hierarchical model for creating NDRange of work-items grouped

into work-groups

12. For the vector addition example, the work-group size might be specified as
size_t workgroupSize[3] = {64, 1, 1};
13. If the total number of work-items per array is 1024, this results in the creation of 16 work-
groups (1024 work-items/(64 work-items per work-group) = 16 workgroups).
14. For hardware efficiency, the work-group size is usually fixed to a favorable size.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 84) OpenCL Basics

15. In previous versions of the OpenCL specification, the index space dimensions would have
to be rounded up to be a multiple of the work-group dimensions. In the kernel code, then it
is required to specify that extra work-items in each dimension simply return immediately
without outputting any data.
16. However, the OpenCL 2.0 specification allows each dimension of the index space that is
not evenly divisible by the work-group size to be divided into two regions - one region
where the number of work-items per work-group is as specified by the programmer, and
another region of remainder work-groups which have fewer work-items.
17. As the work-group sizes can be non-uniform in multiple dimensions, there are up to four
different sizes possible for a two-dimensional NDRange, and upto eight different sizes for
a three-dimensional NDRange.
18. For programs such as vector addition in which work-items behave independently (even
within a work-group), OpenCL allows the work-group size to be ignored by the
programmer altogether and to be generated automatically by the implementation; in such
case, while defining the work-group size NULL value can be passed.

 4.6 Part A : Short Answered Questions [2 Marks Each]

Q.1 What is OpenCL ?
 Answer : Open Computing Language is a framework for writing programs that execute
across heterogeneous platforms. They consist of CPUs GPUs DSPs and FPGAs. OpenCL
specifies a programming language (based on C99) for programming these devices and
application programming interfaces (APIs) to control the platform and execute programs on
the compute devices. OpenCL provides a standard interface for parallel computing using task-
based and data-based parallelism.
Q.2 What is kernel ?
 Answer : A kernel is a small unit of execution that performs a clearly defined function and
that can be executed in parallel. Such a kernel can be executed on each element of an input
stream (also termed as NDRange) or simply at each point in an arbitrary index space. A kernel
is analogous and, on some devices identical, to what graphics programmers call a shader
program.
Q.3 What is context in OpenCL ?
 Answer : In OpenCL, a context is an abstract environment within which coordination and
memory management for kernel execution is valid and well defined. A context coordinates the
mechanisms for host-device interaction, manages the memory objects available to the devices,
and keeps track of the programs and kernels that are created for each device.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 85) OpenCL Basics

Q.4 What is global memory ?

 Answer : Global memory is shared with all Device, but slow. It is persistent between
kernel calls. The global memory is defined as the region of device memory that is accessible to
both the OpenCL host and device. Global memory permits read / write access to the host
processor as well to all compute units in the device. The host is responsible for the allocation
and de-allocation of buffers in this memory space. There is a handshake between host and
device over control of the data stored in this memory. The host processor transfers data from
the host memory space into the global memory space. Then, once a kernel is launched to
process the data, the host loses access rights to the buffer in global memory. The device takes
over and is capable of reading and writing from the global memory until the kernel execution
is complete. Upon completion of the operations associated with a kernel, the device turns
control of the global memory buffer back to the host processor.
Q.5 What is pipe memory object ?
 Answer : The pipe memory object conceptually is an ordered sequence of data items. A
pipe has two endpoints: a write endpoint into which data items are inserted, and a read
endpoint from which data items are removed. At any one time, only one kernel instance may
write into a pipe, and only one kernel instance may read from a pipe. To support the producer
consumer design pattern, one kernel instance connects to the write endpoint (the producer)
while another kernel instance connects to the reading endpoint (the consumer).
A pipe is a memory object that stores data organized as a FIFO. Pipe objects can only be
accessed using built-in functions that read from and write to a pipe. Pipe objects are not
accessible from the host. A pipe object encapsulates the following information :
o Packet size in bytes
o Maximum capacity in packets
o Information about the number of packets currently in the pipe
o Data packets

 4.7 Part B : Long Answered Questions

Q.1 What are various OpenCL components ? (Refer section 4.1.4)
Q.2 Explain OpenCL architecture. (Refer section 4.3)
Q.3 Explain kernel programming model. (Refer section 4.3.3)
Q.4 In brief explain OpenCL memory hierarchy. (Refer section 4.3.4)
Q.5 Explain memory object ‘Buffer’. (Refer section 4.4.1)


TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (4 - 86) OpenCL Basics

Notes

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

UNIT - V

5 Algorithms on GPU

Syllabus

Parallel Patterns : Convolution, Prefix Sum, Sparse Matrix - Matrix Multiplication -

Programming Heterogeneous Cluster.

Contents

5.1 Concept of Parallelism

5.2 GPU Stream Types

5.3 GPU Parallel Algorithms

5.4 Heterogeneous Cluster

5.5 Part A : Short Answered Questions [2 Marks Each]

5.6 Part B : Long Answered Questions

(5 - 1)
GPU Architecture and Programming (5 - 2) Algorithms on GPU

 5.1 Concept of Parallelism

 5.1.1 Parallel Computations / Concurrency

1. The foremost aspect of concurrency is to think about the particular problem, without regard
for any implementation, and consider what aspects of it could run in parallel. If possible
one should think of a formula that represents each output point as some function of the
input data. This is a lengthy process for some algorithms, for example, those algorithms
with loops. For these, consider each step or iteration individually. Can the data points for
the step be represented as a transformation of the input dataset ? If so, then one can simply
have a set of kernels (steps) that run in sequence. These can simply be pushed into a queue
(or stream) that the hardware will schedule sequentially.
2. A significant number of problems are known as “embarrassingly parallel,” a term that
rather underplays what is being achieved. If one can construct a formula where the output
data points can be represented without relation to each other for example, a matrix
multiplication then it is very satisfying.
3. These types of problems can be implemented extremely well on GPUs and are easy to code.
If one or more steps of the algorithm can be represented in this way, but may be one stage
cannot then also it is helpful. This single stage may turn out to be a bottleneck and may
require a little thought, but the rest of the problem will usually be quite easy to code on a
GPU.
4. If the problem requires every data point to know about the value of its surrounding
neighbors then the speedup will ultimately be limited. In such cases having more processors
for the problem works up to the expectations. At this point the computation slows down due
to the processors (or threads) spending more time sharing data than doing any useful work.
The point at which one hits this will depend largely on the amount and cost of the
communication overhead.
5. To perform some action, central command (the kernel / host program) must provide some
action plus some data. Each soldier (thread) works on his or her individual part of the
problem. Threads may from time to time swap data with one another under the coordination
of either the sergeant (the warp) or the lieutenant (the block). However, any coordination
with other units (blocks) has to be performed by central command (the kernel/host
program).
6. The emergence of \massively parallel" many-core processors has inspired interest in
algorithms with abundant parallelism. Modern GPU architectures, which accommodate tens
of thousands of concurrent threads, are at the forefront of this trend towards massively
parallel throughput-oriented execution. While such architectures offer higher absolute
performance, in terms of theoretical peak FLOPs and bandwidth, than contemporary
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 3) Algorithms on GPU

(latency-oriented) CPUs, existing algorithms need to be reformulated to make effective use

of the GPU.
7. Modern GPUs are organized into tens of multiprocessors, each of which is capable of
executing hundreds of hardware-scheduled threads. Warps of threads represent the finest
granularity of scheduled computational units on each multiprocessor with the number of
threads per warp defined by the underlying hardware. Execution across a warp of threads
follows a data parallel SIMD (Single Instruction, Multiple Data) model and performance
penalties occur when this model is violated as happens when threads within a warp follow
separate streams of execution (i.e., divergence) or when atomic operations are executed in
order ( i.e., serialization). Warps within each multiprocessor are grouped into a hierarchy of
fixed-size execution units known as blocks or Cooperative Thread Arrays (CTAs);
intra-CTA computation and communication may be routed through a shared memory region
accessible by all threads within the CTA.
8. At the next level in the hierarchy CTAs are grouped into grids and grids are launched by a
host thread with instructions encapsulated in a specialized GPU programming construct
known as a kernel.
9. GPUs sacrifice serial performance of single thread tasks to increase the overall throughput
of parallel workloads. Effective use of the GPU depends on four key features :
(i) An abundance of fine-grained parallelism
(ii) Uniform work distribution
(iii) High arithmetic intensity
(iv) Regularly-structured memory access patterns.
10. Workloads that do not have these characteristics often do not fully utilize the available
computational resources and represent an opportunity for further optimization.

 5.1.2 Types of Parallelism

 5.1.2.1 Task-based Parallelism

1. If a typical operating system is considered, it can be seen that it exploit a type of parallelism
called task parallelism.
2. The processes are diverse and unrelated. A user might be reading an article on a website
while playing music from his or her music library in the background. More than one CPU
core can be exploited by running each application on a different core.
3. In terms of parallel programming, this can be exploited by writing a program as a number
of sections that “pipe” (send via messages) the information from one application to another.
The Linux pipe operator (the | symbol) does just this, via the operating system. The output

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 4) Algorithms on GPU

of one program, such as grep, is the input of the next, such as sort. Thus, a set of input files
can be easily scanned for a certain set of characters (the grep program) and that output set
then sorted (the sort program). Each program can be scheduled to a separate CPU core.
This pattern of parallelism is known as pipeline parallelism.
4. In pipeline parallelism the output on one program provides the input for the next. With a
diverse set of components, such as the various text-based tools in Linux, a huge variety of
useful functions can be performed by the user. As the programmer cannot know at the
outset everyone’s needs, by providing components that operate together and can be
connected easily, the programmer can target a very wide and diverse user base.
5. This type of parallelism is very much geared toward coarse-grained parallelism. That is,
there are a number of powerful processors, each of which can perform a significant chunk
of work. In terms of GPUs one can see coarse-grained parallelism only in terms of a GPU
card and the execution of GPU kernels. GPUs support the pipeline parallelism pattern in
two ways.
6. First, kernels can be pushed into a single stream and separate streams executed
concurrently. Second, multiple GPUs can work together directly through either passing data
via the host or passing data via messages directly to one another over the PCI-E bus.
7. The second approach, the peer-to-peer (P2P) mechanism, was introduced in the CUDA 4.x
SDK and requires certain OS/hardware/driver-level support.
8. One of the issues with a pipeline-based pattern is, like any production line, it can only run
as fast as the slowest component. Thus, if the pipeline consists of five elements, each of
which takes one second, one can produce one output per second. However, if just one of
these elements takes two seconds, the throughput of the entire pipeline is reduced to one
output every two seconds.
9. The approach to solving this is twofold. The production line analogy can be considered for
a moment. Anil’s station takes two seconds because his task is complex. If Anil is provided
with an assistant, Sachin, and split his task in half with Sachin, then task is back to one
second per stage. Now there are six stages instead of five, but the throughput of the pipeline
is now again one widget per second.
10. One can put four GPUs into a desktop PC with some thought and care about the design.
Thus, if there is a single GPU and it’s taking too long to process a particular workflow,
one can simply add another one and increase the overall processing power of the node.
However in such case one should to think about the division of work between the two
GPUs.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 5) Algorithms on GPU

11. There may not be an easy 50/50 split. If one can only extract a 70 / 30 split, clearly the
maximum benefit will be 7/10 (70 %) of the existing runtime. If another GPU is
introduced and then may be move another task, which occupied say 20 % of the time, then
it would end up with a 50/30/20 split. Again the speedup compared to one GPU would be
1/2 or 50 % of the original time. Still it would be the worst case time dominating the
overall execution time for the process.
12. Similar problem arises when a single CPU / GPU combination is used. If 80 % of the
work is moved off the CPU and onto the GPU, with the GPU computing this in just 10 %
of the time, then the CPU takes 20 % of the original time and the GPU 10 % of the
original time, but in parallel. Thus, the dominating factor is still the CPU. As the GPU is
running in parallel and consumes less time than the CPU fraction, one can discount this
time entirely. Thus, the maximum speedup is one divided by the fraction of the program
that takes the longest time to execute.
This is known as Amdahl’s law and is often quoted as the limiting factor in any speedup.
It allows one to know at the outset what the maximum speedup achievable is, without
writing a single line of code. Ultimately, there would be serial operations. Even if
everything is moved onto the GPU, CPU have to be used to load and store data to and from
storage devices. The data needs to be transfer to and from the GPU to facilitate input and
output (I/O). Thus, maximum theoretical speedup is determined by the fraction of the
program that performs the computation/algorithmic part, plus the remaining serial fraction
part of the execution.

 5.1.2.2 Data-based Parallelism

1. As the computational power has been growing drastically now there are tera-flop capable
GPUs in the industry. Though the computation power has been progressed drastically the
data access time has not been altered that efficiently.
2. Data-based parallelism concentrates on transformation of required data than just accessing
the data. Task-based parallelism focuses on to fit more with coarse-grained parallelism
approaches.
3. Consider an example of performing four different transformations on four separate,
unrelated, and similarly sized data arrays. Suppose there are four CPU cores, and a GPU
with four SMs. In a task-based decomposition of the problem, one would assign one array
to each of the CPU cores or SMs in the GPU. The parallel decomposition of the problem is
driven by thinking about the tasks or transformations, not the data. On the CPU side one
could create four threads or processes to achieve this. On the GPU side one would need to

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 6) Algorithms on GPU

use four blocks and pass the address of every array to every block. On the newer Fermi and
Kepler devices, one could also create four separate kernels, one to process each array and
run it concurrently. A data-based decomposition would instead split the first array into four
blocks and assign one CPU core or one GPU SM to each section of the array. Once
completed, the remaining three arrays would be processed in a similar way. In terms of the
GPU implementation, this would be four kernels, each of which contained four or more
blocks. The parallel decomposition here is driven by thinking about the data first and the
transformations second. As the considered CPU has only four cores, it makes a lot of sense
to decompose the data into four blocks. There can be thread 0 process element 0, thread 1
process element 1, thread 2 process element 2, thread 3 process element 3, and so on.
Alternatively, the array could be split into four parts and each thread could start processing
its section of the array. In the first case, thread 0 fetches element 0. As CPUs contain
multiple levels of cache, this brings the data into the device. Typically the L3 cache is
shared by all cores. Thus, the memory access from the first fetch is distributed to all cores
in the CPU. By contrast in the second case, four separate memory fetches are needed and
four separate L3 cache lines are utilized. The second approach is often better where the
CPU cores need to write data back to memory. Interleaving the data elements by core
means the cache has to coordinate and combine the writes from different cores, which is
usually a bad idea.
4. If the algorithm permits, one can exploit a certain type of data parallelism, the SIMD
(Single Instruction, Multiple Data) model. This would make use of special SIMD
instructions such as MMX, SSE, AVX, etc. present in many x86-based CPUs. Thus, thread
0 could actually fetch multiple adjacent elements and process them with a single SIMD
instruction.
5. If same problem is to be considered on the GPU, each array needs to have a separate
transformation performed on it. This naturally maps such that one transformation equates to
a single GPU kernel (or program). Each SM, unlike a CPU core, is designed to run multiple
blocks of data with each block split into multiple threads. Thus, a further level of
decomposition is needed to be used so as to use the GPU efficiently.
6. One can typically allocate, at least initially, a combination of blocks and threads such that a
single thread processed a single element of data. As with the CPU, there are benefits from
processing multiple elements per thread. This is somewhat limited on GPUs as only load /
store / move explicit SIMD primitives are supported, but this in turn allows for enhanced
levels of Instruction-Level Parallelism (ILP) where instructions can be split to run
concurrently .
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 7) Algorithms on GPU

 5.1.3 Common Parallel Patterns

There are various parallel problems those can be thought of as patterns. Thinking in terms
of patterns allows to broadly de-construct or abstract a problem, and therefore more easily one
can think about how to solve it.

 5.1.3.1 Loop-based Patterns

1. Loops are familiar iterative programming constructs those vary primarily in terms of entry
and exit conditions (for, do-while, while), and whether they create dependencies between
loop iterations or not.
2. A loop-based iteration dependency is where one iteration of the loop depends on one or
more previous iterations. One needs to remove these if at all possible as they make
implementing parallel algorithms more difficult. If in fact this can not be done, the loop is
typically broken into a number of blocks that are executed in parallel. The result from block
0 is then retrospectively applied to block 1, then to block 2, and so on.
3. Loop-based iteration is one of the easiest patterns to parallelize. With inter-loop
dependencies removed, it is then simply a matter of deciding how to split, or partition, the
work between the available processors. This should be done with a view to minimizing
communication between processors and maximizing the use of on-chip resources (registers
and shared memory on a GPU; L1 / L2 / L3 cache on a CPU). Communication overhead
typically scales badly and is often the bottleneck in poorly designed systems.
4. The macro-level decomposition should be based on the number of logical processing units
available. For the CPU, this is simply the number of logical hardware threads available. For
the GPU, this is the number of SMs multiplied by the maximum load that can be given to
each SM, 1 to 16 blocks depending on resource usage and GPU model. Notice that here the
term logical and not physical hardware thread.
5. Some Intel CPUs in particular support more than one logical thread per physical CPU core,
so-called hyper-threading. GPUs run multiple blocks on a single SM, so one has to at least
multiply the number of SMs by the maximum number of blocks each SM can support.
6. Using more than one thread per physical device maximizes the throughput of such devices,
in terms of giving them something to do while they may be waiting on either a memory
fetch or I/O-type operation. Selecting some multiple of this minimum number can also be
useful in terms of load balancing on the GPU and allows for improvements when new
GPUs are released. This is particularly the case when the partition of the data would
generate an uneven workload, where some blocks take much longer than others. In this
case, using many times the number of SMs as the basis of the partitioning of the data allows
slack SMs to take work from a pool of available blocks.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 8) Algorithms on GPU

7. However, on the CPU side, over subscribing the number of threads tends to lead to poor
performance. This is largely due to context switching being performed in software by the
OS. Increased contention for the cache and memory bandwidth also contributes
significantly if one tries to run too many threads. Thus, an existing multicore CPU solution,
taken as is, typically has far too large a granularity for a GPU.
8. There is almost always a need to have repartitioning of the data into many smaller blocks to
solve the same problem on the GPU.
9. When considering loop parallelism and porting an existing serial implementation, be
critically aware of hidden dependencies. Look carefully at the loop to ensure one iteration
does not calculate a value used later. Be wary of loops that count down as opposed to the
standard zero to max value construct, which is the most common type of loop found. So
there is a need of backward counting too. It is likely this may be because there is some
dependency in the loop and parallelizing it without understanding the dependencies will
likely break it.
10. There is a need to consider loops where one should have an inner loop and one or more
outer loops. For concurrent execution on a CPU the approach would be to parallelize only
the outer loop as there are only a limited number of threads. This works well, but as
discussed it depends on there being no loop iteration dependencies.
11. On the GPU the inner loop, provided it is small, is typically implemented by threads
within a single block. As the loop iterations are grouped, adjacent threads usually access
adjacent memory locations. This often allows to exploit locality, something very important
in CUDA programming. Any outer loop(s) are then implemented as blocks of the threads.
12. It should be also considered that most loops can be flattened, thus reducing an inner and
outer loop to a single loop. Consider a example of an image processing algorithm that
iterates along the X pixel axis in the inner loop and the Y pixel axis in the outer loop. It’s
possible to flatten this loop by considering all pixels as a little more thought on the
programming side, but it may be useful if one or more loops contain a very small number
of iterations. Such small loops present considerable loop overhead compared to the work
done per iteration. They are, thus, typically not efficient.

 5.1.3.2 Fork / Join Pattern

1. The fork / join pattern is a common pattern in serial programming where there are
synchronization situations and only certain aspects of the program are parallel. The serial
code runs and at some point reaches a section where the work can be distributed to P
processors in some manner. It then “forks” or spawns N threads / processes that perform the
calculation in parallel. These threads then execute independently and finally converge or
join once all the calculations are complete.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 9) Algorithms on GPU

2. This is typically the approach found in OpenMP, where one can define a parallel region
with pragma statements. The code then splits into N threads and later converges to a single
thread again.
3. The fork / join pattern is typically implemented with static partitioning of the data. That is,
the serial code will launch N threads and divide the dataset equally between the N threads.
If each packet of data takes the same time to process, then this works well. However, as the
overall time to execute is the time of the slowest thread, giving one thread too much work
means it becomes the single factor determining the total time.
4. Systems such as OpenMP also have dynamic scheduling allocation, which mirrors the
approach taken by GPUs. Here a thread pool is created (a block pool for GPUs) and only
once one task is completed is more work allocated. Thus, if 1 task takes 10x time and 20
tasks take just 1x time each, they are allocated only to free cores. With a dual-core CPU,
core 1 gets the big 10x task and five of the smaller 1x tasks. Core 2 gets 15 of the smaller
1x tasks, and therefore both CPU core 1 and 2 complete around the same time.
5. Consider a example, to fork three threads, though there are six data items in the queue. It
should be noted that all 6 threads are not chosen for forking. The reality is that in most
problems there can actually be millions of data items and attempting to fork a million
threads will cause almost all OSs to fail in one way or another.
6. Typically an OS will apply a “fair” scheduling policy. Thus, each of the million threads
would need to be processed in turn by one of perhaps four available processor cores. Each
thread also requires its own memory space. In Windows a thread can come with a 1 MB
stack allocation, meaning that one can rapidly run out of memory prior to being able to fork
enough threads.
7. Therefore on CPUs, typically programmers and many multithreaded libraries will use the
number of logical processor threads available as the number of processes to fork. As CPU
threads are typically also expensive to create and destroy, and also to limit maximum
utilization, often a thread pool of workers is used who then fetch work from a queue of
possible tasks.
8. On GPUs there is opposite problem, in that it needs thousands or tens of thousands of
threads. There is exactly the thread pool concept one can find on more advanced CPU
schedulers, except it is more like a block pool than a thread pool.
9. The GPU has an upper limit on the number of concurrent blocks it can execute. Each block
contains a number of threads. Both the number of threads per block and the overall number
of concurrently running blocks vary by GPU generation.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 10) Algorithms on GPU

10. The fork / join pattern is often used when there is an unknown amount of concurrency in a
problem. Traversing a tree structure or a path exploration type algorithm may spawn
(fork) additional threads when it encounters another node or path. When the path has been
fully explored these threads may then join back into the pool of threads or simply
complete to be re-spawned later.
11. This pattern is not natively supported on a GPU, as it uses a fixed number of
blocks / threads at kernel launch time. Additional blocks cannot be launched by the kernel,
only the host program. Thus, such algorithms on the GPU side are typically implemented
as a series of GPU kernel launches, each of which needs to generate the next state. An
alternative is to coordinate or signal the host and have it launch additional, concurrent
kernels. Neither solution works particularly well, as GPUs are designed for a static
amount of concurrency. Kepler introduces a concept, dynamic parallelism, which
addresses this issue.
12. Within a block of threads on a GPU there are a number of methods to communication
between threads and to coordinate a certain amount of problem growth or varying levels
of concurrency within kernel.
13. For example, if there is an 8 * 8 matrix one might have many places where just 64 threads
are active. However, there may be others where 256 threads can be used. One can launch
256 threads and leave most of them idle until such time as needed. Such idle threads
occupy resources and may limit the overall throughput, but do not consume any execution
time on the GPU whilst idle. This allows the use of shared memory, fast memory close to
the processor, rather than creating a number of distinct steps that need to be synchronized
by using the much slower global memory and multiple kernel launches.
14. Ultimately it should be noted that the later-generation GPUs support fast atomic
operations and synchronization primitives that communicate data between threads in
addition to simply synchronizing.

 5.1.4 Data Parallelism Versus Task Parallelism

1. Data parallelism is a way of performing parallel execution of an application on multiple
processors. It focuses on distributing data across different nodes in the parallel execution
environment and enabling simultaneous sub-computations on these distributed data across
the different compute nodes.
2. This is typically achieved in SIMD mode (Single Instruction, Multiple Data mode) and can
either have a single controller controlling the parallel data operations or multiple threads
working in the same way on the individual compute nodes (SPMD).
3. In contrast, task parallelism focuses on distributing parallel execution threads across
parallel computing nodes. These threads may execute the same or different threads. These
threads exchange messages either through shared memory or explicit communication
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 11) Algorithms on GPU

messages, as per the parallel algorithm. In the most general case, each of the threads of a
Task-Parallel system can be doing completely different tasks but coordinating to solve a
specific problem. In the most simplistic case, all threads can be executing the same program
and differentiating based on their node-id's to perform any variation in task-responsibility.
Most common Task-Parallel algorithms follow the Master-Worker model, where there is a
single master and multiple workers. The master distributes the computation to different
workers based on scheduling rules and other task-allocation strategies.
4. MapReduce falls under the category of data parallel SPMD architectures.

 5.2 GPU Stream Types

GPU memory has a number of usage restrictions and is accessible only via the abstractions
of a graphics programming interface. Each of these abstractions can be thought of as a
different stream type with its own set of access rules. The three types of streams visible to the
GPU programmer are vertex streams, frame-buffer streams, and texture streams. A fourth
stream type, fragment streams, is produced and consumed entirely within the GPU.

 5.2.1 Vertex Streams

1. Vertex streams are specified as vertex buffers via the graphics API. These streams hold
vertex positions and a variety of per-vertex attributes. These attributes have traditionally
been used for texture coordinates, colors, normals, and so on, but they can be used for
arbitrary input stream data for vertex programs.
2. Vertex programs are not allowed to randomly index into their input vertices. Until recently,
vertex streams could be updated only by transferring data from the CPU to the GPU. The
GPU was not allowed to write to vertex streams.
3. Recent API enhancements, however, have made it possible for the GPU to write to vertex
streams. This is accomplished by either "copy-to-vertex-buffer" or "render-to-vertex-
buffer." In the “copy-to-vertex-buffer” technique, rendering results are copied from a frame
buffer to a vertex buffer; in the “render-to-vertex-buffer” technique, the rendering results
are written directly to a vertex buffer.
4. The new addition of GPU-writable vertex streams enables GPUs, for the first time, to loop
stream results from the end to the beginning of the pipeline.

 5.2.2 Fragment Streams

1. Fragment streams are generated by the rasterizer and consumed by the fragment processor.
They are the stream inputs to fragment programs, but they are not directly accessible to
programmers because they are created and consumed entirely within the graphics
processor.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 12) Algorithms on GPU

2. Fragment stream values include all of the interpolated outputs from the vertex processor :
position, color, texture coordinates, and so on. As with per-vertex stream attributes, the
per-fragment values that have traditionally been used for texture coordinates may now be
used for any stream value required by the fragment program.
3. Fragment streams cannot be randomly accessed by fragment programs. Permitting random
access to the fragment stream would create a dependency between fragment stream
elements, thus breaking the data-parallel guarantee of the programming model. If random
access to fragment streams is needed by algorithms, the stream must first be saved to
memory and converted to a texture stream.
 5.2.3 Frame-Buffer Streams
1. Frame-buffer streams are written by the fragment processor. They have traditionally been
used to hold pixels for display to the screen. Streaming GPU computation, however, uses
frame buffers to hold the results of intermediate computation stages.
2. In addition, modern GPUs are able to write to multiple frame-buffer surfaces (that is,
multiple RGBA buffers) simultaneously. Current GPUs can write up to 16 floating-point
scalar values per render pass (this value is expected to increase in future hardware).
3. Frame-buffer streams cannot be randomly accessed by either fragment or vertex programs.
They can, however, be directly read from or written to by the CPU via the graphics API.
Lastly, recent API enhancements have begun to blur the distinction between frame buffers,
vertex buffers, and textures by allowing a render pass to be directly written to any of these
stream types.

 5.2.4 Texture Streams

1. Textures are the only GPU memory that is randomly accessible by fragment programs and,
for Vertex Shader 3.0 GPUs, vertex programs. If programmers need to randomly index into
a vertex, fragment, or frame-buffer stream, they must first convert it to a texture.
2. Textures can be read from and written to by either the CPU or the GPU. The GPU writes to
textures either by rendering directly to them instead of to a frame buffer or by copying data
from a frame buffer to texture memory.
3. Textures are declared as 1D, 2D, or 3D streams and addressed with a 1D, 2D, or 3D
address, respectively. A texture can also be declared as a cube map, which can be treated as
an array of six 2D textures.

 5.2.5 GPU Kernel Memory Access

Vertex and fragment programs (kernels) are the workhorses of modern GPUs. Vertex
programs operate on vertex stream elements and send output to the rasterizer. Fragment
programs operate on fragment streams and write output to frame buffers. The capabilities of
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 13) Algorithms on GPU

these programs are defined by the arithmetic operations they can perform and the memory they
are permitted to access. The variety of available arithmetic operations permitted in GPU
kernels is approaching those available on CPUs, yet there are numerous memory access
restrictions. As described previously, many of these restrictions are in place to preserve the
parallelism required for GPUs to maintain their speed advantage. Other restrictions, however,
are artifacts of the evolving GPU architecture and will almost certainly be relaxed in future
generations.

 5.3 GPU Parallel Algorithms

 5.3.1 Convolutions

 5.3.1.1 Convolutions Fundamentals

1. The convolution operation is a mathematical operation which depicts a rule of how to
combine two functions or pieces of information to form a third function. The feature map
(or input data) and the kernel are combined to form a transformed feature map. The
convolution algorithm is often interpreted as a filter, where the kernel filters the feature
map for certain information. A kernel, for example, might filter for edges and discard other
information. The inverse of the convolution operation is called deconvolution.
2. In image processing, convolution is a commonly used algorithm that modifies the value of
each pixel in an image by using information from neighboring pixels. A convolution kernel,
or filter, describes how each pixel will be influenced by its neighbors. For example, a
blurring filter will take the weighted average of neighboring pixels so that large differences
between pixel values are reduced. By using the same source image and changing only the
filter, one can produce effects such as sharpening, blurring, edge enhancing, and
embossing.

 5.3.1.2 Mathematical Foundation for Convolution

1. The mathematical definition of convolution of two functions f and x over a range t is :

y(t) = f  x =  f(k)  x(t – k) dk
–

where the symbol  denotes convolution.

2. Linear time-invariant (LTI) systems are widely used in applications related to signal
processing. LTI systems are both linear (output for a combination of inputs is the same as a
combination of the outputs for the individual inputs) and time invariant (output is not
dependent on the time when an input is applied). For an LTI system, the output signal is the
convolution of the input signal with the impulse response function of the system.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 14) Algorithms on GPU

3. Convolution of two functions is an important mathematical operation that found heavy

application in signal processing. In computer graphics and image processing fields,
discrete functions (e.g. an image) are used and a discrete form of the convolution is applied
to remove high frequency noise, sharpen details, detect edges, or otherwise modulate the
frequency domain of the image. A general 2D convolution has a high bandwidth
requirement as the final value of a given pixel is determined by several neighboring pixels.
Following Fig. 5.3.1 depicts convolution blurred affect.

Fig. 5.3.1 : Convolution - Blurred effect

 5.3.1.3 Convolution Operation Example

Fig. 5.3.2 : Convolution using 3  3 Kernel

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 15) Algorithms on GPU

Above Fig. 5.3.2 depicts the convolution using a small 3  3 kernel. The filter is defined as
a matrix, where the central item weights the center pixel, and the other items define the
weights of the neighbor pixels. It can be said that the radius of the 3  3 kernel is 1, since only
the one-ring neighborhood is considered during the convolution. Also the convolution’s
behavior has to be defined at border of the image, where the kernel maps to undefined values
outside the image. Generally, the filtered values outside the image boundaries are either treated
as zeros or clamped to the border pixels of the image. The design of the convolution filter
requires a careful selection of kernel weights to achieve the desired effect.
 5.3.1.4 Applications of Convolution
Applications of convolution include those in digital signal processing, image processing,
language modeling and natural language processing, probability theory, statistics, physics, and
electrical engineering. A convolutional neural network is a class of artificial neural
network that uses convolutional layers to filter inputs for useful information, and has
applications in a number of image and speech processing systems.

 5.3.1.5 Working of Convolution Algorithm

1. Convolution algorithms works by iterating over each pixel in the source image. For each
source pixel, the filter is centered over the pixel, and the values of the filter multiply the
pixel values that they overlay. A sum of the products is then taken to produce a new pixel
value.
2. Fourier Transforms in Convolution :
Convolution is important in physics and mathematics as it defines a bridge between the
spatial and time domains (pixel with intensity 147 at position (0, 30)) and the frequency
domain (amplitude of 0.3, at 30 Hz, with 60-degree phase) through the convolution
theorem. This bridge is defined by the use of Fourier transforms: When Fourier transform is
used on both the kernel and the feature map, then the convolute operation is simplified
significantly (integration becomes mere multiplication). Convolution in the frequency
domain can be faster than in the time domain by using the Fast Fourier Transform (FFT)
algorithm. Some of the fastest GPU implementations of convolutions (for example some
implementations in the NVIDIA cuDNN library) currently make use of Fourier transforms.

 5.3.1.6 Serial Convolution Code Written in C / C++ and OpenCL Kernel Code
1. Serial Convolution
The two outer loops iterate over pixels in the source image, selecting the next source pixel.
At each source pixel, the filter is applied to the neighboring pixels. The filter can try to access
pixels that are out-of-bounds. To deal with this situation, four explicit checks have been
provided within the innermost loop to set the out-of-bounds coordinate to the nearest border
pixel.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 16) Algorithms on GPU

Serial Convolution

/ * Iterate over the rows of the source image * /

for ( int i = 0; i < rows ; i ++)

/ * Iterate over the columns of the source image * /

for ( int j = 0; j < cols ; j ++)

/ * Reset sum f or new source pixel * /

int sum = 0 ;

/ * Apply the filter to the neighbourhood * /

for ( int k = −half Filter Width ; k <= half Filter Width ; k++)

for ( int l = −half Filter Width ; l <= half Filter Width ; l ++)

/ * Indices used t o access the image * /

int r = i+k;

int c = j+l ;

/ * Handle out−of−bounds locations by clamping to the border pixel */

r = ( r < 0) ? 0 : r ;

c = ( c < 0) ? 0 : c ;

r = ( r >= rows ) ? rows−1 : r ;

c = ( c >= cols ) ? cols −1 : c;

sum += Image [ r ] [ c ] ;

Filter [ k+ half Filter Width ] [ l+ half Filter Width ] ;

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 17) Algorithms on GPU

/ * Write the new pix-el value */

resultImage [ i ] [ j ] = sum;

2. In OpenCL Kernel Code

 Using image memory objects to implement a convolution has a few advantages over an
implementation using buffers. Image sampling comes with options to automatically
handle out-of-bounds accesses and also provides optimized caching of two-dimensional
data.
 The OpenCL implementation of the convolution kernel -
It is written similarly to the C version. In the OpenCL version, one work-item per output
pixel is created, using parallelism to remove the two outer loops from code discussed for
serial implementation. The task of each work-item is to execute the two innermost
loops, which perform the filtering operation. As in the previous example, input is read
from the source image, that must be performed using an OpenCL construct that is
specific to the data type.
 For this example, read_imagef() is used again. Accesses to an image always return a
four-element vector (one per channel). In the serial code .x is appended to the image
access function to return the first component. In current implementation both pixel (the
value returned by the image access) and sum (resultant data that is copied to the output
image) are declared as type float4. The convolution filter is a perfect candidate for
constant memory in this example because all work-items access the same element each
iteration.
 OpenCL __kernel
/* adding the keyword _ _constant in the signature of the function places the
filter in constant memory*/

void convolution ( __read_only image2d_t inputImage , __write_only

image2d_t outputImage , int rows , int cols ,

_ _ constant float filter , int filter Width , sampler_t

sampler )

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 18) Algorithms on GPU

/ * Store each work−i tem ’ s unique row and column * /

int column = get _ global _ i d (0) ;

int row = get _ global _ id (1) ;

/ * Half the width of the filter is needed for indexing memory later * /

int halfWidth = ( int ) ( filter Width / 2 ) ;

/ * All accesses to images return data as four−element vectors ( i. e .

, float 4 ) , although only the x component will contain meaningful data in this
code * /

float4 sum = {0.0 f , 0.0 f , 0.0 f , 0.0 f };

/ * Iterat or for the filter * /

int filter I dx = 0;

/ * Each work−item iterates around its local area on the basis of the
size of the filter * /

int 2 coords ; / / Coordinates for access ing the image

/ * Iterate the filter rows * /

for ( int i = −halfWidth ; i <= halfWidth ; i ++)

coords . y = row + i ;

/ * Iterate over the filter columns * /

for ( int j = −halfWidth ; j <= halfWidth ; j ++)

coords . x = column + j ;

/ * Read a pixel from the image . A single −channel image stores the pixel

in the x coordinate of the returned vector . * /

float 4 pixel ;
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 19) Algorithms on GPU

pixel = read_imagef ( inputImage , sampler , coords ) ;

/* The x-component is used when accumulating the filtered pixel value*/

sum . x += pixel . x * filter [ filter I dx ++];

/ * Copy the data to the output image * /

coords . x = column ;

coords . y = row ;

write_image f ( resultImage , coords , sum) ;

 In this example sampler is created using the host API and pass it as a kernel argument.
For this example, the C++ API is used (the C++ sampler constructor has identical
parameters). The host API signature to create a sampler in C is as follows.
cl_sampler clCreateSampler(cl_context context,cl_bool normalized_coords,
cl_addressing_mode addressing_mode, cl_filter_mode filter_mode,

cl_int *errcode_ret)

/* With use of the C++ API signature */

cl::Sampler::Sampler(const Context& context, cl_bool normalized_coords,

cl_addressing_mode addressing_mode, cl_filter_mode filter_mode,

cl_int * err = NULL)

/The sampler utilized by kernel /

cl::Sampler sampler = new cl::Sampler(context, CL_FALSE,

CLK_ADDRESS_CLAMP_TO_EDGE, CLK_FILTER_NEAREST);

 The sampler uses unnormalized coordinates. However, here two different options are
shown for the remaining sampler parameters : the filtering mode returns the nearest
pixel without interpolation (CLK_FILTER_NEAREST), and the addressing mode for
out-of-bounds accesses returns the nearest border pixel
(CL_ADDRESS_CLAMP_TO_EDGE). With the C++ API, a two-dimensional image is
created using the Image2D class, which requires an ImageFormat object as an argument.
Unlike with the C API, an image descriptor is not required.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 20) Algorithms on GPU

 The signatures for the Image2D and ImageFormat constructors are as follows.
cl::Image2D::Image2D(Context& context, cl_mem_flags flags, ImageFormat
format, ::size_t width, ::size_t height, ::size_t row_pitch = 0,

void * host_ptr = NULL, cl_int * err = NULL)

cl::ImageFormat::ImageFormat(cl_channel_order order, cl_channel_type type)

 The input and output images used for the convolution can be created using calls such as,
cl::ImageFormat imageFormat = cl::ImageFormat(CL_R, CL_FLOAT);

cl::Image2D inputImage = cl::Image2D(context, CL_MEM_READ_ONLY,

imageFormat, imageCols, imageRows);

cl::Image2D outputImage = cl::Image2D(context, CL_MEM_WRITE_ONLY,

imageFormat, imageCols, imageRows);

 5.3.1.7 Source Code for the Image Convolution Host Program

/* complete source code using the C++ API for image convolution. In the host
program, a 5 × 5 Gaussian blurring filter is used for the convolution.*/

# define __CL_ENABLE_EXCEPTIONS

# include <CL/ c l . hpp>

# include <fstream >

# include <iostream >

# include <vector >

# include “ utils.h”

# include ”bmp−utils . h”

static const char * inputImagePath = ” . . / . . / Images /baloon.bmp” ;

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 21) Algorithms on GPU

static float gaussian Blur Filter [25] = {

1.0 f / 2 7 3 . 0 f , 4.0 f / 2 7 3 . 0 f , 7.0 f / 2 7 3 . 0 f , 4.0 f / 2 7 3 . 0 f , 1.0 f

/273.0f,

4.0 f / 2 7 3 . 0 f , 16.0 f / 2 7 3 . 0 f , 26.0 f / 2 7 3 . 0 f , 16.0 f / 2 7 3 . 0 f ,

4.0 f / 2 7 3 . 0 f ,

7.0 f / 2 7 3 . 0 f , 26.0 f / 2 7 3 . 0 f , 41.0 f / 2 7 3 . 0 f , 26.0 f / 2 7 3 . 0 f ,

7.0 f / 2 7 3 . 0 f ,

4.0 f / 2 7 3 . 0 f , 16.0 f / 2 7 3 . 0 f , 26.0 f / 2 7 3 . 0 f , 16.0 f / 2 7 3 . 0 f ,

4.0 f / 2 7 3 . 0 f ,

1.0 f / 2 7 3 . 0 f , 4.0 f / 2 7 3 . 0 f , 7.0 f / 2 7 3 . 0 f , 4.0 f / 2 7 3 . 0 f , 1.0 f

/273.0f

};

static const int gaussian Blur Filter Width = 5;

int main ( )

float *hInputImage ;

float *hOutputImage ;

int imageRows ;

int imageCols ;

/* Set the filter here * /

int filter Width = gaussian Blur Filter Width ;

float * filter = gaussian Blur Filter ;

/ * Read in the BMP image * /

hInputImage = readBmpFloat ( inputImagePath , &imageRows , &imageCols ) ;

/ * Allocate space for the output image * /

TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 22) Algorithms on GPU

hOutputImage = new float [ imageRows * imageCols ] ;

try

/ * Query for platforms * /

std : : vector < c l : : Platform > platforms ;

c l : : Platform : : g e t (&platforms ) ;

/ * Get a list of devices on this plat form * /

std : : vector <c l : : Device > devices ;

platforms [ 0 ] . get Devices (CL_DEVICE_TYPE_GPU , &devices ) ;

/ * Create a context for the devices * /

c l : : Context context ( devices ) ;

/ * Create a command−queue for the f irst device * /

c l : : CommandQueue queue = c l : : CommandQueue ( context , devices

[0]);

/ * Create the images * /

c l : : ImageFormat imageFormat = c l : : ImageFormat (CL_R, CL_FLOAT) ;

c l : : Image2D inputImage = c l : : Image2D( context , CL_MEM_READ_ONLY,

imageFormat , imageCols , imageRows ) ;

c l : : Image2D output Image = c l : : Image2D ( context ,

CL_MEM_WRITE_ONLY,

imageFormat , imageCols , imageRows ) ;

/ * Create a buffer for the filter * /

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 23) Algorithms on GPU

cl : : Buffer filter Buffer = cl : : Buffer ( context ,

CL_MEM_READ_ONLY,

filter Width * filter Width * size of ( float ) ) ;

/ * Copy the input data to the input image * /

c l : : size_t <3> origin ;

origin [ 0 ] = 0 ;

origin [ 1 ] = 0 ;

origin [ 2 ] = 0 ;

c l : : size_t <3> region ;

region [0] = imageCols ;

region [1] = imageRows ;

region [2] = 1;

queue . enqueueWriteImage ( inputImage , CL_TRUE, or igin , region , 0 , 0 ,

hInputImage ) ;

/ * Copy the filter to the buffer * /

queue . enqueueWriteBuffer ( filter Buffer , CL_TRUE, 0 ,

filter Width * filter Width * size of ( float ) , filter ) ;

/ * Create the sampler * /

c l : : Sampler sampler = c l : : Sampler ( context , CL_FALSE,

CL_ADDRESS_CLAMP_TO_EDGE, CL_FILTER_NEAREST) ;

/ * Read the program source * /

std : : if stream source File ( ” image−convolution . c l ” ) ;

std : : string sourceCode (

std : : istream buf _ iterator <char >( sourcFile ) ,

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 24) Algorithms on GPU

( std : : istream buf _ iterator <char > ( ) ) ) ;

cl : : Program : : Sources source (1 ,

s t d : : make_pair ( sourceCode . c _ s t r ( ) ,

sourceCode . length ( ) + 1) ) ;

/ * Make program from the source code * /

c l : : Program program = c l : : Program ( cont ext , source ) ;

/ * Bui ld the program for the devices * /

program . build ( devices ) ;

/ * Create the kernel * /

c l : : Kernel kerne l ( program , ” convolution ” ) ;

/ * S e t the kerne l arguments * /

kernel . setArg (0 , inputImage ) ;

kernel . setArg (1 , outputImage ) ;

kernel . setArg (2 , filter Buffer ) ;

kernel . setArg (3 , filter Width ) ;

kernel . setArg ( 4 , sampler ) ;

/ * Execute the kernel * /

c l : : NDRange global ( imageCols , imageRows ) ;

c l : : NDRange local ( 8 , 8) ;

queue . enqueueNDRangeKernel ( kerne l , c l : : NullRange , global ,local ) ;

/ * Copy the output data back to the host * /

queue . enqueueReadImage ( output Image , CL_TRUE, origin , region , 0 , 0 ,

hOutputImage ) ;

/ * Save the output BMP image * /

writeBmpFloat ( hOutputImage , ” cat−filtered . bmp” , imageRows ,imageCols,

inputImagePath ) ;

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 25) Algorithms on GPU

catch ( c l : : Error error )

std : : cout << error . what ( ) << ” ( ” << error . err ( ) << ” )
” << std : :endl ;

free ( hInputImage ) ;

delete hOutputImage ;

return 0 ;

 5.3.1.8 Real World Interpretations of Convolution

 Convolution can describe the diffusion of information, for example, the model of the
diffusion that takes place if one puts milk into the black tea and do not stir (pixels diffuse
towards contours in an image). In quantum mechanics, it describes the probability of a
quantum particle being in a certain place when one measures the particle’s position
(average probability for a pixel’s position is highest at contours). In probability theory, it
describes cross-correlation, which is the amount of overlap or degree of similarity for two
sequences (similarity high if the pixels of a feature (e.g. nose) overlap in an image (e.g.
face)). In statistics, it describes a weighted moving average over a normalized sequence of
input (large weights for contours, small weights for everything else). Many other
interpretations exist.
 While it is unknown which interpretation is correct for deep learning, the cross-correlation
interpretation is currently the most useful: convolutional filters can be interpreted as feature
detectors, that is, the input (feature map) is filtered for a certain feature (the kernel) and the
output is large if the feature is detected in the image.

 5.3.2 Prefix Sum

 5.3.2.1 Prefix Sum Fundamentals

1. Parallel prefix sum belongs to popular data-parallel algorithms. PPS is often used in
problems such as stream compaction, sorting, Eulerian tours of a graph, computation of
cumulative distribution functions, etc. The all-prefix-sums operation on an array of data is
commonly known as scan.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 26) Algorithms on GPU

2. The scan just defined is an exclusive scan, because each element j of the result is the sum of
all elements up to but not including j in the input array. In an inclusive scan, all elements
including j are summed. An exclusive scan can be generated from an inclusive scan by
shifting the resulting array right by one element and inserting the identity. Likewise, an
inclusive scan can be generated from an exclusive scan by shifting the resulting array left
and inserting at the end the sum of the last element of the scan and the last element of the
input array (Blelloch 1990).
3. Blelloch (1990) describes all-prefix-sums as a good example of a computation that seems
inherently sequential, but for which there is an efficient parallel algorithm. Blelloch defines
the all-prefix-sums operation as follows,
The all-prefix-sums operation takes a binary associative operator  with identity I, and an
array of n elements
{ a0, a1, …., an – 1 }
and returns the array
[ I, a0, (a0  a1), …., (a0  a1  …. an – 2 ) ]
For example, if  is addition, then the all-prefix-sums operation on the array
[4 2 8 0 5 2 8 4]
would result as,
[0 4 6 14 14 19 21 29 33]
Given an input array X= [x0, x1, ...xn – 1] and an associative binary operation  an inclusive
prefix sum computes an array of prefixes [x0, x0  x1, x0  x1  x2, ..., x0  ...  xn – 1]. If the
th
binary operation is addition, the prefix of i element is simply a sum of all preceding elements
th
plus the i element if one wants to have an inclusive prefix sum. If the element itself is not
included then it is an exclusive prefix sum. Table 5.3.1 shows examples of an inclusive and
exclusive scan.

 5.3.2.2 Sequential Implementation of Scan on CPU

void scan( float* scanned, float* input, int length)
{

scanned[0] = 0;

for(int i = 1; i < length; ++i)

scanned[i] = scanned[i-1] + input[i-1];

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 27) Algorithms on GPU

It iterates over all input elements and cumulatively compute the prefixes. It just adds each
element to the sum of the elements before (prior to) it. It is trivial, but sequential. Exactly n - 1
add operations are performed. It is optimal in terms of work efficiency.

 5.3.2.3 Parallel Scan Algorithm - A Solution by Hillis and Steele

i. This implementation of the algorithm requires two buffers of length n (in Fig. 5.3.3 the case
3
shown is n = 8 = 2 ). Here it is assumed that the number n of elements is a power of 2 :
M
n=2 .

Fig. 5.3.3 : Parallel Scan - Prefix Sum

M
ii. In this algorithm, the first iteration, go with stride 1 = 20. It starts at x[2 ] and apply this
M
stride to all the array elements before x[2 ] to find the mate of each of them. When looking
for the mate, the stride should not land before the beginning of the array. The sum replaces
M–1
the element of higher index. This means that there would be 2 additions.
1 M
The second iteration, go with stride 2 = 2 . It starts at x[2 ] and apply this stride to all the
M
array elements before x[2 ] to find the mate of each of them. When looking for the mate,
the stride should not land before the beginning of the array. The sum replaces the element
M 1
of higher index. This result in to 2 – 2 additions.
2 M
The third iteration go with stride 4 = 2 . It starts at x[2 ] and apply this stride to all the
M
array elements before x[2 ] to find the mate of each of them. When looking for the mate,
the stride should not land before the beginning of the array. The sum replaces the element
M 2
of higher index. It result into 2 – 2 additions.
Consider the kth iteration (k is some arbitrary valid integer) that go with stride 2k – 1. It
M M
starts at x[2 ] and apply this stride to all the array elements before x[2 ] to find the mate of
each of them. When looking for the mate, the stride should not land before the beginning of
M k–1
the array. The sum replaces the element of higher index. It results in 2 – 2 additions.
The algorithm is as follows.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 28) Algorithms on GPU

 5.3.2.4 Doubled Buffer Version Algorithm

for d := 0 to M-1 do

for all k in parallel do

if k – 2d _0 then

x[out][k] := x[in][k] + x[in][k − 2d]

else

x[out][k] := x[in][k]

end forall

swap(in,out)

end for

i. For above algorithm the number of operations would be,

M 0 M 1 M k M M–1
(2 – 2 ) + (2 – 2 ) + … + (2 – 2 ) +…+ (2 – 2 )
ii. The final operation count would be,
M 0 M–1 M M
M * 2 – (2 + …+ 2 ) = M * 2 – 2 + 1 = n (log(n) –1 ) + 1
iii. Thus this algorithm has time complexity O(n*log(n)). This scan algorithm is not that work
efficient. Sequential scan algorithm does n-1 adds. A factor of log(n) might hurt: 20x more
6
work for 10 elements. A parallel algorithm can be slow when execution resources are
saturated due to low algorithm efficiency

 5.3.2.5 Hillis and Steele Algorithm - Kernel Function

__global__ void scan(float *g_odata, float *g_idata, int n)

extern shared float temp[]; // allocated on invocation

int thid = threadIdx.x;

int pout = 0, pin = 1;

// load input into shared memory.

// Exclusive scan: shift right by one and set first element to 0

temp[thid] = (thid > 0) ? g_idata[thid-1] : 0;

__syncthreads();

for( int offset = 1; offset < n; offset <<= 1 )

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 29) Algorithms on GPU

pout = 1 - pout; // swap double buffer indices

pin = 1 - pout;

if (thid >= offset)

temp[poutn+thid] += temp[pinn+thid - offset];

else

temp[pout*n+thid] = temp[pin*n+thid];

__syncthreads();

g_odata[thid] = temp[pout*n+thid1]; // write output

The kernel only works when the entire array is processed by one block.One block in CUDA
has 512 threads, which means one can have up to 1024 elements. This needs to be improved
upon.

 5.3.2.6 Improving Algorithm Efficiency

 A common parallel algorithm pattern needs Balanced Trees. For parallel algorithms build a
balanced binary tree on the input data and sweep it to and then from the root. Tree is not an
actual data structure, but a concept to determine what each thread does at each step.
 For scan algorithm one can traverse down from leaves to root building partial sums at
internal nodes in the tree. Root holds sum of all leaves (this is a reduction algorithm!).
Again traverse back up the tree building the scan from the partial sums.
 The reduced algorithm steps are,
for k=0 to M – 1
k
offset = 2
M–k–1
for j=1 to 2 in parallel do
k+1 k+1 k+1 k
x[j – 2 – 1] = x[j·2 – 1] + x[j·2 – 2 – 1]
end for

end for

 The step count for above reduced version of the scan algorithm is,
M–1
 2
M–k–1 M
= 2 –1 = n–1
k=0

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 30) Algorithms on GPU

 5.3.2.7 Work - Efficient Prefix Sum

 If work efficiency of the above algorithm is carefully analyzed, it is found that it performs
(n log2 n) addition operations, which is log2 n more than the linear sequential scan. Thus,
the version is obviously not work-efficient, which can significantly slow down the
computation, especially in the case of large arrays. Therefore it would be efficient to use an
algorithm that removes the logarithmic term and performs only (n) additions.
 An algorithm that is optimal in terms of the number of addition operations was presented by
Blelloch in 1989. The main idea is to perform the PPS in two hierarchical sweeps that
manipulate data in a tree-like manner. In the first up-sweep (also called reduce) phase
algorithm traverse the virtual tree from leaves towards the root and compute prefix sums
only for some elements (i.e. the inner nodes of the tree). In fact, it computes parallel
reduction with interleaved addressing. The second phase, called down-sweep, is responsible
for adding and propagating the intermediate results from inner nodes back to the leaves. In
order to obtain correct result one needs to overwrite the root (the result of the reduction
phase) with zero. Then with simple descend through the tree the values of the child nodes
are computed as, sum of the current node value and the former left child value in the
case of the right child whereas the current node value in the case of the left child.

 5.3.2.8 Avoiding Bank Conflicts

There can be bank conflicts that occur if different threads from the same work-group access
(different) data in the same bank. This will definitely penalize the performance so one should
try to achieve a conflict-free addressing of the individual elements. Notice that as the stride
between two elements is getting bigger, more and more threads will access the same bank as
one proceeds further. An easy solution is to reserve a bit more local memory and add some
padding. One can wrap the address (offset) to the shared memory with a macro to conveniently
experiment with different paddings. A simple conflict-free access can be achieved by adding 1
to the address for every multiple of the number of banks. By doing this it will ensure that
elements originally falling into the same bank will be shifted to different banks. One can use
the profiler to make sure there are no bank conflicts.

 5.3.2.9 Applications Prefix Sum Algorithm

 In general, all-prefix-sums can be used to convert certain sequential computations into
equivalent, but parallel, computations. There are many uses for scan, that includes,
1. Sorting (radix sort, quick sort).
2. Lexical analysis.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 31) Algorithms on GPU

3. String comparison.
4. Polynomial evaluation.
5. Stream compaction.
6. Building histograms.
7. Building data structures (like graphs, trees) and performing operation on them in
parallel.
8. Solving recurrence.
 On the GPU, the first published scan work was Horn's 2005 implementation (Horn 2005).
Horn's scan was used as a building block for a nonuniform stream compaction operation,
which was then used in a collision-detection application. Horn's scan implementation had
O(n log n) work complexity. Hensley et al. (2005) used scan for summed-area-table
generation later that year, improving the overall efficiency of Horn's implementation by
pruning unnecessary work. Like Horn's, however, the overall work complexity of Hensley
et al.'s technique was also O(n log n).

 5.3.3 Sparse Matrix - Matrix Multiplication

 5.3.3.1 Sparse Matrix Fundamentals

1. In a sparse matrix, the vast majority of the elements are zeros. Storing and processing these
zero elements are wasteful in terms of memory, time, and energy.
2. Operations on sparse data structures abound in all areas of information and physical
science. In particular, the sparse matrix-matrix multiplication (SpMM) is a fundamental
operation that arises in many practical contexts, including graph contractions, multi-source
breadth-first search, matching, and algebraic multigrid (AMG) methods.
3. A sparse matrix is a matrix where the majority of the elements are zero. Sparse matrices
arise in many science, engineering, and financial modeling problems. Matrices are often
used to represent the coefficients in a linear system of equations. Each row of the matrix
represents one equation of the linear system. In many science and engineering problems,
there are a large number of variables and the equations involved are loosely coupled. That
is, each equation only involves a small number of variables.
4. Sparse matrix multiplication is an important algorithm in a wide variety of problems,
including graph algorithms, simulations and linear solving to name a few. Yet, there are but
a few works related to acceleration of sparse matrix multiplication on a GPU.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 32) Algorithms on GPU

5. Many algorithms in machine learning, data analysis, and graph analysis can be organized
such that the bulk of the computation is structured as sparse matrix-dense matrix
multiplication (SpMM).
6. Modern GPUs are throughput-oriented many core processors that rely on large-scale
multithreading to attain high computational throughput and hide memory access time. The
latest generation of NVIDIA GPUs have up to 80 “streaming multiprocessors” (SMs), each
with up to hundreds of arithmetic logic units(ALUs). GPU programs are called kernels,
which run a large number of threads in parallel in a single-program, multiple-data (SPMD)
fashion. The underlying hardware runs an instruction on each SM on each clock cycle on a
warp of 32 threads in lockstep. The largest parallel unit that can be synchronized within a
GPU kernel is called a cooperative thread array(CTA),which is composed of warps. For
problems that require irregular data access, a successful GPU implementation needs to
ensure coalesced memory access to external memory and efficiently use the memory
hierarchy, minimize thread divergence within a warp, and maintain high occupancy, which
is a measure of how many threads are available to run on the implementation on the GPU.

 5.3.3.2 Sparse Matrix and Compressed Sparse Row (CSR) Storage Format
i. A m  n matrix is often called sparse if its number of non-zeroes is small enough compared
to O(mn) such that it makes sense to take advantage of sparsity. The compressed sparse row
(CSR) format stores only the column indices and values of non-zeroes within a row. The
start and end of each row are then stored in terms of the column indices and value in a row
offsets (or row pointers) array. Hence,CSR only requires m + 2n non-zero memory for
storage.
ii. A dense matrix is in row-major order when successive elements in the same row are
contiguous in memory. Similarly, it is in column-major order when successive elements in
the same column are contiguous in memory.
iii. Sparse matrices are stored in a format that a voids storing zero elements. The process start
with the Compressed Sparse Row (CSR) storage format.
iv. The Compressed Sparse Row (CSR) format is a popular, general-purpose sparse matrix
representation. CSR stores a sparse matrix via three arrays :
(1) the array AA contains all the nonzero entries of A.
(2) the array JA contains column indices of the nonzero entries stored in AA.
(3) entries of the array IA point to the first entry of subsequent rows of A in the arrays AA
and JA.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 33) Algorithms on GPU

This process is illustrated in below figure,

For example, the matrix

4 1 0 1 0 0
1 4 1 0 1 0 
A =
0 1 4 0 0 1 
 10 0 0 4 1 0

0 1
0
0
1
1
0
4
1
1
4 
is stored in the CSR format by

AA : [4 1 1 1 4 1 1 1 4 1 1 4 1 1 1 4 1 1 1 4] ,

JA : [0 1 3 0 1 2 4 1 2 5 0 3 4 1 3 4 5 2 4 5] ,

IA : [0 3 7 10 13 17 20].

Fig. 5.3.4 : CSR Format

 5.3.3.3 cuSPARSE Library

The vendor-shipped library cuSPARSE library provides two functions csrmm and csrmm2
for SpMM on CSR-format input matrices. The former expects a column-major input dense
matrix and generates column-major output, while the latter expects row-major input and
generates column-major output. Among many efforts to define and characterize alternate
matrix formats for SpMM are a variant of ELLPACK called ELLPACK-R and a variant of
Sliced ELLPACK called SELL-P. Hong et al. performs dynamic load-balancing by separating
the sparse matrix into heavy and light rows. The heavy rows are processed by CSR and the
light rows by Doubly Compressed Sparse Row (DCSR) in order to take advantage of tiling.
However, there is a real cost to deviating from the standard CSR encoding. Firstly, the rest of
the computation pipeline will need to convert from CSR to another format to run SpMM and
convert back. This process may take longer than the SpMM operation itself. Secondly, the
pipeline will need to reserve valuable memory to store multiple copies of the same matrix one
in CSR format, another in the format used for SpMM.

 5.3.3.4 Load Balancing Problem

 In the context SpMM, load-balancing problem has two aspects,
1. Load imbalance across warps. Some CTAs or warps may be assigned less work than
others, which may lead to these less-loaded computation units being idle while the more
loaded ones continue to do useful work.
2. Load imbalance within a warp, in two ways, which is collectively called as “Type 2"
load imbalance. Some warps may not have enough work to occupy all 32 threads in the
warp. In this case, thread processors are idle, and performance is lowered. Some warps
may assign different tasks to different threads.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 34) Algorithms on GPU

In this case, SIMD execution within a thread means that some threads are idle while other
threads are running; moreover, the divergence in execution across the warp means memory
accesses across the entire warp are unlikely to be coalesced.

 5.3.3.5 Parallelizations of SpMM

1. Row split - Assigns an equal number of rows to each processor.
2. Merge based - Performs two-phase decomposition.
Phase 1 - the kernel divides work evenly amongst CTAs.
Phase 2 - Kernel processes the work as follows,
(a) Nonzero split - Assign an equal number of nonzeroes per processor.
Then do a 1-D (1-dimensional) binary search on row offsets to determine at which
row to start.
(b) Merge path - Assign an equal number of nonzeroes and rows per processor. This is
done by doing a 2-D binary search (i.e., on the diagonal line) over row offsets and
nonzero indices of matrix A.
While row split focuses primarily on ILP (Instruction-Level Parallelism) and TLP (Thread-
Level Parallelism), nonzero split and merge path focus on load-balancing as well. Consider
nonzero split and merge path to be explicit load-balancing methods, as they rearrange the
distribution of work such that each thread must perform T independent instructions; if T > 1,
then explicit load-balancing creates ILP where there was previously little or none. Thus load-
balance is closely linked with ILP, because if each thread is guaranteed T > 1 units of
independent work (ILP), then each thread is doing the same amount of work (i.e., is load-
balanced).
 Algorithm I : Row-splitting SpMM
Row split aims to assign each row to a different thread, warp, or CTA. The typical SpMM
row split is only the left-most column of matrix B with orange cells replaced by green cells.
This gives SpMM independent instruction and uncoalesced, random accesses into the vector.
Although row-split is a well-known method for SpMM there are three important design
decisions,
1. Granularity - Should each row be assigned to a thread, warp, or CTA? Each row is
assigned to a warp compared to the alternatives of assigning a thread and a CTA per
row. This leads to the simplest design out of the three options, since it gives us
coalesced memory accesses into B. For matrices with few non-zeroes per row, the
thread-per-matrix-row work assignment may be more efficient.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 35) Algorithms on GPU

2. Memory access pattern - How should work be divided in fetching B ? What is the
impact on ILP and TLP ?
3. Shared memory - Can shared memory be used for performance gain ? This design
decision had the greatest impact on performance.
 Algorithm II : Merge-based SpMM
 The essence of merge-based algorithms is to explicitly and evenly distribute the non-zeroes
across parallel processors. It does so by doing a two-phase decomposition: In the first phase
(Partition SpMM), it divides the work between threads so that T work is assigned per
thread, and based on this assignment deduces the starting indices of each CTA.
 Once coordinated, work is done in the second phase. In theory, this approach should
eliminate both Type 1 and Type 2 load imbalances.
mk kn
Input : Sparse matrix in CSR A  R and dense matrix B  R .
mn
Output : C  R such that C  AB.
1. procedure SPMMMERGE (A, B)
2. limits []  PARTITIONSPMM (A, blockDim.x) // Phase 1 : Divide work and run
// binary-search
3. for each CTA i in parallel do // Phase 2 : Do computation
4. num_rows  limits [i + 1] – limits[i]
5. shared.csr  GLOBALTOSHARED (A.row_ptr + limits[i], num_rows)
// Read A and store to shared memory
6. end  min(blockDim.x, A.nnz – blockIdx.x  blockDim.x)
7. if row_ind < end then
8. col_ind  A.col_ind[riw_ind] // Read A if matrix not finished
9. valA  A.values[row_ind]
10. else
11. col_ind  0 // Otherwise do nothing
12. valA  0
13. end if
14. for each thread j in parallel do
15. for j = 0, 1, …, 31 do // Unroll this loop

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 36) Algorithms on GPU

16. new_ind[j]  Broadcast(col_ind, j) // Each thread broadcasts

17. new_val[j]  Broadcast(valA, j) // col_ind and valA
18. valB[j]  B[col_ind] [j]  new_val[j] // Read B
19. end for
20. end for
21. terms  PREPARESPMM (shared.csr) // Flatten CSR-to-COO
22. carryout[i]  REDUCETOGLOBALSPMM (C, valB, valB) // Compute partial
of C and save carry-outs
23. end for
24. FIXCARRYOUT (C, limits, carryout) // Carry-out fix-up (rows spanning
// across blocks)
25. return C
26. end procedure

 5.4 Heterogeneous Cluster

 5.4.1 Concept of Heterogeneous Clusters

1. One of the biggest advantages of distributed systems over standalone computers is an
ability to share the workload between computers, processors, and cores. Clusters (a network
of computers configured to work as one computer), grids, and cloud computing are one of
the most progressive branches in a field of parallel computing and data processing
nowadays, and have been identified as important new technologies that may be used to
solve complex scientific and engineering problems as well as to tackle many projects in
commerce and industry.
2. A broad spectrum of current parallel computing activities and scientific projects are carried
out. A new model for parallel computing that relies on usage of CPU and GPU units to
solve general purpose scientific and engineering problems revolutionized data computation
over last few years.
3. The tasks that can be divided up into large numbers of independent parts are good
candidates. GPU-enabled calculations seem to be very promising in data analysis,
optimization, simulation, etc. Using CUDA or OpenCL, and graphical processing units
many real-world applications can be easily implemented and run significantly faster than in
multi-processor or multi-core systems.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 37) Algorithms on GPU

4. A heterogeneous computing system refers to a system that contains different types of

computational units, such as multicore CPUs, GPUs, DSPs, FPGAs, and ASICs. The
computational units in a heterogeneous system typically include a general-purpose
processor that runs an operating system.
5. In High-Performance Computing (HPC), various applications require the aggregate
computing power of a cluster of computing nodes. Many of the HPC clusters today have
one or more hosts and one or more devices in each node. Since early days, these clusters
have been programmed predominately with the Message Passing Interface (MPI). MPI
helps to scale heterogeneous applications to multiple nodes in a cluster environment. By
carrying out domain partitioning, point-to-point communication, and collective
communication can help a kernel to scale up that is it can improve efficiency.

 5.4.2 MPI (Message Passing Interface)

 5.4.2.1 MPI Fundamentals

1. In today’s times widely used programming interface for computing clusters is MPI
[Gropp1999], which is a set of API functions for communication between processes
running in a computing cluster. MPI assumes a distributed memory model where processes
exchange information by sending messages to each other.
2. When an application uses API communication functions, it does not need to deal with the
details of the interconnect network. The MPI implementation allows the processes to
address each other using logical numbers, much the same way as using phone numbers in a
telephone system - telephone users can dial each other using phone numbers without
knowing exactly where the called person is and how the call is routed.
3. Message Passing Interface, is a standard API for communicating data via messages between
distributed processes that is commonly used in HPC to build applications that can scale to
multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is
designed for parallel computing on a single computer or node. There are many reasons for
wanting to combine the two parallel programming approaches of MPI and CUDA. A
common reason is to enable solving problems with a data size too large to fit into the
memory of a single GPU, or that would require an unreasonably long compute time on a
single node. Another reason is to accelerate an existing MPI application with GPUs or to
enable an existing single-node multi-GPU application to scale across multiple nodes. With
CUDA-aware MPI these goals can be achieved easily and efficiently.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 38) Algorithms on GPU

4. In a typical MPI application, data and work are partitioned among processes. As shown in
below Fig. 5.4.1, each node can contain one or more processes, shown as clouds within
nodes. As these processes progress, they may need data from each other. This need is
satisfied by sending and receiving messages. In some cases, the processes also need to
synchronize with each other and generate collective results when collaborating on a large
task. This is done with collective communication API functions.

Fig. 5.4.1 : MPI applications running

 5.4.2.2 MPI Working

1. Similar to CUDA, MPI programs are based on the SPMD(Single Program, Multiple Data)
parallel execution model. All MPI processes execute the same program. The MPI system
provides a set of API functions to establish communication systems that allow the processes
to communicate with each other.
2. MPI process is usually called a “rank”. The processes involved in an MPI program have
private address spaces, which allows an MPI program to run on a system with a distributed
memory space, such as a cluster. The MPI standard defines a message-passing API which
covers point-to-point messages as well as collective operations like reductions.
3. Below are five essential API functions that set up and shut down communication systems
for an MPI application.
1. int MPI_Init (int*argc, char***argv) - Initialize MPI.
2. int MPI_Comm_rank (MPI_Comm comm, int *rank) - Rank of the calling process in
group of comm.
3. int MPI_Comm_size (MPI_Comm comm, int *size) - Number of processes in the group
of comm..

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 39) Algorithms on GPU

4. int MPI_Comm_abort (MPI_Comm comm) - Terminate MPI communication

connection with an error flag.
5. int MPI_Finalize ( ) - Ending an MPI application, close all resources.

 5.4.2.3 MPI Programming Example

 MPI Programming Example 1
/* A simple MPI program in C language - Program sends the message “Hi, I am
there” from process 0 to process 1. */

#include <stdio.h>

#include <string.h>

#include <mpi.h>

int main(int argc, char *argv[])

char message[20];

int myrank, tag=99;

MPI_Status status;

/* Initialize the MPI library */

MPI_Init(&argc, &argv);

/* Determine unique id of the calling process of all processes participating in

this MPI program. This id is usually called MPI rank. */

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

if (myrank == 0) {

strcpy(message, "Hi, I am there");

/* Send the message "Hi, I am there" from the process with rank 0 to the
process with rank 1. */

MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, tag,

MPI_COMM_WORLD);

} else {

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 40) Algorithms on GPU

/* Receive a message with a maximum length of 20 characters from process

with rank 0. */

MPI_Recv(message, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);

printf("received %s\n", message);

/* Finalize the MPI library to free resources acquired by it. */

MPI_Finalize();

return 0;

 Programming Example 2
/* A simple MPI program in C language - Program uses MPI library APIs to create
processes with specified constraints on number of processes to create */

#include "mpi.h”

int main(int argc, char *argv[])

int pad = 0, dimx = 480+pad, dimy = 480, dimz = 400, nreps = 100;

int procid=-1, nump=-1;

MPI_Init (&argc, &argv);

MPI_Comm_rank (MPI_COMM_WORLD, &procid);

MPI_Comm_size (MPI_COMM_WORLD, &nump);

if(np< 3)

if(0 == procid)

printf (“Needed 3 or more processes.\n");

MPI_Abort ( MPI_COMM_WORLD, 1 );

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 41) Algorithms on GPU

return 1;

if(procid < nump - 1)

compute_process (dimx, dimy, dimz/ (np - 1), nreps);

else

data_server( dimx,dimy,dimz);

MPI_Finalize();

return 0;

 Above program, is simple MPI program that uses these API functions. A user needs to
supply the executable file of the program to the mpirun command or the mpiexec command
in a cluster.
 Each process starts by initializing the MPI runtime with a MPI_Init() call. This initializes
the communication system for all the processes running the application.
 Once the MPI runtime is initialized, each process calls two functions to prepare for
communication.
 The first function MPI_Comm_rank() - returns a unique number to call each process,
called an MPI rank or process ID. The numbers received by the processes vary from 0 to
the number of processes minus 1.
 MPI rank for a process is equivalent to the expression blockIdx.x * blockDim.x +
threadIdx.x for a CUDA thread. It uniquely identifies the process in a communication,
similar to the phone number in a telephone system.
 The MPI_Comm_rank() has two parameters. The first one is an MPI built-in type
MPI_Comm that specifies the scope of the request. Valuesof the MPI_Comm are
commonly referred to as a communicator.
 MPI_Comm and other MPI built-in types are defined in a mpi.h header file that should be
included in all C program files that use MPI. This is similar to the cuda.h header file for
CUDA programs.
 An MPI application can create one or more intracommunicators. Members of each
intracommunicator are MPI processes.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 42) Algorithms on GPU

 MPI_Comm_rank() assigns a unique ID to each process in an intracommunicator. In above

program 2, the parameter value passed is MPI_COMM_WORLD, which means that the
intracommunicator includes all MPI processes running the application. The second
parameter to the MPI_Comm_rank() function is a pointer to an integer variable into which
the function will deposit the returned rank value.
 In above program, a variable procid is declared for this purpose.
 After the MPI_Comm_rank() returns, the procid variable will contain the unique ID for the
calling process.
 The second API functionis MPI_Comm_size(), which returns the total number of MPI
processes running in the intracommunicator.The MPI_Comm_size() function takes two
parameters.The first one is an MPI built-in type MPI_Comm that gives the scope of the
request. In above program 2 the scope is MPI_COMM_WORLD. Since
MPI_COMM_WORLD is used, the returned value is the number of MPI processes running
the application. This is requested by a user when the application is submitted using the
mpirun command or the mpiexec command. However, the user may not have requested a
sufficient number of processes. Also, the system may or may not be able to create all the
processes requested. Therefore, it is good practice for an MPI application program to check
the actual number of processes running. The second parameter is a pointer to an integer
variable into which the MPI_Comm_size() function will deposit the return value. In above
program 2, a variable nump is declared for this purpose. After the function returns, the
variable nump contains the number of MPI processes running the application.
 In program 2, it is assumed that the application requires at least three MPI processes.
Hence, it checks if the number of processes is at least three. If not, it calls the
MPI_Comm_abort() function to terminate the communication connections and return with
an error flag value of 1. Also program 2 shows a common pattern for reporting errors or
other chores. There are multiple MPI processes but it needs to report the error only once.
The application code designates the process with procid 50 to do the reporting.
 MPI_Comm_abort() function takes two parameters. The first is the scope of the request. In
the program 2, the scope is a llMPI processes running the application. These cond
parameter is a code for the type of error that caused the abort. Any number other than 0
indicates that an error has happened. If the number of processes satisfies the requirement,
the application program goes on to perform the calculation. In program2, the application
uses nump-1 processes (procid from 0 to nump-2) to perform the calculation and one
process (the last one of which the procid is nump-1) to perform an input / output (I/O)
service for the other processes. The process that performs the I/O services is refer to as the
data-server and the processes that perform the calculation as compute-processes.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 43) Algorithms on GPU

 In program 2, if the procid of a process is within the range from 0 to nump-2, it is a

compute process and calls the compute_process() function. If the process pid is nump-1, it
is the data server and calls the data_server() function.
 After the application completes its computation, it notifies the MPI runtime with a call to
the MPI_Finalize(), which frees all MPI communication resources allocated to the
application. The application can then exit with a return value 0, which indicates that no
error occurred.

 5.4.2.4 MPI Point to Point Communication Types

 Point-to-Point Communication
 The most elementary form of message-passing communication involves two nodes, one
passing a message to the other. Although there are several ways that this might happen in
hardware, logically the communication is point-to-point: one node calls a send routine and
the other calls a receive.
 A message sent from a sender contains two parts: data (message content) and the message
envelope. The data part of the message consists of a sequence of successive items of the
type indicated by the variable datatype. MPI supports all the basic C datatypes and allows
a more elaborate application to construct new datatypes at runtime. The basic MPI
datatypes for C are MPI_INT, MPI_FLOAT, MPI_DOUBLE, MPI_COMPLEX,
MPI_CHAR. The message envelope contains information such as the source (sender),
destination (receiver), tag and communicator.
 As with most existing message-passing systems today, MPI provides blocking send and
receive as well as nonblocking send and receive.
 Blocking Send and Receive
1. Blocking Send Operation :
Below is the syntax of the blocking send operation :
MPI_Send(void* buf, int count, MPI_Datatype datatype,

int dest, int tag, MPI_Comm comm);

where buf is the address of the send buffer, count is the number of items in the send buffer,
datatype (MPI_INT, MPI_FLOAT, MPI_CHAR, etc.) describes these items' datatype, dest
is the rank of the destination processor, tag is the message type identifier and comm is the
communicator to be used (see MPI advanced topics for more information on
communicators).

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 44) Algorithms on GPU

The blocking send call does not return until the message has been safely stored away so that
the sender can freely reuse the send buffer.
2. Blocking Receive Operation :
The MPI blocking receive call has the form:
MPI_Recv(void* buf, int count, MPI_Datatype datatype,

int source, int tag, MPI_Comm comm,

MPI_Status *status);

where all the arguments have the same meaning as for the blocking send except for source
and status. Source is the rank of the node sent the message and status is an MPI-defined
integer array of size MPI_STATUS_SIZE. The information carried by status can be used in
other MPI routines.
The blocking receive does not return until the message has been stored in the receive
buffer.
3. Order :
Messages are non-overtaking :
If a sender sends two messages in succession to the same destination and both match the
same receive, then this operation cannot receive the second message while the first is still
pending. If a receiver posts two receives in succession and both match the same message,
then this message cannot satisfy the second receive operation, as long as the first one is still
pending. This requirement facilitates matching sends to receives. It guarantees that
message-passing code is deterministic if processes are single-threaded and the wildcard
MPI_ANY_SOURCE is not used in receives.
4. Progress :
If a pair of matching send and receives have been initiated on two processes, then at least
one of these two operations will complete, independent of other action in the system. The
send operation will complete unless the receive is satisfied and completed by another
message. The receive operation will complete unless the message sent is consumed by
another matching receive that was posted at the same destination process.
5. Avoid a Deadlock :
It is possible to get into a deadlock situation if one uses blocking send and receive. Here is a
fragment of code to illustrate the deadlock situation :
MPI_Comm_rank(comm,&rank);

if (rank == 0) {
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 45) Algorithms on GPU

MPI_Recv(recvbuf,count,MPI_REAL,1,tag,comm,&status);

MPI_Send(sendbuf,count,MPI_REAL,1,tag,comm);

elseif (rank == 1) {

MPI_Recv(recvbuf,count,MPI_REAL,0,tag,comm,&status);

MPI_Send(sendbuf,count,MPI_REAL,0,tag,comm);

The receive operation of the first process must complete before its send, and can complete
only if the matching send of the second process is executed. The receive operation of the
second process must complete before its send and can complete only if the matching send
of the first process is executed. This program will always deadlock. To avoid deadlock, one
can use one of the following two examples :
MPI_Comm_rank(comm,&rank);

if (rank == 0) {

MPI_Send(sendbuf,count,MPI_REAL,1,tag,comm,);

MPI_Recv(recvbuf,count,MPI_REAL,1,tag,comm,&status);

elseif(rank == 1) {

MPI_Recv(recvbuf,count,MPI_REAL,0,tag,comm,&status);

MPI_Send(sendbuf,count,MPI_REAL,0,tag,comm);

MPI_Comm_rank(comm,&rank);

if (rank == 0) {

MPI_Recv(recvbuf,count,MPI_REAL,1,tag,comm,&status);

MPI_Send(sendbuf,count,MPI_REAL,1,tag,comm);

elseif(rank == 1) {

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 46) Algorithms on GPU

MPI_Send(sendbuf,count,MPI_REAL,0,tag,comm);

MPI_Recv(recvbuf,count,MPI_REAL,0,tag,comm,&status);

 Summary - A Minimum Set of MPI Calls

Summary of six functions discussed above
Function Explanation

-----------------------------------------------------

MPI_Init Initiate MPI

MPI_Comm_size Find out how many processes there are

MPI_Comm_rank Determine rank of the calling process

MPI_Send Send a message

MPI_Recv Receive a message

MPI_Finalize Terminate MPI

With these six calls, one can write a vast number of useful and efficient programs.
Other functions in MPI not yet introduced all add flexibility, robustness, efficiency,
modularity, and convenience. It is a good practice, however, for those who are just beginning
to learn message-passing programming to ignore other more advanced MPI functions and
concepts and to concentrate first on these six basic functions.
 Nonblocking Send and Receive - Overlapping Communication and Computation
One can improve performance on many systems by overlapping communication and
computation. One way to achieve that is to use nonblocking communication. MPI includes
nonblocking send and receive calls, described below.
1. Nonblocking Send Operation
A nonblocking send call initiates the send operation, but does not complete it. The send
start call will return before the message is copied out of the send buffer. A separate send
complete call is needed to complete the communication, i.e., to verify that the data have
been copied out of the send buffer. Here is the syntax of the nonblocking send operation :
MPI_Isend(void* buf, int count, MPI_Datatype datatype,

int dest, int tag, MPI_Comm comm,

MPI_Request *request);

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 47) Algorithms on GPU

where the first six arguments have the same meaning as in the blocking send. Request is a
communication request handle that one can use later to query the status of the
communication or to wait for its completion. Calling the nonblocking send indicates that
the system may start copying data out of the send buffer. The sender should not access any
part of the send buffer after a nonblocking send operation is called until the send completes.
2. Nonblocking Receive Operation
A nonblocking receive start call initiates the receive operation, but does not complete it.
The call will return before a message is stored into the receive buffer. An MPI_Wait call is
needed to complete the receive operation. The format of the nonblocking receive is :
MPI_Irecv(void* buf, int count, MPI_Datatype datatype,

int source, int tag, MPI_Comm comm,

MPI_Request *request);

where source is the rank of the processor that sent the message and the remaining
arguments are as previously discussed.
The receive buffer stores count consecutive elements of the type specified by datatype,
starting at the address in buf. The length of the received message must be less than or equal
to the length of the receive buffer. An overflow error occurs if all incoming data does not
fit, without truncation, into the receive buffer.
3. Check for Completion
To check the status of a nonblocking send or receive, one can call
MPI_Test(MPI_Request *request, int *flag,

MPI_Status *status);

where request is a communication request (from a call to MPI_Isend, for example), flag is
set to TRUE if the operation is completed, and status is a status object containing
information on the transaction returned by the call. The MPI_Test call returns immediately.
A call to MPI_Wait returns when the operation identified by request is complete.
MPI_Wait(MPI_Request *request, MPI_Status *status);

where all arguments are the same as for MPI_Test.

4. Code Fragments Using Nonblocking Send and Receive
Order : Nonblocking communication operations occur in the same order as the execution of
the calls that initiate the communication. In both blocking and nonblocking communication,
operations are non-overtaking. An example of a nonblocking send and receive with an
MPI_Wait call is given below,
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 48) Algorithms on GPU

MPI_Comm_rank(comm,&rank);

if (rank == 0) {

MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request);

do some computation to mask latency

MPI_Wait(&request,&status);

else (!(rank == 1)) {

MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request);

do some computation to mask latency

MPI_Wait(&request,&status);

A request object can be deadlocked without waiting for the associated communication to
complete. One way to avoid any deadlock caused by a request object is to use a different
request for each pair of sends and receives. For example :
MPI_Comm_rank(comm,&rank);

if (rank == 0) {

MPI_Isend(sbuf1,count,MPI_REAL,1,tag,comm,&req1);

MPI_Isend(sbuf2,count,MPI_REAL,1,tag,comm,&req2);

else (!(rank == 1) {

MPI_Irecv(rbuf1,count,MPI_REAL,0,tag,comm,&req1);

MPI_Irecv(rbuf2,count,MPI_REAL,0,tag,comm,&req2);

MPI_Wait(&req1,&status);

MPI_Wait(&req2,&status);
5. Progress
A call to MPI_Wait that completes a receive will eventually terminate and return if a
matching send has been started, unless the send is satisfied by another receive. In particular,
if the matching send is nonblocking, then the receive should complete even if no call is
executed by the sender to complete the send. Similarly, a call to MPI_Wait that completes a
send will eventually return if a matching receive has been started, unless the receive is
satisfied by another send, even if no call is executed to complete the receive.
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (5 - 49) Algorithms on GPU

 Collective Communication and Synchronization Points

1. Essentially collective communication implies a synchronization point among processes.
This means that all processes must reach a point in their code before they can all begin
executing again.
2. Communication and computation is coordinated among a group of processes in a
communicator. Groups and communicators can be constructed “by hand” or using topology
routines. Non-blocking versions of collective operations added in MPI-3.
 MPI Synchronization using Barrier Call
MPI has a special function that is dedicated to synchronizing processes,
MPI_Barrier(MPI_Comm communicator)

The name of the function is quite descriptive - the function forms a barrier, and no
processes in the communicator can pass the barrier until all of them call the function.
Following figure illustrates the barrier usage. In the figure horizontal axis represents execution
of the program and the circles represent different processes.

Fig. 5.4.2 : MPI communication using barrier

As seen in the Fig. 5.4.2, Process zero first calls MPI_Barrier at the first time snapshot (T
1). While process zero is hung up at the barrier, process one and three eventually make it (T 2).
When process two finally makes it to the barrier (T 3), all of the processes then begin
execution again (T 4). MPI_Barrier can be useful for many things. One of the primary uses of
MPI_Barrier is to synchronize a program so that portions of the parallel code can be timed
accurately.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 50) Algorithms on GPU

MPI_Barrier (comm) blocks until all processes in the group of the communicator comm
call it. It is almost never required in a parallel program It is occasionally useful in measuring
performance and load balancing. In unusual cases, can increase performance by reducing
network contention it does not guarantee that processes exit at the same (or even close to the
same) time
 MPI Broadcasting with MPI_Bcast
 A broadcast is one of the standard collective communication techniques. During a
broadcast, one process sends the same data to all processes in a communicator. One of the
main uses of broadcasting is to send out user input to a parallel program, or send out
configuration parameters to all processes. The communication pattern of a broadcast is
shown in below Fig. 5.4.3.

Fig. 5.4.3 : MPI communication - broadcast

 In this example, process zero is the root process, and it has the initial copy of data. All of
the other processes receive the copy of data.
 In MPI, broadcasting can be accomplished by using MPI_Bcast. The function prototype is,
MPI_Bcast(

void* data,

int count,

MPI_Datatype datatype,

int root,

MPI_Comm communicator)

 Although the root process and receiver processes do different jobs, they all call the same
MPI_Bcast function. When the root process (as shown in above figure, it was process zero)
calls MPI_Bcast, the data variable will be sent to all other processes. When all of the
receiver processes call MPI_Bcast, the data variable will be filled in with the data from the
root process.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 51) Algorithms on GPU

 5.5 Part A : Short Answered Questions [2 Marks Each]

Q.1 What are common parallel execution patterns ?
 Answer : Parallel patterns can run parallely. Some common parallel patterns are :

1. Loop-based patterns -Loops are familiar iterative programming constructs those vary
primarily in terms of entry and exit conditions (for, do-while, while), and whether they
create dependencies between loop iterations or not. Loop-based iteration is one of the
easiest patterns to parallelize. With inter-loop dependencies removed, it’s then simply a
matter of deciding how to split, or partition, the work between the available processors.
This should be done with a view to minimizing communication between processors and
maximizing the use of on-chip resources (registers and shared memory on a GPU;
L1/L2/L3 cache on a CPU). Communication overhead typically scales badly and is often
the bottleneck in poorly designed systems.
2. The fork/join pattern is a common pattern in serial programming where there are
synchronization situations and only certain aspects of the program are parallel. The
serial code runs and at some point reaches a section where the work can be distributed to
P processors in some manner. It then “forks” or spawns N threads/processes that
perform the calculation in parallel. These threads then execute independently and finally
converge or join once all the calculations are complete. The fork/join pattern is typically
implemented with static partitioning of the data. That is, the serial code will launch N
threads and divide the dataset equally between the N threads. If each packet of data
takes the same time to process, then this works well. However, as the overall time to
execute is the time of the slowest thread, giving one thread too much work means it
becomes the single factor determining the total time.
Q.2 What are GPU streams ?
 Answer : GPU memory has a number of usage restrictions and is accessible only via the
abstractions of a graphics programming interface. Each of these abstractions can be thought of
as a different stream type with its own set of access rules. The three types of streams visible to
the GPU programmer are vertex streams, frame-buffer streams, and texture streams. A fourth
stream type, fragment streams, is produced and consumed entirely within the GPU.
 Vertex streams are specified as vertex buffers via the graphics API. These streams hold
vertex positions and a variety of per-vertex attributes. These attributes have traditionally
been used for texture coordinates, colors, normals, and so on, but they can be used for
arbitrary input stream data for vertex programs.

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 52) Algorithms on GPU

 Fragment streams are generated by the rasterizer and consumed by the fragment processor.
They are the stream inputs to fragment programs, but they are not directly accessible to
programmers because they are created and consumed entirely within the graphics
processor.
 Frame-buffer streams are written by the fragment processor. They have traditionally been
used to hold pixels for display to the screen. Streaming GPU computation, however, uses
frame buffers to hold the results of intermediate computation stages.
 Textures are the only GPU memory that is randomly accessible by fragment programs and,
for Vertex Shader 3.0 GPUs, vertex programs. If programmers need to randomly index into
a vertex, fragment, or frame-buffer stream, they must first convert it to a texture.
Q.3 What is convolution ?
 Answer : The convolution operation is a mathematical operation which depicts a rule of
how to combine two functions or pieces of information to form a third function. The feature
map (or input data) and the kernel are combined to form a transformed feature map. The
convolution algorithm is often interpreted as a filter, where the kernel filters the feature map
for certain information. A kernel, for example, might filter for edges and discard other
information. The inverse of the convolution operation is called deconvolution.
 In image processing, convolution is a commonly used algorithm that modifies the value of
each pixel in an image by using information from neighboring pixels. A convolution kernel,
or filter, describes how each pixel will be influenced by its neighbors. For example, a
blurring filter will take the weighted average of neighboring pixels so that large differences
between pixel values are reduced. By using the same source image and changing only the
filter, one can produce effects such as sharpening, blurring, edge enhancing, and
embossing.
Q.4 Explain parallel prefix sum operation.
 Answer : Parallel prefix sum belongs to popular data-parallel algorithms. PPS is often
used in problems such as stream compaction, sorting, Eulerian tours of a graph, computation of
cumulative distribution functions, etc. The all-prefix-sums operation on an array of data is
commonly known as scan.
 The scan just defined is an exclusive scan, because each element j of the result is the sum of
all elements up to but not including j in the input array. In an inclusive scan, all elements
including j are summed. An exclusive scan can be generated from an inclusive scan by
shifting the resulting array right by one element and inserting the identity. Likewise, an

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 53) Algorithms on GPU

inclusive scan can be generated from an exclusive scan by shifting the resulting array left
and inserting at the end the sum of the last element of the scan and the last element of the
input array (Blelloch 1990).
Q.5 Explain the term MPI ?
 Answer : Message passing interface for computing clusters is MPI [Gropp1999], which is
a set of API functions for communication between processes running in a computing cluster.
MPI assumes a distributed memory model where processes exchange information by sending
messages to each other.
 When an application uses API communication functions, it does not need to deal with the
details of the interconnect network. The MPI implementation allows the processes to
address each other using logical numbers, much the same way as using phone numbers in a
telephone system - telephone users can dial each other using phone numbers without
knowing exactly where the called person is and how the call is routed.
 Message Passing Interface, is a standard API for communicating data via messages between
distributed processes that is commonly used in HPC to build applications that can scale to
multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is
designed for parallel computing on a single computer or node. There are many reasons for
wanting to combine the two parallel programming approaches of MPI.

 5.6 Part B : Long Answered Questions

Q.1 What task parallelism and data parallelism ? (Refer section 5.1.2)

Q.2 Explain convolution parallel algorithm. (Refer section 5.3.1)

Q.3 In brief discuss CSR format for sparse matrix. (Refer section 5.3.3)

Q.4 In sparse matrix multiplication algorithm in detail. (Refer section 5.3.3)

Q.5 Explain MPI communication on GPU. (Refer section 5.4)




TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (5 - 54) Algorithms on GPU

Notes

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

SOLVED MODEL QUESTION PAPER
(As Per New Syllabus)
GPU Architecture and Programming
Semester – VIII (CSE) Professional Elective-V

Time : Three Hours] [Maximum Marks : 100

Answer All Questions

Part - A (10  2 = 20 Marks)

Q.1 List out different scheduling policies.
 Ans. : There are 4 scheduling policies: 1) Round Robin (RR) – Warp instructions are
fetched in Round Robin (RR) manner. 2) Least Recently Fetched (LRF) - Warp for which
instruction has not been fetched for the longest time gets priority in the fetching of an
instruction. 3) Fair (FAIR) - The scheduler fetches instruction to a warp for which minimum
number of instructions have been fetched. 4) Thread block based – Scheduler allocates more
resources to the warp that shall take the longest time to execute.
Q.2 What are threads and threadblocks?
 Ans. : The thread is an abstract entity that represents the execution of the kernel. A
“Thread Block” is a programming abstraction which represents a group of “Threads”. Threads
in the same threadblock can communicate with each other.
Q.3 Explain CUDA design goals
 Ans. : Following are some of the CUDA design goals a) enable a straightforward
implementation of parallel algorithms b) allow programmers to focus on the task of
parallelization of the algorithms rather than spending time on their implementation c) support
heterogeneous computation where applications can use both the CPU and GPU d) enable
running serial portions of applications on the CPU, and parallel portions on the GPU.
Q.4 How the performance problem is identified?
 Ans. : Using a profiler, the developer can identify hotspots and compile a list of
candidates for parallelization. Hotspots are the function or functions in which the application is
spending most of its execution time. CUDA toolkit provides several tools/solutions which can
be used by developers in application performance profiling and identifying hotspots.
Q.5 What is RACE hazard?
 Ans. : A race hazard occurs when sections of the program “race” toward a critical point,
such as a memory read/write. Sometimes warp 0 may win the race and the result is correct.
Other times warp 1 might get delayed and warp 3 hits the critical section first, producing the
(M - 1)
GPU Architecture and Programming (M - 2) Solved Model Question Paper

wrong answer. The major problem with race hazards is they do not always occur. This makes
debugging them and trying to place a breakpoint on the error difficult. The second feature of
race hazards is they are extremely sensitive to timing disturbances. Thus, adding a breakpoint
and single-stepping the code always delays the thread being observed. This delay often
changes the scheduling pattern of other warps, meaning the particular conditions of the wrong
answer may never occur.
Q.6 Describe divide and conquer error finding strategy.
 Ans. : The divide-and-conquer approach is a common approach for debugging and is not
GPU specific. If you have thousands of lines of code in which the bug might be hiding, going
through each one line by line would take too long. What you need to do is divide your code in
half and see if the bug is in the first half or the second. Once you’ve identified which half
contains the problem, repeat the process until you narrow down to exactly where the problem
lies. This approach is useful where your kernel is causing some exception that is not handled
by the runtime.
Q.7 What is kernel?
 Ans. : A kernel is a small unit of execution that performs a clearly defined function and
that can be executed in parallel. Such a kernel can be executed on each element of an input
stream (also termed as NDRange) or simply at each point in an arbitrary index space. A kernel
is analogous and, on some devices identical, to what graphics programmers call a shader
program.
Q.8 What is pipe memory object?
 Ans. : The pipe memory object conceptually is an ordered sequence of data items. A pipe
has two endpoints: a write endpoint into which data items are inserted, and a read endpoint
from which data items are removed. At any one time, only one kernel instance may write into a
pipe, and only one kernel instance may read from a pipe. To support the producer consumer
design pattern, one kernel instance connects to the write endpoint (the producer) while another
kernel instance connects to the reading endpoint (the consumer).
A pipe is a memory object that stores data organized as a FIFO. Pipe objects can only be
accessed using built-in functions that read from and write to a pipe. Pipe objects are not
accessible from the host. A pipe object encapsulates the following information:
 Packet size in bytes
 Maximum capacity in packets
 Information about the number of packets currently in the pipe
 Data packets
TECHNICAL PUBLICATIONS® - An up thrust for knowledge
GPU Architecture and Programming (M - 3) Solved Model Question Paper

Q.9 What is convolution?

 Ans. : The convolution operation is a mathematical operation which depicts a rule of how
to combine two functions or pieces of information to form a third function. The feature map
(or input data) and the kernel are combined to form a transformed feature map. The
convolution algorithm is often interpreted as a filter, where the kernel filters the feature map
for certain information. A kernel, for example, might filter for edges and discard other
information. The inverse of the convolution operation is called deconvolution.
In image processing, convolution is a commonly used algorithm that modifies the value of
each pixel in an image by using information from neighboring pixels. A convolution kernel, or
filter, describes how each pixel will be influenced by its neighbors. For example, a blurring
filter will take the weighted average of neighboring pixels so that large differences between
pixel values are reduced. By using the same source image and changing only the filter, one can
produce effects such as sharpening, blurring, edge enhancing, and embossing.
Q.10 Explain the term MPI?
 Ans. : Message passing interface for computing clusters is MPI [Gropp1999], which is a
set of API functions for communication between processes running in a computing cluster.
MPI assumes a distributed memory model where processes exchange information by sending
messages to each other.
When an application uses API communication functions, it does not need to deal with the
details of the interconnect network. The MPI implementation allows the processes to address
each other using logical numbers, much the same way as using phone numbers in a telephone
system - telephone users can dial each other using phone numbers without knowing exactly
where the called person is and how the call is routed.
Message Passing Interface, is a standard API for communicating data via messages between
distributed processes that is commonly used in HPC to build applications that can scale to
multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is
designed for parallel computing on a single computer or node. There are many reasons for
wanting to combine the two parallel programming approaches of MPI

Part - B (5  13 = 65 Marks)
Q.11 (a) (i) What are differences between GPU and CPU? (Refer section 1.1) [5]

(ii) Explain various types of memory to be deal with in CUDA.

(Refer section 1.5) [8]

TECHNICAL PUBLICATIONS® - An up thrust for knowledge

GPU Architecture and Programming (M - 4) Solved Model Question Paper

OR
(b) (i) Describe in detail CUDA hardware details? (Refer section 1.4) [8]
(ii) Explain parallelism with GPU? (Refer section 1.3) [5]
Q.12 (a) (i) How the optimization is carried out for CUDA application.
(Refer section 2.4) [9]
(ii) Explain multi GPU solutions. (Refer section 2.3) [4]
OR
(b) (i) Explain the structure of multi GPU? (Refer section 2.4) [9]
(ii) Explain resource contention. (Refer section 2.4) [4]
Q.13 (a) (i) Describe parallel programming issues. (Refer section 3.3) [9]
(ii) What are tools and techniques that you should employ for finding and solving
CUDA errors? (Refer section 3.2) [4]
OR
(b) (i) How to find and avoid errors in CUDA? (Refer section 2.6) [8]
(ii) What are synchronization issues? (Refer section 1.4) [5]
Q.14 (a) (i) Explain kernel programming model. (Refer section 4.3.3) [6]
(ii) Explain memory object ‘Buffer’. (Refer section 4.1.1) [7]
OR
(b) (i) Explain features of OpenCL in detail. (Refer section 4.1) [8]
(ii) In brief explain OpenCL memory hierarchy. (Refer section 4.3.4) [5]
Q.15 (a) (i) In brief discuss CSR format for sparse matrix. (Refer section 5.3.3) [9]
(ii) Explain prefix sum algorithm working? (Refer section 5.3.3) [4]
OR

(b) (i) Explain MPI communication on GPU. (Refer section 5.4) [9]
(ii) What are heterogeneous clusters? (Refer section 5.4.1) [4]
Part - C (1  15 = 15 Marks)
Q.16 (a) (i) What are the components of OpenCL? (Refer section 4.1.4) [10]
(ii) Brief about MPI point to point communication. (Refer section 5.4.2) [5]
OR
(b) (i) Explain OpenCL architecture? (Refer section 4.3) [10]
(ii) What task parallelism and data parallelism? (Refer section 5.1.2) [5]


TECHNICAL PUBLICATIONS® - An up thrust for knowledge

Be 02000041
No ratings yet
Be 02000041
3 pages
CCS332 - App Development Lab Manual
No ratings yet
CCS332 - App Development Lab Manual
50 pages
ITCNA - Chapter 2 - Installing System Devicesv1
No ratings yet
ITCNA - Chapter 2 - Installing System Devicesv1
49 pages
CGM V Sem Lab Manual
No ratings yet
CGM V Sem Lab Manual
17 pages
Computers As Components: Principles of Embedded Computing System Design
No ratings yet
Computers As Components: Principles of Embedded Computing System Design
9 pages
System On Chip (Soc) Design
No ratings yet
System On Chip (Soc) Design
70 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
No ratings yet
UNIT V Parallel Programming Patterns in CUDA (T2 Chapter 7) - P P With CUDA
35 pages
Agile (GtuStudy)
No ratings yet
Agile (GtuStudy)
332 pages
EC400D-Computer Architecture and Microcontrollers-Syllabus PDF
No ratings yet
EC400D-Computer Architecture and Microcontrollers-Syllabus PDF
4 pages
New Question Paper Format For Engineering Students - En5170@Dbatu - Ac.in - Dr. Babasaheb Ambedkar Technological University, Lonere
No ratings yet
New Question Paper Format For Engineering Students - En5170@Dbatu - Ac.in - Dr. Babasaheb Ambedkar Technological University, Lonere
2 pages
Python Application Programming - 18CS752 - Syllabus
No ratings yet
Python Application Programming - 18CS752 - Syllabus
4 pages
Adv - Java GTU Study Material Presentations Unit-2 JDBC Programming
No ratings yet
Adv - Java GTU Study Material Presentations Unit-2 JDBC Programming
107 pages
MPI Lab Manual PDF
No ratings yet
MPI Lab Manual PDF
25 pages
Lab Manual Unix and Linux Programming PR Cot 218 and It 214
No ratings yet
Lab Manual Unix and Linux Programming PR Cot 218 and It 214
7 pages
21ec503 Vlsi Design Unit 1
No ratings yet
21ec503 Vlsi Design Unit 1
110 pages
Course End Survey-21.04.2020
No ratings yet
Course End Survey-21.04.2020
136 pages
Advanced Web Programming Lab Manual SJA SSEC
No ratings yet
Advanced Web Programming Lab Manual SJA SSEC
62 pages
Mentor Graphics Tutorial
No ratings yet
Mentor Graphics Tutorial
18 pages
Microprocessor and Interfacing Techniques: (Course Code: CET208A) Credits-3
No ratings yet
Microprocessor and Interfacing Techniques: (Course Code: CET208A) Credits-3
147 pages
Section - A: 1. Define System Programming
No ratings yet
Section - A: 1. Define System Programming
23 pages
BE02000041-Fundamental of Assignments
100% (1)
BE02000041-Fundamental of Assignments
12 pages
BCSE305L - Embedded Systems Systems: Module - 3 Architecture of Special Purpose Computing System
No ratings yet
BCSE305L - Embedded Systems Systems: Module - 3 Architecture of Special Purpose Computing System
358 pages
DBMS Notes Unit-III - For Students
No ratings yet
DBMS Notes Unit-III - For Students
90 pages
Computer Fundamental Notes: Download Practice MCQ
No ratings yet
Computer Fundamental Notes: Download Practice MCQ
91 pages
Java - Co - Po - Mapping
No ratings yet
Java - Co - Po - Mapping
4 pages
CS3691 EMBEDDED SYSTEMS AND IOTl 1 To 4 Unit
100% (1)
CS3691 EMBEDDED SYSTEMS AND IOTl 1 To 4 Unit
265 pages
A Process Control Block
No ratings yet
A Process Control Block
26 pages
GE8072 - Foundation Skills in Integrated Product Development (Ripped From Amazon Kindle Ebooks by Sai Seena)
No ratings yet
GE8072 - Foundation Skills in Integrated Product Development (Ripped From Amazon Kindle Ebooks by Sai Seena)
140 pages
Superscalar Processors Questions
No ratings yet
Superscalar Processors Questions
12 pages
Write A Program To Find Factorial of List of Number Reading Input As Command Line Argument
100% (1)
Write A Program To Find Factorial of List of Number Reading Input As Command Line Argument
7 pages
Comparison of 80286 and 80386
100% (1)
Comparison of 80286 and 80386
14 pages
Multicore Question Bank
No ratings yet
Multicore Question Bank
5 pages
Cae Questions Papers
No ratings yet
Cae Questions Papers
13 pages
Lecture 30 GPU Programming Loop Parallelism
No ratings yet
Lecture 30 GPU Programming Loop Parallelism
16 pages
2-QUESTION PAPER DR K UMA Question Bank CS3001 SOFTWARE ENGG-converted1
No ratings yet
2-QUESTION PAPER DR K UMA Question Bank CS3001 SOFTWARE ENGG-converted1
71 pages
CSE 5th Semester - UI and UX Design - CCS370 - Hand Written Notes - Unit 2 - Foundations of UI Design
No ratings yet
CSE 5th Semester - UI and UX Design - CCS370 - Hand Written Notes - Unit 2 - Foundations of UI Design
25 pages
Data Structures
No ratings yet
Data Structures
2 pages
Computer Graphics (CSC209)
No ratings yet
Computer Graphics (CSC209)
6 pages
DATA STRUCTURE Lab Manual For 3rd Sem
No ratings yet
DATA STRUCTURE Lab Manual For 3rd Sem
60 pages
Prolog Lab Sheets
No ratings yet
Prolog Lab Sheets
36 pages
FSWD University Question Paper
No ratings yet
FSWD University Question Paper
6 pages
CP4291 IOT LAb MANUAL-1
No ratings yet
CP4291 IOT LAb MANUAL-1
37 pages
Cse - Iot IV Years Cs & Syllabus Ug r20
No ratings yet
Cse - Iot IV Years Cs & Syllabus Ug r20
168 pages
OOP&CG GDP Lab Manual
No ratings yet
OOP&CG GDP Lab Manual
87 pages
Ai Quantam Ai Quantum 7th Sem
No ratings yet
Ai Quantam Ai Quantum 7th Sem
52 pages
Advanced Computer Architecture Question Paper
No ratings yet
Advanced Computer Architecture Question Paper
1 page
Cdac Vlsi Course Structure
No ratings yet
Cdac Vlsi Course Structure
2 pages
Object Oriented Software Engineering - CCS356 - Important Questions With 2 Marks Answer
100% (1)
Object Oriented Software Engineering - CCS356 - Important Questions With 2 Marks Answer
77 pages
De Report Sem-6
No ratings yet
De Report Sem-6
22 pages
CSE 5th Semester - Software Testing and Automation - CCS366 - Question Bank and Important 2 Marks Questions With Answer
No ratings yet
CSE 5th Semester - Software Testing and Automation - CCS366 - Question Bank and Important 2 Marks Questions With Answer
25 pages
R22-M.tech Curriculum and Syllabus
No ratings yet
R22-M.tech Curriculum and Syllabus
85 pages
r23 Oops QB With Answers
No ratings yet
r23 Oops QB With Answers
148 pages
Deep Learning - AD3501 - Notes - Unit 2 - Convolutional Neural Networks
No ratings yet
Deep Learning - AD3501 - Notes - Unit 2 - Convolutional Neural Networks
36 pages
Dspa 17ec751 M5
No ratings yet
Dspa 17ec751 M5
34 pages
Vtu Lab Manuals Materials
100% (1)
Vtu Lab Manuals Materials
3 pages
Python Programming (21EC643) Module - 5 QP Solution by Prof. Sujay Gejji
100% (1)
Python Programming (21EC643) Module - 5 QP Solution by Prof. Sujay Gejji
30 pages
CSD Lab Manual v1
No ratings yet
CSD Lab Manual v1
40 pages
VTU Question Paper of 18EVE14 VLSI Testing Dec - 2019
No ratings yet
VTU Question Paper of 18EVE14 VLSI Testing Dec - 2019
2 pages
PDS Practical
No ratings yet
PDS Practical
94 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
If4093 Syllabus1
No ratings yet
If4093 Syllabus1
2 pages
Thesis Gpu Programming
100% (2)
Thesis Gpu Programming
6 pages
Mathina DC
No ratings yet
Mathina DC
10 pages
Security and Privacy in Cloud Computing
No ratings yet
Security and Privacy in Cloud Computing
12 pages
Iii Ai & DS - CCS366 - Sta End Semester Exam Answer Key
No ratings yet
Iii Ai & DS - CCS366 - Sta End Semester Exam Answer Key
29 pages
Mathina STA
No ratings yet
Mathina STA
14 pages
Manjula - BDA
No ratings yet
Manjula - BDA
15 pages
CCS336 - CSM CIA II - Unit 3 & Unit 4 - Answer Key
No ratings yet
CCS336 - CSM CIA II - Unit 3 & Unit 4 - Answer Key
24 pages
FAT QP - Fall 2023-24 - RES701-Research Methodology
No ratings yet
FAT QP - Fall 2023-24 - RES701-Research Methodology
3 pages
Iii Ai & DS - CCS362 - S & PC End Semester Exam Answer Key
No ratings yet
Iii Ai & DS - CCS362 - S & PC End Semester Exam Answer Key
19 pages
CCS367 - ST Cia Ii QP
No ratings yet
CCS367 - ST Cia Ii QP
3 pages
ccs336 CSM Unit 2, 3 Key
No ratings yet
ccs336 CSM Unit 2, 3 Key
12 pages
CCS354 Ext 122751 Key
No ratings yet
CCS354 Ext 122751 Key
19 pages
Thesis Paraphased
No ratings yet
Thesis Paraphased
166 pages
(Ge8151)
No ratings yet
(Ge8151)
3 pages
RPE1
No ratings yet
RPE1
13 pages
Cia 1 QP CCS332 - App Unit 1 Answer Key
No ratings yet
Cia 1 QP CCS332 - App Unit 1 Answer Key
11 pages
Mathina R RPA
No ratings yet
Mathina R RPA
16 pages
Al3452 Set1
No ratings yet
Al3452 Set1
2 pages
ESIOT Manual
No ratings yet
ESIOT Manual
23 pages
Al3452 Set3
100% (1)
Al3452 Set3
2 pages
(CS2403 PTCS2403)
100% (1)
(CS2403 PTCS2403)
3 pages
Daa-Lab-Manual FINAL
No ratings yet
Daa-Lab-Manual FINAL
21 pages
Cia 1 QP CCS332 - App Unit 1 Answer Key
No ratings yet
Cia 1 QP CCS332 - App Unit 1 Answer Key
11 pages
CCS366-Software Testing and Automation-Lab-Manual
100% (1)
CCS366-Software Testing and Automation-Lab-Manual
55 pages
CCS334 - Bda Lab Manual
No ratings yet
CCS334 - Bda Lab Manual
40 pages
Cia 1 QP CS3391 - Oops - Bme Unit - 3 Answer Key
No ratings yet
Cia 1 QP CS3391 - Oops - Bme Unit - 3 Answer Key
4 pages
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
No ratings yet
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
51 pages
IT Fundamentals Certification Handbook
100% (4)
IT Fundamentals Certification Handbook
285 pages
Near-Threshold RISC-V Core With DSP Extensions For Scalable IoT Endpoint Devices
No ratings yet
Near-Threshold RISC-V Core With DSP Extensions For Scalable IoT Endpoint Devices
14 pages
Precision 3660 Spec Sheet
No ratings yet
Precision 3660 Spec Sheet
9 pages
Cpu and Soc: Memory
No ratings yet
Cpu and Soc: Memory
1 page
ECS765P - W1 - Introduction To Big Data
No ratings yet
ECS765P - W1 - Introduction To Big Data
50 pages
SAG Product Catalog WebMethods Edition 2017
100% (1)
SAG Product Catalog WebMethods Edition 2017
206 pages
Instruction Set Architecture 24
No ratings yet
Instruction Set Architecture 24
17 pages
8th Gen Core Family Desktop Brief
No ratings yet
8th Gen Core Family Desktop Brief
4 pages
Data Mining Project
No ratings yet
Data Mining Project
10 pages
Changing Trends in Computer Architecture
No ratings yet
Changing Trends in Computer Architecture
13 pages
Pentium M: - 1.30 GHZ To 1.70 GHZ - Primary 32-Kb Instruction Cache - 1-Mb Second Level Cache
No ratings yet
Pentium M: - 1.30 GHZ To 1.70 GHZ - Primary 32-Kb Instruction Cache - 1-Mb Second Level Cache
27 pages
300V Cable
No ratings yet
300V Cable
34 pages
Vlsi 2015 Ieee Titles
No ratings yet
Vlsi 2015 Ieee Titles
10 pages
Parallel Algorithms
100% (1)
Parallel Algorithms
348 pages
Edne 2013EDNEFebruary
No ratings yet
Edne 2013EDNEFebruary
50 pages
A Smart Way To Drive Ecu Consolidation
No ratings yet
A Smart Way To Drive Ecu Consolidation
4 pages
Important Questions For Mid Exam (Coa)
No ratings yet
Important Questions For Mid Exam (Coa)
24 pages
Multiple Processor Scheduling
No ratings yet
Multiple Processor Scheduling
4 pages
CGE13213 It03 Hardware and Software
No ratings yet
CGE13213 It03 Hardware and Software
22 pages
H14 Read Me
No ratings yet
H14 Read Me
6 pages
HPC Computer Engg Sem 8 Notes
No ratings yet
HPC Computer Engg Sem 8 Notes
36 pages
M e Cse
No ratings yet
M e Cse
77 pages
ANSYS Mechanical APDL Parallel Processing Guide
No ratings yet
ANSYS Mechanical APDL Parallel Processing Guide
56 pages
Assignment Operating System
67% (3)
Assignment Operating System
5 pages
Unit-1 & Ii GCC
No ratings yet
Unit-1 & Ii GCC
37 pages
PNX8500
No ratings yet
PNX8500
2 pages
Asus Expertcenter X500ma
No ratings yet
Asus Expertcenter X500ma
3 pages