0% found this document useful (0 votes)
103 views53 pages

ACA Unit3 Revised

Class notes

Uploaded by

Vaibhav Makadiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
103 views53 pages

ACA Unit3 Revised

Class notes

Uploaded by

Vaibhav Makadiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

1

MODULE-III
Getting started with CUDA Installation,
Driver, Sdk, Toolkit
 CUDA® is a parallel computing platform and programming model
invented by NVIDIA
 It enables dramatic increases in computing performance by
harnessing the power of the graphics processing unit (GPU).
 CUDA was developed with several design goals in mind:
 Provide a small set of extensions to standard programming
languages, like C, that enable a straightforward implementation of
parallel algorithms.
 With CUDA C/C++,programmers can focus on the task of
parallelization of the algorithms rather than spending time on their
implementation.
 ‣ Support heterogeneous computation where applications use both
the CPU and GPU. Serial portions of applications are run on the
CPU, and parallel portions are offloaded to the GPU. As such,
CUDA can be incrementally applied to existing Applications.
 The CPU and GPU are treated as separate devices that have their
own memory spaces. This configuration also allows simultaneous
computation on the CPU and GPU without contention for memory
resources.
 CUDA-capable GPUs have hundreds of cores that can collectively
run thousands of computing threads. These cores have shared
resources including a register file and a shared memory. The on-chip
2

shared memory allows parallel tasks running on these cores to share


data without sending it over the system memory bus.
This guide will show you how to install and check the correct operation
of the CUDA development tools.
1.1. System Requirements
To use CUDA on your system, you will need the following installed:
‣ CUDA-capable GPU
‣ A supported version of Linux with a gcc compiler and toolchain
‣ NVIDIA CUDA Toolkit (available at http://developer.nvidia.com/cuda-
downloads)
The CUDA development environment relies on tight integration with the
host development environment, including the host compiler and C
runtime libraries.

Ubuntu
1. Perform the pre-installation actions.
2. Install repository meta-data
When using a proxy server with aptitude, ensure that wget is set up to use
the same proxy settings before installing the cuda-repo package.
$ sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb
3. Update the Apt repository cache
$ sudo apt-get update
4. Install CUDA
3

$ sudo apt-get install cuda


5. Perform the post-installation actions.
POST-INSTALLATION ACTIONS
Some actions must be taken after installing the CUDA Toolkit and Driver
before they can
be completely used:
‣ Setup evironment variables.
‣ Install a writable copy of the CUDA Samples.
‣ Verify the installation.
6.1. Environment Setup
The PATH variable needs to include /usr/local/cuda-6.5/bin
The LD_LIBRARY_PATH variable needs to contain /usr/local/cuda-
6.5/lib64 on a
64-bit system, and /usr/local/cuda-6.5/lib on a 32-bit ARM system
‣ To change the environment variables for 64-bit operating systems:
$ export PATH=/usr/local/cuda-6.5/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/cuda-
6.5/lib64:$LD_LIBRARY_PATH
‣ To change the environment variables for 32-bit ARM operating systems:
$ export PATH=/usr/local/cuda-6.5/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/local/cuda-
6.5/lib:$LD_LIBRARY_PATH
4

6.2. (Optional) Install Writable Samples


In order to modify, compile, and run the samples, the samples must be
installed with write permissions. A convenience installation script is
provided:
$ cuda-install-samples-6.5.sh <dir>
This script is installed with the cuda-samples-6-5 package. The cuda-
samples-6-5 package installs only a read-only copy in /usr/local/cuda-
6.5/samples.
6.3. Verify the Installation
Before continuing, it is important to verify that the CUDA toolkit can find
and communicate correctly with the CUDA-capable hardware. To do this,
you need to compile and run some of the included sample programs.
Ensure the PATH and LD_LIBRARY_PATH variables are set correctly.
6.3.1. Verify the Driver Version
If you installed the driver, verify that the correct version of it is installed.
This can be done through your System Properties (or equivalent) or by
executing the command
$ cat /proc/driver/nvidia/version
Note that this command will not work on an iGPU/dGPU system.
6.3.2. Compiling the Examples
The version of the CUDA Toolkit can be checked by running nvcc -V in
a terminal window. The nvcc command runs the compiler driver that
compiles CUDA programs. It calls the gcc compiler for C code and the
NVIDIA PTX compiler for the CUDA code.
5

The NVIDIA CUDA Toolkit includes sample programs in source form.


You should compile them by changing to ~/NVIDIA_CUDA-
6.5_Samples and typing make. The resulting binaries will be placed under
~/NVIDIA_CUDA-6.5_Samples/bin.
3.8. Additional Package Manager Capabilities
The recommended installation package is the cuda package. This package
will install the full set of other CUDA packages required for native
development and should cover most scenarios.
The cuda package installs all the available packages for native
developments. That includes the compiler, the debugger, the profiler, the
math libraries,... For x86_64 patforms, this also include NSight Eclipse
Edition and the visual profiler It also includes the NVIDIA driver
package.
On supported platforms, the cuda-cross-armhf package installs all the
packages required for cross-platform development on ARMv7. The
libraries and header files of the ARMv7 display driver package are also
installed to enable the cross compilation of ARMv7 applications. The
cuda-cross-armhf package does not install the native display driver.
The packages installed by the packages above can also be installed
individually by specifying their names explicitly. The list of available
packages be can obtained with:
$ cat /var/lib/apt/lists/*cuda*Packages | grep "Package:" # Ubuntu
3.8.2. Package Upgrades
The cuda package points to the latest stable release of the CUDA Toolkit.
When a new version is available, use the following commands to upgrade
the toolkit and driver:
6

$ sudo apt-get install cuda # Ubuntu


The cuda-cross-armhf package can also be upgraded in the same manner.
The cuda-drivers package points to the latest driver release available in
the CUDA repository. When a new version is available, use the following
commands to upgrade the driver:
$ sudo apt-get install cuda-drivers # Ubuntu

SDK installation
Eclipse IDE is the Software Development kit for the CUDA programs
Go to terminal and give the command nsight if eclipse already installed
then it opens it otherwise it give a command to install eclipse like $sudo
apt install nvidia-nsightafter authentication it downloads the files from
the internet source and install the eclipse.
This post is the first in a series on CUDA C and C++, which is the C/C++
interface to the CUDA parallel computing platform. This series of posts
assumes familiarity with programming in C. We will be running a parallel
series of posts about CUDA Fortran targeted at Fortran programmers .
These two series will cover the basic concepts of parallel computing on
the CUDA platform. From here on unless I state otherwise, I will use the
term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is
essentially C/C++ with a few extensions that allow one to execute
functions on the GPU using many threads in parallel.
7

CUDA Programming Model Basics


Before we jump into CUDA C code, those new to CUDA will benefit
from a basic description of the CUDA programming model and some of
the terminology used.
The CUDA programming model is a heterogeneous model in which both
the CPU and GPU are used. In CUDA, the host refers to the CPU and its
memory, while the device refers to the GPU and its memory. Code run on
the host can manage memory on both the host and device, and also
launches kernels which are functions executed on the device. These
kernels are executed by many GPU threads in parallel.

Given the heterogeneous nature of the CUDA programming model, a


typical sequence of operations for a CUDA C program is:

 Declare and allocate host and device memory.


 Initialize host data.
 Transfer data from the host to the device.
 Execute one or more kernels.
 Transfer results from the device to the host.
 Keeping this sequence of operations in mind, let’s look at a CUDA
C example.

A First CUDA C Program


In a recent post, I illustrated Six Ways to SAXPY, which includes a
CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and
8

is a good “hello world” example for parallel computation. In this post I


will dissect a more complete version of the CUDA C SAXPY, explaining
in detail what is done and why. The complete SAXPY code is:
Basic Programming Concepts
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
9

y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
The function saxpy is the kernel that runs in parallel on the GPU, and the
main function is the host code. Let’s begin our discussion of this program
with the host code.
10

Host Code
The main function declares two pairs of arrays.

float *x, *y, *d_x, *d_y;


x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));

cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
The pointers x and y point to the host arrays, allocated with malloc in the
typical fashion, and the d_x and d_y arrays point to device arrays allocated
with the cudaMalloc function from the CUDA runtime API. The host and
device in CUDA have separate memory spaces, both of which can be
managed from host code (CUDA C kernels can also allocate device
memory on devices that support it).

The host code then initializes the host arrays. Here we set x to an array
of ones, and y to an array of twos.

for (int i = 0; i < N; i++) {


x[i] = 1.0f;
y[i] = 2.0f;
}
11

To initialize the device arrays, we simply copy the data from x and y to
the corresponding device arrays d_x and d_y using cudaMemcpy, which
works just like the standard C memcpy function, except that it takes a
fourth argument which specifies the direction of the copy. In this case we
use cudaMemcpyHostToDevice to specify that the first (destination)
argument is a device pointer and the second (source) argument is a host
pointer.

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);


cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
After running the kernel, to get the results back to the host, we copy from
the device array pointed to by d_y to the host array pointed to by y by
using cudaMemcpy with cudaMemcpyDeviceToHost.

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);


Launching a Kernel
The saxpy kernel is launched by the statement:

saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);


The information between the triple chevrons is the execution
configuration, which dictates how many device threads execute the kernel
in parallel. In CUDA there is a hierarchy of threads in software which
mimics how thread processors are grouped on the GPU. In the CUDA
programming model we speak of launching a kernel with a grid of thread
blocks. The first argument in the execution configuration specifies the
12

number of thread blocks in the grid, and the second specifies the number
of threads in a thread block.

Thread blocks and grids can be made one-, two- or three-dimensional by


passing dim3 (a simple struct defined by CUDA with x, y, and z members)
values for these arguments, but for this simple example we only need one
dimension so we pass integers instead. In this case we launch the kernel
with thread blocks containing 256 threads, and use integer arithmetic to
determine the number of thread blocks required to process all N elements
of the arrays ((N+255)/256).

For cases where the number of elements in the arrays is not evenly
divisible by the thread block size, the kernel code must check for out-of-
bounds memory accesses.

Cleaning Up
After we are finished, we should free any allocated memory. For device
memory allocated with cudaMalloc(), simply call cudaFree(). For host
memory, use free() as usual.

cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
13

Device Code
We now move on to the kernel code.

__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
In CUDA, we define kernels such as saxpy using the __global__
declaration specifier. Variables defined within device code do not need to
be specified as device variables because they are assumed to reside on the
device. In this case the n, a and i variables will be stored by each thread
in a register, and the pointers x and y must be pointers to the device
memory address space. This is indeed true because we passed d_x and
d_y to the kernel when we launched it from the host code. The first two
arguments, n and a, however, were not explicitly transferred to the device
in host code. Because function arguments are passed by value by default
in C/C++, the CUDA runtime can automatically handle the transfer of
these values to the device. This feature of the CUDA Runtime API makes
launching kernels on the GPU very natural and easy—it is almost the same
as calling a C function.
14

There are only two lines in our saxpy kernel. As mentioned earlier, the
kernel is executed by multiple threads in parallel. If we want each thread
to process an element of the resultant array, then we need a means of
distinguishing and identifying each thread. CUDA defines the variables
blockDim, blockIdx, and threadIdx. These predefined variables are of
type dim3, analogous to the execution configuration parameters in host
code. The predefined variable blockDim contains the dimensions of each
thread block as specified in the second execution configuration parameter
for the kernel launch. The predefined variables threadIdx and blockIdx
contain the index of the thread within its thread block and the thread block
within the grid, respectively. The expression:

int i = blockDim.x * blockIdx.x + threadIdx.x


generates a global index that is used to access elements of the arrays. We
didn’t use it in this example, but there is also gridDim which contains the
dimensions of the grid as specified in the first execution configuration
parameter to the launch.

Before this index is used to access array elements, its value is checked
against the number of elements, n, to ensure there are no out-of-bounds
memory accesses. This check is required for cases where the number of
elements in an array is not evenly divisible by the thread block size, and
as a result the number of threads launched by the kernel is larger than the
array size. The second line of the kernel performs the element-wise work
of the SAXPY, and other than the bounds check, it is identical to the inner
loop of a host implementation of SAXPY.
15

if (i < n) y[i] = a*x[i] + y[i];

Compiling and Running the Code


The CUDA C compiler, nvcc, is part of the NVIDIA CUDA Toolkit. To
compile our SAXPY example, we save the code in a file with a .cu
extension, say saxpy.cu. We can then compile it with nvcc.

nvcc -o saxpy saxpy.cu


We can then run the code:

% ./saxpy
Max error: 0.000000
Summary and Conclusions
With this walkthrough of a simple CUDA C implementation of SAXPY,
you now know the basics of programming CUDA C. There are only a few
extensions to C required to “port” a C code to CUDA C: the __global__
declaration specifier for device kernel functions; the execution
configuration used when launching a kernel; and the built-in device
variables blockDim, blockIdx, and threadIdx used to identify and
differentiate GPU threads that execute the kernel in parallel.

One advantage of the heterogeneous CUDA programming model is that


porting an existing code from C to CUDA C can be done incrementally,
one kernel at a time.
16

Modes of Parallel programming


Parallel Processing and Data Transfer Modes in a Computer System
Instead of processing each instruction sequentially, a parallel
processing system provides concurrent data processing to increase the
execution time.
In this the system may have two or more ALU's and should be able to
execute two or more instructions at the same time. The purpose of parallel
processing is to speed up the computer processing capability and increase
its throughput.
NOTE: Throughput is the number of instructions that can be executed
in a unit of time.
Parallel processing can be viewed from various levels of complexity. At
the lowest level, we distinguish between parallel and serial operations by
the type of registers used. At the higher level of complexity, parallel
processing can be achieved by using multiple functional units that
perform many operations simultaneously.
17

Data Transfer Modes of a Computer System


According to the data transfer mode, computer can be divided into 4 major
groups:

1. SISD
2. SIMD
3. MISD
4. MIMD
18

SISD (Single Instruction Stream, Single Data Stream)


It represents the organization of a single computer containing a control
unit, processor unit and a memory unit. Instructions are executed
sequentially. It can be achieved by pipelining or multiple functional units.

SIMD (Single Instruction Stream, Multiple Data Stream)


It represents an organization that includes multiple processing units under
the control of a common control unit. All processors receive the same
instruction from control unit but operate on different parts of the data.
They are highly specialized computers. They are basically used for
numerical problems that are expressed in the form of vector or matrix. But
they are not suitable for other types of computations

MISD (Multiple Instruction Stream, Single Data Stream)


It consists of a single computer containing multiple processors connected
with multiple control units and a common memory unit. It is capable of
processing several instructions over single data stream simultaneously.
MISD structure is only of theoretical interest since no practical system has
been constructed using this organization.

MIMD (Multiple Instruction Stream, Multiple Data Stream


It represents the organization which is capable of processing several
programs at same time. It is the organization of a single computer
containing multiple processors connected with multiple control units and
a shared memory unit. The shared memory unit contains multiple modules
to communicate with all processors simultaneously. Multiprocessors and
19

multicomputer are the examples of MIMD. It fulfills the demand of large


scale computations.

CUDA Programming model


GPU Programming Model
GPGPU Programming Model
CUDA Programming Model
20
21
22
23
24
25
26

KERNAL
A kernel is the central part of an operating system. It manages the
operations of the computer and the hardware - most notably memory and
CPU time. There are two types of kernels: A microkernel, which only
contains basic functionality; A monolithic kernel, which contains many
device drivers.
27
28
29
30
31
32
33
34
35
36

Calling Kernel on Device


Splitting the Kernel .In a Unix system, several concurrent processes attend
to different tasks. Each process asks for system resources, be it computing
power, memory, network connectivity, or some other resource. The kernel
is the big chunk of executable code in charge of handling all such requests.
Although the distinction between the different kernel tasks isn’t always
clearly marked, the kernel’s role can be split (as shown in Figure 1-1) into
the following parts:

Process management
The kernel is in charge of creating and destroying processes and handling
their connection to the outside world (input and output). Communication
among different processes (through signals, pipes, or interprocess
communication primitives) is basic to the overall system functionality and
is also handled by the kernel. In addition, the scheduler, which controls
how processes share the CPU, is part of process management. More
generally, the kernel’s process management activity implements the
abstraction of several processes on top of a single CPU or a few of them.

Memory Management
The computer’s memory is a majorresource, and the policy used to deal
with it is a critical one for system performance. The kernel builds up a
virtual addressing space for any and all processes on top of the limited
available resources. The different parts of the kernel interact with the
memory-management subsystem through a set of function calls, ranging
from the simple malloc/free pair to much more complex functionalities.
37

Filesystems
Unix is heavily based on the filesystem concept; almost everything in
Unix can be treated as a file. The kernel builds a structured filesystem on
top of unstructured hardware, and the resulting file abstraction is heavily
used throughout the whole system. In addition, Linux supports multiple
filesystem types, that is, different ways of organizing data on the physical
medium. For example, disks may be formatted with the Linux-standard
ext3 filesystem, the commonly used FAT filesystem or several others.

Device control
Almost every system operation eventually maps to a physical device.
With the exception of the processor, memory, and a very few other
entities, any and all device control operations are performed by code that
is specific to the device being addressed. That code is called a device
driver. The kernel must have embedded in it a device driver for every
peripheral present on a system, from the hard drive to the keyboard and
the tape drive. This aspect of the kernel’s functions is our primary interest
in this book.

Networking
Networking must be managed by the operating system, because most
network operations are not specific to a process: incoming packets are
asynchronous events. The packets must be collected, identified, and
dispatched before a process takes care of them. The system is in charge of
delivering data packets across program and network interfaces, and it
must control the execution of programs according to their network
38

activity. Additionally, all the routing and address resolution issues are
implemented within the kernel.

Loadable Modules
One of the good features of Linux is the ability to extend at runtime the
set of features offered by the kernel. This means that you can add
functionality to the kernel (and remove functionality as well) while the
system is up and running. Each piece of code that can be added to the
kernel at runtime is called a module. The Linux kernel offers support for
quite a few different types (or classes) of modules, including, but not
limited to, device drivers. Each module is made up of object code (not
linked into a complete executable) that can be dynamically linked to the
running kernel by the insmod program and can be unlinked by the rmmod
program. Figure 1-1 identifies different classes of modules in charge of
specific tasks—a module is said to belong to a specific class according to
the functionality it offers. The placement of modules in Figure 1-1 covers
the most important classes, but is far from complete because more and
more functionality in Linux is being modularized.

 Classes of Devices and Modules


The Linux way of looking at devices distinguishes between three
fundamental device types. Each module usually implements one of these
types, and thus is classifiable as a char module, a block module, or a
network module. This division of modules into different types, or classes,
is not a rigid one; the programmer can choose to build huge modules
implementing different drivers in a single chunk of code. Good
programmers, nonetheless, usually create a different module for each new
functionality they implement, because decomposition is a key element of
scalability and extendability.
39

The three classes are:


Character devices
A character (char) device is one that can be accessed as a stream of bytes
(like a file); a char driver is in charge of implementing this behavior. Such
a driver usually implements at least the open, close, read, and write system
calls. The text console (/dev/console) and the serial ports (/dev/ttyS0 and
40

friends) are examples of char devices, as they are well represented by the
stream abstraction. Char devices are accessed by means of filesystem
nodes, such as /dev/tty1 and /dev/lp0. The only relevant difference
between a char device and a regular file is that you can always move back
and forth in the regular file, whereas most char devices are just data
channels, which you can only access sequentially. There exist,
nonetheless, char devices that look like data areas, and you can move back
and forth in them; for instance, this usually applies to frame grabbers,
where the applications can access the whole acquired image using mmap
or lseek.

Block devices
Like char devices, block devices are accessed by filesystem nodes in the
/dev directory. A block device is a device (e.g., a disk) that can host a
filesystem. In most Unix systems, a block device can only handle I/O
operations that transfer one or more whole blocks, which are usually 512
bytes (or a larger power oftwo) bytes in length. Linux, instead, allows the
application to read and write a block device like a char device—it permits
the transfer of any number of bytes at a time. As a result, block and char
devices differ only in the way data is managedinternally by the kernel,
and thus in the kernel/driver software interface. Like a char device, each
block device is accessed through a filesystem node, and the difference
between them is transparent to the user. Block drivers have a completely
different interface to the kernel than char drivers.

Network interfaces
Any network transaction is made through an interface, that is, a device
that is able to exchange data with other hosts. Usually, an interface is a
hardware device, but it might also be a pure software device, like the
loopback interface. A network interface is in charge of sending and
41

receiving data packets, driven by the network subsystem of the kernel,


without knowing how individual transactions map to the actual packets
being transmitted. Many network connections (especially those using
TCP) are stream-oriented, but network devices are, usually, designed
around the transmission and receipt of packets. A network driver knows
nothing about individual connections; it only handles packets.
Not being a stream-oriented device, a network interface isn’t easily
mapped to a node in the filesystem, as /dev/tty1 is. The Unix way to
provide access to interfaces is still by assigning a unique name to them
(such as eth0), but that name doesn’t have a corresponding entry in the
filesystem. Communication between the kernel and a network device
driver is completely different from that used with char and block drivers.
Instead of read and write, the kernel calls functions related to packet
transmission.
There are other ways of classifying driver modules that are orthogonal to
the above device types. In general, some types of drivers work with
additional layers of kernel support functions for a given type of device.
For example, one can talk of universal serial bus (USB) modules, serial
modules, SCSI modules, and so on. Every USB device is driven by a USB
module that works with the USB subsystem, but the device itself shows
up in the system as a char device (a USB serial port, say), a block device
(a USB memory card reader), or a network device (a USB Ethernet
interface).
Other classes of device drivers have been added to the kernel in recent
times, including FireWire drivers and I2O drivers. In the same way that
they handled USB and SCSI drivers, kernel developers collected class-
wide features and exported them to driver implementers to avoid
duplicating work and bugs, thus simplifying and strengthening the process
of writing such drivers.
42

In addition to device drivers, other functionalities, both hardware and


software, are modularized in the kernel. One common example is
filesystems. A filesystem type determines how information is organized
on a block device in order to represent a tree of directories and files. Such
an entity is not a device driver, in that there’s no explicit device associated
with the way the information is laid down; the filesystem type is instead
a software driver, because it maps the low-level data structures to high-
level data structures. It is the filesystem that determines how long a
filename can be and what information about each file is stored in a
directory entry.
The filesystem module must implement the lowest level of the system
calls that access directories and files, by mapping filenames and paths (as
well as other information, such as access modes) to data structures stored
in data blocks. Such an interface is completely independent of the actual
data transfer to and from the disk (or other medium), which is
accomplished by a block device driver. If you think of how strongly a
Unix system depends on the underlying filesystem, you’ll realize that such
a software concept is vital to system operation. The ability to decode
filesystem information stays at the lowest level of the kernel hierarchy
and is of utmost importance; even if you write a block driver for your new
CD-ROM, it is useless if you are not able to run ls or cp on the data it
hosts. Linux supports the concept of a filesystem module, whose software
interface declares the different operations that can be performed on a
filesystem inode, directory, file, and superblock. It’squite unusual for a
programmer to actually need to write a filesystem module,because the
official kernel already includes code for the most important filesystem
types.
43

Shared Memory Usage in CUDA


Shared memory is much faster than local and global memory. In
fact, shared memory latency is roughly 100x lower than uncached
global memory latency (provided that there are no bank conflicts
between the threads, which we will examine later in this
post). Shared memory is allocated per thread block, so all threads
in the block have access to the same shared memory. Threads can
access data in shared memory loaded from global memory by other
threads within the same thread block. This capability (combined
with thread synchronization) has a number of uses, such as user-
managed data caches, high-performance cooperative parallel
algorithms (parallel reductions, for example), and to facilitate global
memory coalescing in cases where it would otherwise not be
possible.
Declare shared memory in CUDA C/C++ device code using
the __shared__ variable declaration specifier. There are multiple
ways to declare shared memory inside a kernel, depending on
whether the amount of memory is known at compile time or at run
time. The following complete code (available on GitHub) illustrates
various methods of using shared memory.
#include

__global__ void staticReverse(int *d, int n)


{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}

__global__ void dynamicReverse(int *d, int n)


44

{
extern __shared__ int s[];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}

int main(void)
{
const int n = 64;
int a[n], r[n], d[n];

for (int i = 0; i < n; i++) {


a[i] = i;
r[i] = n-i-1;
d[i] = 0;
}

int *d_d;
cudaMalloc(&d_d, n * sizeof(int));

// run version with static shared memory


cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHos
tToDevice);
staticReverse<<<1,n>>>(d_d, n);
cudaMemcpy(d, d_d, n*sizeof(int), cudaMemcpyDev
iceToHost);
for (int i = 0; i < n; i++)
if (d[i] != r[i]) printf("Error: d[%d]!=r[%d]
(%d, %d)n", i, i, d[i], r[i]);

// run dynamic shared memory version


45

cudaMemcpy(d_d, a, n*sizeof(int), cudaMemcpyHos


tToDevice);
dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
cudaMemcpy(d, d_d, n * sizeof(int), cudaMemcpyD
eviceToHost);
for (int i = 0; i < n; i++)
if (d[i] != r[i]) printf("Error: d[%d]!=r[%d]
(%d, %d)n", i, i, d[i], r[i]);
}
This code reverses the data in a 64-element array using shared
memory. The two kernels are very similar, differing only in how the
shared memory arrays are declared and how the kernels are
invoked.

Static Shared Memory


If the shared memory array size is known at compile time, as in the
staticReverse kernel, then we can explicitly declare an array of that
size, as we do with the array s.
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
In this kernel, t and tr are the two indices representing the
original and reverse order, respectively. Threads copy the data
from global memory to shared memory with the statement s[t] =
d[t], and the reversal is done two lines later with the
statement d[t] = s[tr]. But before executing this final line in
which each thread accesses data in shared memory that was
written by another thread, remember that we need to make sure all
46

threads have completed the loads to shared memory, by


calling __syncthreads().
The reason shared memory is used in this example is to facilitate
global memory coalescing on older CUDA devices (Compute
Capability 1.1 or earlier). Optimal global memory coalescing is
achieved for both reads and writes because global memory is
always accessed through the linear, aligned index t. The reversed
index tr is only used to access shared memory, which does not
have the sequential access restrictions of global memory for
optimal performance. The only performance issue with shared
memory is bank conflicts, which we will discuss later. (Note that on
devices of Compute Capability 1.2 or later, the memory system can
fully coalesce even the reversed index stores to global memory.
But this technique is still useful for other access patterns, as I’ll
show in the next post.)

Dynamic Shared Memory


The other three kernels in this example use dynamically allocated
shared memory, which can be used when the amount of shared
memory is not known at compile time. In this case the shared
memory allocation size per thread block must be specified (in
bytes) using an optional third execution configuration parameter,
as in the following excerpt.
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
The dynamic shared memory kernel, dynamicReverse(),
declares the shared memory array using an unsized extern array
syntax, extern shared int s[](note the empty brackets and
use of the extern specifier). The size is implicitly determined from
the third execution configuration parameter when the kernel is
launched. The remainder of the kernel code is identical to
the staticReverse() kernel.
What if you need multiple dynamically sized arrays in a single
kernel? You must declare a single extern unsized array as
47

before, and use pointers into it to divide it into multiple arrays, as in


the following excerpt.
extern __shared__ int s[];
int *integerData = s; // n
I ints
float *floatData = (float*)&integerData[nI]; // n
F floats
char *charData = (char*)&floatData[nF]; // n
C chars
In the kernel launch, specify the total shared memory needed, as
in the following.
myKernel<<<gridSize, blockSize, nI*sizeof(int)+nF
*sizeof(float)+nC*sizeof(char)>>>(...);

Shared memory bank conflicts


To achieve high memory bandwidth for concurrent accesses,
shared memory is divided into equally sized memory modules
(banks) that can be accessed simultaneously. Therefore, any
memory load or store of n addresses that spans b distinct memory
banks can be serviced simultaneously, yielding an effective
bandwidth that is b times as high as the bandwidth of a single bank.
However, if multiple threads’ requested addresses map to the
same memory bank, the accesses are serialized. The hardware
splits a conflicting memory request into as many separate conflict-
free requests as necessary, decreasing the effective bandwidth by
a factor equal to the number of colliding memory requests. An
exception is the case where all threads in a warp address the same
shared memory address, resulting in a broadcast. Devices of
compute capability 2.0 and higher have the additional ability to
multicast shared memory accesses, meaning that multiple
accesses to the same location by any number of threads within a
warp are served simultaneously.
48

To minimize bank conflicts, it is important to understand how


memory addresses map to memory banks. Shared memory banks
are organized such that successive 32-bit words are assigned to
successive banks and the bandwidth is 32 bits per bank per clock
cycle. For devices of compute capability 1.x, the warp size is 32
threads and the number of banks is 16. A shared memory request
for a warp is split into one request for the first half of the warp and
one request for the second half of the warp. Note that no bank
conflict occurs if only one memory location per bank is accessed
by a half warp of threads.
For devices of compute capability 2.0, the warp size is 32 threads
and the number of banks is also 32. A shared memory request for
a warp is not split as with devices of compute capability 1.x,
meaning that bank conflicts can occur between threads in the first
half of a warp and threads in the second half of the same warp.
Devices of compute capability 3.x have configurable bank size,
which can be set using cudaDeviceSetSharedMemConfig() to
either four bytes (cudaSharedMemBankSizeFourByte, the default)
or eight bytes (cudaSharedMemBankSizeEightByte). Setting
the bank size to eight bytes can help avoid shared memory bank
conflicts when accessing double precision data.

Configuring the amount of shared memory


On devices of compute capability 2.x and 3.x, each multiprocessor
has 64KB of on-chip memory that can be partitioned between L1
cache and shared memory. For devices of compute capability 2.x,
there are two settings, 48KB shared memory / 16KB L1 cache, and
16KB shared memory / 48KB L1 cache. By default the
48KB shared memory setting is used. This can be configured
during runtime API from the host for all
kernels using cudaDeviceSetCacheConfig() or on a per-
kernel basis using cudaFuncSetCacheConfig(). These accept
one of three
49

options: cudaFuncCachePreferNone, cudaFuncCachePrefe


rShared, and cudaFuncCachePreferL1. The driver will honor
the specified preference except when a kernel requires more
shared memory per thread block than available in the specified
configuration. Devices of compute capability 3.x allow a third
setting of 32KB shared memory / 32KB L1 cache which can be
obtained using the option cudaFuncCachePreferEqual.

Summary
Shared memory is a powerful feature for writing well optimized
CUDA code. Access to shared memory is much faster than global
memory access because it is located on chip. Because shared
memory is shared by threads in a thread block, it provides a
mechanism for threads to cooperate. One way to use shared
memory that leverages such thread cooperation is to enable global
memory coalescing, as demonstrated by the array reversal in this
post. By reversing the array using shared memory we are able to
have all global memory reads and writes performed with unit stride,
achieving full coalescing on any CUDA GPU.
50

CUDA-C On the GPU


CUDA is a parallel programming model and software environment
developed by NVIDIA. It provides programmers with a set of
instructions that enable GPU acceleration for data-parallel
computations. The computing performance of many applications
can be dramatically increased by using CUDA directly or by
linking to GPU-accelerated libraries.

Setting up your environment

To link and run applications using CUDA you will need to make
some changes to your path and environment. Load the
appropriate version of cuda:

module load cuda/8.0

The list of available versions of cuda can be obtained by


executing the module avail cuda command.

Compiling a simple CUDA C/C++ program

Consider the following simple CUDA program gpu_info.cu that


prints out information about GPUs installed on the system:

Download the source code of gpu_info.cu and transfer it to the


directory where you are working on the SCC.
Then execute the following command to compile gpu_info.cu:

scc1% nvcc -o gpu_info gpu_info.cu

Running a CUDA program interactively on a GPU-enabled node


51

To execute a CUDA code, you have to login via interactive batch


to a GPU-enabled node on the SCC. To request an interactive
session with access to 1 GPU:

scc1% qrsh -l gpus=1

To run a CUDA program interactively, you then type in the name


of the program at the command prompt:

gpunode% gpu_info

Submit a CUDA program Batch Job

The following line shows how to submit the gpu_info program to


run in batch mode on a single CPU with access to a single GPU:

scc1% qsub -l gpus=1 -b y gpu_info

where the –l gpus=# option indicates the number of GPUs


requested for each processor (possibly a fraction). To learn about
all options that could be used for submitting a job, please visit
the running jobs page.

CUDA Libraries

Several scientific libraries that make use of CUDA are available:

 cuBLAS – Linear Algebra Subroutines. A GPU accelerated


version of the complete standard BLAS library.
 cuFFT – Fast Fourier Transform library. Provides a simple
interface for computing FFTs up to 10x faster.
52

 cuRAND – Random Number Generation library. Delivers high


performance random number generation.
 cuSparse – Sparse Matrix library. Provides a collection of basic
linear algebra subroutines used for sparse matrices.
 NPP – Performance Primitives library. A collection of image and
signal processing primitives.

Architecture specific options

There are currently 3 sets of nodes that incorporate GPUs and


are available to SCC users.

The first set includes 18 nodes. Each of these nodes has an Intel
Xeon X5675 CPU with 12 cores running at 3.07Ghz and 48 GB of
memory. Each node also has 8 NVIDIA Tesla M2070 GPU cards
with 6 GB of Memory each and 2.0 compute capability.

The second set includes 2 nodes. Each of these nodes has E5-
2650v2 processors with 16 cores running at 2.6Ghz and 128 GB
of memory. Each node also has 2 NVIDIA Tesla K40m GPU
cards with 12 GB of Memory each and 3.5 compute capability.

The third set includes 4 nodes. Each of these nodes has E5-
2680v4 processors with 28 cores running at 2.4Ghz and 256 GB
of memory. Each node also has 2 NVIDIA Tesla P100 GPU cards
with 12 GB of Memory each and 6.0 compute capability.
53

CUDA-Accelerated Applications
The following applications have been accelerated using the
CUDA parallel computing architecture of NVIDIA Tesla GPUs.

GOVERNMENT & DEFENSE


MOLECULAR DYNAMICS, COMPUTATIONAL CHEMISTRY
LIFE SCIENCES, BIO-INFORMATICS
ELECTRODYNAMICS AND ELECTROMAGNETIC
MEDICAL IMAGING, CT, MRI
OIL & GAS
FINANCIAL COMPUTING AND OPTIONS PRICING
MATLAB, LABVIEW, MATHEMATICA, R
ELECTRONIC DESIGN AUTOMATION
WEATHER AND OCEAN MODELING
VIDEO, IMAGING, AND VISION APPLICATIONS

You might also like