ACA Unit3 Revised
ACA Unit3 Revised
MODULE-III
Getting started with CUDA Installation,
Driver, Sdk, Toolkit
CUDA® is a parallel computing platform and programming model
invented by NVIDIA
It enables dramatic increases in computing performance by
harnessing the power of the graphics processing unit (GPU).
CUDA was developed with several design goals in mind:
Provide a small set of extensions to standard programming
languages, like C, that enable a straightforward implementation of
parallel algorithms.
With CUDA C/C++,programmers can focus on the task of
parallelization of the algorithms rather than spending time on their
implementation.
‣ Support heterogeneous computation where applications use both
the CPU and GPU. Serial portions of applications are run on the
CPU, and parallel portions are offloaded to the GPU. As such,
CUDA can be incrementally applied to existing Applications.
The CPU and GPU are treated as separate devices that have their
own memory spaces. This configuration also allows simultaneous
computation on the CPU and GPU without contention for memory
resources.
CUDA-capable GPUs have hundreds of cores that can collectively
run thousands of computing threads. These cores have shared
resources including a register file and a shared memory. The on-chip
2
Ubuntu
1. Perform the pre-installation actions.
2. Install repository meta-data
When using a proxy server with aptitude, ensure that wget is set up to use
the same proxy settings before installing the cuda-repo package.
$ sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb
3. Update the Apt repository cache
$ sudo apt-get update
4. Install CUDA
3
SDK installation
Eclipse IDE is the Software Development kit for the CUDA programs
Go to terminal and give the command nsight if eclipse already installed
then it opens it otherwise it give a command to install eclipse like $sudo
apt install nvidia-nsightafter authentication it downloads the files from
the internet source and install the eclipse.
This post is the first in a series on CUDA C and C++, which is the C/C++
interface to the CUDA parallel computing platform. This series of posts
assumes familiarity with programming in C. We will be running a parallel
series of posts about CUDA Fortran targeted at Fortran programmers .
These two series will cover the basic concepts of parallel computing on
the CUDA platform. From here on unless I state otherwise, I will use the
term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is
essentially C/C++ with a few extensions that allow one to execute
functions on the GPU using many threads in parallel.
7
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
The function saxpy is the kernel that runs in parallel on the GPU, and the
main function is the host code. Let’s begin our discussion of this program
with the host code.
10
Host Code
The main function declares two pairs of arrays.
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
The pointers x and y point to the host arrays, allocated with malloc in the
typical fashion, and the d_x and d_y arrays point to device arrays allocated
with the cudaMalloc function from the CUDA runtime API. The host and
device in CUDA have separate memory spaces, both of which can be
managed from host code (CUDA C kernels can also allocate device
memory on devices that support it).
The host code then initializes the host arrays. Here we set x to an array
of ones, and y to an array of twos.
To initialize the device arrays, we simply copy the data from x and y to
the corresponding device arrays d_x and d_y using cudaMemcpy, which
works just like the standard C memcpy function, except that it takes a
fourth argument which specifies the direction of the copy. In this case we
use cudaMemcpyHostToDevice to specify that the first (destination)
argument is a device pointer and the second (source) argument is a host
pointer.
number of thread blocks in the grid, and the second specifies the number
of threads in a thread block.
For cases where the number of elements in the arrays is not evenly
divisible by the thread block size, the kernel code must check for out-of-
bounds memory accesses.
Cleaning Up
After we are finished, we should free any allocated memory. For device
memory allocated with cudaMalloc(), simply call cudaFree(). For host
memory, use free() as usual.
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
13
Device Code
We now move on to the kernel code.
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
In CUDA, we define kernels such as saxpy using the __global__
declaration specifier. Variables defined within device code do not need to
be specified as device variables because they are assumed to reside on the
device. In this case the n, a and i variables will be stored by each thread
in a register, and the pointers x and y must be pointers to the device
memory address space. This is indeed true because we passed d_x and
d_y to the kernel when we launched it from the host code. The first two
arguments, n and a, however, were not explicitly transferred to the device
in host code. Because function arguments are passed by value by default
in C/C++, the CUDA runtime can automatically handle the transfer of
these values to the device. This feature of the CUDA Runtime API makes
launching kernels on the GPU very natural and easy—it is almost the same
as calling a C function.
14
There are only two lines in our saxpy kernel. As mentioned earlier, the
kernel is executed by multiple threads in parallel. If we want each thread
to process an element of the resultant array, then we need a means of
distinguishing and identifying each thread. CUDA defines the variables
blockDim, blockIdx, and threadIdx. These predefined variables are of
type dim3, analogous to the execution configuration parameters in host
code. The predefined variable blockDim contains the dimensions of each
thread block as specified in the second execution configuration parameter
for the kernel launch. The predefined variables threadIdx and blockIdx
contain the index of the thread within its thread block and the thread block
within the grid, respectively. The expression:
Before this index is used to access array elements, its value is checked
against the number of elements, n, to ensure there are no out-of-bounds
memory accesses. This check is required for cases where the number of
elements in an array is not evenly divisible by the thread block size, and
as a result the number of threads launched by the kernel is larger than the
array size. The second line of the kernel performs the element-wise work
of the SAXPY, and other than the bounds check, it is identical to the inner
loop of a host implementation of SAXPY.
15
% ./saxpy
Max error: 0.000000
Summary and Conclusions
With this walkthrough of a simple CUDA C implementation of SAXPY,
you now know the basics of programming CUDA C. There are only a few
extensions to C required to “port” a C code to CUDA C: the __global__
declaration specifier for device kernel functions; the execution
configuration used when launching a kernel; and the built-in device
variables blockDim, blockIdx, and threadIdx used to identify and
differentiate GPU threads that execute the kernel in parallel.
1. SISD
2. SIMD
3. MISD
4. MIMD
18
KERNAL
A kernel is the central part of an operating system. It manages the
operations of the computer and the hardware - most notably memory and
CPU time. There are two types of kernels: A microkernel, which only
contains basic functionality; A monolithic kernel, which contains many
device drivers.
27
28
29
30
31
32
33
34
35
36
Process management
The kernel is in charge of creating and destroying processes and handling
their connection to the outside world (input and output). Communication
among different processes (through signals, pipes, or interprocess
communication primitives) is basic to the overall system functionality and
is also handled by the kernel. In addition, the scheduler, which controls
how processes share the CPU, is part of process management. More
generally, the kernel’s process management activity implements the
abstraction of several processes on top of a single CPU or a few of them.
Memory Management
The computer’s memory is a majorresource, and the policy used to deal
with it is a critical one for system performance. The kernel builds up a
virtual addressing space for any and all processes on top of the limited
available resources. The different parts of the kernel interact with the
memory-management subsystem through a set of function calls, ranging
from the simple malloc/free pair to much more complex functionalities.
37
Filesystems
Unix is heavily based on the filesystem concept; almost everything in
Unix can be treated as a file. The kernel builds a structured filesystem on
top of unstructured hardware, and the resulting file abstraction is heavily
used throughout the whole system. In addition, Linux supports multiple
filesystem types, that is, different ways of organizing data on the physical
medium. For example, disks may be formatted with the Linux-standard
ext3 filesystem, the commonly used FAT filesystem or several others.
Device control
Almost every system operation eventually maps to a physical device.
With the exception of the processor, memory, and a very few other
entities, any and all device control operations are performed by code that
is specific to the device being addressed. That code is called a device
driver. The kernel must have embedded in it a device driver for every
peripheral present on a system, from the hard drive to the keyboard and
the tape drive. This aspect of the kernel’s functions is our primary interest
in this book.
Networking
Networking must be managed by the operating system, because most
network operations are not specific to a process: incoming packets are
asynchronous events. The packets must be collected, identified, and
dispatched before a process takes care of them. The system is in charge of
delivering data packets across program and network interfaces, and it
must control the execution of programs according to their network
38
activity. Additionally, all the routing and address resolution issues are
implemented within the kernel.
Loadable Modules
One of the good features of Linux is the ability to extend at runtime the
set of features offered by the kernel. This means that you can add
functionality to the kernel (and remove functionality as well) while the
system is up and running. Each piece of code that can be added to the
kernel at runtime is called a module. The Linux kernel offers support for
quite a few different types (or classes) of modules, including, but not
limited to, device drivers. Each module is made up of object code (not
linked into a complete executable) that can be dynamically linked to the
running kernel by the insmod program and can be unlinked by the rmmod
program. Figure 1-1 identifies different classes of modules in charge of
specific tasks—a module is said to belong to a specific class according to
the functionality it offers. The placement of modules in Figure 1-1 covers
the most important classes, but is far from complete because more and
more functionality in Linux is being modularized.
friends) are examples of char devices, as they are well represented by the
stream abstraction. Char devices are accessed by means of filesystem
nodes, such as /dev/tty1 and /dev/lp0. The only relevant difference
between a char device and a regular file is that you can always move back
and forth in the regular file, whereas most char devices are just data
channels, which you can only access sequentially. There exist,
nonetheless, char devices that look like data areas, and you can move back
and forth in them; for instance, this usually applies to frame grabbers,
where the applications can access the whole acquired image using mmap
or lseek.
Block devices
Like char devices, block devices are accessed by filesystem nodes in the
/dev directory. A block device is a device (e.g., a disk) that can host a
filesystem. In most Unix systems, a block device can only handle I/O
operations that transfer one or more whole blocks, which are usually 512
bytes (or a larger power oftwo) bytes in length. Linux, instead, allows the
application to read and write a block device like a char device—it permits
the transfer of any number of bytes at a time. As a result, block and char
devices differ only in the way data is managedinternally by the kernel,
and thus in the kernel/driver software interface. Like a char device, each
block device is accessed through a filesystem node, and the difference
between them is transparent to the user. Block drivers have a completely
different interface to the kernel than char drivers.
Network interfaces
Any network transaction is made through an interface, that is, a device
that is able to exchange data with other hosts. Usually, an interface is a
hardware device, but it might also be a pure software device, like the
loopback interface. A network interface is in charge of sending and
41
{
extern __shared__ int s[];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
int main(void)
{
const int n = 64;
int a[n], r[n], d[n];
int *d_d;
cudaMalloc(&d_d, n * sizeof(int));
Summary
Shared memory is a powerful feature for writing well optimized
CUDA code. Access to shared memory is much faster than global
memory access because it is located on chip. Because shared
memory is shared by threads in a thread block, it provides a
mechanism for threads to cooperate. One way to use shared
memory that leverages such thread cooperation is to enable global
memory coalescing, as demonstrated by the array reversal in this
post. By reversing the array using shared memory we are able to
have all global memory reads and writes performed with unit stride,
achieving full coalescing on any CUDA GPU.
50
To link and run applications using CUDA you will need to make
some changes to your path and environment. Load the
appropriate version of cuda:
gpunode% gpu_info
CUDA Libraries
The first set includes 18 nodes. Each of these nodes has an Intel
Xeon X5675 CPU with 12 cores running at 3.07Ghz and 48 GB of
memory. Each node also has 8 NVIDIA Tesla M2070 GPU cards
with 6 GB of Memory each and 2.0 compute capability.
The second set includes 2 nodes. Each of these nodes has E5-
2650v2 processors with 16 cores running at 2.6Ghz and 128 GB
of memory. Each node also has 2 NVIDIA Tesla K40m GPU
cards with 12 GB of Memory each and 3.5 compute capability.
The third set includes 4 nodes. Each of these nodes has E5-
2680v4 processors with 28 cores running at 2.4Ghz and 256 GB
of memory. Each node also has 2 NVIDIA Tesla P100 GPU cards
with 12 GB of Memory each and 6.0 compute capability.
53
CUDA-Accelerated Applications
The following applications have been accelerated using the
CUDA parallel computing architecture of NVIDIA Tesla GPUs.