As we may have already seen, NVIDIA GPUs use CUDA cores to perform calculations. In this segment, we
will be using Python to run CUDA programs.
The difference between running C and Python is that Python has a lot more QOL especially in regards to
data visualization and scientific computing.
We will demonstrate one of the simplest applications of parallel programming. Adding of two matrices.
Note that each element in the matrix will be added in parallel i.e. in a single step. We will forgo the
classical for loop in this case. We will start with 2 2D matrices with an arbitrary size in numpy and then
will proceed to add them.
Getting started
To start our virtual workspace, we will use Google Colaboratory, address at:
Google’s cloud service will provide this to us for free (note that you will require a gmail account)
Since pycuda is not a native library in colab we need an additional line before importing the libraries.
Run the code segment first before proceeding (at the left, a play button)
Wait for a bit while pycuda is being installed. After the build finishes, we are ready to proceed.
Of course, we are free to declare as we wish, but this will create a standard random number matrix,
repeat for a “b” matrix. Note that the cloud probably does not support double precision accuracy so to
avoid difficulties, we pre-convert to float32 or single precision.
GPU details
As we may have seen earlier, but GPU’s don’t support data in them, that is to say, we can’t say that
> accessing GPU
> GPU variable a
We will need to allocate memory and then create a copy from the device and access it via pointers and
the run calculations on that.
We do this as follows:
htod stands for “host to device” that is, from your system to the GPU, we will be making the
reverse of this command to extract data from our GPU.
Writing the function
Interestingly enough, the function body has to be The equivalence in this is such that all a[idx] are
written in C being executed simultaneously.
threadIdx.x refers to x
a+5 a+(5+1) a+(5+2) a+(5+3) a+(5+4) threadIdx.y refers to y
blockDim.x refers to
a+(5*2) a+(5*2+1) . . . dimension in x which
in our case is 5
Hence, we can create
. . . . . an equivalent 1D
array of size 25x1
with the following
. . . . a+(5*4+4) index notation.
= a+24
s e
Calling the function
Have to create a separate variable to get This will extract the function from
the function as follows “SourceModule” and execute it on the
GPU copies of a and b
Extracting the
Note: As we may already have seen, but parallel programming needs to be applied in very specific
circumstances. More specifically, it is best suited for work where a computation is not affected by its next