Chapter-2---Heterogeneous-data-parallel_2023_Programming-Massively-Parallel-
Chapter-2---Heterogeneous-data-parallel_2023_Programming-Massively-Parallel-
FIGURE 2.1
Conversion of a color image to a grayscale image.
2.1 Data parallelism 25
The actual allowable mixtures of these three colors vary across indus-
try-specified color spaces. Here, the valid combinations of the three colors
in the AdbobeRGBt color space are shown as the interior of the triangle.
The vertical coordinate (y value) and horizontal coordinate (x value) of
each mixture show the fraction of the pixel intensity that should be G and
R. The remaining fraction (1-yx) of the pixel intensity should be assigned
to B. To render an image, the r, g, b values of each pixel are used to cal-
culate both the total intensity (luminance) of the pixel as well as the mix-
ture coefficients (x, y, 1-y-x).
FIGURE 2.2
Data parallelism in image-to-grayscale conversion. Pixels can be calculated independently
of each other.
FIGURE 2.3
Execution of a CUDA program.
1
There has been a steady movement for CUDA C to adopt C++ features. We will be using some
of these C++ features in our programming examples.
28 CHAPTER 2 Heterogeneous data parallel computing
Note that Fig. 2.3 shows a simplified model in which the CPU execution and
the GPU execution do not overlap. Many heterogeneous computing applications
manage overlapped CPU and GPU execution to take advantage of both CPUs
and GPUs.
Launching a grid typically generates many threads to exploit data parallelism.
In the color-to-grayscale conversion example, each thread could be used to com-
pute one pixel of the output array O. In this case, the number of threads that
ought to be generated by the grid launch is equal to the number of pixels in the
image. For large images, a large number of threads will be generated. CUDA pro-
grammers can assume that these threads take very few clock cycles to generate
and schedule, owing to efficient hardware support. This assumption contrasts with
traditional CPU threads, which typically take thousands of clock cycles to gener-
ate and schedule. In the next chapter we will show how to implement color-to-
grayscale conversion and image blur kernels. In the rest of this chapter we will
use vector addition as a running example for simplicity.
Threads
A thread is a simplified view of how a processor executes a sequential pro-
gram in modern computers. A thread consists of the code of the program,
the point in the code that is being executed, and the values of its variables
and data structures. The execution of a thread is sequential as far as a
user is concerned. One can use a source-level debugger to monitor the
progress of a thread by executing one statement at a time, looking at the
statement that will be executed next and checking the values of the vari-
ables and data structures as the execution progresses.
Threads have been used in programming for many years. If a program-
mer wants to start parallel execution in an application, he/she creates and
manages multiple threads using thread libraries or special languages. In
CUDA, the execution of each thread is sequential as well. A CUDA pro-
gram initiates parallel execution by calling kernel functions, which causes
the underlying runtime mechanisms to launch a grid of threads that pro-
cess different parts of the data in parallel.
FIGURE 2.4
A simple traditional vector addition C code example.
C program that consists of a main function and a vector addition function. In all
our examples, whenever there is a need to distinguish between host and device
data, we will suffix the names of variables that are used by the host with “_h”
and those of variables that are used by a device with “_d” to remind ourselves of
the intended usage of these variables. Since we have only host code in Fig. 2.4,
we see only variables suffixed with “_h”.
Assume that the vectors to be added are stored in arrays A and B that are allo-
cated and initialized in the main program. The output vector is in array C, which is
also allocated in the main program. For brevity we do not show the details of how
A, B, and C are allocated or initialized in the main function. The pointers to these
arrays are passed to the vecAdd function, along with the variable N that contains
the length of the vectors. Note that the parameters of the vecAdd function are suf-
fixed with “_h” to emphasize that they are used by the host. This naming conven-
tion will be helpful when we introduce device code in the next few steps.
The vecAdd function in Fig. 2.4 uses a for-loop to iterate through the vector
elements. In the ith iteration, output element C_h[i] receives the sum of A_h[i]
and B_h[i]. The vector length parameter n is used to control the loop so that
the number of iterations matches the length of the vectors. The function reads the
elements of A and B and writes the elements of C through the pointers A_h, B_h,
and C_h, respectively. When the vecAdd function returns, the subsequent state-
ments in the main function can access the new contents of C.
A straightforward way to execute vector addition in parallel is to modify the
vecAdd function and move its calculations to a device. The structure of such a
modified vecAdd function is shown in Fig. 2.5. Part 1 of the function allocates
space in the device (GPU) memory to hold copies of the A, B, and C vectors and
copies the A and B vectors from the host memory to the device memory. Part 2
FIGURE 2.5
Outline of a revised vecAdd function that moves the work to a device.
2.4 Device global memory and data transfer 31
calls the actual vector addition kernel to launch a grid of threads on the device.
Part 3 copies the sum vector C from the device memory to the host memory and
deallocates the three arrays from the device memory.
Note that the revised vecAdd function is essentially an outsourcing agent that
ships input data to a device, activates the calculation on the device, and collects
the results from the device. The agent does so in such a way that the main pro-
gram does not need to even be aware that the vector addition is now actually
done on a device. In practice, such a “transparent” outsourcing model can be very
inefficient because of all the copying of data back and forth. One would often
keep large and important data structures on the device and simply invoke device
functions on them from the host code. For now, however, we will use the simpli-
fied transparent model to introduce the basic CUDA C program structure. The
details of the revised function, as well as the way to compose the kernel function,
will be the topic of the rest of this chapter.
FIGURE 2.6
CUDA API functions for managing device global memory.
of device global memory for an object. The reader should notice the striking simi-
larity between cudaMalloc and the standard C runtime library malloc function. This
is intentional; CUDA C is C with minimal extensions. CUDA C uses the standard
C runtime library malloc function to manage the host memory2 and adds
cudaMalloc as an extension to the C runtime library. By keeping the interface as
close to the original C runtime libraries as possible, CUDA C minimizes the time
that a C programmer spends relearning the use of these extensions.
The first parameter to the cudaMalloc function is the address of a pointer vari-
able that will be set to point to the allocated object. The address of the pointer vari-
able should be cast to (void ) because the function expects a generic pointer; the
memory allocation function is a generic function that is not restricted to any particu-
lar type of objects.3 This parameter allows the cudaMalloc function to write the
address of the allocated memory into the provided pointer variable regardless of its
type.4 The host code that calls kernels passes this pointer value to the kernels that
need to access the allocated memory object. The second parameter to the cudaMalloc
function gives the size of the data to be allocated, in number of bytes. The usage of
this second parameter is consistent with the size parameter to the C malloc function.
We now use the following simple code example to illustrate the use of
cudaMalloc and cudaFree:
float A_d
int size=n sizeof(float);
cudaMalloc((void )&A_d, size);
...
cudaFree(A_d);
2
CUDA C also has more advanced library functions for allocating space in the host memory. We
will discuss them in Chapter 20, Programming a Heterogeneous Computing Cluster.
3
The fact that cudaMalloc returns a generic object makes the use of dynamically allocated mul-
tidimensional arrays more complex. We will address this issue in Section 3.2.
4
Note that cudaMalloc has a different format from the C malloc function. The C malloc func-
tion returns a pointer to the allocated object. It takes only one parameter that specifies the size of
the allocated object. The cudaMalloc function writes to the pointer variable whose address is
given as the first parameter. As a result, the cudaMalloc function takes two parameters. The two-
parameter format of cudaMalloc allows it to use the return value to report any errors in the same
way as other CUDA API functions.
2.4 Device global memory and data transfer 33
FIGURE 2.7
CUDA API function for data transfer between host and device.
34 CHAPTER 2 Heterogeneous data parallel computing
The vecAdd function calls the cudaMemcpy function to copy the A_h and B_h
vectors from the host memory to A_d and B_d in the device memory before adding
them and to copy the C_d vector from the device memory to C_h in the host mem-
ory after the addition has been done. Assuming that the values of A_h, B_h, A_d,
B_d, and size have already been set as we discussed before, the three cudaMemcpy
calls are shown below. The two symbolic constants, cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost, are recognized, predefined constants of the CUDA pro-
gramming environment. Note that the same function can be used to transfer data in
both directions by properly ordering the source and destination pointers and using
the appropriate constant for the transfer type.
cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice);
...
cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost);
To summarize, the main program in Fig. 2.4 calls vecAdd, which is also exe-
cuted on the host. The vecAdd function, outlined in Fig. 2.5, allocates space in
device global memory, requests data transfers, and calls the kernel that performs
the actual vector addition. We refer to this type of host code as a stub for calling
a kernel. We show a more complete version of the vecAdd function in Fig. 2.8.
FIGURE 2.8
A more complete version of vecAdd().
2.5 Kernel functions and threading 35
Compared to Fig. 2.5, the vecAdd function in Fig. 2.8 is complete for Part 1 and
Part 3. Part 1 allocates device global memory for A_d, B_d, and C_d and transfers
A_h to A_d and B_h to B_d. This is done by calling the cudaMalloc and cudaMemcpy
functions. The readers are encouraged to write their own function calls with the
appropriate parameter values and compare their code with that shown in Fig. 2.8.
Part 2 calls the kernel and will be described in the following subsection. Part 3 cop-
ies the vector sum data from the device to the host so that the values will be avail-
able in the main function. This is accomplished with a call to the cudaMemcpy
function. It then frees the memory for A_d, B_d, and C_d from the device global
memory, which is done by calls to the cudaFree function (Fig. 2.9).
This way, if the system is out of device memory, the user will be
informed about the situation. This can save many hours of debugging time.
One could define a C macro to make the checking code more concise
in the source.
the code to be executed by all threads during a parallel phase. Since all these
threads execute the same code, CUDA C programming is an instance of the well-
known single-program multiple-data (SPMD) (Atallah, 1998) parallel program-
ming style, a popular programming style for parallel computing systems.5
When a program’s host code calls a kernel, the CUDA runtime system
launches a grid of threads that are organized into a two-level hierarchy. Each grid
is organized as an array of thread blocks, which we will refer to as blocks for
brevity. All blocks of a grid are of the same size; each block can contain up to
1024 threads on current systems.6 Fig. 2.9 shows an example in which each block
consists of 256 threads. Each thread is represented by a curly arrow stemming
from a box that is labeled with the thread’s index number in the block.
Built-in Variables
Many programming languages have built-in variables. These variables
have special meaning and purpose. The values of these variables are often
pre-initialized by the runtime system and are typically read-only in the
program. The programmers should refrain from redefining these variables
for any other purposes.
The total number of threads in each thread block is specified by the host code
when a kernel is called. The same kernel can be called with different numbers
of threads at different parts of the host code. For a given grid of threads, the
number of threads in a block is available in a built-in variable named blockDim.
The blockDim variable is a struct with three unsigned integer fields (x, y, and z)
that help the programmer to organize the threads into a one-, two-, or three-
dimensional array. For a one-dimensional organization, only the x field is used.
For a two-dimensional organization, the x and y fields are used. For a three-
dimensional structure, all three x, y, and z fields are used. The choice of
dimensionality for organizing threads usually reflects the dimensionality of the
data. This makes sense because the threads are created to process data in parallel,
so it is only natural that the organization of the threads reflects the organization
of the data. In Fig. 2.9, each thread block is organized as a one-dimensional array
of threads because the data are one-dimensional vectors. The value of the
blockDim.x variable indicates the total number of threads in each block, which is
256 in Fig. 2.9. In general, it is recommended that the number of threads in each
dimension of a thread block be a multiple of 32 for hardware efficiency reasons.
We will revisit this later.
5
Note that SPMD is not the same as SIMD (single instruction multiple data) [Flynn 1972]. In an
SPMD system the parallel processing units execute the same program on multiple parts of the data.
However, these processing units do not need to be executing the same instruction at the same time.
In an SIMD system, all processing units are executing the same instruction at any instant.
6
Each thread block can have up to 1024 threads in CUDA 3.0 and beyond. Some earlier CUDA
versions allow only up to 512 threads in a block.
2.5 Kernel functions and threading 37
FIGURE 2.9
All threads in a grid execute the same kernel code.
CUDA kernels have access to two more built-in variables (threadIdx and
blockIdx) that allow threads to distinguish themselves from each other and to
determine the area of data each thread is to work on. The threadIdx variable
gives each thread a unique coordinate within a block. In Fig. 2.9, since we are
using a one-dimensional thread organization, only threadIdx.x is used. The
threadIdx.x value for each thread is shown in the small shaded box of each
thread in Fig. 2.9. The first thread in each block has value 0 in its threadIdx.x
variable, the second thread has value 1, the third thread has value 2, and so on.
Hierarchical Organizations
Like CUDA threads, many real-world systems are organized hierar-
chically. The U.S. telephone system is a good example. At the top level,
the telephone system consists of “areas” each of which corresponds to a
geographical area. All telephone lines within the same area have the same
3-digit “area code”. A telephone area is sometimes larger than a city. For
example, many counties and cities of central Illinois are within the same
telephone area and share the same area code 217. Within an area, each
phone line has a seven-digit local phone number, which allows each area
to have a maximum of about ten million numbers.
One can think of each phone line as a CUDA thread, with the area code
as the value of blockIdx and the seven-digital local number as the value
of threadIdx. This hierarchical organization allows the system to have a
very large number of phone lines while preserving “locality” for calling the
same area. That is, when dialing a phone line in the same area, a caller
only needs to dial the local number. As long as we make most of our calls
within the local area, we seldom need to dial the area code. If we occasion-
ally need to call a phone line in another area, we dial 1 and the area code,
followed by the local number. (This is the reason why no local number in
any area should start with a 1.) The hierarchical organization of CUDA
threads also offers a form of locality. We will study this locality soon.
38 CHAPTER 2 Heterogeneous data parallel computing
The blockIdx variable gives all threads in a block a common block coordi-
nate. In Fig. 2.9, all threads in the first block have value 0 in their blockIdx.x
variables, those in the second thread block value 1, and so on. Using an analogy
with the telephone system, one can think of threadIdx.x as local phone number
and blockIdx.x as area code. The two together gives each telephone line in the
whole country a unique phone number. Similarly, each thread can combine its
threadIdx and blockIdx values to create a unique global index for itself within
the entire grid.
In Fig. 2.9 a unique global index i is calculated as i=blockIdx.x blockDim.
x + threadIdx.x. Recall that blockDim is 256 in our example. The i values of
threads in block 0 range from 0 to 255. The i values of threads in block 1 range
from 256 to 511. The i values of threads in block 2 range from 512 to 767. That
is, the i values of the threads in these three blocks form a continuous coverage of
the values from 0 to 767. Since each thread uses i to access A, B, and C, these
threads cover the first 768 iterations of the original loop. By launching a grid
with a larger number of blocks, one can process larger vectors. By launching a
grid with n or more threads, one can process vectors of length n.
Fig. 2.10 shows a kernel function for vector addition. Note that we do not use
the “_h” and “_d” convention in kernels, since there is no potential confusion. We
will not have any access to the host memory in our examples. The syntax of a
kernel is ANSI C with some notable extensions. First, there is a CUDA-C-
specific keyword “__global__” in front of the declaration of the vecAddKernel
function. This keyword indicates that the function is a kernel and that it can be
called to generate a grid of threads on a device.
In general, CUDA C extends the C language with three qualifier keywords
that can be used in function declarations. The meaning of these keywords is sum-
marized in Fig. 2.11. The “__global__” keyword indicates that the function being
declared is a CUDA C kernel function. Note that there are two underscore charac-
ters on each side of the word “global.” Such a kernel function is executed on the
device and can be called from the host. In CUDA systems that support dynamic
parallelism, it can also be called from the device, as we will see in Chapter 21,
FIGURE 2.10
A vector addition kernel function.
2.5 Kernel functions and threading 39
FIGURE 2.11
CUDA C keywords for function declaration.
CUDA Dynamic Parallelism. The important feature is that calling such a kernel
function results in a new grid of threads being launched on the device.
The “__device__” keyword indicates that the function being declared is a
CUDA device function. A device function executes on a CUDA device and can
be called only from a kernel function or another device function. The device func-
tion is executed by the device thread that calls it and does not result in any new
device threads being launched.7
The “__host__” keyword indicates that the function being declared is a
CUDA host function. A host function is simply a traditional C function that exe-
cutes on the host and can be called only from another host function. By default,
all functions in a CUDA program are host functions if they do not have any of
the CUDA keywords in their declaration. This makes sense, since many CUDA
applications are ported from CPU-only execution environments. The programmer
would add kernel functions and device functions during the porting process. The
original functions remain as host functions. Having all functions to default into
host functions spares the programmer the tedious work of changing all original
function declarations.
Note that one can use both “__host__” and “__device__” in a function decla-
ration. This combination tells the compilation system to generate two versions of
object code for the same function. One is executed on the host and can be called
only from a host function. The other is executed on the device and can be called
only from a device or kernel function. This supports a common use case when the
same function source code can be recompiled to generate a device version. Many
user library functions will likely fall into this category.
The second notable extension to C, in Fig. 2.10, is the built-in variables
“threadIdx,” “blockIdx,” and “blockDim.” Recall that all threads execute the same
kernel code and there needs to be a way for them to distinguish themselves from
each other and direct each thread toward a particular part of the data. These built-in
variables are the means for threads to access hardware registers that provide the
7
We will explain the rules for using indirect function calls and recursions in different generations
of CUDA later. In general, one should avoid the use of recursion and indirect function calls in their
device functions and kernel functions to allow maximal portability.
40 CHAPTER 2 Heterogeneous data parallel computing
identifying coordinates to threads. Different threads will see different values in their
threadIdx.x, blockIdx.x, and blockDim.x variables. For readability we will some-
times refer to a thread as threadblockIdx.x, threadIdx.x in our discussions.
There is an automatic (local) variable i in Fig. 2.10. In a CUDA kernel func-
tion, automatic variables are private to each thread. That is, a version of i will be
generated for every thread. If the grid is launched with 10,000 threads, there will
be 10,000 versions of i, one for each thread. The value assigned by a thread to its
i variable is not visible to other threads. We will discuss these automatic variables
in more details in Chapter 5, Memory Architecture and Data Locality.
A quick comparison between Fig. 2.4 and Fig. 2.10 reveals an important
insight into CUDA kernels. The kernel function in Fig. 2.10 does not have a loop
that corresponds to the one in Fig. 2.4. The reader should ask where the loop
went. The answer is that the loop is now replaced with the grid of threads. The
entire grid forms the equivalent of the loop. Each thread in the grid corresponds
to one iteration of the original loop. This is sometimes referred to as loop paral-
lelism, in which iterations of the original sequential code are executed by threads
in parallel.
Note that there is an if (i , n) statement in addVecKernel in Fig. 2.10. This is
because not all vector lengths can be expressed as multiples of the block size. For
example, let’s assume that the vector length is 100. The smallest efficient thread
block dimension is 32. Assume that we picked 32 as block size. One would need
to launch four thread blocks to process all the 100 vector elements. However, the
four thread blocks would have 128 threads. We need to disable the last 28 threads
in thread block 3 from doing work not expected by the original program. Since
all threads are to execute the same code, all will test their i values against n,
which is 100. With the if (i , n) statement, the first 100 threads will perform the
addition, whereas the last 28 will not. This allows the kernel to be called to pro-
cess vectors of arbitrary lengths.
FIGURE 2.12
A vector addition kernel call statement.
2.6 Calling kernel functions 41
FIGURE 2.13
A complete version of the host code in the vecAdd function.
8
While we use an arbitrary block size 256 in this example, the block size should be determined by
a number of factors that will be introduced later.
42 CHAPTER 2 Heterogeneous data parallel computing
runs at lower speed on small GPUs and at higher speed on larger GPUs. We will
revisit this point in Chapter 4, Compute Architecture and Scheduling.
It is important to point out again that the vector addition example is used for
its simplicity. In practice, the overhead of allocating device memory, input data
transfer from host to device, output data transfer from device to host, and deallo-
cating device memory will likely make the resulting code slower than the original
sequential code in Fig. 2.4. This is because the amount of calculation that is done
by the kernel is small relative to the amount of data processed or transferred.
Only one addition is performed for two floating-point input operands and one
floating-point output operand. Real applications typically have kernels in which
much more work is needed relative to the amount of data processed, which makes
the additional overhead worthwhile. Real applications also tend to keep the data
in the device memory across multiple kernel invocations so that the overhead can
be amortized. We will present several examples of such applications.
2.7 Compilation
We have seen that implementing CUDA C kernels requires using various exten-
sions that are not part of C. Once these extensions have been used in the code, it
is no longer acceptable to a traditional C compiler. The code needs to be com-
piled by a compiler that recognizes and understands these extensions, such as
NVCC (NVIDIA C compiler). As is shown at the top of Fig. 2.14, the NVCC
FIGURE 2.14
Overview of the compilation process of a CUDA C program.
2.8 Summary 43
2.8 Summary
This chapter provided a quick, simplified overview of the CUDA C programming
model. CUDA C extends the C language to support parallel computing. We dis-
cussed an essential subset of these extensions in this chapter. For your convenience
we summarize the extensions that we have discussed in this chapter as follows:
area of data to work on. We discussed the threadIdx, blockDim, and blockIdx
variables in this chapter. In Chapter 3, Multidimensional Grids and Data, we will
discuss more details of using these variables.
Exercises
1. If we want to use each thread in a grid to calculate one output element of a
vector addition, what would be the expression for mapping the thread/block
indices to the data index (i)?
(A) i=threadIdx.x + threadIdx.y;
(B) i=blockIdx.x + threadIdx.x;
(C) i=blockIdx.x blockDim.x + threadIdx.x;
(D) i=blockIdx.x threadIdx.x;
2. Assume that we want to use each thread to calculate two adjacent elements of
a vector addition. What would be the expression for mapping the thread/block
indices to the data index (i) of the first element to be processed by a thread?
(A) i=blockIdx.x blockDim.x + threadIdx.x +2;
(B) i=blockIdx.x threadIdx.x 2;
(C) i=(blockIdx.x blockDim.x + threadIdx.x) 2;
(D) i=blockIdx.x blockDim.x 2 + threadIdx.x;
3. We want to use each thread to calculate two elements of a vector addition.
Each thread block processes 2 blockDim.x consecutive elements that form
two sections. All threads in each block will process a section first, each
processing one element. They will then all move to the next section, each
Exercises 45
processing one element. Assume that variable i should be the index for the
first element to be processed by a thread. What would be the expression for
mapping the thread/block indices to data index of the first element?
(A) i=blockIdx.x blockDim.x + threadIdx.x +2;
(B) i=blockIdx.x threadIdx.x 2;
(C) i=(blockIdx.x blockDim.x + threadIdx.x) 2;
(D) i=blockIdx.x blockDim.x 2 + threadIdx.x;
4. For a vector addition, assume that the vector length is 8000, each thread
calculates one output element, and the thread block size is 1024 threads. The
programmer configures the kernel call to have a minimum number of thread
blocks to cover all output elements. How many threads will be in the grid?
(A) 8000
(B) 8196
(C) 8192
(D) 8200
5. 5. If we want to allocate an array of v integer elements in the CUDA device
global memory, what would be an appropriate expression for the second
argument of the cudaMalloc call?
(A) n
(B) v
(C) n sizeof(int)
(D) v sizeof(int)
6. If we want to allocate an array of n floating-point elements and have a
floating-point pointer variable A_d to point to the allocated memory, what
would be an appropriate expression for the first argument of the cudaMalloc
() call?
(A) n
(B) (void ) A_d
(C) A_d
(D) (void ) &A_d
7. If we want to copy 3000 bytes of data from host array A_h (A_h is a pointer
to element 0 of the source array) to device array A_d (A_d is a pointer to
element 0 of the destination array), what would be an appropriate API call
for this data copy in CUDA?
(A) cudaMemcpy(3000, A_h, A_d, cudaMemcpyHostToDevice);
(B) cudaMemcpy(A_h, A_d, 3000, cudaMemcpyDeviceTHost);
(C) cudaMemcpy(A_d, A_h, 3000, cudaMemcpyHostToDevice);
(D) cudaMemcpy(3000, A_d, A_h, cudaMemcpyHostToDevice);
8. How would one declare a variable err that can appropriately receive the
returned value of a CUDA API call?
(A) int err;
(B) cudaError err;
(C) cudaError_t err;
(D) cudaSuccess_t err;
46 CHAPTER 2 Heterogeneous data parallel computing
9. Consider the following CUDA kernel and the corresponding host function
that calls it:
01 __global__ void foo_kernel(float a, float b, unsigned int
N){
02 unsigned int i=blockIdx.x blockDim.x + threadIdx.
x;
03 if(i , N) {
04 b[i]=2.7f a[i] - 4.3f;
05 }
06 }
07 void foo(float a_d, float b_d) {
08 unsigned int N=200000;
09 foo_kernel ,, , (N + 1281)/128, 128 .. . (a_d,
b_d, N);
10 }
References
Atallah, M.J. (Ed.), 1998. Algorithms and Theory of Computation Handbook. CRC Press.
Flynn, M., 1972. Some computer organizations and their effectiveness. IEEE Trans.
Comput. C- 21, 948.
NVIDIA Corporation, March 2021. NVIDIA CUDA C Programming Guide.
Patt, Y.N., Patel, S.J., 2020. ISBN-10: 1260565912, 2000, 2004 Introduction to Computing
Systems: From Bits and Gates to C and Beyond. McGraw Hill Publisher.