0% found this document useful (0 votes)

41 views42 pages

A Very Biased Introduction To C Programming and Very Very Short Introduction To CUDA Programming

This document provides a very brief introduction to C programming and CUDA programming. It outlines examples of basic C code constructs like variables, if/else statements, for loops, arrays, pointers, functions, and file I/O. It also gives a high-level overview of CUDA terminology, programming model, and an example image processing program to demonstrate CUDA. The goal is to quickly expose the reader to these topics and motivate further self-learning. Exercises are provided throughout to reinforce key concepts.

Uploaded by

Basit Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views42 pages

A Very Biased Introduction To C Programming and Very Very Short Introduction To CUDA Programming

Uploaded by

Basit Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

A very biased introduction to C programming and very very short

introduction to CUDA programming

March 26, 2013

Dr Dennis Deng

Department of Electronic Engineering, La Trobe University

1
Contents
1 Aim 4

2 C language examples and exercises 4

2.1 The simplest one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Variables and the if-else command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 For-loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 2-D array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Pointers for arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Command line input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 File input and output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Function using pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11 An image processing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.12 Make it run faster-optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.12.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 A very very brief introduction to CUDA 20

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Terms and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 CUDA specific functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Image processing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 Main parts of the program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2 Details of the GPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.3 The Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.4 Compile and run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2
4 Linear index of 2D arrays 31
4.1 C language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Run CUDA code in Matlab 33

5.1 Basic steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Dealing with matrix element’s index in CUDA kernel . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Using C language in Matlab and dealing with matrix element’s index in the C kernel . . . . . 36
5.3.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Summary 42

3
1 Aim
This is a very biased introduction to C programming. It is biased because it does not cover everything. It
does not even cover the details of the programming language. The purpose of this introduction is to show
you what is required (the format) to quickly write your own C program. I trust you can learn the details
by yourself. There are many different ways to learn a new language. I think one of the easiest ways is by
copying some sample codes and studying what they do. That is the way little kids learn to speak. Therefore,
I will present a series of code examples. Each one will illustrate certain points and each one will be a little
more complicated than the previous example.
CUDA is a relatively new development. It is basically an extension of standard C program. The exten-
sion allows people with C programming knowledge to quickly learn how to program the NVIDIA graphics
processor which has massive parallel computing capabilities. An excellent web site for learning CUDA is:
http://courses.ece.illinois.edu/ece498/al/. The NVIDIA’s CUDA web page has links to many tuto-
rials and technical training materials: www.nvidia.com/cuda.
My plan is to use 4 lectures and 4 hours of lab time to cover the basis of C and CUDA programming. Ob-
viously, there is not enough time. The purpose of this introduction is show you what you can do with these
new technology. I hope you will be motivated to quickly learn all the tools and tricks by yourself. I also plan
to use CUDA in ELE5IPC (ELE4IPC) in second semester1 .

Source code for all examples can be found in the ELE4ASP web page.

2 C language examples and exercises

2.1 The simplest one

This is perhaps the simplest C program. You can also use it as a template.

1 #include <s t d i o . h>

2 int main ( )
3 { // you can add your code h e r e
4 p r i n t f ( " H e l l o World\n" ) ;
5 return 0 ;
6 }

You can use a text editor such as the Matlab editor or the Notepad to write your own C code and save it as,
for example in this case, Hello.c, in your own directory.

2.1.1 Exercises

How do we run it? We will use Microsoft Visual Studio 2005. Here are the steps

• In Windows, click “Start” -> “Program” -> “Microsoft Visual Studio 2005” -> “Visual Studio Tools”
-> “Visual Studio 2005 Command Prompt”. You will have a command window.
1 ELE4IPC is not offer in 2012

4
• Suppose your code is save in a directory c:\abc. In the command window, you need to change the
directory by using: cd c:\abc. You can make sure your C program is in the current directory by listing
the files: dir *.c . If you can see the file Hello.c listed, then you can go on to the next step.

• To compile it, you use: cl Hello.c -o Hello. Take a careful look at the result. There is a warning about
the -o option. We will keep this option anyway, because in another popular compiler called GCC this
is a required option. There will be error messages if there is something wrong in your code.

• To run the program, simply type Hello. What do you see? Hello World!

2.2 Variables and the if-else command

Make a copy of Hello.c and rename it as vie.c. To do this, you can type in the command window: copy
Hello.c vie.c. Open vie.c in the text editor and edit it such that it looks like the following.

1 #include <s t d i o . h>

2 int main ( )
3 { int a = 1 0 , b = 2 ;
4 f l o a t c , PI = 3 . 1 4 ;
5 i f (a > b)
6 c = ( f l o a t ) ( a ∗ a ) ∗ PI ;
7 else
8 c = ( f l o a t ) ( b ∗ b ) ∗ PI ;
9 p r i n t f ( "The a r e a o f t h e l a r g e r c i r c l e i s %f \n" , c ) ;
10 return 0 ;
11 }

In this example, we introduce three new features: declaring variables [e.g., float c, PI = 3.14], converting the
number to a different format [e.g., (float) (a * a)] and the if-else structure.

2.2.1 Exercises

Given three different numbers, a, b, c. You can assume they are integers. Write a program to calculate and
print out the average and median value of these three numbers. The average is given by

1
x= (a + b + c)
3

and the median is number which less than the maximum of the three and greater than the minimum of the
three. You can use vie.c as a template.

2.3 For-loop

In this example, we show how to use the for-loop. We also show how to define a constant [#define PI 3.14].

1 #include <s t d i o . h>

2 #define PI 3 . 1 4
3

5
4 int main ( )
5 { int n , N = 1 0 ;
6 float s ;
7 f o r ( n = 1 ; n < N; n++)
8 { s = ( f l o a t ) ( n ∗ n ) ∗ PI ;
9 p r i n t f ( "The a r e a o f t h e c i r c l e o f r a d i u s %d i s %f \n" , n , s ) ;
10 }
11 return 0 ;
12 }

2.3.1 Exercises

Modify this program such that it converts the Australian dollar to US dollar. Your program should print
out corresponding value of the US dollars for the following Australian dollars 10, 20, 30,...100. Hint: you can
define the exchange rate and instead of using n++ in the for-loop, use n = n + 10.

2.4 Array

In this example, we introduce three new features. When we use mathematical functions such as exp, we
should include the math library math.h. A 1-d array is a vector. You should declare them before using
them. In the statement: b[n] = exp( -(float) (n)/2.f ) we use “2.f” to tell the compiler that this is
a floating point number. An important issue is related to dimension of the array. When we declare: float
a[10]; we define a vector of 10 elements. The first element is a[0] and the last is a[9]. What would happen
if you try something like: a[-2] = -2 or c[0] = a[100] + PI? Un-comment the three lines after the line
//stupid errors. Then compile and run the modified program. Unexpected results can happen. Be careful.

1 // a r r a y . c
2 #include <s t d i o . h>
3 #include <math . h>
4 #define PI 3 . 1 4
5
6 int main ( )
7 { int n , N = 1 0 ;
8 float a [ 1 0 ] , b [ 1 0 ] , c [ 1 0 ] , s = 0;
9
10 f o r ( n = 0 ; n < N; n++)
11 { a [ n ] = ( f l o a t ) ( n − N) ∗ PI ;
12 b [ n ] = exp (−( f l o a t ) n / 2 . f ) ;
13 p r i n t f ( "%d %f %f \n" , n , a [ n ] , b [ n ] ) ;
14 }
15
16 f o r ( n = 0 ; n < N; n++)
17 { c [n] = a[n] ∗ b[n] ;
18 s = s + a[n] ∗ a[n];

6
19 }
20
21 p r i n t f ( "The l e n g t h o f v e c t o r a i s %f \n" , s ) ;
22
23 // s t u p i d e r r o r s
24 a [ −2] = −2. f ;
25 c [ 0 ] = a [ 1 0 0 ] + PI ;
26 p r i n t f ( " S t u p i d e r r o r s a [ −2]= %f and a [ 1 0 0 ] + PI= %f \n" , a [ − 2 ] , c [ 0 ] ) ;
27
28 return 0 ;
29 }

2.4.1 Exercises

Create two sinusoidal vectors

s[n] = sin(ωn)

and
c[n] = cos(ωn)

where n = 0, 1, ..., N , N = 1023, and ω = 0.2π. Calculate the inner product of these two vectors. You can
use array.c as a template.

2.5 2-D array

This example illustrates how to deal with a 2-d array. As you can see from the print out, it is row-based.
This has an important implication. We can use a linear index for elements of the 2d array. For example, we
have defined a 2d array: a[NRow][MCol] which has “NRow” rows and “MCol” columns. What is the index
of a particular element a[n][m] at the nth row and mth column? It is given by: n ∗ M Col + m. With this
linear indexing scheme, we do not need to deal with a 2-d array. We only need to deal with a 1-d array. It
is interesting to point out that in Matlab, a 2-d array is stored in column-based way. So the linear address
of a 2-d array in Matlab is (m − 1) ∗ N Row + n.

1 // a r r a y 2 d . c
2 #include <s t d i o . h>
3 #include <math . h>
4 #define PI 3 . 1 4
5
6 main ( )
7 { int m, M, n , N ;
8 int a [ 3 ] [ 4 ] , b [ 3 ] [ 4 ] , c [ 4 ] , d [ 4 ] ;
9
10 M = 4;
11 N = 3;

7
12
13 f o r ( n = 0 ; n < N; n++)
14 f o r ( m = 0 ; m < M; m++)
15 { a [ n ] [m] = n ∗ M + m ; // d e f i n e t h e m a t r i x a o f 3 rows and 4 columns
16 b [ n ] [m] = n ; // d e f i n e t h e m a t r i x b
17 }
18
19 f o r ( n = 0 ; n < N; n++)
20 { f o r ( m = 0 ; m < M; m++)
21 { p r i n t f ( "%d " , a [ n ] [m] ) ; }
22 p r i n t f ( " \n" ) ;
23 }
24 p r i n t f ( " \n" ) ;
25 f o r ( n = 0 ; n < N; n++)
26 {
27 f o r ( m = 0 ; m < M; m++)
28 { p r i n t f ( "%d " , b [ n ] [m] ) ; }
29 p r i n t f ( " \n" ) ;
30 }
31 }

2.5.1 Exercises

• Modify the code array2d.c such that it calculates the element-wise addition and multiplication of the
two matrices a and b. The result is stored in d and e.

• Use linear index to perform the above tasks.

• Define a vector f [4] = [1 2 3 4]T . Modify the code to calculate the matrix-vector multiplication g = a∗f .

2.6 Pointers for arrays

Pointers are closely related to arrays. In the this example, we show how to use pointers to declare a 1-d and
a 2-d array and how to use pointers to access elements of the array.

1 // p o i n t e r _ a r r a y . c
2
3 #include <s t d i o . h>
4 #include <math . h>
5
6 main ( )
7 { int ∗a ; // i n t e g e r p o i n t e r
8 f l o a t ∗b , ∗ c ; // f l o a t p o i n t
9 int Length , NRow, MCol , n ,m, k ;
10

8
11 Length = 5 ;
12 NRow = 1 0 ;
13 MCol = 5 ;
14
15 // a l l o c a t e mem f o r t h e two a r r a y s
16 a = ( int ∗ ) m a l l o c ( Length ∗ s i z e o f ( int ) ) ;
17 b = ( f l o a t ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( f l o a t ) ) ;
18 c = ( f l o a t ∗ ) m a l l o c ( Length ∗ s i z e o f ( f l o a t ) ) ;
19
20 for ( k = 0 ; k < Length ; k++)
21 a[k] = k;
22
23 f o r ( n = 0 ; n < NRow; n++)
24 f o r ( m = 0 ; m < MCol ; m++)
25 { k = n ∗ MCol + m;
26 b [ k ] = k ; // d e f i n e t h e m a t r i x a o f 3 rows and 4 columns
27 }
28
29 // p r i n t a
30 p r i n t f ( " P r i n t i n g a \n" ) ;
31 f o r ( k = 0 ; k < Length ; k++)
32 p r i n t f ( "%d\n " , a[k]) ;
33
34 p r i n t f ( " \n" ) ;
35 // p r i n t b
36 p r i n t f ( " P r i n t i n g b \n" ) ;
37 f o r ( n = 0 ; n < NRow; n++)
38 { f o r ( m = 0 ; m < MCol ; m++)
39 { p r i n t f ( "%f " , b [ n ∗ MCol + m] ) ; }
40 p r i n t f ( " \n" ) ;
41 }
42
43 // c a l c u l a t e c = b ∗ a
44 f o r ( n = 0 ; n < NRow; n++)
45 { c [m] = 0 ;
46 f o r ( m = 0 ; m < MCol ; m++)
47 c [ n ] = c [ n ] + ( f l o a t ) a [m] ∗b [ n ∗ MCol + m] ;
48 }
49
50 p r i n t f ( " P r i n t i n g c \n" ) ;
51 // p r i n t c
52 f o r ( k = 0 ; k < Length ; k++)
53 p r i n t f ( "%f \n " , c [k]) ;

9
54
55 // f i n i s h u s i n g t h e memory , f r e e them
56 free (a) ;
57 free (b) ;
58 free (c) ;
59 }

Another example
1 //min_max . c
2
3 #include <s t d i o . h>
4 #include <math . h>
5 #include < s t d l i b . h> // d e f i n e s rand ( )
6 #include <time . h>
7 main ( )
8 {
9 f l o a t ∗a , max , min ;
10 int NRow, MCol , n ,m, k ;
11
12 NRow = 4 ;
13 MCol = 5 ;
14
15 // a l l o c a t e mem f o r t h e two a r r a y s
16 a = ( f l o a t ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( f l o a t ) ) ;
17
18 // i n i t i a l i z e random s e e d :
19 s r a n d ( time (NULL) ) ;
20
21 f o r ( n = 0 ; n < NRow; n++)
22 f o r ( m = 0 ; m < MCol ; m++)
23 { k = n ∗ MCol + m;
24 a [ k ] = ( f l o a t ) rand ( ) ;
25
26 }
27
28 // p r i n t a
29 p r i n t f ( " P r i n t i n g b \n" ) ;
30 f o r ( n = 0 ; n < NRow; n++)
31 {
32 f o r ( m = 0 ; m < MCol ; m++)
33 p r i n t f ( "%f " , a [ n ∗ MCol + m] ) ;
34 p r i n t f ( " \n" ) ;
35 }
36

10
37 max = −1e12 ;
38 min = 1 e12 ;
39
40 f o r ( n = 0 ; n < NRow; n++)
41 f o r ( m = 0 ; m < MCol ; m++)
42 { k = n ∗ MCol + m;
43 i f ( a [ k ] >= max )
44 max = a [ k ] ;
45 i f ( a [ k ] <= min )
46 min = a [ k ] ;
47 }
48
49 p r i n t f ( "Max = %f Min = %f \n" , max , min ) ;
50
51 // f i n i s h u s i n g t h e memory , f r e e them
52 free (a) ;
53 }

2.7 Command line input

Refer to the previous example, we have to define the dimension of the array in the program. Once it is
compiled, the dimension is fixed. If we want to change it, we need to change the program, compile and run
it again. There is a better way. What is new in the following example is that the inputs are collected from
the command line.
The function main() is declared as: main(int argc, char **argv). This is a fixed format. The variable
argc counts the number of inputs and the character pointer argv will be used to store the input. For example,
argv[1] stores the first input. There is an important point here. The program name (main) is counted as
the first input (with index 0). So the first parameter input is with the index 1 and so on. If two inputs are
expected, then number of inputs is 3. In the following example, we test if the number of inputs satisfies the
requirement. If not, we print out the error message and terminate the program. This is an important issue,
because we want to program to run as what is intended to run.

1 //min_max . c
2
3 #include <s t d i o . h>
4 #include <math . h>
5 #include < s t d l i b . h> // d e f i n e s rand ( )
6 #include <time . h>
7 main ( int argc , char ∗∗ argv )
8 {
9 f l o a t ∗a , max , min ;
10
11 int NRow, MCol , n ,m, k ;
12

11
13
14 i f ( a r g c != 3 ) // number o f i n p u t s + 1
15 {
16 p r i n t f ( "Two i n p u t s r e q u i r e d \n" ) ;
17 p r i n t f ( " Usage : min_max_ci NRow Mcol \n" ) ;
18 exit (0) ;
19 }
20 else
21 {
22 NRow = a t o i ( argv [ 1 ] ) ; // g e t t h e 1 s t i n p u t and c o n v e r t o i n t e g e r
23 MCol = a t o i ( argv [ 2 ] ) ; // g e t t h e 2nd i n p u t and c o n v e r t o i n t e g e r
24 }
25
26
27 i f (NRow <0 | | MCol < 0 )
28 { p r i n t f ( "Non−n e g a t i v e i n p u t s r e q u i r e d \n" ) ;
29 p r i n t f ( " Usage : min_max_ci NRow Mcol \n" ) ;
30 exit (0) ;
31 }
32
33 p r i n t f ( "NRow = %i MCol = %i \n" , NRow, MCol ) ;
34
35 // a l l o c a t e mem f o r t h e two a r r a y s
36 a = ( f l o a t ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( f l o a t ) ) ;
37
38
39 // i n i t i a l i z e random s e e d :
40 s r a n d ( time (NULL) ) ;
41
42
43 f o r ( n = 0 ; n < NRow; n++)
44 f o r ( m = 0 ; m < MCol ; m++)
45 { k = n ∗ MCol + m;
46 a [ k ] = ( f l o a t ) rand ( ) ;
47
48 }
49
50
51 // p r i n t a
52 p r i n t f ( " P r i n t i n g b \n" ) ;
53 f o r ( n = 0 ; n < NRow; n++)
54 {
55 f o r ( m = 0 ; m < MCol ; m++)

12
56 p r i n t f ( "%f " , a [ n ∗ MCol + m] ) ;
57 p r i n t f ( " \n" ) ;
58 }
59
60 max = −1e12 ;
61 min = 1 e12 ;
62
63 f o r ( n = 0 ; n < NRow; n++)
64 f o r ( m = 0 ; m < MCol ; m++)
65 { k = n ∗ MCol + m;
66
67 i f ( a [ k ] >= max )
68 max = a [ k ] ;
69 i f ( a [ k ] <= min )
70 min = a [ k ] ;
71
72 }
73
74 p r i n t f ( "Max = %f Min = %f \n" , max , min ) ;
75
76 // f i n i s h u s i n g t h e memory , f r e e them
77 free (a) ;
78 }

2.7.1 Exercises

• Compile the program by using: cl min_max_ci.c -o min_max_ci and and run the program by typing:
min_max_ci 5 6. What do you see?

• You can do a test: min_max_ci 5 (only supply one parameter) or min_max_ci 5 6 7 (supply more
than one parameter). What are the results?

• You can do another test min_max_ci 5 -6 (supply a negative number). What do you see?

• Obviously, negative number is not allowed. Modify the code such that it can detect a
negative number and print the error message then terminate.

2.8 File input and output

Here is another application of command line input. We want to process certain data such as an image stored
in file. We need to read it into an array (1-d or 2-d), process it and write the output to a file. This example
shows you how to do it. The line with ”FILE *fpr, *fpw;” declares points to files you want to read and write.
You then use command input to obtain the file names. Next you use “fopen” to associate the pointer with
the file name (e.g., argv[1]) and whether it is intended for read (e.g., “rb”) or write (e.g., “wb”) operations.

13
You also need to declare pointers to store the data. In this case, its type is unsigned char which is 8-bit,
because we assume the image data is 8 bit/pixel. You should be able to understand the rest of the code.

1 // f i l e I O . c
2 // read an image and w r i t e i t o u t t o a n o t h e r f i l e
3 #include <s t d i o . h>
4 #include <time . h>
5 void main ( int argc , char ∗∗ argv )
6 {
7 FILE ∗ f p r , ∗ fpw ;
8 unsigned char ∗ in , ∗ out ;
9 int NRow, MCol , n ,m, k ;
10
11 i f ( a r g c != 5 ) // number o f i n p u t s + 1
12 {
13 p r i n t f ( " Four i n p u t s r e q u i r e d \n" ) ;
14 p r i n t f ( " Usage : f i l e I O i n p u t output NRow Mcol \n" ) ;
15 exit (0) ;
16 }
17 else
18 { f p r = f o p e n ( argv [ 1 ] , " rb " ) ; // g e t t h e p o i n t e r t o read i n p u t image
19 fpw = f o p e n ( argv [ 2 ] , "wb" ) ; // g e t t h e p o i n t t o w r i t e o u t p u t image
20 NRow = a t o i ( argv [ 3 ] ) ; // s i z e o f t h e image
21 MCol = a t o i ( argv [ 4 ] ) ; // s i z e o f t h e image
22 }
23
24 // i f s o m e t h i n g i s s t i l l wrong !
25 i f ( f p r == 0 | | fpw == 0 | | NRow <0 | | MCol < 0 )
26 { p r i n t f ( " Usage : f i l e I O i n p u t output NRow Mcol \n" ) ;
27 exit (0) ;
28 }
29
30 // a l l o c a t e mem f o r t h e two a r r a y s
31 i n = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
32 out = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
33
34 // read image
35 f r e a d ( in , s i z e o f ( unsigned char ) , NRow ∗ MCol , f p r ) ;
36
37 // t h e o u t p u t i s t h e same as t h e i n p u t
38 f o r ( n = 0 ; n < NRow; n++)
39 f o r ( m = 0 ; m < MCol ; m++)
40 { k = n ∗ MCol + m; // use l i n e a r i n d e x i n g
41 out [ k ] = i n [ k ] ;

14
42 }
43
44 // w r i t e d a t a t o a f i l e
45 f w r i t e ( out , s i z e o f ( unsigned char ) , NRow ∗ MCol , fpw ) ;
46
47 // f i n i s h u s i n g t h e memory , f r e e them
48 f r e e ( in ) ;
49 f r e e ( out ) ;
50
51 // c l o s e
52 fclose ( fpr ) ;
53 f c l o s e ( fpw ) ;
54
55 p r i n t f ( "%d\n" ,CLOCKS_PER_SEC) ;
56 }

2.8.1 Exercises

• Download the image file “airplane”. Compile the program and run it using: fileIO airplane out 512 512.
We assume the image size is 512 × 512. You can compare the two files by using: comp airplane out. If
the two files are the same, you should have a message like: “Files compare OK”.

• Modify the code: fileIO.c such that it reads an input image and determines the minimum and
maximum value of the image.

2.9 Functions

A typical function takes from inputs from the main function and perform certain calculations then return
the result to the main function. Here is an example.The main function passes two numbers to the function
which calculates the sum and returns the result.

1 #include<s t d i o . h>
2 int add ( int x , int y ) ;
3 void main ( )
4 { int a = 1 , b = 2 ;
5 int c ;
6
7 c = add ( a , b ) ;
8 p r i n t f ( "%d + %d = %d\n" , a , b , c ) ;
9 }
10
11 int add ( int x , int y )
12 { int sum = 0 ;
13 sum = x + y ;
14 return sum ;

15
15 }

2.9.1 Exercises

• Modify the program to allow a use to input the two integers from command line.

• Write a function that returns the larger number of the two input numbers.

2.10 Function using pointers

Here is a very simple example which shows you how to define a function for adding two vectors. The function
accepts two pointers and returns a pointer. These pointers are best understood as containing the addresses of
particular arrays. For example, when the function is called: “vectorAdd(c,a, b, Length);” , the address
of the array a[0] is passed to another array x[0]. When the function finishes its work, it passes the address of
z[0] to c[0].

1 // d e f i n e a f u n c t i o n
2
3 #include<s t d i o . h>
4
5 void vectorAdd ( int ∗ z , int ∗x , int ∗y , int Length ) ;
6
7 void main ( )
8 { int ∗a , ∗b , ∗ c ;
9 int Length , n ;
10
11 Length = 1 0 ;
12
13 //memory a l l o c a l t i o n
14 a = ( int ∗ ) m a l l o c ( Length ∗ s i z e o f ( int ) ) ;
15 b = ( int ∗ ) m a l l o c ( Length ∗ s i z e o f ( int ) ) ;
16 c = ( int ∗ ) m a l l o c ( Length ∗ s i z e o f ( int ) ) ;
17
18 //make tow v e c t o r s
19 f o r ( n = 0 ; n < Length ; n++)
20 { a[n] = n;
21 b [ n ] = Length − n ;
22 }
23
24 // c a l l t h e f u n c t i o n
25 vectorAdd ( c , a , b , Length ) ;
26
27 // p r i n t r e s u l t

16
28 f o r ( n = 0 ; n < Length ; n++)
29 p r i n t f ( " c [%d ] = %d\n" , n , c [ n ] ) ;
30
31 free (a) ;
32 free (b) ;
33 free (c) ;
34 }
35
36 void vectorAdd ( int ∗ z , int ∗x , int ∗y , int Length )
37 { int n ;
38
39 f o r ( n = 0 ; n < Length ; n++)
40 z [n] = x[n] + y[n];
41 }

2.10.1 Exercises

• Compile and the run the program vectorAdd.c. Does the result make sense? Change Length to 20.
Can you predict the print out? Compile and run the modified program.
P
• Modify the program such that it calculates the inner product of two vectors: s = a[n] ∗ b[n]. Hint:
the function now returns an integer instead of a pointer. You can refer to section 2.9 for an example.

2.11 An image processing example

An interesting image processing algorithm is called gamma correction. Let an image be stored in a 1-d array
in[n] (we use linear indexing, see section 2.6). The processed image is given by
γ
in[n]
out[n] = 255 ∗
255

Here we assume the image is 8-bit/pixel.

The following program involves almost all programming knowledge we have reviewed: command line input,
file input/output, pointers, and linear indexing for 2-d arrays, for-loops, etc. This program is slightly mod-
ified version of the program fileIO.c. We have added the line: “gamma = atof(argv[5]);” for getting the
gamma value. Do your homework to find out what “atof” means. We use the “clock” function to mea-
sure how many clock-ticks the program takes to processing the image. We should know that measuring the
running time of a program is not a simple matter. The clock function only provides a very rough indica-
tion of the running time. You should also pay attention to this line “out[k] = (unsigned char)(255.f
* pow((float)in[k]/255.f, gamma));” where we have used 255.f to indicate that it is a floating point
number and we have made suitable type-conversions, e.g., (float)in[k].

1 // imagePow . c
2 // read an image and w r i t e i t o u t t o a n o t h e r f i l e

17
3 #include <s t d i o . h>
4 #include <math . h>
5 #include <time . h>
6
7 void main ( int argc , char ∗∗ argv )
8 {
9 FILE ∗ f p r , ∗ fpw ;
10 unsigned char ∗ in , ∗ out ;
11 int NRow, MCol , n ,m, k , t ;
12 f l o a t gamma ;
13
14 i f ( a r g c != 6 ) // number o f i n p u t s + 1
15 {
16 p r i n t f ( " F ive i n p u t s r e q u i r e d \n" ) ;
17 p r i n t f ( " Usage : f i l e I O i n p u t output NRow Mcol gamma\n" ) ;
18 exit (0) ;
19 }
20 else
21 { f p r = f o p e n ( argv [ 1 ] , " rb " ) ; // g e t t h e p o i n t e r t o read i n p u t image
22 fpw = f o p e n ( argv [ 2 ] , "wb" ) ; // g e t t h e p o i n t t o w r i t e o u t p u t image
23 NRow = a t o i ( argv [ 3 ] ) ; // s i z e o f t h e image
24 MCol = a t o i ( argv [ 4 ] ) ; // s i z e o f t h e image
25 gamma = a t o f ( argv [ 5 ] ) ; //gamma
26 }
27
28 // i f s o m e t h i n g i s s t i l l wrong !
29 i f ( f p r == 0 | | fpw == 0 | | NRow <0 | | MCol < 0 )
30 { p r i n t f ( " Usage : f i l e I O i n p u t output NRow Mcol \n" ) ;
31 exit (0) ;
32 }
33
34 // a l l o c a t e mem f o r t h e two a r r a y s
35 i n = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
36 out = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
37
38 // read image
39 f r e a d ( in , s i z e o f ( unsigned char ) , NRow ∗ MCol , f p r ) ;
40
41 t = clock () ;
42 // t h e o u t p u t i s t h e same as t h e i n p u t
43 f o r ( n = 0 ; n < NRow; n++)
44 f o r ( m = 0 ; m < MCol ; m++)
45 { k = n ∗ MCol + m; // use l i n e a r i n d e x i n g

18
46 out [ k ] = ( unsigned char ) ( 2 5 5 . f ∗ pow ( ( f l o a t ) i n [ k ] / 2 5 5 . f , gamma) ) ;
47 }
48
49 t = clock () − t ;
50
51 // w r i t e d a t a t o a f i l e
52 f w r i t e ( out , s i z e o f ( unsigned char ) , NRow ∗ MCol , fpw ) ;
53
54 // f i n i s h u s i n g t h e memory , f r e e them
55 f r e e ( in ) ;
56 f r e e ( out ) ;
57
58 // c l o s e
59 fclose ( fpr ) ;
60 f c l o s e ( fpw ) ;
61
62 p r i n t f ( "Number o f c l o c k t i c k s used i n p r o c e s s i n g t h e image = %d\n" , t ) ;
63 }

2.11.1 Exercises

• Compile and the program: “cl imagePow.c -o imagePow” and “imagePow airplane out0.5 512 512 0.5”.

• In Matlab you can read the image and display it:

a=fopen(’airplane’,’r’);
b=fread(a, [512 512]);
fclose(a)
a=fopen(’out0.5’,’r’);
c=fread(a, [512 512]);
fclose(a)
imshow(double([b c])/256).

• You can experiment with a number of settings for gamma (0.1, 0.5, 1.5, 2) and display the resulting
images.

• Modify the c program imagePow.c such that the image processing task (the double for-loops) is per-
formed in a function. You can use the program in 2.10 as an example. The function is given below.
You should NOT look at it unless you really do not how to write your own.

1 void gammaCorrection ( unsigned char ∗x , unsigned char ∗y , int N, int M,

float g)\
2 { unsigned char ∗y ;
3 int n , m, k ;
4
5 //gamma c o r r e c t i o n

19
6 f o r ( n = 0 ; n < N; n++)
7 f o r ( m = 0 ; m < M; m++)
8 { k = n ∗ M + m; // use l i n e a r i n d e x i n g
9 y [ k ] = ( unsigned char ) ( 2 5 5 . f ∗ pow ( ( f l o a t ) x [ k ] / 2 5 5 . f , g )
) ;
10 }
11 }

You need to replace the double for-loops with the following:

gammaCorrection(in, out, NRow, MCol, gamma);

2.12 Make it run faster-optimization

Now we can write C programs to perform quite complicated signal processing tasks. Can we make our
program run faster? There are many tricks and tools to write computationally efficient C programs. One of
the easiest way is to write it in a “standard” way and let the compiler to do the hard work to optimize it. In
the command window, type: cl /?, you will see a lot of options.
Some are related to optimization for faster running speed. For example, using the option /O2 will maximize
the running speed. Using the option /fp:fast will use less accurate but faster floating point calculations.
Use the option /arch:SSE2 will enable the use of instructions available with the SSE2 enable CPUs. By
default, all optimization options are disabled!
According to Wikipedia, Streaming SIMD Extensions (SSE) is a SIMD instruction set extension to the x86
architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to
AMD’s 3DNow!.

2.12.1 Exercises

• Compile and run the program imageExp.c using different combinations of optimization options and
observe the running time. For example, you can compile it without any optimization: cl imagePow.c
-o imagePow. Then you can use:

– (a) cl /O2 imagePow.c -o imagePow1,

– (b) cl /O2 /fp:fast imagePow.c -o imagePow2
– (c) cl /O2 /fp:fast /arch:SSE2 imagePow.c -o imagePow3

• Perform a literature search to understand the benefit of using SSE2 or other processor specific instruc-
tion sets.

• Perform a literature search to understand the optimization options of the GCC and Intel C compiler.

3 A very very brief introduction to CUDA

3.1 Motivation
From a scientific computation point of view, the processor takes data from the memory, performs certain
calculations and writes the result back to the memory. The program (software) tells the processor what to

20
do. The processed result will be written into a file and stored in a hard disk. For signal/image processing
applications, it is usually the case that the same set of operations will be performed on many data. For
example, in the gamma correction case, the following calculation is performed on all pixels:
γ
in[k]
out[k] = 255 ∗
255

where for a 512 × 512 image k = 0, 1, ..., 262143. For non-parallel programs, one pixel is processed at a time
and in a sequential way. So using a for-loop is a natural way to go

N = 512 * 512;
for ( k = 0 ; k < N ; k ++)
out[k] = (unsigned char) floor( 255.f * pow( (float) in[k] / 255.f, gamma ) );

Here is an interesting point. The for-loop index k is actually used as an index for the location of the
pixel. Suppose the processor can finish the calculation for one pixel in tc seconds. Then it takes roughly
N × tc seconds to processed the image.
Now we have graphics processors (GPU) which have hundreds of processing units in them. It makes sense to
utilize this computing power. In the above example, we can ask each processing unit in the GPU to process
one pixel simultaneously. Suppose we have M processing units and it takes tg seconds to process one pixel.
Then it takes roughly N × tg /M seconds to process the image. The increase in processing speed can be
calculated as (not accurate!)
N × tc tc
=M
N × tg /M tg

Obviously, more and faster processing units in a GPU will lead to higher speed up ratio.
An interesting point is that just as in the for-loop we can use the loop index as the location index of the
pixel, in CUDA we can use the thread ID as the location index.

3.2 Terms and definitions

The following information is collected from a number of sources: including the UIUC lecture notes and the
CUDA technical training material from NVIDIA.
A typical GPU (the G80 chip from NVIDIA) has 16 streaming multiprocessors (SM). Each SM has 8 streaming
processors (SP). The SP, running at 1.35GHz, has a multiply-add and a multiply unit. There are 128 SPs.
Each supports up to 768 threads. The following are list of terms related to GPU computing.
Host: the computer
Device: the graphics processor
Kernel: a computational task such as the calculation of the gamma correction (the line after the for-loop in
the previous example)
Thread: tasks.
Grid: organization of threads into blocks
Block: a 3D array of threads

21
Block ID: blockIDx.x and blockIDx.y.
Thread ID: threadIDx.x, threadIDx.y, and threadIDx.z
In general terms, a thread is related to a processing task (for example the processing of one pixel in the
gamma correction case). When a kernel function is launched, it is executed as a grid of parallel threads.
This can be regarded as a way the run-time system allocates processor resources for the parallel running
of the kernel function. The thread grid is organized into a two-level hierarchy. On the top level, we define
the thread block. Each block has a 2-d ID given by the CUDA keyword: blockIDx.x and blockIDx.y. All
thread blocks must have the same number of threads organized in the same way. A thread block is organized
as a 3D array of threads with a total of up to 512 threads. The ID for a tread in a block is given by the three
indies: threadIDx.x, threadIDx.y, and threadIDx.z.
For example, we can define a 4 × 5 grid of blocks and define each block as a 6 × 7 × 1 array. Then the total
number of threads in this grid is 4 × 5 × 6 × 7 = 840. A particular thread in the block (1, 2) and at the
location of the array (3, 4, 1) is with the follow IDs
blockIDx.x = 1
blockIDx.y = 2
threadIDx.x = 3
threadIDx.y = 4
threadIDx.z = 1

In fact, these are the built in device variables (defined as follows) which are accessible by all global and device
functions
dim3 gridDim
dim3 blockDim
dim3 blockIdx
dim3 threadIdx

3.3 Programming model

The CUDA programming model is simple. A C program which runs in the host is responsible for setting
up the variables for both host and device, data input and output, copying data from host to device, calling
the kernel function which runs in the device, and copying data from the device back to the host once the
processing is finished. It is then possible to invoke another kernel function.
The syntax for invoking a kennel function is as follows

dim3 dimGrid(grid_size_x, grid_size_y);

dim3 dimBlock(block_size_x, block_size_y, block_size_z)
Kernel_Function <‌<‌< dimGrid, dimBlock >‌>‌> (parameters of the kernel function);

where Kernel_Function is the name of the kennel, dimGrid and dimBlock is the definition of the thread
grid and the last items are parameters that are passed from the host to the kennel function.

22
For a 2-d thread block (block_size_z = 1), the 2-d thread ID for a particular block with indices (blockIdx.x,
blockIdx.y) and the thread indices within the block (threadIdx.x, threadIdx.y) are given as follows
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
Note that this is common way to “address” a 2D thread within a grid. You should contrast this to the
double for-loops in the C program. Roughly speaking, in the C program, you use the double for-loops to
tell the CPU to process each pixel of the image in a sequential way. In CUDA, the runtime system will
look after the calculation of the indices of idx and idy and distribution the associated processing task to as
many processors as possible at the same time. As such, we do not know which pixel is processed by which
processor.

3.4 CUDA specific functions

There are a number of CUDA specific functions.

3.5 Image processing example

3.5.1 Main parts of the program

We will use gamma-correction as an example. The C program will call a function that runs on the CPU and
another function that runs on the GPU. The following is a list of key steps in the program

1. Define the GPU and CPU functions

2. Declare variables

3. Take care of command line input

4. Memory allocation for arrays in host and device

5. Read image

6. Run GPU code

(a) Define the grid and block

(b) Copy image data to device and check errors
(c) Warm up - running cudaThreadSynchronize();
(d) Run GPU code
(e) Wait until all tasks finished and check running error
(f) Copy data from device to host and check error
(g) Free device memory

7. Write data to a file out0.5

8. Run CPU code

23
9. Calculate the maximum error and the sum of absolute error between the results from the CPU cod and
the GPU code. The results are necessarily exactly the same!

10. Free host memory.

3.5.2 Details of the GPU code

• Block and grid definition

In this program, we define a 16 × 16 thread block. This is done through the two macros
//define 16x16 thread block
#define BLOCKDIM_X 16
#define BLOCKDIM_Y 16
There are 256 threads in one block which has a limit of 512 threads.
The grid dimension is calculated by using the function
int iDivUp(int a, int b)
{ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
This function simply returns a/b if a is divisible by b, otherwise it returns 1 + a/b. The grid dimension is
defined as
dim3 dimGrid(iDivUp(MCol, BLOCKDIM_X), iDivUp(NRow, BLOCKDIM_Y));
It calculates how many blocks is sufficient to cover all pixels of an image.

• Memory allocation and data transfer

This is through the function cudaMalloc and cudaMemcpy

//allocate memory for device
cudaMalloc( (void **) &d_in, NRow * MCol * sizeof(unsigned char));
cudaMalloc( (void **) &d_out, NRow * MCol * sizeof(unsigned char));
//copy data to device
cudaMemcpy( d_in, in, NRow * MCol * sizeof(unsigned char), cudaMemcpyHostToDevice );

• The kernel function

The definition of the kernel function is similar to the normal function in C except that there is a key word:
__global__ void .
Since we are dealing with an image, we use a 2-d thread ID to “address” each pixel. This is similar to the
double for-loops in a normal CPU code where we use for-loop indices to address pixels. The difference is that
these threads are allocate to many SPs and the kernel is executed simultaneously. The calculation of the 2-d
thread ID (idx and idy ) follows a common pattern. You can used it as a template. Note also in this code
we use the linear index method.

__global__ void gammaCorrectionGPU(unsigned char *d_in, unsigned char *d_out, const int NROW,
const int NCOL, const float gamma)

24
{ float x0;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int k;
if(idx < NCOL && idy < NROW)
{ k = idy * NCOL + idx; //linear index
x0 = (float) d_in[k];
d_out[k] = (unsigned char) floor(255.f * pow(x0/255.f, gamma));
}
}
There are some interesting detailed information on what happens when the kernel function is invoked and
threads (tasks) are assigned to SPs in D. Kirk and W-M Hwu’s lecture notes (Chapter 3). There are also
limitations on how many blocks can be assigned to a SM and thus how many blocks can be run in a particular
processor. There are also issues related to thread scheduling and the hardware resources such as the number
of registers used by the thread. These issues must be considered when defining the dimension of the grid.
Due to time limitations, we will skip these details.

• The rest of the code

The rest of the code is quite easy to understand

//run GPU code
gammaCorrectionGPU <‌<‌< dimGrid, dimBlock >‌>‌> ( d_in, d_out, NRow, MCol, gamma);
// until the device has completed all computations
cudaThreadSynchronize();
// check if kernel execution generated an error
checkCUDAError("kernel execution");
// device to host copy
cudaMemcpy( out_GPU, d_out, NRow * MCol * sizeof(unsigned char), cudaMemcpyDeviceToHost
);
// Check for any CUDA errors
checkCUDAError("cudaMemcpy");

• Running time

We use the function clock() from the library “time.h”. This function returns number of clock ticks elapsed
since the program was launched. We use it to measure the running time of the CPU and the GPU code.

3.5.3 The Program

1 #include <s t d i o . h>

2 #include <time . h>

25
3 #include <math . h>
4 #define MAX( a , b ) ( a > b ? a : b ) // d e f i n e a max macro
5 #define BLOCKDIM_X 16
6 #define BLOCKDIM_Y 16
7
8 // Simple u t i l i t y f u n c t i o n t o c h e c k f o r CUDA runtime e r r o r s
9 void checkCUDAError ( const char ∗msg )
10 {
11 cudaError_t e r r = cudaGetLastError ( ) ;
12 i f ( c u d a S u c c e s s != e r r )
13 {
14 f p r i n t f ( s t d e r r , "Cuda e r r o r : %s : %s . \ n" , msg , c u d a G e t E r r o r S t r i n g ( e r r )
);
15 e x i t ( −1) ;
16 }
17 }
18
19 int iDivUp ( int a , int b ) {
20 return ( ( a % b ) != 0 ) ? ( a / b + 1 ) : ( a / b ) ;
21 }
22
23 //GPU f u n c t i o n
24 __global__ void gammaCorrectionGPU ( unsigned char ∗d_in , unsigned char ∗d_out ,
const int NROW, const int NCOL, const f l o a t gamma)
25 {
26 f l o a t x0 ;
27 int i d x = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
28 int i d y = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ;
29 int k ;
30
31 i f ( i d x < NCOL && i d y < NROW )
32 {
33 k = i d y ∗ NCOL + i d x ;
34 x0 = ( f l o a t ) d_in [ k ] ;
35 d_out [ k ] = ( unsigned char ) f l o o r ( 2 5 5 . f ∗ pow ( x0 / 2 5 5 . f , gamma) ) ;
36 }
37 }
38
39 //CPU f u n c t i o n
40 void gammaCorrectionCPU ( unsigned char ∗x , unsigned char ∗y , int N, int M,
float g)
41 { int n , m, k ;
42

26
43 //gamma c o r r e c t i o n
44 f o r ( n = 0 ; n < N; n++)
45 f o r ( m = 0 ; m < M; m++)
46 { k = n ∗ M + m; // use l i n e a r i n d e x i n g
47 y [ k ] = ( unsigned char ) f l o o r ( 2 5 5 . f ∗ pow ( ( f l o a t ) x [ k ] / 2 5 5 . f , g )
) ;
48 }
49 }
50
51
52 // ///////////////////////////////////////////////////////////////////////
53 // Program main
54 // ///////////////////////////////////////////////////////////////////////
55 int main ( int argc , char ∗∗ argv )
56 {
57 FILE ∗ f p r , ∗ fpw ;
58 int NRow, MCol , max_diff , sum , t ;
59 int i , j , k ;
60 f l o a t gamma ;
61
62 // p o i n t e r f o r h o s t memory
63 unsigned char ∗ in , ∗ out , ∗out_GPU ;
64 // p o i n t e r f o r d e v i c e memory
65 unsigned char ∗d_in , ∗d_out ;
66
67 i f ( a r g c != 6 ) // number o f i n p u t s + 1
68 { p r i n t f ( " F ive i n p u t s r e q u i r e d \n" ) ;
69 p r i n t f ( " Usage : gammaCorrection i n p u t output NRow Mcol gamma\n" ) ;
70 exit (0) ;
71 }
72 else
73 { f p r = f o p e n ( argv [ 1 ] , " rb " ) ; // f i l e p o i n t e r t o r ead i n p u t image
74 fpw = f o p e n ( argv [ 2 ] , "wb" ) ; // f i l e p o i n t t o w r i t e o u t p u t image
75 NRow = a t o i ( argv [ 3 ] ) ; // s i z e o f t h e image
76 MCol = a t o i ( argv [ 4 ] ) ; // s i z e o f t h e image
77 gamma = a t o f ( argv [ 5 ] ) ; //gamma
78 }
79
80 // i f s o m e t h i n g i s s t i l l wrong !
81 i f ( f p r == 0 | | fpw == 0 | | NRow <0 | | MCol < 0 )
82 { p r i n t f ( " Usage : gammaCorrection i n p u t output NRow Mcol \n" ) ;
83 exit (0) ;
84 }

27
85
86 // a l l o c a t e memory f o r h o s t
87 i n = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
88 out = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
89 out_GPU = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
90
91 // a l l o c a t e memory f o r d e v i c e
92 cudaMalloc ( ( void ∗ ∗ ) &d_in , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
93 cudaMalloc ( ( void ∗ ∗ ) &d_out , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
94
95 // read image
96 f r e a d ( in , s i z e o f ( unsigned char ) , NRow ∗ MCol , f p r ) ;
97 fclose ( fpr ) ;
98
99 // d e f i n e g r i d and b l o c k s i z e f o r GPU
100 dim3 t h r e a d s (BLOCKDIM_X, BLOCKDIM_Y) ;
101 dim3 g r i d ( iDivUp ( MCol , BLOCKDIM_X) , iDivUp (NRow, BLOCKDIM_Y) ) ;
102
103 // dim3 g r i d ( 6 4 , 64) ;
104 // s t a r t i n g time f o r GPU
105 t = clock () ;
106 // ////////////////////////
107 //uncomment t h e f o l l o w i n g l i n e t o s e e run t h e GPU code r e p e a t l y f o r 200 t i m e s
108 f o r ( k = 0 ; k < 2 0 ; k++)
109 {
110 // copy d a t a t o d e v i c e
111 cudaMemcpy ( d_in , in , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ,
cudaMemcpyHostToDevice ) ;
112 cudaMemcpy ( d_out , in , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ,
cudaMemcpyHostToDevice ) ;
113
114 //warm up
115 cudaThreadSynchronize ( ) ;
116
117 // run GPU code
118 gammaCorrectionGPU <<< g r i d , t h r e a d s >>>( d_in , d_out , NRow, MCol ,
gamma) ;
119
120 // u n t i l t h e d e v i c e has c o m p l e t e d a l l c o m p u t a t i o n s
121 cudaThreadSynchronize ( ) ;
122
123 // c h e c k i f k e r n e l e x e c u t i o n g e n e r a t e d an e r r o r
124 checkCUDAError ( " k e r n e l e x e c u t i o n " ) ;

28
125
126 // d e v i c e t o h o s t copy
127 cudaMemcpy ( out_GPU , d_out , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ,
cudaMemcpyDeviceToHost ) ;
128
129 // Check f o r any CUDA e r r o r s
130 checkCUDAError ( "cudaMemcpy" ) ;
131 }
132
133 // f r e e d e v i c e memory
134 cudaFree ( d_in ) ;
135 cudaFree ( d_out ) ;
136
137 // f i n i s h time f o r GPU
138 t = clock () − t ;
139 p r i n t f ( " I t took %d c l o c k t i c k s t o p r o c e s s t h e s i g n a l u s i n g GPU\n" , t ) ;
140
141 // w r i t e d a t a t o f i l e
142 f w r i t e (out_GPU , s i z e o f ( char ) , NRow∗MCol , fpw ) ;
143 f c l o s e ( fpw ) ;
144
145 // cpu v e r s i o n
146 t = clock () ;
147
148 // run CPU code 20 t i m e s
149 f o r ( k = 0 ; k < 20 ; k++)
150 gammaCorrectionCPU ( in , out , NRow, MCol , gamma) ;
151
152 t = c l o c k ( )−t ;
153 p r i n t f ( " I t took %d c l o c k t i c k s t o p r o c e s s t h e s i g n a l u s i n g CPU\n" , t ) ;
154
155 // s e e i f t h e r e i s a d i f f e r e n c e b e t w e e n t h e two r e s u l t s
156 sum = 0 ;
157 max_diff = 0 ;
158 f o r ( i = 0 ; i < NRow ; ++i )
159 f o r ( j = 0 ; j < MCol ; ++j )
160 { k = i ∗ MCol + j ;
161 max_diff = MAX( max_diff , abs (out_GPU [ k ] − out [ k ] ) ) ;
162 sum = sum + abs (out_GPU [ k ] − out [ k ] ) ;
163 }
164
165 p r i n t f ( " Max_diff and sum o f abs d i f f a r e %d %d\n" , max_diff , sum ) ;
166

29
167 // f r e e h o s t memory
168 f r e e (out_GPU) ;
169 f r e e ( in ) ;
170 f r e e ( out ) ;
171
172 return 0 ;
173 }

3.5.4 Compile and run

To compile this program which is called “gammaCorrection.cu”, you need to use the compiler nvcc in the
Visual Studio 2005 Command Prompt window.

• In Windows, click “Start” -> “Program” -> “Microsoft Visual Studio 2010” -> “Visual Studio Tools”
-> “Visual Sudio 2010 Command Prompt”. You will have a command window.

• Suppose your code is save in a directory z:\abc. In the command window, you need to change the
directory by using: cd z:\abc. You can make sure your C program is in the current directory by listing
the files: dir *.c . If you can see the file gammaCorrection.cu listed, then you can go on to the next
step. Otherwise, you need to down load the source code from the LMS web page and store it in the
current directory.

• The simplest way to compile the program is as follows:

nvcc gammaCorrection.cu -o gammaCorrection

• In the command window, you can run the program by:

gammaCorrection airplane out0.5 512 512 0.7.
You should have a new file called out0.5 in your current directory. Running the program gammaCorrec-
tion, you should see the number of clock ticks that the GPU and the CPU take to finish the processing.
In my desk top machine with an Intel Core Dual 2 2.4 GHz processor and NVIDIA 9800GT GPU, the
program reports 0 and 47 clock ticks for the GPU and CPU, respectively.

• You can load result and the original image into Matlab and display it by using the following commands:
a = fopen(’out0.5’,’r’);
b = fread(a, [512 512]);
imshow(b/256, []).

• To make the program run faster, we can turn on the optimization options for both CPU and GPU. We
have discussed the CPU options in 2.12. To use the C compiler option, we need to follow the format:
-Xcompiler /option. For the GPU, we can use the option -use_fast_math. Another useful option is
-Xptxas=-v. It tells us some information about the hardware usage, e.g., the number of register used
and the number of bytes of constant and shared memory used.

• Here is an example of turning on the O2 optimization of the C compiler:

nvcc -Xcompiler /O2 gammaCorrection.cu -o gammaCorrection

• A more complicated example is as follows:

30
nvcc -use_fast_math -Xptxas=-v -Xcompiler /O2 -Xcompiler /fp:fast gammaCorrection.cu
-o gammaCorrection
By turning on the optimization options, I was able to reduce the running time for the CPU code from 47
clock ticks to about 32.

3.6 Exercises

• Identify the GPU and CPU sections of the code in the main function. How many times of the GUP
and CPU codes are repeatly run?

• Compile and run the program: gammaCorrection.cu using: nvcc gammaCorrection.cu -o gammaCor-
rection. How many clock ticks does it take for the GPU and CPU to process the image? Load the
result into Matlab and display the processed and original image side-by-side.

• Use some compiler options:

nvcc -use_fast_math -Xptxas=-v -Xcompiler /O2 -Xcompiler /fp:fast gammaCorrection.cu -o gamma-
Correction.
How many registers does the kernel use? How many bytes of share [smem] and constant menory [cmem]
does the kernel use? How many clock ticks does it take for the GPU and CPU to process the image?

• Modify the code for the GPU and CPU sections such that the both GPU and CPU processing code is
repeatly run 100 times. (Hint: you need to change the for-loop.)

• Compile and run the program again. Based on the timing numbers for the GPU and the CPU, estimate
how many times the processing speed of the GPU is faster than that of the CPU.

4 Linear index of 2D arrays

4.1 C language

When we say we have an array a[3][4], we mean the array has 3 rows and 4 columes. Generally, the size of
the array is represented by something like “NRow” and “MCol”, where the former is the number of rows and
latter the number of columns. In C language, the storage of a matrix data is row-based. Thus the linear
address of a particular element a[m][n] is then calculated as: id = m ∗ M Col + n. For example, the linear
index of the element at the 3rd row and 2nd column a[2][1] is calculated as: id = 2*4+1=9.
This point is important, as the same notation is used in calculating the thread index. For example, we define
a (3 × 5) grid. Each block has (6 × 7) threads. The two variables: blockDim.x and blockDim.y are then
given by
blockDim.x = 7
blockDim.y = 6
The horizontal index of thread defined as idx is calculated as
idx = blockIdx.x * blockDim.x + threadIdx.x;
which is actually the column coordinate of the thread. Similarly, the vertical index of the thread defined as
idy is calculated as

31
idy = blockIdx.y * blockDim.y + threadIdx.y;
which is actually the row coordinate of the thread. The linear index is given by
id = idy * blockDim.x + idx.
Here is an example to illustrate the above points. Suppose we have two images of the size (480 × 720) stored
in two 1-D arrays: A and B. Suppose we want to blend the two images using C = 0.3 * A + 0.7 * B. In C
language, we have
SizeCol = 720;%number of col
SizeRow = 480;%number of row
for (int row = 0; row < SizeRow; ++row)
for (int col = 0; col < SizeCol; ++col)
{ id = row * SizeCol + col;
C[id] = 0.3 * A[id] + 0.7 *B[id]
}
In GPU code, we have
SizeCol = 720;
SizeRow = 480;
idx = blockIdx.x * blockDim.x + threadIdx.x;//col. index
idy = blockIdx.y * blockDim.y + threadIdx.y;//row index
if ( idx < SizeCol && idy < SizeRow)
id = idy * SizeCol + idx;
C[id] = 0.3 * A[id] + 0.7 *B[id]
The important points are

1. In C the first array index refers to row while the second refers to row.

2. In CUDA the x-related thread index refers to column index while the y-related index refers to row
index.

3. To avoid confusion, linear indexing is aways row-based.

4.2 Matlab

However, in Matlab the storage of matrix is column based. For a matrix of size (M × N ), the linear index
of its element at mth row and nth column is calculated as: id = (n-1) * M + m. Therefore, we must be
careful when programming in mixed C and Matlab. When the matrix data stored as a variable A in Matlab
is passed onto a pointer variable B (declared as float *B) in a C program, the linear index of the element
at mth row and nth column in the C program is given by
idx = n*NROW+m;
where NROW is number of rows of the matrix. This will be further discussed in next section.

32
5 Run CUDA code in Matlab

5.1 Basic steps

The Matlab Parallel Computing Toolbox allows us to run certain Matlab functions in GPU. There are three
methods:

1. Load the data into GPU by using the function gpuArray, then perform the operations
%x is a matrix, y is another matrix
x_d = gpuArray(x);
y_d = gpuArray(y);
z_d = exp(x_d) * cos(y_d) + log(abs(x_d)); % operations running in GPU
z = gather(z_d); %copy the result back to workspace, z_d is in GPU memory

2. Using the function arrayfun. See Matlab document for details.

3. Complile the CUDA kernel and run it in Matlab. The following is an example of a CUDA program
stored as GC.cu.

1 //GPU f u n c t i o n
2 __global__ void gammaCorrectionGPU ( unsigned char ∗d_in , const int NROW,
const int NCOL, const f l o a t gamma)
3 {
4 f l o a t x0 ;
5 int i d x = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
6 int i d y = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ;
7 int k ;
8
9 i f ( i d x < NROW && i d y < NCOL )
10 {
11 k = i d y ∗ NROW + i d x ;
12 x0 = ( f l o a t ) d_in [ k ] ;
13 d_in [ k ] = ( unsigned char ) f l o o r ( 2 5 5 . f ∗ pow ( x0 / 2 5 5 . f , gamma) ) ;
14 }
15 }

It performs simple gamma correction operation. To run it in Matlab, we need to follow theses steps:

(a) Compile the kernel into ptx code:

nvcc -ptx GC.cu
(b) Make the Matlab kernel:
k = parallel.gpu.CUDAKernel(’GC.ptx’,’GC.cu’).
To see the properties of the kernel, just type k and press return in the command window.
The following messages shows up
k =
parallel.gpu.CUDAKernel handle Package: parallel.gpu

33
Properties:
ThreadBlockSize: [1 1 1]
MaxThreadsPerBlock: 1024
GridSize: [1 1]
SharedMemorySize: 0
EntryPoint: ’_Z18gammaCorrectionGPUPhiif’
MaxNumLHSArguments: 1
NumRHSArguments: 4 ArgumentTypes: {1x4 cell}
Methods, Events, Superclasses
The TreadBlockSize is set to [1 1 1] and the GridSzie is set to [1 1]. This means only one thread
block having one thread is set up by the function parallel.gpu.CUDAKernel. So we need to
set up the threads. For an image of N rows and M columns, we need N × M threads. Suppose
we set the the block size as [16 16] which has 16×16 threads per block. The grid size should be
[ceil(N/16), ceil(M/16)]. For example, we use the following code to set up the threads
[N, M] = size(x);
k.ThreadBlockSize = [16 16];
k.GridSize = [ceil(N/16) ceil(M/16)];
After setting up the threads, we can run the CUDA kernel by using
y = feval(k, x, N, M, 0.1); % y is in GPU card memory
y = gather(y); %copy it to computer memory
[y1, y2] = feval(k, x1, x2, x3)

(c) Let us come back to the kernel function. How do we let Matlab know which one is the input and
which one is the output? For example, if the C kernel within a CU file has the following signature:
__global__ void simpleExample( float * pInOut, float c )
The corresponding kernel object (k) in MATLAB has the following properties:
MaxNumLHSArguments: 1
NumRHSArguments: 2
ArgumentTypes: {’inout single vector’ ’in single scalar’}
Therefore, to use the kernel object from this code with the function feval, we need to provide
feval with two input arguments (in addition to the kernel object), and we can use one output ar-
gument: y = feval(k, x, C); The input values x and C correspond to pInOut and c in the C
function prototype. The output argument y corresponds to the value of pInOut in the C function
prototype after the C kernel has executed. This is what happens in the program GC.cu in which
the pointer unsigned char *d_in acts as the input and the output. What do we do if we have
three matrix inputs and two matrix outputs? The rule is that if we declare the pointer to be a
constant, then it will not be used as the output. Here is an example that shows a combination of
const and non-const pointers:
__global__ void complicatedExample( const float * pIn, float * pInOut1, float * pInOut2)
The corresponding kernel object in MATLAB then has the properties:
MaxNumLHSArguments: 2
NumRHSArguments: 4
ArgumentTypes: {’in single vector’ ’inout single vector’ ’inout single vector’}

34
You can use feval on this code’s kernel (k) with the syntax:
[y1, y2] = feval(k, x1, x2, x3)
The three input arguments x1, x2, and x3, correspond to the three arguments that are passed
into the C function. The output arguments y1 and y2, correspond to the values of pInOut1 and
pInOut2 after the C kernel has executed.

5.2 Dealing with matrix element’s index in CUDA kernel

Here is an example of a CUDA kernal function which is used to calculate the sum of two matrices.

1
2 __global__ void addMtx ( f l o a t ∗d_out , const f l o a t ∗d_in , const int NROW, const
int NCOL)
3 {
4 int k ;
5 int i d x = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ; // row i d x
6 int i d y = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ; // c o l i d x
7
8 i f ( i d x < NROW && i d y < NCOL )
9 { k = i d y ∗NROW + i d x ;
10 d_out [ k ] = d_in [ k ] + d_out [ k ] ;
11 }
12 }

Here is another example showing the working of index. This example is used to demonstrate how to use
GPU kernel to calculate the 2d average filter with a fixed window size.

1 //GPU f u n c t i o n
2 __global__ void w e i g h t e d A v e r a g e F i l t e r ( unsigned char ∗d_out , const unsigned
char ∗d_in , const int NROW, const int NCOL, const int padNum , const int
winSize )
3 {
4 f l o a t x0 ;
5 int i d x = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
6 int i d y = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ;
7 int k ,m, j j , kk , NROW_A;
8
9 // i d x must l i n k e d t o row i n d e x and i d y l i n k e d t o c o l i n d e x
10 i f ( i d x < NROW && i d y < NCOL )
11 { k = i d y ∗ NROW + i d x ;
12 NROW_A = NROW + 2 ∗ padNum ;
13 x0 = 0 . f ;
14 /∗ column−wise , a l i t t l e b i t more m u l t i p l i c a t i o n s ∗/
15 f o r ( j j = i d x ; j j <=(i d x +2∗padNum) ; j j ++) // row i n d e x
16 f o r ( kk = i d y ; kk<=(i d y +2∗padNum) ; kk++) // c o l i n d e x

35
17 { x0 = x0 + ( ( f l o a t ) d_in [ kk ∗ NROW_A + j j ] ) ; }
18
19 /∗ row−wise , one more r e g i s t e r used , l e s s m u l t i p l i c a t i o n s ∗/
20 /∗
21 ∗ f o r ( kk = i d y ; kk <=(i d y +2∗padNum) ; kk++) // c o l i n d e x
22 ∗ { m = kk ∗ NROW_A;
23 ∗ f o r ( j j = i d x ; j j <=(i d x +2∗padNum) ; j j ++) // row i n d e x
24 ∗ { x0 = x0 + ( ( f l o a t ) d_in [m + j j ] ) ; }
25 ∗ }
26 ∗/
27 d_out [ k ] = ( unsigned char ) ( x0 / ( w i n S i z e ∗ w i n S i z e ) ) ;
28 }
29 }

We notice that compared with the C program which is directly compliled and run, the above CUDA kernel
functions is different in the follow aspects:

1. The index idx is used to represent the ROW index, while in the C program it is used to represent
the COLUMN index. Similarly the index idy is used to represent the COLUMN index, while in the C
program it is used to represent the ROW index. This is the case, because when the matrix data stored
in a variable is passed on to the CUDA kernel function, the matrix is stored in a column-wise way, i.e.,
it is stored as a 1-D array by stacking one column after another column. For example, a (2 × 3) matrix
is given by " #
1 3 5
A=
2 4 6

In Matlab, it is stored as [1 2 3 4 5 6]. The 1st two numbers are from the 1st column and the next
two number are from the 2nd column, etc. When this 1D array is passed onto the CUDA kernel, the
data is copied to the device memory. To find out the index of the element at the mth row and nth
column, we perform the calculation: id = n*NROW + m. For example in the element located at row-2
and column-3 in Matlab is A[2][3] = 6. The linear index of this element in Matlab is: (3-1)*2+2=6.
This is because in Matlab the index starts from 1. However, in C language the index starts from 0. So
the same element would be located at row-1 and column-2 and the linear index is id = 2*2+1=5 which
is index of the last element of the 1D vectorwhich has 6 elements with index from 0 to 5.

2. The for-loop can be implemented in the column-wise way or the row-wise way.

5.3 Using C language in Matlab and dealing with matrix element’s index in the
C kernel

Similar to the CUDA kernel, we can can write C program and compile it to run in Matlab using the mex
function. Although Matlab is very powerful in signal and image processing, for some applications it is
desirable to use the C language which runs faster. To use C in Matlab, you need a C program which contains
two major parts: the actual program for image processing and a function for the interface between C and
Matlab. Let me use an example to show you how to program.

36
The interface function has a fixed format
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
We need to define the input and output variables in this function. There are some fixed variables as well:

nlhs -- number of variables on the left handside (output variables)

nrhs -- number of variables on the right handside (input variables)
*plhs -- pointers for creating output matrix
*prls -- pointers for creating input matrix
Once you have the code, then you compile it in Matlab by using either its built in compiler called “lcc”.
However, this compiler does not produce fast running programs. A better choice is to use other compilers
such as gcc, the Intel C compiler or the C compiler of the Microsoft Visual Studio. The function you use to
compile you C code is mex.
The mex function allows you to select a compiler if you have multiple compilers installed in your system.
To do that, you use: mex -setup. A very small gcc compiler you can install and use is called gnumex. See
http://gnumex.sourceforge.net/ for download and installation information. Detailed information and
more examples of using C language in Matlab can be found in http://cnx.org/content/m12348/latest/
and http://www.cs.yale.edu/homes/spielman/ECC/cMatlab.html

5.3.1 Example 1

Here is a simple program taken and modified from http://cnx.org/content/m12348/latest/. The pur-
poses of this code are (1) to show how to write the interface function, and (2) more importantly how index is
treated in Matlab and C. This code is used to calculate the average value of each column vector of a matrix. I
save this code in a file called “mySimpleC.c”. To run this function, I first compile it using: mex mySimpleC.c.
Then I test it by using the following in Matlab command window:
a = [1 2 3 4;5 6 7 8];
mySimpleC(a);

I have the following results

The average of column 0 is 3.000000
The average of column 1 is 4.000000
The average of column 2 is 5.000000
The average of column 3 is 6.000000
The average of row 0 is 2.500000
The average of row 1 is 6.500000

There are two important points. (1) Just as in the CUDA kernel case, the number of rows and number
of columns in Matlab become the corresponding number of columns and number of rows. So it is a bit
confusing for the usage of the two function mxGetN and mxGetM which, according to Matlab documentation,
will return the number of rows and number of columns. In fact, they are the other way around. I think the
problem/confusion is due to the two different ways of storing a matrix, i.e., in Matlab it is column-wise while

37
in C language it is row-wise. (2) Just as in the CUDA kernel, in the C function linear index for a matrix is
idxLin = idxCol * MROW + idxRow.

1 // mySimpleC . c
2 #include "math . h"
3 #include "mex . h" // This one i s r e q u i r e d
4
5 void mexFunction ( int n l h s , mxArray ∗ p l h s [ ] , int nrhs , const mxArray ∗ p r h s [ ] )
6 { // D e c l a r a t i o n s
7 mxArray ∗ xData ;
8 double ∗ xValues ;
9 int idxRow , idxCol , MROW, NCOL, i d x L i n ;
10 double avg ;
11
12 //Copy i n p u t p o i n t e r x
13 xData = p r h s [ 0 ] ;
14
15 // Get m a t r i x x
16 xValues = mxGetPr ( xData ) ;
17 NCOL = mxGetN( xData ) ; // t h i s i s number o f columns NOT rows
18 MROW = mxGetM( xData ) ; // t h i s i s number o f rows NOT columns
19
20 // P r i n t t h e i n t e g e r avg o f each c o l t o m a t l a b c o n s o l e
21 f o r ( i d x C o l = 0 ; i d x C o l < NCOL; i d x C o l++)
22 { avg = 0 ;
23 f o r ( idxRow = 0 ; idxRow < MROW; idxRow++)
24 { i d x L i n = i d x C o l ∗ MROW + idxRow ;
25 avg = avg + xValues [ i d x L i n ] ;
26 }
27 avg = avg /MROW;
28 p r i n t f ( "The a v e r a g e o f column %d i s %f \n" , idxCol , avg ) ;
29 }
30
31 // P r i n t t h e i n t e g e r avg o f each row t o m a t l a b c o n s o l e
32 f o r ( idxRow = 0 ; idxRow < MROW; idxRow++)
33 { avg = 0 ;
34 f o r ( i d x C o l = 0 ; i d x C o l < NCOL; i d x C o l++)
35 { i d x L i n = i d x C o l ∗ MROW + idxRow ;
36 avg = avg + xValues [ i d x L i n ] ;
37 }
38 avg = avg /NCOL;
39 p r i n t f ( "The a v e r a g e o f row %d i s %f \n" , idxRow , avg ) ;
40 }

38
41 }

5.3.2 Example 2

As a more interesting example, suppose I want to write a program to shrink the size of an image by a factor
of 2. This can be implemented as follows. I can divide the image into (2 × 2) blocks. Then for each block, I
calculate the average value of the 4 pixels as the pixel value of the output image. The code is shown below.
Apart from the two important points, you can see how the linear index for the input and output pixels are
calculated. To test this program, I run the following command in the Matlab command window:
mex imageShrink2.c
x = double(imread(’cameraman.tif’)+1)/257;//read image data into x
y=imageShrink2(x);
imtool([y])

1 #include <s t d i o . h>

2 #include "mex . h"
3
4 void imageShrink2 ( double ∗imOut , double ∗ imIn , int MROW, int NCOL)
5 { int idxRow , idxCol , idxLin_imIn , idxLin_imOut ;
6 int r , c ,M,N;
7 double tmp ;
8
9 // d e t e r m i n e t h e s i z e o f t h e s h r i n k e d image
10 M = ( int ) MROW/ 2 ;
11 N = ( int ) NCOL/ 2 ;
12
13 f o r ( idxRow = 0 ; idxRow < MROW; idxRow = idxRow + 2 )
14 f o r ( i d x C o l = 0 ; i d x C o l < NCOL; i d x C o l = i d x C o l + 2 )
15 { tmp = 0 ;
16 f o r ( r = idxRow ; r <= idxRow + 1 ; r++)
17 f o r ( c = i d x C o l ; c <= i d x C o l + 1 ; c++)
18 { idxLin_imIn = c ∗ MROW + r ;
19 tmp = tmp + imIn [ idxLin_imIn ] ;
20 }
21 idxLin_imOut = M ∗ i d x C o l /2 + idxRow / 2 ;
22 imOut [ idxLin_imOut ] = tmp / 4 ;
23 }
24 return ;
25 }
26
27 void mexFunction ( int n l h s , mxArray ∗ p l h s [ ] , int nrhs , const mxArray ∗ p r h s [ ] )

39
28 { double ∗x , ∗y ;
29 int MROW,NCOL;
30 int M,N; // o u t p u t image s i z e
31
32 i f ( n r h s == 0 )
33 { mexErrMsgTxt ( " Usage : h = MyHisto (X) ; " ) ; }
34
35 /∗ Check f o r p r o p e r number o f arguments . ∗/
36 i f ( n r h s != 1 )
37 { mexErrMsgTxt ( " 1 i n p u t r e q u i r e d . " ) ; }
38 else i f ( nlhs > 1)
39 { mexErrMsgTxt ( "Too many output arguments " ) ; }
40
41 // g e t t h e s i z e o f t h e image
42 x = mxGetPr ( p r h s [ 0 ] ) ;
43 NCOL = ( int ) mxGetN( p r h s [ 0 ] ) ; // number o f columns NOT rows
44 MROW = ( int ) mxGetM( p r h s [ 0 ] ) ; // number o f rows NOT columns
45
46 // d e t e r m i n e t h e s i z e o f t h e s h r i n k e d image
47 M = ( int ) MROW/ 2 ;
48 N = ( int ) NCOL/ 2 ;
49
50 /∗ C r e a te m a t r i x f o r t h e r e t u r n argument . ∗/
51 p l h s [ 0 ] = mxCreateDoubleMatrix (M, N,mxREAL) ;
52 /∗ A s s i g n p o i n t e r s t o each i n p u t and o u t p u t . ∗/
53 y = mxGetPr ( p l h s [ 0 ] ) ;
54
55 /∗ C a l l t h e s u b r o u t i n e . ∗/
56 imageShrink2 ( y , x ,MROW,NCOL) ;
57 }

5.3.3 Example 3

In this example, I show you how to pass an integer from the interface function to the C program. In this
case, I expand the functionality of imageShrink2.c such that it can handle a user specified shrinkage factor
by using
shrinkFactor = (int) (mxGetScalar(prhs[1]));
To test this program, I run the following command in the Matlab command window:
mex imageShrink.c
x = double(imread(’cameraman.tif’)+1)/257;//read image data into x
y=imageShrink(x,3); %shrink by a factor of 3
imtool([y])

40
1 #include <s t d i o . h>
2 #include "mex . h"
3 #include <math . h>
4
5 void imageShrink ( double ∗imOut , double ∗ imIn , int MROW, int NCOL, int
shrinkFactor )
6 { int idxRow , idxCol , idxLin_imIn , idxLin_imOut ;
7 int r , c ,M, N, s h r i n k F a c t o r 2 ;
8 double tmp ;
9
10 shrinkFactor2 = shrinkFactor ∗ shrinkFactor ;
11
12 // d e t e r m i n e t h e s i z e o f t h e s h r i n k e d image
13 M = ( int ) c e i l ( ( double )MROW/ ( double ) s h r i n k F a c t o r ) ;
14 N = ( int ) c e i l ( ( double )NCOL/ ( double ) s h r i n k F a c t o r ) ;
15
16 f o r ( idxRow = 0 ; idxRow < MROW; idxRow = idxRow + s h r i n k F a c t o r )
17 f o r ( i d x C o l = 0 ; i d x C o l < NCOL; i d x C o l = i d x C o l + s h r i n k F a c t o r )
18 { tmp = 0 ;
19 f o r ( r = idxRow ; r <= idxRow + s h r i n k F a c t o r − 1 ; r++)
20 f o r ( c = i d x C o l ; c <= i d x C o l + s h r i n k F a c t o r − 1 ; c++)
21 { idxLin_imIn = c ∗ MROW + r ;
22 tmp = tmp + imIn [ idxLin_imIn ] ;
23 }
24 idxLin_imOut = M ∗ i d x C o l / s h r i n k F a c t o r + idxRow/ s h r i n k F a c t o r ;
25 imOut [ idxLin_imOut ] = tmp/ s h r i n k F a c t o r 2 ;
26 }
27 return ;
28 }
29
30 void mexFunction ( int n l h s , mxArray ∗ p l h s [ ] , int nrhs , const mxArray ∗ p r h s [ ] )
31 { double ∗x , ∗y ;
32 int s h r i n k F a c t o r ;
33 int MROW,NCOL;
34 int M,N; // o u t p u t image s i z e
35
36 i f ( n r h s == 0 | | n r h s != 2 | | n l h s > 1 )
37 { mexErrMsgTxt ( " Usage : y = imageShrink ( x , s h r i n k F a c t o r ) ; " ) ; }
38
39 // g e t t h e s i z e o f t h e image
40 x = mxGetPr ( p r h s [ 0 ] ) ;

41
41 NCOL = ( int ) mxGetN( p r h s [ 0 ] ) ; // number o f columns NOT rows
42 MROW = ( int ) mxGetM( p r h s [ 0 ] ) ; // number o f rows NOT columns
43
44 // g e t an i n t e g e r −− t h e s h r i n k f a c t o r
45 s h r i n k F a c t o r = ( int ) ( mxGetScalar ( p r h s [ 1 ] ) ) ;
46
47 // d e t e r m i n e t h e s i z e o f t h e s h r i n k e d image
48 M = ( int ) c e i l ( ( double )MROW/ ( double ) s h r i n k F a c t o r ) ;
49 N = ( int ) c e i l ( ( double )NCOL/ ( double ) s h r i n k F a c t o r ) ;
50
51 /∗ C r e a te m a t r i x f o r t h e r e t u r n argument . ∗/
52 p l h s [ 0 ] = mxCreateDoubleMatrix (M, N,mxREAL) ;
53 /∗ A s s i g n p o i n t e r s t o each i n p u t and o u t p u t . ∗/
54 y = mxGetPr ( p l h s [ 0 ] ) ;
55
56 /∗ C a l l t h e s u b r o u t i n e . ∗/
57 imageShrink ( y , x ,MROW,NCOL, s h r i n k F a c t o r ) ;
58 }

6 Summary
This has been a very short review of certain key elements in C programming and a very very short introduction
to programming in CUDA. I hope that we can learning more about programming in CUDA in another unit
ELE5IPC (ELE4ICP) in which we can do more in image processing.
To learn more about CUDA, the NVIDIA’ s CUDA web page contains a lot of useful information and links
to tutorials and technical training materials. The UIUC has a one semester course in CUDA programming.
The web page has detailed lecture notes.

Programming, Problem Solving & Abstraction With C (PDFDrive)
100% (1)
Programming, Problem Solving & Abstraction With C (PDFDrive)
253 pages
C Programming and Numerical Analysis
91% (11)
C Programming and Numerical Analysis
200 pages
Nissan Check Engine Light Codes & ECU Reset
33% (3)
Nissan Check Engine Light Codes & ECU Reset
29 pages
RE For Beginners
No ratings yet
RE For Beginners
791 pages
CSC209
No ratings yet
CSC209
72 pages
Charles Severance, Kevin Dowd-High Performance Computing-O'Reilly Media (1998) PDF
No ratings yet
Charles Severance, Kevin Dowd-High Performance Computing-O'Reilly Media (1998) PDF
132 pages
Embedded C Absolute Beginner-Compressed PDF
100% (2)
Embedded C Absolute Beginner-Compressed PDF
80 pages
Manual Optoma Ml300
100% (1)
Manual Optoma Ml300
28 pages
Velconic Toei Motor Media Inercia Serie V T Manual Servo Driver VLASX em Ingles
33% (3)
Velconic Toei Motor Media Inercia Serie V T Manual Servo Driver VLASX em Ingles
376 pages
Notes Computing
No ratings yet
Notes Computing
201 pages
Notes
No ratings yet
Notes
536 pages
Notes On Data Structures and Programming Techniques (CPSC 223, Spring 2018)
No ratings yet
Notes On Data Structures and Programming Techniques (CPSC 223, Spring 2018)
6 pages
0934 Data Structures and Programming Techniques
No ratings yet
0934 Data Structures and Programming Techniques
6 pages
Programming Methodology in C
No ratings yet
Programming Methodology in C
117 pages
Programming Methodology in C: Hugh Anderson
No ratings yet
Programming Methodology in C: Hugh Anderson
117 pages
Cs 1101 C
No ratings yet
Cs 1101 C
117 pages
Luciano M Barone, Enzo Marinari, Giovanni Organtini, Federico Ricci Tersenghi-Scientific Programming - C-Language, Algorithms and Models in Science-World Scientific Publishing Company (2013)
No ratings yet
Luciano M Barone, Enzo Marinari, Giovanni Organtini, Federico Ricci Tersenghi-Scientific Programming - C-Language, Algorithms and Models in Science-World Scientific Publishing Company (2013)
718 pages
Quick Introduction To Reverse Engineering For Beginners by Dennis Yurichev
100% (3)
Quick Introduction To Reverse Engineering For Beginners by Dennis Yurichev
213 pages
c Programming for Noobs
No ratings yet
c Programming for Noobs
362 pages
Applid C and C
No ratings yet
Applid C and C
715 pages
Mips2c Programming From The Machine Up 1st Philip Machanick download
No ratings yet
Mips2c Programming From The Machine Up 1st Philip Machanick download
84 pages
Notes
No ratings yet
Notes
517 pages
Programming Assembly-2
No ratings yet
Programming Assembly-2
100 pages
RE For Beginners-En
No ratings yet
RE For Beginners-En
224 pages
RE For Beginners-En
No ratings yet
RE For Beginners-En
224 pages
Computer Systems A Programmer s Perspective Randal E. Bryant instant download
100% (1)
Computer Systems A Programmer s Perspective Randal E. Bryant instant download
48 pages
RE4B-EN-A5
No ratings yet
RE4B-EN-A5
1,462 pages
(Ebook) 65a3ac9ab5d77 by Randal E. Bryant David R. O’Hallaron instant download
100% (1)
(Ebook) 65a3ac9ab5d77 by Randal E. Bryant David R. O’Hallaron instant download
51 pages
(Ebook) Reverse Engineering for Beginners by Dennis Yurichev 2024 Scribd Download
100% (2)
(Ebook) Reverse Engineering for Beginners by Dennis Yurichev 2024 Scribd Download
81 pages
Instant ebooks textbook Reverse Engineering for Beginners Dennis Yurichev download all chapters
No ratings yet
Instant ebooks textbook Reverse Engineering for Beginners Dennis Yurichev download all chapters
40 pages
RE For Beginners-En
No ratings yet
RE For Beginners-En
247 pages
C++ Lecture Notes: Fran Cois Fleuret November 21, 2005
No ratings yet
C++ Lecture Notes: Fran Cois Fleuret November 21, 2005
7 pages
Understanding Assembly Language (Reverse Engineering for Beginners) Dennis Yurichev instant download
100% (2)
Understanding Assembly Language (Reverse Engineering for Beginners) Dennis Yurichev instant download
59 pages
Understanding Assembly Language (Reverse Engineering for Beginners) Dennis Yurichev instant download
No ratings yet
Understanding Assembly Language (Reverse Engineering for Beginners) Dennis Yurichev instant download
37 pages
Instant Access to Understanding Assembly Language (Reverse Engineering for Beginners) Dennis Yurichev ebook Full Chapters
No ratings yet
Instant Access to Understanding Assembly Language (Reverse Engineering for Beginners) Dennis Yurichev ebook Full Chapters
31 pages
Introduction to Computational Physics with examples in Julia Gerson J. Ferreira pdf download
100% (1)
Introduction to Computational Physics with examples in Julia Gerson J. Ferreira pdf download
54 pages
Understanding Assembly Language- 0Reverse Engineering for -- Dennis Yurichev -- 2021 -- 47c19d56d33b67023699efb925af6ae3 -- Anna’s Archive
No ratings yet
Understanding Assembly Language- 0Reverse Engineering for -- Dennis Yurichev -- 2021 -- 47c19d56d33b67023699efb925af6ae3 -- Anna’s Archive
1,365 pages
Understanding Assembly Language (Reverse Engineering for Beginners) Dennis Yurichev - Own the ebook now with all fully detailed chapters
100% (1)
Understanding Assembly Language (Reverse Engineering for Beginners) Dennis Yurichev - Own the ebook now with all fully detailed chapters
80 pages
Prehubz
No ratings yet
Prehubz
7 pages
C&Matlab Primer
No ratings yet
C&Matlab Primer
412 pages
Wizard Code
No ratings yet
Wizard Code
259 pages
Pop C Manualv1.0
No ratings yet
Pop C Manualv1.0
42 pages
Notes
No ratings yet
Notes
215 pages
Notes
No ratings yet
Notes
215 pages
C Manual Lab
No ratings yet
C Manual Lab
42 pages
Exercises C++
No ratings yet
Exercises C++
37 pages
CPLManualV2.0
No ratings yet
CPLManualV2.0
40 pages
DOC-20240919-WA0037. (1)
No ratings yet
DOC-20240919-WA0037. (1)
71 pages
C Lenguage PDF
100% (1)
C Lenguage PDF
153 pages
Introduction To Programming by Examples
No ratings yet
Introduction To Programming by Examples
224 pages
Inzva Algorithm Programme 2018-2019 Bundle 1 Intro: Muhammed Burak Buğrul
No ratings yet
Inzva Algorithm Programme 2018-2019 Bundle 1 Intro: Muhammed Burak Buğrul
18 pages
(FREE PDF Sample) Understanding Assembly Language (Reverse Engineering For Beginners) Dennis Yurichev Ebooks
100% (2)
(FREE PDF Sample) Understanding Assembly Language (Reverse Engineering For Beginners) Dennis Yurichev Ebooks
79 pages
The Science of Computing: Curl Burch
No ratings yet
The Science of Computing: Curl Burch
118 pages
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Plain JavaScript: Learning the Front-End
From Everand
Plain JavaScript: Learning the Front-End
Roger Beans-Rivet
No ratings yet
Teaching Scratch Programming…from Scratch
From Everand
Teaching Scratch Programming…from Scratch
John Nunez
No ratings yet
Mastering Python Advanced Concepts and Practical Applications
From Everand
Mastering Python Advanced Concepts and Practical Applications
Aissa Younes
No ratings yet
Grow with Python Programming: From Basics to Advanced
From Everand
Grow with Python Programming: From Basics to Advanced
Mark Fliks
No ratings yet
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
Solution To Problems
No ratings yet
Solution To Problems
20 pages
Telecommunications Engineering: Ele5Tel 1
No ratings yet
Telecommunications Engineering: Ele5Tel 1
31 pages
Classification Based On Nature of H/W Faults
No ratings yet
Classification Based On Nature of H/W Faults
9 pages
Telecommunications Engineering: Ele5Tel 1
No ratings yet
Telecommunications Engineering: Ele5Tel 1
31 pages
Telecommunications Engineering: Dr. David Tay Room BG434 X 2529 D.tay@latrobe - Edu.au
No ratings yet
Telecommunications Engineering: Dr. David Tay Room BG434 X 2529 D.tay@latrobe - Edu.au
21 pages
Telecommunications Engineering: Dr. David Tay Room BG434 X 2529 D.tay@latrobe - Edu.au
No ratings yet
Telecommunications Engineering: Dr. David Tay Room BG434 X 2529 D.tay@latrobe - Edu.au
36 pages
1.1 Fault Classification (Ebook)
No ratings yet
1.1 Fault Classification (Ebook)
8 pages
Building Materials: 1. Building Stones 2. Bricks and Clay Products 3. Timber and Wood Products
No ratings yet
Building Materials: 1. Building Stones 2. Bricks and Clay Products 3. Timber and Wood Products
31 pages
2.2.2.5 Lab - Configuring IPv4 Static and Default Routes - Swinburne - V1.0
No ratings yet
2.2.2.5 Lab - Configuring IPv4 Static and Default Routes - Swinburne - V1.0
7 pages
Assignment No. 10
100% (1)
Assignment No. 10
32 pages
Latrobe College of Science, Health and Engineering Subject: Engineering Practice
No ratings yet
Latrobe College of Science, Health and Engineering Subject: Engineering Practice
21 pages
OLSR (Optimized Link State Routing) Protocol For Ad Hoc Networks
No ratings yet
OLSR (Optimized Link State Routing) Protocol For Ad Hoc Networks
28 pages
Routing Techniques in Wireless Sensor Networks: A Survey J. Al-Karaki, A. E. Kamal
No ratings yet
Routing Techniques in Wireless Sensor Networks: A Survey J. Al-Karaki, A. E. Kamal
68 pages
Drivers Proactive Routing Protocol
No ratings yet
Drivers Proactive Routing Protocol
15 pages
Modulation and Coding Lecture
No ratings yet
Modulation and Coding Lecture
29 pages
LXW5 Series Micro Switches
No ratings yet
LXW5 Series Micro Switches
7 pages
Article Review Form: Name Student Number Faculty & Class: Engineering, MPK English
No ratings yet
Article Review Form: Name Student Number Faculty & Class: Engineering, MPK English
6 pages
Wyniki Testu Kart Graficznych Z Dnia 11 07 2014r
No ratings yet
Wyniki Testu Kart Graficznych Z Dnia 11 07 2014r
19 pages
Service Manual 43B
No ratings yet
Service Manual 43B
174 pages
AC Lines SVS Compensation
No ratings yet
AC Lines SVS Compensation
8 pages
096 00393 1 CS4811
No ratings yet
096 00393 1 CS4811
24 pages
8V1090.00-2 Datasheet
No ratings yet
8V1090.00-2 Datasheet
13 pages
8 SeriesParallelRLCCircuits
No ratings yet
8 SeriesParallelRLCCircuits
5 pages
Accessories Catalog
No ratings yet
Accessories Catalog
28 pages
Moog-ServoValves-Techn_Look-Overview-en
100% (1)
Moog-ServoValves-Techn_Look-Overview-en
40 pages
Lecture-25 (KEC-072) Raman Kapoor ABES
No ratings yet
Lecture-25 (KEC-072) Raman Kapoor ABES
16 pages
Daftar Pasaran Harga Laptop (Report February 2012)
No ratings yet
Daftar Pasaran Harga Laptop (Report February 2012)
13 pages
Mobicool MCG15
No ratings yet
Mobicool MCG15
276 pages
Riedel Catalog en
No ratings yet
Riedel Catalog en
97 pages
Sample Solution To Exam in MAS501 Control Systems 2 Autumn 2015
No ratings yet
Sample Solution To Exam in MAS501 Control Systems 2 Autumn 2015
8 pages
LLumar
No ratings yet
LLumar
2 pages
Human Combilyzer Plus - User Manual PDF
No ratings yet
Human Combilyzer Plus - User Manual PDF
24 pages
Vivid E9 UM
No ratings yet
Vivid E9 UM
613 pages
Guide: Prof. Seema Mishra
No ratings yet
Guide: Prof. Seema Mishra
14 pages
AV Shack
No ratings yet
AV Shack
1 page
Apts User Manual 2022 Rev b
No ratings yet
Apts User Manual 2022 Rev b
3 pages
INJkon 00 Manual
No ratings yet
INJkon 00 Manual
27 pages
Google Drive and Google Forms
No ratings yet
Google Drive and Google Forms
14 pages
Green Leaf Detection Using Raspberry Pi: Project Guide
100% (1)
Green Leaf Detection Using Raspberry Pi: Project Guide
12 pages
CONVERTIDOR DC-DC REDUCTOR SMD RT8295AH O RT8295A
100% (1)
CONVERTIDOR DC-DC REDUCTOR SMD RT8295AH O RT8295A
14 pages
Millimeter-Wave Monopulse Filtenna Array With Dire
No ratings yet
Millimeter-Wave Monopulse Filtenna Array With Dire
9 pages
Minimize Inrush Current in Power Transformers
No ratings yet
Minimize Inrush Current in Power Transformers
12 pages

A Very Biased Introduction To C Programming and Very Very Short Introduction To CUDA Programming

Uploaded by

A Very Biased Introduction To C Programming and Very Very Short Introduction To CUDA Programming

Uploaded by

A very biased introduction to C programming and very very short

introduction to CUDA programming

March 26, 2013

Department of Electronic Engineering, La Trobe University

2 C language examples and exercises 4

3 A very very brief introduction to CUDA 20

5 Run CUDA code in Matlab 33

2 C language examples and exercises

2.1 The simplest one

1 #include <s t d i o . h>

2.2 Variables and the if-else command

1 #include <s t d i o . h>

1 #include <s t d i o . h>

Create two sinusoidal vectors

2.5 2-D array

• Use linear index to perform the above tasks.

2.6 Pointers for arrays

2.7 Command line input

2.8 File input and output

2.10 Function using pointers

2.11 An image processing example

Here we assume the image is 8-bit/pixel.

• In Matlab you can read the image and display it:

1 void gammaCorrection ( unsigned char ∗x , unsigned char ∗y , int N, int M,

You need to replace the double for-loops with the following:

2.12 Make it run faster-optimization

– (a) cl /O2 imagePow.c -o imagePow1,

3 A very very brief introduction to CUDA

3.2 Terms and definitions

3.3 Programming model

dim3 dimGrid(grid_size_x, grid_size_y);

3.4 CUDA specific functions

There are a number of CUDA specific functions.

3.5 Image processing example

3.5.1 Main parts of the program

1. Define the GPU and CPU functions

3. Take care of command line input

4. Memory allocation for arrays in host and device

6. Run GPU code

(a) Define the grid and block

7. Write data to a file out0.5

8. Run CPU code

10. Free host memory.

3.5.2 Details of the GPU code

• Block and grid definition

• Memory allocation and data transfer

This is through the function cudaMalloc and cudaMemcpy

• The kernel function

• The rest of the code

The rest of the code is quite easy to understand

3.5.3 The Program

1 #include <s t d i o . h>

3.5.4 Compile and run

• The simplest way to compile the program is as follows:

• In the command window, you can run the program by:

• Here is an example of turning on the O2 optimization of the C compiler:

nvcc -Xcompiler /O2 gammaCorrection.cu -o gammaCorrection

• A more complicated example is as follows:

• Use some compiler options:

4 Linear index of 2D arrays

3. To avoid confusion, linear indexing is aways row-based.

5.1 Basic steps

2. Using the function arrayfun. See Matlab document for details.

(a) Compile the kernel into ptx code:

5.2 Dealing with matrix element’s index in CUDA kernel

nlhs -- number of variables on the left handside (output variables)

I have the following results

1 #include <s t d i o . h>

You might also like