0% found this document useful (0 votes)
29 views

A Very Biased Introduction To C Programming and Very Very Short Introduction To CUDA Programming

This document provides a very brief introduction to C programming and CUDA programming. It outlines examples of basic C code constructs like variables, if/else statements, for loops, arrays, pointers, functions, and file I/O. It also gives a high-level overview of CUDA terminology, programming model, and an example image processing program to demonstrate CUDA. The goal is to quickly expose the reader to these topics and motivate further self-learning. Exercises are provided throughout to reinforce key concepts.

Uploaded by

Basit Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

A Very Biased Introduction To C Programming and Very Very Short Introduction To CUDA Programming

This document provides a very brief introduction to C programming and CUDA programming. It outlines examples of basic C code constructs like variables, if/else statements, for loops, arrays, pointers, functions, and file I/O. It also gives a high-level overview of CUDA terminology, programming model, and an example image processing program to demonstrate CUDA. The goal is to quickly expose the reader to these topics and motivate further self-learning. Exercises are provided throughout to reinforce key concepts.

Uploaded by

Basit Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

A very biased introduction to C programming and very very short

introduction to CUDA programming

March 26, 2013

Dr Dennis Deng

Department of Electronic Engineering, La Trobe University

1
Contents
1 Aim 4

2 C language examples and exercises 4


2.1 The simplest one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Variables and the if-else command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 For-loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 2-D array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Pointers for arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Command line input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 File input and output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Function using pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11 An image processing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.12 Make it run faster-optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.12.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 A very very brief introduction to CUDA 20


3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Terms and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 CUDA specific functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Image processing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1 Main parts of the program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2 Details of the GPU code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5.3 The Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.4 Compile and run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2
4 Linear index of 2D arrays 31
4.1 C language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Matlab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Run CUDA code in Matlab 33


5.1 Basic steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Dealing with matrix element’s index in CUDA kernel . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Using C language in Matlab and dealing with matrix element’s index in the C kernel . . . . . 36
5.3.1 Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3.2 Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.3 Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6 Summary 42

3
1 Aim
This is a very biased introduction to C programming. It is biased because it does not cover everything. It
does not even cover the details of the programming language. The purpose of this introduction is to show
you what is required (the format) to quickly write your own C program. I trust you can learn the details
by yourself. There are many different ways to learn a new language. I think one of the easiest ways is by
copying some sample codes and studying what they do. That is the way little kids learn to speak. Therefore,
I will present a series of code examples. Each one will illustrate certain points and each one will be a little
more complicated than the previous example.
CUDA is a relatively new development. It is basically an extension of standard C program. The exten-
sion allows people with C programming knowledge to quickly learn how to program the NVIDIA graphics
processor which has massive parallel computing capabilities. An excellent web site for learning CUDA is:
http://courses.ece.illinois.edu/ece498/al/. The NVIDIA’s CUDA web page has links to many tuto-
rials and technical training materials: www.nvidia.com/cuda.
My plan is to use 4 lectures and 4 hours of lab time to cover the basis of C and CUDA programming. Ob-
viously, there is not enough time. The purpose of this introduction is show you what you can do with these
new technology. I hope you will be motivated to quickly learn all the tools and tricks by yourself. I also plan
to use CUDA in ELE5IPC (ELE4IPC) in second semester1 .

Source code for all examples can be found in the ELE4ASP web page.

2 C language examples and exercises

2.1 The simplest one

This is perhaps the simplest C program. You can also use it as a template.

1 #include <s t d i o . h>


2 int main ( )
3 { // you can add your code h e r e
4 p r i n t f ( " H e l l o World\n" ) ;
5 return 0 ;
6 }

You can use a text editor such as the Matlab editor or the Notepad to write your own C code and save it as,
for example in this case, Hello.c, in your own directory.

2.1.1 Exercises

How do we run it? We will use Microsoft Visual Studio 2005. Here are the steps

• In Windows, click “Start” -> “Program” -> “Microsoft Visual Studio 2005” -> “Visual Studio Tools”
-> “Visual Studio 2005 Command Prompt”. You will have a command window.
1 ELE4IPC is not offer in 2012

4
• Suppose your code is save in a directory c:\abc. In the command window, you need to change the
directory by using: cd c:\abc. You can make sure your C program is in the current directory by listing
the files: dir *.c . If you can see the file Hello.c listed, then you can go on to the next step.

• To compile it, you use: cl Hello.c -o Hello. Take a careful look at the result. There is a warning about
the -o option. We will keep this option anyway, because in another popular compiler called GCC this
is a required option. There will be error messages if there is something wrong in your code.

• To run the program, simply type Hello. What do you see? Hello World!

2.2 Variables and the if-else command

Make a copy of Hello.c and rename it as vie.c. To do this, you can type in the command window: copy
Hello.c vie.c. Open vie.c in the text editor and edit it such that it looks like the following.

1 #include <s t d i o . h>


2 int main ( )
3 { int a = 1 0 , b = 2 ;
4 f l o a t c , PI = 3 . 1 4 ;
5 i f (a > b)
6 c = ( f l o a t ) ( a ∗ a ) ∗ PI ;
7 else
8 c = ( f l o a t ) ( b ∗ b ) ∗ PI ;
9 p r i n t f ( "The a r e a o f t h e l a r g e r c i r c l e i s %f \n" , c ) ;
10 return 0 ;
11 }

In this example, we introduce three new features: declaring variables [e.g., float c, PI = 3.14], converting the
number to a different format [e.g., (float) (a * a)] and the if-else structure.

2.2.1 Exercises

Given three different numbers, a, b, c. You can assume they are integers. Write a program to calculate and
print out the average and median value of these three numbers. The average is given by

1
x= (a + b + c)
3

and the median is number which less than the maximum of the three and greater than the minimum of the
three. You can use vie.c as a template.

2.3 For-loop

In this example, we show how to use the for-loop. We also show how to define a constant [#define PI 3.14].

1 #include <s t d i o . h>


2 #define PI 3 . 1 4
3

5
4 int main ( )
5 { int n , N = 1 0 ;
6 float s ;
7 f o r ( n = 1 ; n < N; n++)
8 { s = ( f l o a t ) ( n ∗ n ) ∗ PI ;
9 p r i n t f ( "The a r e a o f t h e c i r c l e o f r a d i u s %d i s %f \n" , n , s ) ;
10 }
11 return 0 ;
12 }

2.3.1 Exercises

Modify this program such that it converts the Australian dollar to US dollar. Your program should print
out corresponding value of the US dollars for the following Australian dollars 10, 20, 30,...100. Hint: you can
define the exchange rate and instead of using n++ in the for-loop, use n = n + 10.

2.4 Array

In this example, we introduce three new features. When we use mathematical functions such as exp, we
should include the math library math.h. A 1-d array is a vector. You should declare them before using
them. In the statement: b[n] = exp( -(float) (n)/2.f ) we use “2.f” to tell the compiler that this is
a floating point number. An important issue is related to dimension of the array. When we declare: float
a[10]; we define a vector of 10 elements. The first element is a[0] and the last is a[9]. What would happen
if you try something like: a[-2] = -2 or c[0] = a[100] + PI? Un-comment the three lines after the line
//stupid errors. Then compile and run the modified program. Unexpected results can happen. Be careful.

1 // a r r a y . c
2 #include <s t d i o . h>
3 #include <math . h>
4 #define PI 3 . 1 4
5
6 int main ( )
7 { int n , N = 1 0 ;
8 float a [ 1 0 ] , b [ 1 0 ] , c [ 1 0 ] , s = 0;
9
10 f o r ( n = 0 ; n < N; n++)
11 { a [ n ] = ( f l o a t ) ( n − N) ∗ PI ;
12 b [ n ] = exp (−( f l o a t ) n / 2 . f ) ;
13 p r i n t f ( "%d %f %f \n" , n , a [ n ] , b [ n ] ) ;
14 }
15
16 f o r ( n = 0 ; n < N; n++)
17 { c [n] = a[n] ∗ b[n] ;
18 s = s + a[n] ∗ a[n];

6
19 }
20
21 p r i n t f ( "The l e n g t h o f v e c t o r a i s %f \n" , s ) ;
22
23 // s t u p i d e r r o r s
24 a [ −2] = −2. f ;
25 c [ 0 ] = a [ 1 0 0 ] + PI ;
26 p r i n t f ( " S t u p i d e r r o r s a [ −2]= %f and a [ 1 0 0 ] + PI= %f \n" , a [ − 2 ] , c [ 0 ] ) ;
27
28 return 0 ;
29 }

2.4.1 Exercises

Create two sinusoidal vectors


s[n] = sin(ωn)

and
c[n] = cos(ωn)

where n = 0, 1, ..., N , N = 1023, and ω = 0.2π. Calculate the inner product of these two vectors. You can
use array.c as a template.

2.5 2-D array

This example illustrates how to deal with a 2-d array. As you can see from the print out, it is row-based.
This has an important implication. We can use a linear index for elements of the 2d array. For example, we
have defined a 2d array: a[NRow][MCol] which has “NRow” rows and “MCol” columns. What is the index
of a particular element a[n][m] at the nth row and mth column? It is given by: n ∗ M Col + m. With this
linear indexing scheme, we do not need to deal with a 2-d array. We only need to deal with a 1-d array. It
is interesting to point out that in Matlab, a 2-d array is stored in column-based way. So the linear address
of a 2-d array in Matlab is (m − 1) ∗ N Row + n.

1 // a r r a y 2 d . c
2 #include <s t d i o . h>
3 #include <math . h>
4 #define PI 3 . 1 4
5
6 main ( )
7 { int m, M, n , N ;
8 int a [ 3 ] [ 4 ] , b [ 3 ] [ 4 ] , c [ 4 ] , d [ 4 ] ;
9
10 M = 4;
11 N = 3;

7
12
13 f o r ( n = 0 ; n < N; n++)
14 f o r ( m = 0 ; m < M; m++)
15 { a [ n ] [m] = n ∗ M + m ; // d e f i n e t h e m a t r i x a o f 3 rows and 4 columns
16 b [ n ] [m] = n ; // d e f i n e t h e m a t r i x b
17 }
18
19 f o r ( n = 0 ; n < N; n++)
20 { f o r ( m = 0 ; m < M; m++)
21 { p r i n t f ( "%d " , a [ n ] [m] ) ; }
22 p r i n t f ( " \n" ) ;
23 }
24 p r i n t f ( " \n" ) ;
25 f o r ( n = 0 ; n < N; n++)
26 {
27 f o r ( m = 0 ; m < M; m++)
28 { p r i n t f ( "%d " , b [ n ] [m] ) ; }
29 p r i n t f ( " \n" ) ;
30 }
31 }

2.5.1 Exercises

• Modify the code array2d.c such that it calculates the element-wise addition and multiplication of the
two matrices a and b. The result is stored in d and e.

• Use linear index to perform the above tasks.

• Define a vector f [4] = [1 2 3 4]T . Modify the code to calculate the matrix-vector multiplication g = a∗f .

2.6 Pointers for arrays

Pointers are closely related to arrays. In the this example, we show how to use pointers to declare a 1-d and
a 2-d array and how to use pointers to access elements of the array.

1 // p o i n t e r _ a r r a y . c
2
3 #include <s t d i o . h>
4 #include <math . h>
5
6 main ( )
7 { int ∗a ; // i n t e g e r p o i n t e r
8 f l o a t ∗b , ∗ c ; // f l o a t p o i n t
9 int Length , NRow, MCol , n ,m, k ;
10

8
11 Length = 5 ;
12 NRow = 1 0 ;
13 MCol = 5 ;
14
15 // a l l o c a t e mem f o r t h e two a r r a y s
16 a = ( int ∗ ) m a l l o c ( Length ∗ s i z e o f ( int ) ) ;
17 b = ( f l o a t ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( f l o a t ) ) ;
18 c = ( f l o a t ∗ ) m a l l o c ( Length ∗ s i z e o f ( f l o a t ) ) ;
19
20 for ( k = 0 ; k < Length ; k++)
21 a[k] = k;
22
23 f o r ( n = 0 ; n < NRow; n++)
24 f o r ( m = 0 ; m < MCol ; m++)
25 { k = n ∗ MCol + m;
26 b [ k ] = k ; // d e f i n e t h e m a t r i x a o f 3 rows and 4 columns
27 }
28
29 // p r i n t a
30 p r i n t f ( " P r i n t i n g a \n" ) ;
31 f o r ( k = 0 ; k < Length ; k++)
32 p r i n t f ( "%d\n " , a[k]) ;
33
34 p r i n t f ( " \n" ) ;
35 // p r i n t b
36 p r i n t f ( " P r i n t i n g b \n" ) ;
37 f o r ( n = 0 ; n < NRow; n++)
38 { f o r ( m = 0 ; m < MCol ; m++)
39 { p r i n t f ( "%f " , b [ n ∗ MCol + m] ) ; }
40 p r i n t f ( " \n" ) ;
41 }
42
43 // c a l c u l a t e c = b ∗ a
44 f o r ( n = 0 ; n < NRow; n++)
45 { c [m] = 0 ;
46 f o r ( m = 0 ; m < MCol ; m++)
47 c [ n ] = c [ n ] + ( f l o a t ) a [m] ∗b [ n ∗ MCol + m] ;
48 }
49
50 p r i n t f ( " P r i n t i n g c \n" ) ;
51 // p r i n t c
52 f o r ( k = 0 ; k < Length ; k++)
53 p r i n t f ( "%f \n " , c [k]) ;

9
54
55 // f i n i s h u s i n g t h e memory , f r e e them
56 free (a) ;
57 free (b) ;
58 free (c) ;
59 }

Another example
1 //min_max . c
2
3 #include <s t d i o . h>
4 #include <math . h>
5 #include < s t d l i b . h> // d e f i n e s rand ( )
6 #include <time . h>
7 main ( )
8 {
9 f l o a t ∗a , max , min ;
10 int NRow, MCol , n ,m, k ;
11
12 NRow = 4 ;
13 MCol = 5 ;
14
15 // a l l o c a t e mem f o r t h e two a r r a y s
16 a = ( f l o a t ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( f l o a t ) ) ;
17
18 // i n i t i a l i z e random s e e d :
19 s r a n d ( time (NULL) ) ;
20
21 f o r ( n = 0 ; n < NRow; n++)
22 f o r ( m = 0 ; m < MCol ; m++)
23 { k = n ∗ MCol + m;
24 a [ k ] = ( f l o a t ) rand ( ) ;
25
26 }
27
28 // p r i n t a
29 p r i n t f ( " P r i n t i n g b \n" ) ;
30 f o r ( n = 0 ; n < NRow; n++)
31 {
32 f o r ( m = 0 ; m < MCol ; m++)
33 p r i n t f ( "%f " , a [ n ∗ MCol + m] ) ;
34 p r i n t f ( " \n" ) ;
35 }
36

10
37 max = −1e12 ;
38 min = 1 e12 ;
39
40 f o r ( n = 0 ; n < NRow; n++)
41 f o r ( m = 0 ; m < MCol ; m++)
42 { k = n ∗ MCol + m;
43 i f ( a [ k ] >= max )
44 max = a [ k ] ;
45 i f ( a [ k ] <= min )
46 min = a [ k ] ;
47 }
48
49 p r i n t f ( "Max = %f Min = %f \n" , max , min ) ;
50
51 // f i n i s h u s i n g t h e memory , f r e e them
52 free (a) ;
53 }

2.7 Command line input

Refer to the previous example, we have to define the dimension of the array in the program. Once it is
compiled, the dimension is fixed. If we want to change it, we need to change the program, compile and run
it again. There is a better way. What is new in the following example is that the inputs are collected from
the command line.
The function main() is declared as: main(int argc, char **argv). This is a fixed format. The variable
argc counts the number of inputs and the character pointer argv will be used to store the input. For example,
argv[1] stores the first input. There is an important point here. The program name (main) is counted as
the first input (with index 0). So the first parameter input is with the index 1 and so on. If two inputs are
expected, then number of inputs is 3. In the following example, we test if the number of inputs satisfies the
requirement. If not, we print out the error message and terminate the program. This is an important issue,
because we want to program to run as what is intended to run.

1 //min_max . c
2
3 #include <s t d i o . h>
4 #include <math . h>
5 #include < s t d l i b . h> // d e f i n e s rand ( )
6 #include <time . h>
7 main ( int argc , char ∗∗ argv )
8 {
9 f l o a t ∗a , max , min ;
10
11 int NRow, MCol , n ,m, k ;
12

11
13
14 i f ( a r g c != 3 ) // number o f i n p u t s + 1
15 {
16 p r i n t f ( "Two i n p u t s r e q u i r e d \n" ) ;
17 p r i n t f ( " Usage : min_max_ci NRow Mcol \n" ) ;
18 exit (0) ;
19 }
20 else
21 {
22 NRow = a t o i ( argv [ 1 ] ) ; // g e t t h e 1 s t i n p u t and c o n v e r t o i n t e g e r
23 MCol = a t o i ( argv [ 2 ] ) ; // g e t t h e 2nd i n p u t and c o n v e r t o i n t e g e r
24 }
25
26
27 i f (NRow <0 | | MCol < 0 )
28 { p r i n t f ( "Non−n e g a t i v e i n p u t s r e q u i r e d \n" ) ;
29 p r i n t f ( " Usage : min_max_ci NRow Mcol \n" ) ;
30 exit (0) ;
31 }
32
33 p r i n t f ( "NRow = %i MCol = %i \n" , NRow, MCol ) ;
34
35 // a l l o c a t e mem f o r t h e two a r r a y s
36 a = ( f l o a t ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( f l o a t ) ) ;
37
38
39 // i n i t i a l i z e random s e e d :
40 s r a n d ( time (NULL) ) ;
41
42
43 f o r ( n = 0 ; n < NRow; n++)
44 f o r ( m = 0 ; m < MCol ; m++)
45 { k = n ∗ MCol + m;
46 a [ k ] = ( f l o a t ) rand ( ) ;
47
48 }
49
50
51 // p r i n t a
52 p r i n t f ( " P r i n t i n g b \n" ) ;
53 f o r ( n = 0 ; n < NRow; n++)
54 {
55 f o r ( m = 0 ; m < MCol ; m++)

12
56 p r i n t f ( "%f " , a [ n ∗ MCol + m] ) ;
57 p r i n t f ( " \n" ) ;
58 }
59
60 max = −1e12 ;
61 min = 1 e12 ;
62
63 f o r ( n = 0 ; n < NRow; n++)
64 f o r ( m = 0 ; m < MCol ; m++)
65 { k = n ∗ MCol + m;
66
67 i f ( a [ k ] >= max )
68 max = a [ k ] ;
69 i f ( a [ k ] <= min )
70 min = a [ k ] ;
71
72 }
73
74 p r i n t f ( "Max = %f Min = %f \n" , max , min ) ;
75
76 // f i n i s h u s i n g t h e memory , f r e e them
77 free (a) ;
78 }

2.7.1 Exercises

• Compile the program by using: cl min_max_ci.c -o min_max_ci and and run the program by typing:
min_max_ci 5 6. What do you see?

• You can do a test: min_max_ci 5 (only supply one parameter) or min_max_ci 5 6 7 (supply more
than one parameter). What are the results?

• You can do another test min_max_ci 5 -6 (supply a negative number). What do you see?

• Obviously, negative number is not allowed. Modify the code such that it can detect a
negative number and print the error message then terminate.

2.8 File input and output

Here is another application of command line input. We want to process certain data such as an image stored
in file. We need to read it into an array (1-d or 2-d), process it and write the output to a file. This example
shows you how to do it. The line with ”FILE *fpr, *fpw;” declares points to files you want to read and write.
You then use command input to obtain the file names. Next you use “fopen” to associate the pointer with
the file name (e.g., argv[1]) and whether it is intended for read (e.g., “rb”) or write (e.g., “wb”) operations.

13
You also need to declare pointers to store the data. In this case, its type is unsigned char which is 8-bit,
because we assume the image data is 8 bit/pixel. You should be able to understand the rest of the code.

1 // f i l e I O . c
2 // read an image and w r i t e i t o u t t o a n o t h e r f i l e
3 #include <s t d i o . h>
4 #include <time . h>
5 void main ( int argc , char ∗∗ argv )
6 {
7 FILE ∗ f p r , ∗ fpw ;
8 unsigned char ∗ in , ∗ out ;
9 int NRow, MCol , n ,m, k ;
10
11 i f ( a r g c != 5 ) // number o f i n p u t s + 1
12 {
13 p r i n t f ( " Four i n p u t s r e q u i r e d \n" ) ;
14 p r i n t f ( " Usage : f i l e I O i n p u t output NRow Mcol \n" ) ;
15 exit (0) ;
16 }
17 else
18 { f p r = f o p e n ( argv [ 1 ] , " rb " ) ; // g e t t h e p o i n t e r t o read i n p u t image
19 fpw = f o p e n ( argv [ 2 ] , "wb" ) ; // g e t t h e p o i n t t o w r i t e o u t p u t image
20 NRow = a t o i ( argv [ 3 ] ) ; // s i z e o f t h e image
21 MCol = a t o i ( argv [ 4 ] ) ; // s i z e o f t h e image
22 }
23
24 // i f s o m e t h i n g i s s t i l l wrong !
25 i f ( f p r == 0 | | fpw == 0 | | NRow <0 | | MCol < 0 )
26 { p r i n t f ( " Usage : f i l e I O i n p u t output NRow Mcol \n" ) ;
27 exit (0) ;
28 }
29
30 // a l l o c a t e mem f o r t h e two a r r a y s
31 i n = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
32 out = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
33
34 // read image
35 f r e a d ( in , s i z e o f ( unsigned char ) , NRow ∗ MCol , f p r ) ;
36
37 // t h e o u t p u t i s t h e same as t h e i n p u t
38 f o r ( n = 0 ; n < NRow; n++)
39 f o r ( m = 0 ; m < MCol ; m++)
40 { k = n ∗ MCol + m; // use l i n e a r i n d e x i n g
41 out [ k ] = i n [ k ] ;

14
42 }
43
44 // w r i t e d a t a t o a f i l e
45 f w r i t e ( out , s i z e o f ( unsigned char ) , NRow ∗ MCol , fpw ) ;
46
47 // f i n i s h u s i n g t h e memory , f r e e them
48 f r e e ( in ) ;
49 f r e e ( out ) ;
50
51 // c l o s e
52 fclose ( fpr ) ;
53 f c l o s e ( fpw ) ;
54
55 p r i n t f ( "%d\n" ,CLOCKS_PER_SEC) ;
56 }

2.8.1 Exercises

• Download the image file “airplane”. Compile the program and run it using: fileIO airplane out 512 512.
We assume the image size is 512 × 512. You can compare the two files by using: comp airplane out. If
the two files are the same, you should have a message like: “Files compare OK”.

• Modify the code: fileIO.c such that it reads an input image and determines the minimum and
maximum value of the image.

2.9 Functions

A typical function takes from inputs from the main function and perform certain calculations then return
the result to the main function. Here is an example.The main function passes two numbers to the function
which calculates the sum and returns the result.

1 #include<s t d i o . h>
2 int add ( int x , int y ) ;
3 void main ( )
4 { int a = 1 , b = 2 ;
5 int c ;
6
7 c = add ( a , b ) ;
8 p r i n t f ( "%d + %d = %d\n" , a , b , c ) ;
9 }
10
11 int add ( int x , int y )
12 { int sum = 0 ;
13 sum = x + y ;
14 return sum ;

15
15 }

2.9.1 Exercises

• Modify the program to allow a use to input the two integers from command line.

• Write a function that returns the larger number of the two input numbers.

2.10 Function using pointers

Here is a very simple example which shows you how to define a function for adding two vectors. The function
accepts two pointers and returns a pointer. These pointers are best understood as containing the addresses of
particular arrays. For example, when the function is called: “vectorAdd(c,a, b, Length);” , the address
of the array a[0] is passed to another array x[0]. When the function finishes its work, it passes the address of
z[0] to c[0].

1 // d e f i n e a f u n c t i o n
2
3 #include<s t d i o . h>
4
5 void vectorAdd ( int ∗ z , int ∗x , int ∗y , int Length ) ;
6
7 void main ( )
8 { int ∗a , ∗b , ∗ c ;
9 int Length , n ;
10
11 Length = 1 0 ;
12
13 //memory a l l o c a l t i o n
14 a = ( int ∗ ) m a l l o c ( Length ∗ s i z e o f ( int ) ) ;
15 b = ( int ∗ ) m a l l o c ( Length ∗ s i z e o f ( int ) ) ;
16 c = ( int ∗ ) m a l l o c ( Length ∗ s i z e o f ( int ) ) ;
17
18 //make tow v e c t o r s
19 f o r ( n = 0 ; n < Length ; n++)
20 { a[n] = n;
21 b [ n ] = Length − n ;
22 }
23
24 // c a l l t h e f u n c t i o n
25 vectorAdd ( c , a , b , Length ) ;
26
27 // p r i n t r e s u l t

16
28 f o r ( n = 0 ; n < Length ; n++)
29 p r i n t f ( " c [%d ] = %d\n" , n , c [ n ] ) ;
30
31 free (a) ;
32 free (b) ;
33 free (c) ;
34 }
35
36 void vectorAdd ( int ∗ z , int ∗x , int ∗y , int Length )
37 { int n ;
38
39 f o r ( n = 0 ; n < Length ; n++)
40 z [n] = x[n] + y[n];
41 }

2.10.1 Exercises

• Compile and the run the program vectorAdd.c. Does the result make sense? Change Length to 20.
Can you predict the print out? Compile and run the modified program.
P
• Modify the program such that it calculates the inner product of two vectors: s = a[n] ∗ b[n]. Hint:
the function now returns an integer instead of a pointer. You can refer to section 2.9 for an example.

2.11 An image processing example

An interesting image processing algorithm is called gamma correction. Let an image be stored in a 1-d array
in[n] (we use linear indexing, see section 2.6). The processed image is given by
 γ
in[n]
out[n] = 255 ∗
255

Here we assume the image is 8-bit/pixel.


The following program involves almost all programming knowledge we have reviewed: command line input,
file input/output, pointers, and linear indexing for 2-d arrays, for-loops, etc. This program is slightly mod-
ified version of the program fileIO.c. We have added the line: “gamma = atof(argv[5]);” for getting the
gamma value. Do your homework to find out what “atof” means. We use the “clock” function to mea-
sure how many clock-ticks the program takes to processing the image. We should know that measuring the
running time of a program is not a simple matter. The clock function only provides a very rough indica-
tion of the running time. You should also pay attention to this line “out[k] = (unsigned char)(255.f
* pow((float)in[k]/255.f, gamma));” where we have used 255.f to indicate that it is a floating point
number and we have made suitable type-conversions, e.g., (float)in[k].

1 // imagePow . c
2 // read an image and w r i t e i t o u t t o a n o t h e r f i l e

17
3 #include <s t d i o . h>
4 #include <math . h>
5 #include <time . h>
6
7 void main ( int argc , char ∗∗ argv )
8 {
9 FILE ∗ f p r , ∗ fpw ;
10 unsigned char ∗ in , ∗ out ;
11 int NRow, MCol , n ,m, k , t ;
12 f l o a t gamma ;
13
14 i f ( a r g c != 6 ) // number o f i n p u t s + 1
15 {
16 p r i n t f ( " F ive i n p u t s r e q u i r e d \n" ) ;
17 p r i n t f ( " Usage : f i l e I O i n p u t output NRow Mcol gamma\n" ) ;
18 exit (0) ;
19 }
20 else
21 { f p r = f o p e n ( argv [ 1 ] , " rb " ) ; // g e t t h e p o i n t e r t o read i n p u t image
22 fpw = f o p e n ( argv [ 2 ] , "wb" ) ; // g e t t h e p o i n t t o w r i t e o u t p u t image
23 NRow = a t o i ( argv [ 3 ] ) ; // s i z e o f t h e image
24 MCol = a t o i ( argv [ 4 ] ) ; // s i z e o f t h e image
25 gamma = a t o f ( argv [ 5 ] ) ; //gamma
26 }
27
28 // i f s o m e t h i n g i s s t i l l wrong !
29 i f ( f p r == 0 | | fpw == 0 | | NRow <0 | | MCol < 0 )
30 { p r i n t f ( " Usage : f i l e I O i n p u t output NRow Mcol \n" ) ;
31 exit (0) ;
32 }
33
34 // a l l o c a t e mem f o r t h e two a r r a y s
35 i n = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
36 out = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
37
38 // read image
39 f r e a d ( in , s i z e o f ( unsigned char ) , NRow ∗ MCol , f p r ) ;
40
41 t = clock () ;
42 // t h e o u t p u t i s t h e same as t h e i n p u t
43 f o r ( n = 0 ; n < NRow; n++)
44 f o r ( m = 0 ; m < MCol ; m++)
45 { k = n ∗ MCol + m; // use l i n e a r i n d e x i n g

18
46 out [ k ] = ( unsigned char ) ( 2 5 5 . f ∗ pow ( ( f l o a t ) i n [ k ] / 2 5 5 . f , gamma) ) ;
47 }
48
49 t = clock () − t ;
50
51 // w r i t e d a t a t o a f i l e
52 f w r i t e ( out , s i z e o f ( unsigned char ) , NRow ∗ MCol , fpw ) ;
53
54 // f i n i s h u s i n g t h e memory , f r e e them
55 f r e e ( in ) ;
56 f r e e ( out ) ;
57
58 // c l o s e
59 fclose ( fpr ) ;
60 f c l o s e ( fpw ) ;
61
62 p r i n t f ( "Number o f c l o c k t i c k s used i n p r o c e s s i n g t h e image = %d\n" , t ) ;
63 }

2.11.1 Exercises

• Compile and the program: “cl imagePow.c -o imagePow” and “imagePow airplane out0.5 512 512 0.5”.

• In Matlab you can read the image and display it:


a=fopen(’airplane’,’r’);
b=fread(a, [512 512]);
fclose(a)
a=fopen(’out0.5’,’r’);
c=fread(a, [512 512]);
fclose(a)
imshow(double([b c])/256).

• You can experiment with a number of settings for gamma (0.1, 0.5, 1.5, 2) and display the resulting
images.

• Modify the c program imagePow.c such that the image processing task (the double for-loops) is per-
formed in a function. You can use the program in 2.10 as an example. The function is given below.
You should NOT look at it unless you really do not how to write your own.

1 void gammaCorrection ( unsigned char ∗x , unsigned char ∗y , int N, int M,


float g)\
2 { unsigned char ∗y ;
3 int n , m, k ;
4
5 //gamma c o r r e c t i o n

19
6 f o r ( n = 0 ; n < N; n++)
7 f o r ( m = 0 ; m < M; m++)
8 { k = n ∗ M + m; // use l i n e a r i n d e x i n g
9 y [ k ] = ( unsigned char ) ( 2 5 5 . f ∗ pow ( ( f l o a t ) x [ k ] / 2 5 5 . f , g )
) ;
10 }
11 }

You need to replace the double for-loops with the following:


gammaCorrection(in, out, NRow, MCol, gamma);

2.12 Make it run faster-optimization


Now we can write C programs to perform quite complicated signal processing tasks. Can we make our
program run faster? There are many tricks and tools to write computationally efficient C programs. One of
the easiest way is to write it in a “standard” way and let the compiler to do the hard work to optimize it. In
the command window, type: cl /?, you will see a lot of options.
Some are related to optimization for faster running speed. For example, using the option /O2 will maximize
the running speed. Using the option /fp:fast will use less accurate but faster floating point calculations.
Use the option /arch:SSE2 will enable the use of instructions available with the SSE2 enable CPUs. By
default, all optimization options are disabled!
According to Wikipedia, Streaming SIMD Extensions (SSE) is a SIMD instruction set extension to the x86
architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to
AMD’s 3DNow!.

2.12.1 Exercises

• Compile and run the program imageExp.c using different combinations of optimization options and
observe the running time. For example, you can compile it without any optimization: cl imagePow.c
-o imagePow. Then you can use:

– (a) cl /O2 imagePow.c -o imagePow1,


– (b) cl /O2 /fp:fast imagePow.c -o imagePow2
– (c) cl /O2 /fp:fast /arch:SSE2 imagePow.c -o imagePow3

• Perform a literature search to understand the benefit of using SSE2 or other processor specific instruc-
tion sets.

• Perform a literature search to understand the optimization options of the GCC and Intel C compiler.

3 A very very brief introduction to CUDA

3.1 Motivation
From a scientific computation point of view, the processor takes data from the memory, performs certain
calculations and writes the result back to the memory. The program (software) tells the processor what to

20
do. The processed result will be written into a file and stored in a hard disk. For signal/image processing
applications, it is usually the case that the same set of operations will be performed on many data. For
example, in the gamma correction case, the following calculation is performed on all pixels:
 γ
in[k]
out[k] = 255 ∗
255

where for a 512 × 512 image k = 0, 1, ..., 262143. For non-parallel programs, one pixel is processed at a time
and in a sequential way. So using a for-loop is a natural way to go

N = 512 * 512;
for ( k = 0 ; k < N ; k ++)
out[k] = (unsigned char) floor( 255.f * pow( (float) in[k] / 255.f, gamma ) );

Here is an interesting point. The for-loop index k is actually used as an index for the location of the
pixel. Suppose the processor can finish the calculation for one pixel in tc seconds. Then it takes roughly
N × tc seconds to processed the image.
Now we have graphics processors (GPU) which have hundreds of processing units in them. It makes sense to
utilize this computing power. In the above example, we can ask each processing unit in the GPU to process
one pixel simultaneously. Suppose we have M processing units and it takes tg seconds to process one pixel.
Then it takes roughly N × tg /M seconds to process the image. The increase in processing speed can be
calculated as (not accurate!)
N × tc tc
=M
N × tg /M tg

Obviously, more and faster processing units in a GPU will lead to higher speed up ratio.
An interesting point is that just as in the for-loop we can use the loop index as the location index of the
pixel, in CUDA we can use the thread ID as the location index.

3.2 Terms and definitions

The following information is collected from a number of sources: including the UIUC lecture notes and the
CUDA technical training material from NVIDIA.
A typical GPU (the G80 chip from NVIDIA) has 16 streaming multiprocessors (SM). Each SM has 8 streaming
processors (SP). The SP, running at 1.35GHz, has a multiply-add and a multiply unit. There are 128 SPs.
Each supports up to 768 threads. The following are list of terms related to GPU computing.
Host: the computer
Device: the graphics processor
Kernel: a computational task such as the calculation of the gamma correction (the line after the for-loop in
the previous example)
Thread: tasks.
Grid: organization of threads into blocks
Block: a 3D array of threads

21
Block ID: blockIDx.x and blockIDx.y.
Thread ID: threadIDx.x, threadIDx.y, and threadIDx.z
In general terms, a thread is related to a processing task (for example the processing of one pixel in the
gamma correction case). When a kernel function is launched, it is executed as a grid of parallel threads.
This can be regarded as a way the run-time system allocates processor resources for the parallel running
of the kernel function. The thread grid is organized into a two-level hierarchy. On the top level, we define
the thread block. Each block has a 2-d ID given by the CUDA keyword: blockIDx.x and blockIDx.y. All
thread blocks must have the same number of threads organized in the same way. A thread block is organized
as a 3D array of threads with a total of up to 512 threads. The ID for a tread in a block is given by the three
indies: threadIDx.x, threadIDx.y, and threadIDx.z.
For example, we can define a 4 × 5 grid of blocks and define each block as a 6 × 7 × 1 array. Then the total
number of threads in this grid is 4 × 5 × 6 × 7 = 840. A particular thread in the block (1, 2) and at the
location of the array (3, 4, 1) is with the follow IDs
blockIDx.x = 1
blockIDx.y = 2
threadIDx.x = 3
threadIDx.y = 4
threadIDx.z = 1

In fact, these are the built in device variables (defined as follows) which are accessible by all global and device
functions
dim3 gridDim
dim3 blockDim
dim3 blockIdx
dim3 threadIdx

3.3 Programming model

The CUDA programming model is simple. A C program which runs in the host is responsible for setting
up the variables for both host and device, data input and output, copying data from host to device, calling
the kernel function which runs in the device, and copying data from the device back to the host once the
processing is finished. It is then possible to invoke another kernel function.
The syntax for invoking a kennel function is as follows

dim3 dimGrid(grid_size_x, grid_size_y);


dim3 dimBlock(block_size_x, block_size_y, block_size_z)
Kernel_Function <‌<‌< dimGrid, dimBlock >‌>‌> (parameters of the kernel function);

where Kernel_Function is the name of the kennel, dimGrid and dimBlock is the definition of the thread
grid and the last items are parameters that are passed from the host to the kennel function.

22
For a 2-d thread block (block_size_z = 1), the 2-d thread ID for a particular block with indices (blockIdx.x,
blockIdx.y) and the thread indices within the block (threadIdx.x, threadIdx.y) are given as follows
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
Note that this is common way to “address” a 2D thread within a grid. You should contrast this to the
double for-loops in the C program. Roughly speaking, in the C program, you use the double for-loops to
tell the CPU to process each pixel of the image in a sequential way. In CUDA, the runtime system will
look after the calculation of the indices of idx and idy and distribution the associated processing task to as
many processors as possible at the same time. As such, we do not know which pixel is processed by which
processor.

3.4 CUDA specific functions

There are a number of CUDA specific functions.

3.5 Image processing example

3.5.1 Main parts of the program

We will use gamma-correction as an example. The C program will call a function that runs on the CPU and
another function that runs on the GPU. The following is a list of key steps in the program

1. Define the GPU and CPU functions

2. Declare variables

3. Take care of command line input

4. Memory allocation for arrays in host and device

5. Read image

6. Run GPU code

(a) Define the grid and block


(b) Copy image data to device and check errors
(c) Warm up - running cudaThreadSynchronize();
(d) Run GPU code
(e) Wait until all tasks finished and check running error
(f) Copy data from device to host and check error
(g) Free device memory

7. Write data to a file out0.5

8. Run CPU code

23
9. Calculate the maximum error and the sum of absolute error between the results from the CPU cod and
the GPU code. The results are necessarily exactly the same!

10. Free host memory.

3.5.2 Details of the GPU code

• Block and grid definition

In this program, we define a 16 × 16 thread block. This is done through the two macros
//define 16x16 thread block
#define BLOCKDIM_X 16
#define BLOCKDIM_Y 16
There are 256 threads in one block which has a limit of 512 threads.
The grid dimension is calculated by using the function
int iDivUp(int a, int b)
{ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
This function simply returns a/b if a is divisible by b, otherwise it returns 1 + a/b. The grid dimension is
defined as
dim3 dimGrid(iDivUp(MCol, BLOCKDIM_X), iDivUp(NRow, BLOCKDIM_Y));
It calculates how many blocks is sufficient to cover all pixels of an image.

• Memory allocation and data transfer

This is through the function cudaMalloc and cudaMemcpy


//allocate memory for device
cudaMalloc( (void **) &d_in, NRow * MCol * sizeof(unsigned char));
cudaMalloc( (void **) &d_out, NRow * MCol * sizeof(unsigned char));
//copy data to device
cudaMemcpy( d_in, in, NRow * MCol * sizeof(unsigned char), cudaMemcpyHostToDevice );

• The kernel function

The definition of the kernel function is similar to the normal function in C except that there is a key word:
__global__ void .
Since we are dealing with an image, we use a 2-d thread ID to “address” each pixel. This is similar to the
double for-loops in a normal CPU code where we use for-loop indices to address pixels. The difference is that
these threads are allocate to many SPs and the kernel is executed simultaneously. The calculation of the 2-d
thread ID (idx and idy ) follows a common pattern. You can used it as a template. Note also in this code
we use the linear index method.

__global__ void gammaCorrectionGPU(unsigned char *d_in, unsigned char *d_out, const int NROW,
const int NCOL, const float gamma)

24
{ float x0;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int k;
if(idx < NCOL && idy < NROW)
{ k = idy * NCOL + idx; //linear index
x0 = (float) d_in[k];
d_out[k] = (unsigned char) floor(255.f * pow(x0/255.f, gamma));
}
}
There are some interesting detailed information on what happens when the kernel function is invoked and
threads (tasks) are assigned to SPs in D. Kirk and W-M Hwu’s lecture notes (Chapter 3). There are also
limitations on how many blocks can be assigned to a SM and thus how many blocks can be run in a particular
processor. There are also issues related to thread scheduling and the hardware resources such as the number
of registers used by the thread. These issues must be considered when defining the dimension of the grid.
Due to time limitations, we will skip these details.

• The rest of the code

The rest of the code is quite easy to understand


//run GPU code
gammaCorrectionGPU <‌<‌< dimGrid, dimBlock >‌>‌> ( d_in, d_out, NRow, MCol, gamma);
// until the device has completed all computations
cudaThreadSynchronize();
// check if kernel execution generated an error
checkCUDAError("kernel execution");
// device to host copy
cudaMemcpy( out_GPU, d_out, NRow * MCol * sizeof(unsigned char), cudaMemcpyDeviceToHost
);
// Check for any CUDA errors
checkCUDAError("cudaMemcpy");

• Running time

We use the function clock() from the library “time.h”. This function returns number of clock ticks elapsed
since the program was launched. We use it to measure the running time of the CPU and the GPU code.

3.5.3 The Program

1 #include <s t d i o . h>


2 #include <time . h>

25
3 #include <math . h>
4 #define MAX( a , b ) ( a > b ? a : b ) // d e f i n e a max macro
5 #define BLOCKDIM_X 16
6 #define BLOCKDIM_Y 16
7
8 // Simple u t i l i t y f u n c t i o n t o c h e c k f o r CUDA runtime e r r o r s
9 void checkCUDAError ( const char ∗msg )
10 {
11 cudaError_t e r r = cudaGetLastError ( ) ;
12 i f ( c u d a S u c c e s s != e r r )
13 {
14 f p r i n t f ( s t d e r r , "Cuda e r r o r : %s : %s . \ n" , msg , c u d a G e t E r r o r S t r i n g ( e r r )
);
15 e x i t ( −1) ;
16 }
17 }
18
19 int iDivUp ( int a , int b ) {
20 return ( ( a % b ) != 0 ) ? ( a / b + 1 ) : ( a / b ) ;
21 }
22
23 //GPU f u n c t i o n
24 __global__ void gammaCorrectionGPU ( unsigned char ∗d_in , unsigned char ∗d_out ,
const int NROW, const int NCOL, const f l o a t gamma)
25 {
26 f l o a t x0 ;
27 int i d x = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
28 int i d y = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ;
29 int k ;
30
31 i f ( i d x < NCOL && i d y < NROW )
32 {
33 k = i d y ∗ NCOL + i d x ;
34 x0 = ( f l o a t ) d_in [ k ] ;
35 d_out [ k ] = ( unsigned char ) f l o o r ( 2 5 5 . f ∗ pow ( x0 / 2 5 5 . f , gamma) ) ;
36 }
37 }
38
39 //CPU f u n c t i o n
40 void gammaCorrectionCPU ( unsigned char ∗x , unsigned char ∗y , int N, int M,
float g)
41 { int n , m, k ;
42

26
43 //gamma c o r r e c t i o n
44 f o r ( n = 0 ; n < N; n++)
45 f o r ( m = 0 ; m < M; m++)
46 { k = n ∗ M + m; // use l i n e a r i n d e x i n g
47 y [ k ] = ( unsigned char ) f l o o r ( 2 5 5 . f ∗ pow ( ( f l o a t ) x [ k ] / 2 5 5 . f , g )
) ;
48 }
49 }
50
51
52 // ///////////////////////////////////////////////////////////////////////
53 // Program main
54 // ///////////////////////////////////////////////////////////////////////
55 int main ( int argc , char ∗∗ argv )
56 {
57 FILE ∗ f p r , ∗ fpw ;
58 int NRow, MCol , max_diff , sum , t ;
59 int i , j , k ;
60 f l o a t gamma ;
61
62 // p o i n t e r f o r h o s t memory
63 unsigned char ∗ in , ∗ out , ∗out_GPU ;
64 // p o i n t e r f o r d e v i c e memory
65 unsigned char ∗d_in , ∗d_out ;
66
67 i f ( a r g c != 6 ) // number o f i n p u t s + 1
68 { p r i n t f ( " F ive i n p u t s r e q u i r e d \n" ) ;
69 p r i n t f ( " Usage : gammaCorrection i n p u t output NRow Mcol gamma\n" ) ;
70 exit (0) ;
71 }
72 else
73 { f p r = f o p e n ( argv [ 1 ] , " rb " ) ; // f i l e p o i n t e r t o r ead i n p u t image
74 fpw = f o p e n ( argv [ 2 ] , "wb" ) ; // f i l e p o i n t t o w r i t e o u t p u t image
75 NRow = a t o i ( argv [ 3 ] ) ; // s i z e o f t h e image
76 MCol = a t o i ( argv [ 4 ] ) ; // s i z e o f t h e image
77 gamma = a t o f ( argv [ 5 ] ) ; //gamma
78 }
79
80 // i f s o m e t h i n g i s s t i l l wrong !
81 i f ( f p r == 0 | | fpw == 0 | | NRow <0 | | MCol < 0 )
82 { p r i n t f ( " Usage : gammaCorrection i n p u t output NRow Mcol \n" ) ;
83 exit (0) ;
84 }

27
85
86 // a l l o c a t e memory f o r h o s t
87 i n = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
88 out = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
89 out_GPU = ( unsigned char ∗ ) m a l l o c (NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
90
91 // a l l o c a t e memory f o r d e v i c e
92 cudaMalloc ( ( void ∗ ∗ ) &d_in , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
93 cudaMalloc ( ( void ∗ ∗ ) &d_out , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ) ;
94
95 // read image
96 f r e a d ( in , s i z e o f ( unsigned char ) , NRow ∗ MCol , f p r ) ;
97 fclose ( fpr ) ;
98
99 // d e f i n e g r i d and b l o c k s i z e f o r GPU
100 dim3 t h r e a d s (BLOCKDIM_X, BLOCKDIM_Y) ;
101 dim3 g r i d ( iDivUp ( MCol , BLOCKDIM_X) , iDivUp (NRow, BLOCKDIM_Y) ) ;
102
103 // dim3 g r i d ( 6 4 , 64) ;
104 // s t a r t i n g time f o r GPU
105 t = clock () ;
106 // ////////////////////////
107 //uncomment t h e f o l l o w i n g l i n e t o s e e run t h e GPU code r e p e a t l y f o r 200 t i m e s
108 f o r ( k = 0 ; k < 2 0 ; k++)
109 {
110 // copy d a t a t o d e v i c e
111 cudaMemcpy ( d_in , in , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ,
cudaMemcpyHostToDevice ) ;
112 cudaMemcpy ( d_out , in , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ,
cudaMemcpyHostToDevice ) ;
113
114 //warm up
115 cudaThreadSynchronize ( ) ;
116
117 // run GPU code
118 gammaCorrectionGPU <<< g r i d , t h r e a d s >>>( d_in , d_out , NRow, MCol ,
gamma) ;
119
120 // u n t i l t h e d e v i c e has c o m p l e t e d a l l c o m p u t a t i o n s
121 cudaThreadSynchronize ( ) ;
122
123 // c h e c k i f k e r n e l e x e c u t i o n g e n e r a t e d an e r r o r
124 checkCUDAError ( " k e r n e l e x e c u t i o n " ) ;

28
125
126 // d e v i c e t o h o s t copy
127 cudaMemcpy ( out_GPU , d_out , NRow ∗ MCol ∗ s i z e o f ( unsigned char ) ,
cudaMemcpyDeviceToHost ) ;
128
129 // Check f o r any CUDA e r r o r s
130 checkCUDAError ( "cudaMemcpy" ) ;
131 }
132
133 // f r e e d e v i c e memory
134 cudaFree ( d_in ) ;
135 cudaFree ( d_out ) ;
136
137 // f i n i s h time f o r GPU
138 t = clock () − t ;
139 p r i n t f ( " I t took %d c l o c k t i c k s t o p r o c e s s t h e s i g n a l u s i n g GPU\n" , t ) ;
140
141 // w r i t e d a t a t o f i l e
142 f w r i t e (out_GPU , s i z e o f ( char ) , NRow∗MCol , fpw ) ;
143 f c l o s e ( fpw ) ;
144
145 // cpu v e r s i o n
146 t = clock () ;
147
148 // run CPU code 20 t i m e s
149 f o r ( k = 0 ; k < 20 ; k++)
150 gammaCorrectionCPU ( in , out , NRow, MCol , gamma) ;
151
152 t = c l o c k ( )−t ;
153 p r i n t f ( " I t took %d c l o c k t i c k s t o p r o c e s s t h e s i g n a l u s i n g CPU\n" , t ) ;
154
155 // s e e i f t h e r e i s a d i f f e r e n c e b e t w e e n t h e two r e s u l t s
156 sum = 0 ;
157 max_diff = 0 ;
158 f o r ( i = 0 ; i < NRow ; ++i )
159 f o r ( j = 0 ; j < MCol ; ++j )
160 { k = i ∗ MCol + j ;
161 max_diff = MAX( max_diff , abs (out_GPU [ k ] − out [ k ] ) ) ;
162 sum = sum + abs (out_GPU [ k ] − out [ k ] ) ;
163 }
164
165 p r i n t f ( " Max_diff and sum o f abs d i f f a r e %d %d\n" , max_diff , sum ) ;
166

29
167 // f r e e h o s t memory
168 f r e e (out_GPU) ;
169 f r e e ( in ) ;
170 f r e e ( out ) ;
171
172 return 0 ;
173 }

3.5.4 Compile and run

To compile this program which is called “gammaCorrection.cu”, you need to use the compiler nvcc in the
Visual Studio 2005 Command Prompt window.

• In Windows, click “Start” -> “Program” -> “Microsoft Visual Studio 2010” -> “Visual Studio Tools”
-> “Visual Sudio 2010 Command Prompt”. You will have a command window.

• Suppose your code is save in a directory z:\abc. In the command window, you need to change the
directory by using: cd z:\abc. You can make sure your C program is in the current directory by listing
the files: dir *.c . If you can see the file gammaCorrection.cu listed, then you can go on to the next
step. Otherwise, you need to down load the source code from the LMS web page and store it in the
current directory.

• The simplest way to compile the program is as follows:


nvcc gammaCorrection.cu -o gammaCorrection

• In the command window, you can run the program by:


gammaCorrection airplane out0.5 512 512 0.7.
You should have a new file called out0.5 in your current directory. Running the program gammaCorrec-
tion, you should see the number of clock ticks that the GPU and the CPU take to finish the processing.
In my desk top machine with an Intel Core Dual 2 2.4 GHz processor and NVIDIA 9800GT GPU, the
program reports 0 and 47 clock ticks for the GPU and CPU, respectively.

• You can load result and the original image into Matlab and display it by using the following commands:
a = fopen(’out0.5’,’r’);
b = fread(a, [512 512]);
imshow(b/256, []).

• To make the program run faster, we can turn on the optimization options for both CPU and GPU. We
have discussed the CPU options in 2.12. To use the C compiler option, we need to follow the format:
-Xcompiler /option. For the GPU, we can use the option -use_fast_math. Another useful option is
-Xptxas=-v. It tells us some information about the hardware usage, e.g., the number of register used
and the number of bytes of constant and shared memory used.

• Here is an example of turning on the O2 optimization of the C compiler:

nvcc -Xcompiler /O2 gammaCorrection.cu -o gammaCorrection

• A more complicated example is as follows:

30
nvcc -use_fast_math -Xptxas=-v -Xcompiler /O2 -Xcompiler /fp:fast gammaCorrection.cu
-o gammaCorrection
By turning on the optimization options, I was able to reduce the running time for the CPU code from 47
clock ticks to about 32.

3.6 Exercises

• Identify the GPU and CPU sections of the code in the main function. How many times of the GUP
and CPU codes are repeatly run?

• Compile and run the program: gammaCorrection.cu using: nvcc gammaCorrection.cu -o gammaCor-
rection. How many clock ticks does it take for the GPU and CPU to process the image? Load the
result into Matlab and display the processed and original image side-by-side.

• Use some compiler options:


nvcc -use_fast_math -Xptxas=-v -Xcompiler /O2 -Xcompiler /fp:fast gammaCorrection.cu -o gamma-
Correction.
How many registers does the kernel use? How many bytes of share [smem] and constant menory [cmem]
does the kernel use? How many clock ticks does it take for the GPU and CPU to process the image?

• Modify the code for the GPU and CPU sections such that the both GPU and CPU processing code is
repeatly run 100 times. (Hint: you need to change the for-loop.)

• Compile and run the program again. Based on the timing numbers for the GPU and the CPU, estimate
how many times the processing speed of the GPU is faster than that of the CPU.

4 Linear index of 2D arrays

4.1 C language

When we say we have an array a[3][4], we mean the array has 3 rows and 4 columes. Generally, the size of
the array is represented by something like “NRow” and “MCol”, where the former is the number of rows and
latter the number of columns. In C language, the storage of a matrix data is row-based. Thus the linear
address of a particular element a[m][n] is then calculated as: id = m ∗ M Col + n. For example, the linear
index of the element at the 3rd row and 2nd column a[2][1] is calculated as: id = 2*4+1=9.
This point is important, as the same notation is used in calculating the thread index. For example, we define
a (3 × 5) grid. Each block has (6 × 7) threads. The two variables: blockDim.x and blockDim.y are then
given by
blockDim.x = 7
blockDim.y = 6
The horizontal index of thread defined as idx is calculated as
idx = blockIdx.x * blockDim.x + threadIdx.x;
which is actually the column coordinate of the thread. Similarly, the vertical index of the thread defined as
idy is calculated as

31
idy = blockIdx.y * blockDim.y + threadIdx.y;
which is actually the row coordinate of the thread. The linear index is given by
id = idy * blockDim.x + idx.
Here is an example to illustrate the above points. Suppose we have two images of the size (480 × 720) stored
in two 1-D arrays: A and B. Suppose we want to blend the two images using C = 0.3 * A + 0.7 * B. In C
language, we have
SizeCol = 720;%number of col
SizeRow = 480;%number of row
for (int row = 0; row < SizeRow; ++row)
for (int col = 0; col < SizeCol; ++col)
{ id = row * SizeCol + col;
C[id] = 0.3 * A[id] + 0.7 *B[id]
}
In GPU code, we have
SizeCol = 720;
SizeRow = 480;
idx = blockIdx.x * blockDim.x + threadIdx.x;//col. index
idy = blockIdx.y * blockDim.y + threadIdx.y;//row index
if ( idx < SizeCol && idy < SizeRow)
id = idy * SizeCol + idx;
C[id] = 0.3 * A[id] + 0.7 *B[id]
The important points are

1. In C the first array index refers to row while the second refers to row.

2. In CUDA the x-related thread index refers to column index while the y-related index refers to row
index.

3. To avoid confusion, linear indexing is aways row-based.

4.2 Matlab

However, in Matlab the storage of matrix is column based. For a matrix of size (M × N ), the linear index
of its element at mth row and nth column is calculated as: id = (n-1) * M + m. Therefore, we must be
careful when programming in mixed C and Matlab. When the matrix data stored as a variable A in Matlab
is passed onto a pointer variable B (declared as float *B) in a C program, the linear index of the element
at mth row and nth column in the C program is given by
idx = n*NROW+m;
where NROW is number of rows of the matrix. This will be further discussed in next section.

32
5 Run CUDA code in Matlab

5.1 Basic steps

The Matlab Parallel Computing Toolbox allows us to run certain Matlab functions in GPU. There are three
methods:

1. Load the data into GPU by using the function gpuArray, then perform the operations
%x is a matrix, y is another matrix
x_d = gpuArray(x);
y_d = gpuArray(y);
z_d = exp(x_d) * cos(y_d) + log(abs(x_d)); % operations running in GPU
z = gather(z_d); %copy the result back to workspace, z_d is in GPU memory

2. Using the function arrayfun. See Matlab document for details.

3. Complile the CUDA kernel and run it in Matlab. The following is an example of a CUDA program
stored as GC.cu.

1 //GPU f u n c t i o n
2 __global__ void gammaCorrectionGPU ( unsigned char ∗d_in , const int NROW,
const int NCOL, const f l o a t gamma)
3 {
4 f l o a t x0 ;
5 int i d x = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
6 int i d y = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ;
7 int k ;
8
9 i f ( i d x < NROW && i d y < NCOL )
10 {
11 k = i d y ∗ NROW + i d x ;
12 x0 = ( f l o a t ) d_in [ k ] ;
13 d_in [ k ] = ( unsigned char ) f l o o r ( 2 5 5 . f ∗ pow ( x0 / 2 5 5 . f , gamma) ) ;
14 }
15 }

It performs simple gamma correction operation. To run it in Matlab, we need to follow theses steps:

(a) Compile the kernel into ptx code:


nvcc -ptx GC.cu
(b) Make the Matlab kernel:
k = parallel.gpu.CUDAKernel(’GC.ptx’,’GC.cu’).
To see the properties of the kernel, just type k and press return in the command window.
The following messages shows up
k =
parallel.gpu.CUDAKernel handle Package: parallel.gpu

33
Properties:
ThreadBlockSize: [1 1 1]
MaxThreadsPerBlock: 1024
GridSize: [1 1]
SharedMemorySize: 0
EntryPoint: ’_Z18gammaCorrectionGPUPhiif’
MaxNumLHSArguments: 1
NumRHSArguments: 4 ArgumentTypes: {1x4 cell}
Methods, Events, Superclasses
The TreadBlockSize is set to [1 1 1] and the GridSzie is set to [1 1]. This means only one thread
block having one thread is set up by the function parallel.gpu.CUDAKernel. So we need to
set up the threads. For an image of N rows and M columns, we need N × M threads. Suppose
we set the the block size as [16 16] which has 16×16 threads per block. The grid size should be
[ceil(N/16), ceil(M/16)]. For example, we use the following code to set up the threads
[N, M] = size(x);
k.ThreadBlockSize = [16 16];
k.GridSize = [ceil(N/16) ceil(M/16)];
After setting up the threads, we can run the CUDA kernel by using
y = feval(k, x, N, M, 0.1); % y is in GPU card memory
y = gather(y); %copy it to computer memory
[y1, y2] = feval(k, x1, x2, x3)

(c) Let us come back to the kernel function. How do we let Matlab know which one is the input and
which one is the output? For example, if the C kernel within a CU file has the following signature:
__global__ void simpleExample( float * pInOut, float c )
The corresponding kernel object (k) in MATLAB has the following properties:
MaxNumLHSArguments: 1
NumRHSArguments: 2
ArgumentTypes: {’inout single vector’ ’in single scalar’}
Therefore, to use the kernel object from this code with the function feval, we need to provide
feval with two input arguments (in addition to the kernel object), and we can use one output ar-
gument: y = feval(k, x, C); The input values x and C correspond to pInOut and c in the C
function prototype. The output argument y corresponds to the value of pInOut in the C function
prototype after the C kernel has executed. This is what happens in the program GC.cu in which
the pointer unsigned char *d_in acts as the input and the output. What do we do if we have
three matrix inputs and two matrix outputs? The rule is that if we declare the pointer to be a
constant, then it will not be used as the output. Here is an example that shows a combination of
const and non-const pointers:
__global__ void complicatedExample( const float * pIn, float * pInOut1, float * pInOut2)
The corresponding kernel object in MATLAB then has the properties:
MaxNumLHSArguments: 2
NumRHSArguments: 4
ArgumentTypes: {’in single vector’ ’inout single vector’ ’inout single vector’}

34
You can use feval on this code’s kernel (k) with the syntax:
[y1, y2] = feval(k, x1, x2, x3)
The three input arguments x1, x2, and x3, correspond to the three arguments that are passed
into the C function. The output arguments y1 and y2, correspond to the values of pInOut1 and
pInOut2 after the C kernel has executed.

5.2 Dealing with matrix element’s index in CUDA kernel

Here is an example of a CUDA kernal function which is used to calculate the sum of two matrices.

1
2 __global__ void addMtx ( f l o a t ∗d_out , const f l o a t ∗d_in , const int NROW, const
int NCOL)
3 {
4 int k ;
5 int i d x = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ; // row i d x
6 int i d y = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ; // c o l i d x
7
8 i f ( i d x < NROW && i d y < NCOL )
9 { k = i d y ∗NROW + i d x ;
10 d_out [ k ] = d_in [ k ] + d_out [ k ] ;
11 }
12 }

Here is another example showing the working of index. This example is used to demonstrate how to use
GPU kernel to calculate the 2d average filter with a fixed window size.

1 //GPU f u n c t i o n
2 __global__ void w e i g h t e d A v e r a g e F i l t e r ( unsigned char ∗d_out , const unsigned
char ∗d_in , const int NROW, const int NCOL, const int padNum , const int
winSize )
3 {
4 f l o a t x0 ;
5 int i d x = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
6 int i d y = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ;
7 int k ,m, j j , kk , NROW_A;
8
9 // i d x must l i n k e d t o row i n d e x and i d y l i n k e d t o c o l i n d e x
10 i f ( i d x < NROW && i d y < NCOL )
11 { k = i d y ∗ NROW + i d x ;
12 NROW_A = NROW + 2 ∗ padNum ;
13 x0 = 0 . f ;
14 /∗ column−wise , a l i t t l e b i t more m u l t i p l i c a t i o n s ∗/
15 f o r ( j j = i d x ; j j <=(i d x +2∗padNum) ; j j ++) // row i n d e x
16 f o r ( kk = i d y ; kk<=(i d y +2∗padNum) ; kk++) // c o l i n d e x

35
17 { x0 = x0 + ( ( f l o a t ) d_in [ kk ∗ NROW_A + j j ] ) ; }
18
19 /∗ row−wise , one more r e g i s t e r used , l e s s m u l t i p l i c a t i o n s ∗/
20 /∗
21 ∗ f o r ( kk = i d y ; kk <=(i d y +2∗padNum) ; kk++) // c o l i n d e x
22 ∗ { m = kk ∗ NROW_A;
23 ∗ f o r ( j j = i d x ; j j <=(i d x +2∗padNum) ; j j ++) // row i n d e x
24 ∗ { x0 = x0 + ( ( f l o a t ) d_in [m + j j ] ) ; }
25 ∗ }
26 ∗/
27 d_out [ k ] = ( unsigned char ) ( x0 / ( w i n S i z e ∗ w i n S i z e ) ) ;
28 }
29 }

We notice that compared with the C program which is directly compliled and run, the above CUDA kernel
functions is different in the follow aspects:

1. The index idx is used to represent the ROW index, while in the C program it is used to represent
the COLUMN index. Similarly the index idy is used to represent the COLUMN index, while in the C
program it is used to represent the ROW index. This is the case, because when the matrix data stored
in a variable is passed on to the CUDA kernel function, the matrix is stored in a column-wise way, i.e.,
it is stored as a 1-D array by stacking one column after another column. For example, a (2 × 3) matrix
is given by " #
1 3 5
A=
2 4 6

In Matlab, it is stored as [1 2 3 4 5 6]. The 1st two numbers are from the 1st column and the next
two number are from the 2nd column, etc. When this 1D array is passed onto the CUDA kernel, the
data is copied to the device memory. To find out the index of the element at the mth row and nth
column, we perform the calculation: id = n*NROW + m. For example in the element located at row-2
and column-3 in Matlab is A[2][3] = 6. The linear index of this element in Matlab is: (3-1)*2+2=6.
This is because in Matlab the index starts from 1. However, in C language the index starts from 0. So
the same element would be located at row-1 and column-2 and the linear index is id = 2*2+1=5 which
is index of the last element of the 1D vectorwhich has 6 elements with index from 0 to 5.

2. The for-loop can be implemented in the column-wise way or the row-wise way.

5.3 Using C language in Matlab and dealing with matrix element’s index in the
C kernel

Similar to the CUDA kernel, we can can write C program and compile it to run in Matlab using the mex
function. Although Matlab is very powerful in signal and image processing, for some applications it is
desirable to use the C language which runs faster. To use C in Matlab, you need a C program which contains
two major parts: the actual program for image processing and a function for the interface between C and
Matlab. Let me use an example to show you how to program.

36
The interface function has a fixed format
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
We need to define the input and output variables in this function. There are some fixed variables as well:

nlhs -- number of variables on the left handside (output variables)


nrhs -- number of variables on the right handside (input variables)
*plhs -- pointers for creating output matrix
*prls -- pointers for creating input matrix
Once you have the code, then you compile it in Matlab by using either its built in compiler called “lcc”.
However, this compiler does not produce fast running programs. A better choice is to use other compilers
such as gcc, the Intel C compiler or the C compiler of the Microsoft Visual Studio. The function you use to
compile you C code is mex.
The mex function allows you to select a compiler if you have multiple compilers installed in your system.
To do that, you use: mex -setup. A very small gcc compiler you can install and use is called gnumex. See
http://gnumex.sourceforge.net/ for download and installation information. Detailed information and
more examples of using C language in Matlab can be found in http://cnx.org/content/m12348/latest/
and http://www.cs.yale.edu/homes/spielman/ECC/cMatlab.html

5.3.1 Example 1

Here is a simple program taken and modified from http://cnx.org/content/m12348/latest/. The pur-
poses of this code are (1) to show how to write the interface function, and (2) more importantly how index is
treated in Matlab and C. This code is used to calculate the average value of each column vector of a matrix. I
save this code in a file called “mySimpleC.c”. To run this function, I first compile it using: mex mySimpleC.c.
Then I test it by using the following in Matlab command window:
a = [1 2 3 4;5 6 7 8];
mySimpleC(a);

I have the following results


The average of column 0 is 3.000000
The average of column 1 is 4.000000
The average of column 2 is 5.000000
The average of column 3 is 6.000000
The average of row 0 is 2.500000
The average of row 1 is 6.500000

There are two important points. (1) Just as in the CUDA kernel case, the number of rows and number
of columns in Matlab become the corresponding number of columns and number of rows. So it is a bit
confusing for the usage of the two function mxGetN and mxGetM which, according to Matlab documentation,
will return the number of rows and number of columns. In fact, they are the other way around. I think the
problem/confusion is due to the two different ways of storing a matrix, i.e., in Matlab it is column-wise while

37
in C language it is row-wise. (2) Just as in the CUDA kernel, in the C function linear index for a matrix is
idxLin = idxCol * MROW + idxRow.

1 // mySimpleC . c
2 #include "math . h"
3 #include "mex . h" // This one i s r e q u i r e d
4
5 void mexFunction ( int n l h s , mxArray ∗ p l h s [ ] , int nrhs , const mxArray ∗ p r h s [ ] )
6 { // D e c l a r a t i o n s
7 mxArray ∗ xData ;
8 double ∗ xValues ;
9 int idxRow , idxCol , MROW, NCOL, i d x L i n ;
10 double avg ;
11
12 //Copy i n p u t p o i n t e r x
13 xData = p r h s [ 0 ] ;
14
15 // Get m a t r i x x
16 xValues = mxGetPr ( xData ) ;
17 NCOL = mxGetN( xData ) ; // t h i s i s number o f columns NOT rows
18 MROW = mxGetM( xData ) ; // t h i s i s number o f rows NOT columns
19
20 // P r i n t t h e i n t e g e r avg o f each c o l t o m a t l a b c o n s o l e
21 f o r ( i d x C o l = 0 ; i d x C o l < NCOL; i d x C o l++)
22 { avg = 0 ;
23 f o r ( idxRow = 0 ; idxRow < MROW; idxRow++)
24 { i d x L i n = i d x C o l ∗ MROW + idxRow ;
25 avg = avg + xValues [ i d x L i n ] ;
26 }
27 avg = avg /MROW;
28 p r i n t f ( "The a v e r a g e o f column %d i s %f \n" , idxCol , avg ) ;
29 }
30
31 // P r i n t t h e i n t e g e r avg o f each row t o m a t l a b c o n s o l e
32 f o r ( idxRow = 0 ; idxRow < MROW; idxRow++)
33 { avg = 0 ;
34 f o r ( i d x C o l = 0 ; i d x C o l < NCOL; i d x C o l++)
35 { i d x L i n = i d x C o l ∗ MROW + idxRow ;
36 avg = avg + xValues [ i d x L i n ] ;
37 }
38 avg = avg /NCOL;
39 p r i n t f ( "The a v e r a g e o f row %d i s %f \n" , idxRow , avg ) ;
40 }

38
41 }

5.3.2 Example 2

As a more interesting example, suppose I want to write a program to shrink the size of an image by a factor
of 2. This can be implemented as follows. I can divide the image into (2 × 2) blocks. Then for each block, I
calculate the average value of the 4 pixels as the pixel value of the output image. The code is shown below.
Apart from the two important points, you can see how the linear index for the input and output pixels are
calculated. To test this program, I run the following command in the Matlab command window:
mex imageShrink2.c
x = double(imread(’cameraman.tif’)+1)/257;//read image data into x
y=imageShrink2(x);
imtool([y])

1 #include <s t d i o . h>


2 #include "mex . h"
3
4 void imageShrink2 ( double ∗imOut , double ∗ imIn , int MROW, int NCOL)
5 { int idxRow , idxCol , idxLin_imIn , idxLin_imOut ;
6 int r , c ,M,N;
7 double tmp ;
8
9 // d e t e r m i n e t h e s i z e o f t h e s h r i n k e d image
10 M = ( int ) MROW/ 2 ;
11 N = ( int ) NCOL/ 2 ;
12
13 f o r ( idxRow = 0 ; idxRow < MROW; idxRow = idxRow + 2 )
14 f o r ( i d x C o l = 0 ; i d x C o l < NCOL; i d x C o l = i d x C o l + 2 )
15 { tmp = 0 ;
16 f o r ( r = idxRow ; r <= idxRow + 1 ; r++)
17 f o r ( c = i d x C o l ; c <= i d x C o l + 1 ; c++)
18 { idxLin_imIn = c ∗ MROW + r ;
19 tmp = tmp + imIn [ idxLin_imIn ] ;
20 }
21 idxLin_imOut = M ∗ i d x C o l /2 + idxRow / 2 ;
22 imOut [ idxLin_imOut ] = tmp / 4 ;
23 }
24 return ;
25 }
26
27 void mexFunction ( int n l h s , mxArray ∗ p l h s [ ] , int nrhs , const mxArray ∗ p r h s [ ] )

39
28 { double ∗x , ∗y ;
29 int MROW,NCOL;
30 int M,N; // o u t p u t image s i z e
31
32 i f ( n r h s == 0 )
33 { mexErrMsgTxt ( " Usage : h = MyHisto (X) ; " ) ; }
34
35 /∗ Check f o r p r o p e r number o f arguments . ∗/
36 i f ( n r h s != 1 )
37 { mexErrMsgTxt ( " 1 i n p u t r e q u i r e d . " ) ; }
38 else i f ( nlhs > 1)
39 { mexErrMsgTxt ( "Too many output arguments " ) ; }
40
41 // g e t t h e s i z e o f t h e image
42 x = mxGetPr ( p r h s [ 0 ] ) ;
43 NCOL = ( int ) mxGetN( p r h s [ 0 ] ) ; // number o f columns NOT rows
44 MROW = ( int ) mxGetM( p r h s [ 0 ] ) ; // number o f rows NOT columns
45
46 // d e t e r m i n e t h e s i z e o f t h e s h r i n k e d image
47 M = ( int ) MROW/ 2 ;
48 N = ( int ) NCOL/ 2 ;
49
50 /∗ C r e a te m a t r i x f o r t h e r e t u r n argument . ∗/
51 p l h s [ 0 ] = mxCreateDoubleMatrix (M, N,mxREAL) ;
52 /∗ A s s i g n p o i n t e r s t o each i n p u t and o u t p u t . ∗/
53 y = mxGetPr ( p l h s [ 0 ] ) ;
54
55 /∗ C a l l t h e s u b r o u t i n e . ∗/
56 imageShrink2 ( y , x ,MROW,NCOL) ;
57 }

5.3.3 Example 3

In this example, I show you how to pass an integer from the interface function to the C program. In this
case, I expand the functionality of imageShrink2.c such that it can handle a user specified shrinkage factor
by using
shrinkFactor = (int) (mxGetScalar(prhs[1]));
To test this program, I run the following command in the Matlab command window:
mex imageShrink.c
x = double(imread(’cameraman.tif’)+1)/257;//read image data into x
y=imageShrink(x,3); %shrink by a factor of 3
imtool([y])

40
1 #include <s t d i o . h>
2 #include "mex . h"
3 #include <math . h>
4
5 void imageShrink ( double ∗imOut , double ∗ imIn , int MROW, int NCOL, int
shrinkFactor )
6 { int idxRow , idxCol , idxLin_imIn , idxLin_imOut ;
7 int r , c ,M, N, s h r i n k F a c t o r 2 ;
8 double tmp ;
9
10 shrinkFactor2 = shrinkFactor ∗ shrinkFactor ;
11
12 // d e t e r m i n e t h e s i z e o f t h e s h r i n k e d image
13 M = ( int ) c e i l ( ( double )MROW/ ( double ) s h r i n k F a c t o r ) ;
14 N = ( int ) c e i l ( ( double )NCOL/ ( double ) s h r i n k F a c t o r ) ;
15
16 f o r ( idxRow = 0 ; idxRow < MROW; idxRow = idxRow + s h r i n k F a c t o r )
17 f o r ( i d x C o l = 0 ; i d x C o l < NCOL; i d x C o l = i d x C o l + s h r i n k F a c t o r )
18 { tmp = 0 ;
19 f o r ( r = idxRow ; r <= idxRow + s h r i n k F a c t o r − 1 ; r++)
20 f o r ( c = i d x C o l ; c <= i d x C o l + s h r i n k F a c t o r − 1 ; c++)
21 { idxLin_imIn = c ∗ MROW + r ;
22 tmp = tmp + imIn [ idxLin_imIn ] ;
23 }
24 idxLin_imOut = M ∗ i d x C o l / s h r i n k F a c t o r + idxRow/ s h r i n k F a c t o r ;
25 imOut [ idxLin_imOut ] = tmp/ s h r i n k F a c t o r 2 ;
26 }
27 return ;
28 }
29
30 void mexFunction ( int n l h s , mxArray ∗ p l h s [ ] , int nrhs , const mxArray ∗ p r h s [ ] )
31 { double ∗x , ∗y ;
32 int s h r i n k F a c t o r ;
33 int MROW,NCOL;
34 int M,N; // o u t p u t image s i z e
35
36 i f ( n r h s == 0 | | n r h s != 2 | | n l h s > 1 )
37 { mexErrMsgTxt ( " Usage : y = imageShrink ( x , s h r i n k F a c t o r ) ; " ) ; }
38
39 // g e t t h e s i z e o f t h e image
40 x = mxGetPr ( p r h s [ 0 ] ) ;

41
41 NCOL = ( int ) mxGetN( p r h s [ 0 ] ) ; // number o f columns NOT rows
42 MROW = ( int ) mxGetM( p r h s [ 0 ] ) ; // number o f rows NOT columns
43
44 // g e t an i n t e g e r −− t h e s h r i n k f a c t o r
45 s h r i n k F a c t o r = ( int ) ( mxGetScalar ( p r h s [ 1 ] ) ) ;
46
47 // d e t e r m i n e t h e s i z e o f t h e s h r i n k e d image
48 M = ( int ) c e i l ( ( double )MROW/ ( double ) s h r i n k F a c t o r ) ;
49 N = ( int ) c e i l ( ( double )NCOL/ ( double ) s h r i n k F a c t o r ) ;
50
51 /∗ C r e a te m a t r i x f o r t h e r e t u r n argument . ∗/
52 p l h s [ 0 ] = mxCreateDoubleMatrix (M, N,mxREAL) ;
53 /∗ A s s i g n p o i n t e r s t o each i n p u t and o u t p u t . ∗/
54 y = mxGetPr ( p l h s [ 0 ] ) ;
55
56 /∗ C a l l t h e s u b r o u t i n e . ∗/
57 imageShrink ( y , x ,MROW,NCOL, s h r i n k F a c t o r ) ;
58 }

6 Summary
This has been a very short review of certain key elements in C programming and a very very short introduction
to programming in CUDA. I hope that we can learning more about programming in CUDA in another unit
ELE5IPC (ELE4ICP) in which we can do more in image processing.
To learn more about CUDA, the NVIDIA’ s CUDA web page contains a lot of useful information and links
to tutorials and technical training materials. The UIUC has a one semester course in CUDA programming.
The web page has detailed lecture notes.

42

You might also like