0% found this document useful (0 votes)
22 views

Parallel Edge Detection by SOBEL Algorithm Using CUDA C

Uploaded by

gokularaman1996
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Parallel Edge Detection by SOBEL Algorithm Using CUDA C

Uploaded by

gokularaman1996
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2016 IEEE Students’ Conference on Electrical, Electronics and Computer Science

PARALLEL EDGE DETECTION BY SOBEL


ALGORITHM USING CUDA C
Adhir Jain Anand Namdev Dr. Meenu Chawla
Dept. of Computer Science & Engg. Dept. of Computer Science & Engg. Dept. of Computer Science & Engg.
MANIT MANIT MANIT
Bhopal, M.P., India 462003 Bhopal,M.P., India 462003 Bhopal,M.P., India 462003
[email protected] [email protected] [email protected]

Abstract—Edge detection is one of the most important the set of connected pixel that forms the boundary between
paradigm of Image processing. Images contain millions of pixel two disjoint regions. Edges allows the user to observe feature
and each pixel information is independent of its neighbouring of an image where more or less abrupt change in intensity
pixel. Hence this paper puts to test the capability of Graphics
Processing Unit (GPU) to compute in parallel against the occurs. Edge detection has several important applications in
millions of pixel calculations involved in image processing. digital image processing like pattern recognition, medical
Each pixel operation is independent from other thus GPU field,etc. A number of edge detection algorithms have been
can be effectively used along with the help of high level described in previous papers. Sobel operator [1] is one of
programmable interfaces. More specifically, this paper focuses the classic operator used in edge detection as it is a simple
on Compute Unified Device Architecture (CUDA) as its parallel
programming platform and examines the possible gain in operator, insensitive to noise and requires less computations in
time which can be attained for edge detection in images. A comparison to other operators. How to quickly and accurately
well-known algorithm SOBEL for edge detection is used in the extract the edge information of the images is always a hot
experiment. A dataset of images was tested for edge detection research topic. An image consists of millions of pixels and
both serially and parallely. The results of parallel algorithm each pixel operation is independent to its neighbouring pixel.
were further divided according to the machine on which the
algorithm was tested and were further classified according to Thus parallel programming model can be used which focuses
the number of kernel function used in each machine. Results on performing many operations concurrently but slowly rather
showed that parallel implementation is about 262 times and than performing individual operations rapidly [10].
943 times faster when 2 kernel functions were implemented CUDA [5] is a parallel computing architecture developed by
in parallel on GeForce and Tesla machines respectively, and NVidia for massively parallel high performance computing.
about 120 times and 455 times faster when 3 kernel functions
were implemented in parallel on GeForce and Tesla machines It is the compute engine in the GPU and is accessible by
respectively as compared to serial implementation for larger developers through standard programming languages. The
images. Statistics also showed a decline in speedup of about source code for edge detection program consists of both the
52% when 3 kernels were used than when 2 kernels were used serial(CPU) and parallel(GPU) code.
due to increase in communication time. Hence, an analysis This paper compares the performance differences between
came out that the decision regarding which sections of the
algorithm to be parallelised, should be taken wisely. If not, program code that are run on a sequential processor (CPU)
would lead to an additional overhead i.e. communication and a parallel processor (GPU). Also it compares the results
time (time taken in transferring the data from CPU to GPU on different types of GPU machine (GeForce [3] and Tesla
& back from GPU to CPU) thereby reducing the overall speedup. [4]) and on the basis of number of kernel function [2] [11]
involved in each machine for computation.
Keywords — Parallel Edge Detection, Parallel Computing,
CUDA, Kernel, Sobel Operator, Speedup, Image Processing, The organization of this paper is as follows. In section 2, the
MATLAB, Parallelism v/s Speedup Sobel edge detection operator is discussed, CUDA processing
flow is described in section 3. In the section 4, CUDA thread
I. I NTRODUCTION hierarchy and CUDA kernel is introduced, detailed algorithm
Image processing is one of the most important field widely to be implemented is explained in section 5. In section 6, GPU
used today. The human vision has the capability to easily programming in MATLAB is presented. Outcomes and results
acquire, understand and interpret the information stored in the are shown in section 7. In section 8, graphs are analysed along
form of images. On the other hand it is a challenging task with some observations. Finally, the conclusions are stated in
to make the machine understand, acquire these information section 9.
so as to automate the task without human intervention. It is
thus important to learn and implement various techniques of II. SOBEL EDGE DETECTION OPERATOR
the image processing. One of the most important paradigm Edge detection is a common image processing technique
of the image processing is the Image Edge detection. Image used in feature detection and extraction. Applying edge
edges are the most basic features of an image. Edges are detection on an image can significantly reduce the amount of

978-1-4673-7918-2/16/$31.00 c 2016 IEEE


Authorized licensed use limited to: VIT University. Downloaded on November 08,2024 at 14:45:10 UTC from IEEE Xplore. Restrictions apply.
data needed to be processed at a later phase while maintaining The above processing flow is depicted in figure 1 for more
the important structure of the image. The idea is to remove understanding:
everything from the image except the pixels that are part of
an edge. These edges have special properties, such as corners,
lines, curves, etc. A collection of these properties or features
can be used to accomplish a bigger picture, such as image
recognition. An edge can be identified by significant local
changes of intensity in an image. An edge usually divides
two different regions of an image. Most edge detection
algorithms work best on an image that has the noise removal
procedure already applied. A simple edge detection algorithm
is to apply the Sobel edge detection algorithm. It involves
convolving the image using an integer value filter, which is
both simple and computationally inexpensive.
The Sobel operator is widely used in image processing,
particularly within edge detection algorithms. The Sobel’s
operator find the approximate derivative in horizontal and
vertical direction.
Gx = {f (x + 1, y − 1) + 2f (x + 1, y) + f (x + 1, y + 1)} −
{f (x − 1, y − 1) + 2f (x − 1, y) + f (x − 1, y + 1)}

Gy = {f (x − 1, y + 1) + 2f (x, y + 1) + f (x + 1, y + 1)} −
{f (x − 1, y − 1) + 2f (x, y − 1) + f (x + 1, y − 1)}
q
And the net gradient is : g(x, y) = G2x + G2y
Its convolution template operator as follows:

Sobel operator is used to detect the edge of image M,


then horizontal template Tx and vertical template Ty are used
to convolute with the image without taking into account the Figure 1. CUDA processing flow [12]
border conditions. Then the total gradient value G may get
by adding the two gradient matrices. This G is called as the
gradient image. Finally, edge can be detected by applying IV. CUDA THREAD HIERARCHY AND CUDA
threshold to gradient image [6]. KERNEL
Threads on the device are automatically invoked when a
III. CUDA PROCESSING FLOW kernel is being executed. The programmer determines the
number of threads that best suits the given problem. The
To make the GPU work for general purpose calculations, a thread count along with the thread configurations are passed
certain processing flow is to be maintained which is as follows: into the kernel. Figure 2 shows the entire collection of threads
1) Copy input image arrays from CPU memory to GPU responsible for an execution of the kernel, called a grid. A
memory to load required data on device for computation. grid is further partitioned and can consist of one or more
2) Load GPU program and execute, caching data on chip thread blocks. A block is an array of concurrent threads that
for performance. The time required for evaluation of execute the same thread program. A thread block can be
results on GPU is known as execution time. partitioned into one, two or three dimensions. All threads
3) Copy result image arrays back from GPU memory to within a block can cooperate with each other. They can share
CPU memory to further manipulate and display the data by reading and writing to shared memory, and they can
results. synchronize their execution by using syncthreads().

SCEECS 2016
Authorized licensed use limited to: VIT University. Downloaded on November 08,2024 at 14:45:10 UTC from IEEE Xplore. Restrictions apply.
The threading configuration is then passed to the kernel. three sections of the algorithm have been identified for
Within the kernel, this information is stored as built-in executing in parallel which are as follows:
variables. BlockDim holds the dimension information of the (a) Fetching red, blue and green components from colour
current block , BlockIdx and threadIdx provides the current image.
block and thread index information. One limitation on blocks (b) Conversion of Coloured (RGB) image to GRAYSCALE
is that each block can hold up to 512 threads. Once a kernel image.
is launched, the corresponding grid and block structure is (c) Applying Sobel‘s mask and calculation of gradient image.
created.
This paper shows implementation of the algorithm in
two ways: First by defining 2 kernels by parallelizing only
points (b) and (c). Secondly, by defining 3 kernels by
parallelizing all of the above points. Finally the results are
compared regarding speedup on two machines GeForce and
Tesla.
The decision about which section of the code to be parallelised
should be taken wisely. As the parallelism in a code increases
the speedup increases till a image size limit only, after
that any more parallelism in the code results in the overall
reduction of the speedup. The reason being parallelism
involves huge transfer of the data from CPU to GPU for
parallel execution on device side and then again from GPU to
CPU to copy the results on the host side. Now as the image
size increases, both execution time and communication time
increases due to more pixel calculations but communication
time i.e. overhead increases at a higher rate. Thus ratio
of execution time and communication time decreases with
increase in image resolution resulting in the reduction of
the overall speedup. This paper clearly explains the above
phenomenon and analyses the effect of parallelism versus
image resolution with the help of graph in section 8.
Figure 2. Grid of thread blocks [7]

V. DETAILED ALGORITHM
The algorithm is as follows:
1) Fetch the red(R(i,j)), green(G(i,j)) and blue(B(i,j)) com-
ponents of the input colour image.
2) Go to step 3 directly if the image is already in grayscale
format. Otherwise, convert the input image into the
grayscale [8] image using the formula:

gray(i,j) = 0.2989∗R(i,j)+0.5870∗G(i,j)+0.1140∗B(i,j)
3) Apply Sobel‘s operator in both the horizontal and verti-
cal direction to calculate horizontal and vertical gradient.
4) Calculate net gradient and call that image as gradient
image.
5) Finally take input from user through a slider depicting
the different threshold values and apply on the gradient
image.
NOTE: According to the need of the user, he can set the
threshold or treat the gradient image with a predefined
optimum threshold [6] value. Figure 3. Algorithmic Flow
The above algorithm (flowchart in Figure 3) is implemented
both, serially on CPU and parallely on GPU. In this paper,

SCEECS 2016
Authorized licensed use limited to: VIT University. Downloaded on November 08,2024 at 14:45:10 UTC from IEEE Xplore. Restrictions apply.
VI. GPU PROGRAMMING IN MATLAB
GPUs are increasingly applied to scientific calculations. ptx object = parallel.gpu.CUDAKernel(‘Kernel.ptx‘,
Unlike a traditional CPU, which includes no more than a ‘Kernel.cu‘ );
handful of cores, a GPU has a massively parallel array of
integer and floating-point processors, as well as dedicated, Once the ptx object is created a few setup task must
high-speed memory. A typical GPU comprises hundreds of be completed before one can launch it, such as initializing
these smaller processors. The increased throughput made return data and setting the sizes of the thread blocks and grid.
possible by a GPU comes at a cost. Firstly, for computations The kernel can then be used just like any other MATLAB
to be fast enough data must be sent from the CPU to the function, except that the kernel is launched using the feval
GPU before calculation and then retrieved from it afterwards. command, with the following syntax:
Because a GPU is attached to the host CPU via the Peripheral output = feval(ptx object, input Arguments)[11]
Component Interconnect (PCI) Express bus, the memory
access is slower than with a traditional CPU. This means that VII. OUTCOMES AND RESULTS
our overall computational speedup is limited by the amount
of data transfer that occurs in our algorithm. Secondly,
programming for GPUs requires a different model and a skill
set that can be difficult and time-consuming to acquire. Also,
a lot of time is spent on managing and making our code work
at these large numbers of threads.
Experienced programmers can write their own CUDA
code and can use the CUDA Kernel interface in Parallel
Computing Toolbox of MATLAB to integrate the CUDA
code with MATLAB thereby creating a MATLAB object
that provides access to the existing CUDA kernel which is
already converted into PTX code (PTX is a low-level Parallel
Threaded eXecution instruction set). They then invoke the
feval command to evaluate the kernel on the GPU, using
MATLAB arrays as input and output.

A. 64bit .ptx generation


The CUDA kernel code which is written separately cannot
be directly run in MATLAB, hence it is converted to ptx
code which is then executed from MATLAB commands by
creating an object of the created ptx file.
This can be done by running CMD as administrator and Figure 5. The above four figures are the outcomes of the experiment
done with parallel algorithm for Sobel‘s edge detection. (a) original image
typing in the required command. A screenshot has been given Thunderbird of size 1500x1500 pixels, (b) is greyscale converted image,
for proper visualization as shown in Figure 4. (c) is gradient image and image (d) is final edge detected image.

The abbreviations used in the below tables are stated as


follows:
IR= Image Resolution in pixel2
ST = Serial Time in seconds
PT(2k)= Parallel Time when two kernels were used in
seconds
S(2k)= Speedup when two kernels were used
PT(3k)= Parallel Time when three kernels were used in
seconds
S(3k)= Speedup when three kernels were used
Figure 4. Command to generate ptx file [9]
∆(2k,3k)= Percentage decrease in speedup

B. Evaluating the CUDA Kernel in MATLAB


To load the kernel into MATLAB, path is provided to the
compiled PTX file and source code:

SCEECS 2016
Authorized licensed use limited to: VIT University. Downloaded on November 08,2024 at 14:45:10 UTC from IEEE Xplore. Restrictions apply.
Table I executing time which results in the reduction of the speedup.
ALGORITHM RUNNING TIME AND SPEEDUP ON GEFORCE
MACHINE (SLOW)

PT PT
IR ST S (2k) S (3k) ∆(2k,3k)
(2k) (3k)
1500x1500 59.50 0.35 168.36 0.74 79.83 52.58
1800x1800 86.73 0.45 191.79 1.06 81.74 57.38
2200x2200 137.18 0.70 194.43 1.49 91.81 52.77
3500x3500 425.67 1.69 250.74 3.74 113.63 54.68

Table II
ALGORITHM RUNNING TIME AND SPEEDUP ON TESLA
MACHINE (FAST)

PT PT
IR ST S (2k) S (3k) ∆(2k,3k)
(2k) (3k)
1500x1500 59.50 0.08 705.32 0.17 338.52 52.04 Figure 6. Comparative study (2 kernels)
1800x1800 86.73 0.11 725.56 0.25 342.14 52.84
2200x2200 137.18 0.17 771.71 0.36 372.25 51.76 Figure 7 shows the graph which has been drawn against
3500x3500 425.67 0.45 943.88 0.93 454.16 51.88
test cases when GPU code consisted of 3 kernels. One extra
kernel was added in order to produce maximum parallelism in
the code and the results were then analysed.
Table 1 and Table 2 show the results of image datasets. The From the graph it can be clearly observed that the speed up de-
same experiment was performed on around 25 image dataset creased in comparison to the earlier GPU code with 2 kernels,
with their resolution ranging from 256x256 to 10000x10000. the reason being the increase in communication overhead due
The experiment was performed on two machines GeForce and to that extra kernel which also involved transferring the whole
Tesla. The experiment results showed that on an average there data on the GPU side and then executing the kernel on the
was about 52% decrease in speedup when 3 kernels were GPU side. This simply adds to the total time. The graph has
used as compared to when 2 kernels were used. Hence it is been plotted upto a resolution of 7000 on the faster machine
not always required to impose parallelism unnecessarily as it while upto a resolution of 5000 on the slower machine, as
comes more with a cost of communication time as explained memory was insufficient to store the huge arrays of the image
above in the paper. For simulation environment, the hardware on the GPU side. The graph also shows that the speed up
used is described below: gradually decreases after a resolution of 6500 because of the
extra added burden of the communication time of the 3 kernels.

Table III
HARDWARE SPECIFICATIONS

Property CPU GeForce Tesla


No. of cores 8 96 448
Memory capacity (GB) 8 2 8
Memory speed (GHz) 1.90 1.62 1.50

VIII. GRAPHS AND OBSERVATIONS


Figure 6 shows the graph that has been plotted against image
resolution (in pixel2 ) and the speedup on both machines using
2 kernel functions.
From the graph in figure 6 it can be seen that as the image
resolution increases, the speedup increases but the increase
in the speedup is observed only up to a resolution value of Figure 7. Comparative study (3 kernels)
6500, that is any further increase in the size of the image
results in the decrease of the speedup. This huge decrease in
speedup was mainly due to the communication time i.e. it IX. C ONCLUSION
consists of first copying the data on the GPU side and then This project work now concludes that if parallelism is used
executing the data over device side. Hence as the size of the optimally then one can achieve higher speedups, even 1000
image increases the time of transferring the data over the GPU times or 10000 times faster. The implementation contains
side i.e. communication time increase at a higher rate than the both the sequential version and the parallel version. This

SCEECS 2016
Authorized licensed use limited to: VIT University. Downloaded on November 08,2024 at 14:45:10 UTC from IEEE Xplore. Restrictions apply.
allows the reader to compare and contrast the performance
differences between the two executions. The implementation
used in this project achieves a maximum speedup of around
950 times, however if one tries to maximize parallelism where
even not necessary then it results in reduced speedup due to
much increase in communication time which is overhead for
performance . This project explains the two separate speedups
: one with (2 kernels) and the other with (3 kernels) and the
observation comes out to be that speed up reduced by around
52% on both machines : one GeForce with 96 CUDA cores
and the other Tesla with 448 CUDA cores. So one thing is
clear that proper and efficient use of CUDA programming or
parallelism can perform complex computations in much less
time and provide you with higher speedups.
Future work will involve the edge detection of images with
resolution greater than 10000x10000 with the help of Image
Segmentation as GPU memory becomes insufficient to store
large arrays for high resolution images. Image segmentation
is the process of dividing an image into multiple sub images
such that the result is a set of segments that collectively cover
the entire image. Due to this segmentation of image into parts,
instead of transferring whole array to GPU, calculation can be
done in parts and the result can be merged which will help to
evaluate results for more higher resolution images.
ACKNOWLEDGMENT
We express our sincere gratitude to our guide Dr. Meenu
Chawla and thank her for her guidance and support in
completing this paper.

R EFERENCES
[1] SOBEL, I., An Isotropic 3x3 Gradient Operator,. Machine Vision for
Three – Dimensional Scenes, Freeman, H., Academic Pres, NY, 376-
379, 1990.
[2] CUDA C Programming Guide PG-02829-001 v7.0 Page 9-10,2015
[3] http://www.geforce.com/hardware/notebook-gpus/geforce-gt-630m/
specifications
[4] Michael Garland, “Parallel Computing Experiences with CUDA” in
IPDPS 2010, pp. 13-27
[5] Jayshree Ghorpade, Jitendra Parande, Madhura Kulkarni, Amit Bawaskar,
“GPGPU Processing in CUDA Architecture” in Advanced Computing: An
International Journal ( ACIJ ), Vol.3, No.1, January 2012, pp 105-120
[6] Jin-Yu., Yan. ,Xiang., ”Edge Detection of Images Based on Improved
Sobel Operator and Genetic Algorithms”, IEEE International Conference
on Image Analysis and Signal Processing, IASP 2009,April 2009,pp. 31-35
[7] CUDA C Programming Guide PG-02829-001 v7.5 Page 11,2015
[8] R. C. Gonzalez and R.E. Woods, Digital Image Processing, Prentice Hall,
New Jersey, 2008.
[9] Generating CUDA ptx files from Visual Studio [on-
line] Available: http://stackoverflow.com/questions/13426170/
convert-cu-file-to-ptx-file-in-windows?rq=1
[10] Tinku Acharya and Ajay K. Ray,Image Processing Principles and
Applications, John Wiley & Sons, New Jersey 2005
[11] Cliff Woolley,CUDA Overview,NVIDIA Developer Technology Group,
pp. 20-30
[12] Cyril Zeller,CUDA C/C++ Basics,NVIDIA Corporation, Supercomput-
ing Tutorial,pp. 9-11, 2011

SCEECS 2016
Authorized licensed use limited to: VIT University. Downloaded on November 08,2024 at 14:45:10 UTC from IEEE Xplore. Restrictions apply.

You might also like