Neural Network Implementation Using CUDA and OpenMP
Neural Network Implementation Using CUDA and OpenMP
+b
]
(1)
r
]
= (1 + c
-m
]
)
-1
(2)
In the Eq (1) and (2), the subscript j indexes nodes in
the current layer to be calculated, i indexes the node of
the lower layer connected with the jth node, and w
ij
denotes the weight at the connection between the ith
and jth nodes. x
i
is value inputted to ith node, b
j
the is
the bias term of the jth node, and r
j
is the output value
of the jth node. This general operation is continually
performed from the first hidden layer to the output
layer. Since another NN is also calculated by the
general operation, the inner-production operation and
activate function, this operation can be easily applied
to another NN.
Moreover, since many inner-product operations can be
replaced with a matrix multiplication, the MLP is more
appropriate for CUDA implementation. As such, the
computation-per-layer can be written as follows:
V= |
u
10
u
11
u
1N
u
20
u
21
u
2N
. . .
u
M0
u
M1
u
MN
| = |
V
1
V
2
.
V
M
|,
A = |
1 1 1
r
11
r
12
r
1L
. . .
r
N1
r
N2
r
NL
|
= |A
1
A
2
A
L
],
M = V A
= |
V
1
A
1
V
1
A
2
V
1
A
N
V
2
A
1
V
2
A
2
V
2
A
N
. . .
V
M
A
1
V
M
A
2
V
M
A
N
|
= |
m
11
m
12
m
1L
m
21
m
22
m
2L
. . .
m
M1
m
M2
m
ML
|.
R = sigmoiu(H)
= |
1 + c
-m
11
1 + c
-m
12
1 + c
-m
1L
1 + c
-m
21
1 + c
-m
22
1 + c
-m
2L
. . .
1 + c
-m
M1
1 + c
-m
M2
1 +c
-m
ML
|
where M is the number of nodes in the current layer, N
is the number of nodes in the lower layer, and x
ij
is the
158
ith feature value of the jth input vector. The result R
ij
is
the output of the ith output node for the jth input vector.
Here, the subscript 0 means the bias term, and this is to
make one matrix multiplication without the summation
term in Eq 1.
When implementing the NN using CUDA, all input
feature data for the NN cannot be transferred into the
memories of the GPU, due to the limited memories of
the GPU. Therefore, the proposed architecture divides
the whole process into two parts. The first part is to
make a suitable size of feature data for the memory of
the GPU, and can also include a feature extraction step
to extract features for the NN. However, this part is
much slower than implementation on the GPU.
Therefore, the OpenMP is used for parallel
implementation of the first step, i.e. making feature
data is concurrently performed on the multi-core CPU.
The second step is to implement the NN using feature
data received from the CPU. Then computational times
of two parts are similar to each other, compared with
implementation without OpenMP. Therefore, the
efficiency of the proposed architecture is accomplished
by the parallel use of the CPU and GPU.
Fig. 3. Operations of NN using CUDA.
Fig. 3 shows matrix multiplication and computation
of sigmoid function using CUDA. Since the CUDA
can effectively compute matrix multiplication by using
shared memories. In general operation in GPU, about
400-600 cycles are required to assess the global
memories, but in memory environment of CUDA, only
4 cycles is required to access the shared memories.
Therefore, the shared memories of the CUDA help to
effectively compute the operations. The sigmoid
function can be performed in parallel by allocating
thread equal to the number of elements of matrix and
then compute the operation in each thread
independently.
4. Experimental Results
All experiments were carried out on an Intel core 2
Quad Q6600 CPU (2.4 GHz) and GeForce 8800 GTX
graphics hardware. OpenMP help to process four sets
of data on CPU in parallel. We evaluated the proposed
method through the NN-based text detection
application, section 4.1 describes the text detection,
and section 4.2 shows result images and time
complexity.
4.1. NN-based Text Detection
Recently, researchers have attempted text-based
retrieval of image and video data using several image
processing techniques [13]. As such, an automatic text
detection algorithm for image data and video
documents is important as a preprocessing stage for
optical character recognition, and an NN-based text
detection method has several advantages over other
methods [13].
Therefore, this subsection briefly describes such a
text detection method, and readers are referred to the
authors previous publication for more detail [13]. In
the proposed architecture, an NN is used to classify the
pixels of input images, whereby the feature extraction
and pattern recognition stage are integrated in the NN.
The NN then examines local regions looking for text
pixels that may be contained in a text region. Therefore,
1) gray values of the pixels at predefined positions
inside an MM window over an input frame is
received as the input and 2) a classified image is
generated as the output. After the feature passes the
network, the value of the output node is compared with
a threshold value and the class of each pixel
determined, resulting in a classified image.
Experiments were conducted using an 1111 input
window size, with the number of nodes in a hidden
layer set at 30.
In the proposed architecture, the first step of
previous sentence was performed on the multi-core
CPU using OpenMP for parallel implementation, and
Fig. 4 shows pseudo codes for the first step. The
159
second step was performed on the GPU, and Fig. 5
shows pseudo codes. In Fig. 5, two threads were
allocated to perform the first step, thus the pseudo code
including two indicators #pragma omp section is to
allocate the thread.
for(check everyPixel of image)
{
//Parallel Implementation using OpenMP
#pragma omp parallel section
{
#pragma omp section
{
//pixel check in window range
GetConfigMatrix(cpuData1);
}
#pragma omp section
{
//pixel check in window range
GetConfigMatrix(cpuData2);
}
}
//calculate neural net using CUDA
ForwardCUDA(cpuData1,outputCUDAData);
SaveOutputData(outputCUDAData);
ForwardCUDA(cpuData2,outputCUDAData);
SaveOutputData(outputCUDAData);
}
Fig. 4. Pseudo code for OpenMP performed on
multi-core CPU.
//memory copy from CPU to GPU
cublasSetMatrix(CPUData, CUDAData );
//Result 0 = Weight0 * GPUData
//matrix multiplication of first layer
cublasSgemm(Weight0, CUDAData, Result0);
// sigmoid calculation of first layer
Sigmoid(Result0);
//Result1 = Weight1 * Result0;
// matrix multiplication of second layer
cublasSgemm(Weight1, Result0, Result1);
//sigmoid calculation of second layer
Sigmoid(Result1);
//memory copy from GPU to CPU
cublasGetMatrix(Result1, outputCPUData);
Fig. 5. Pseudo code for NN performed on GPU.
4.2. Result of Text Detection
Fig. 6 shows the result images according to the
image sizes: (a,b) 320240, (c,d) 571785, and (e,f)
115215466. Figs. 6(b,d,f) show the pixel
classification result for the input image Figs. 6(a,c,e),
where a black pixel denotes a text pixel. The
classification using a GPU produced almost the same
results as without a GPU.
(a) (b)
(c) (d)
(e) (f)
Fig. 6. Result images: (a,c,e) input images, and
(b,d,f) result images.
Fig. 7 shows the computational times of Fig. 6,
where x-axis indicates image sizes and y-axis indicates
computational times (sec). As shown in Fig. 7, the
proposed architecture showed about 20 times faster
than the CPU-only and about 5 times faster than the
GPU-only, and the computational times for pixel
classification were significantly reduced using the
proposed method. The reason why the proposed
method showed faster computational times than GPU-
only is we reduced the computational times to generate
the data, which will be processed in GPU, in multi-core
160
CPU using OpenMP. Therefore, we analyzed the
computational times when using OpenMP.
Fig. 8 shows effectiveness of using OpenMP, where
y-axis indicates computational times (msec). If only
CUDA without OpenMP were used to implement NN,
there is no little in differentiation of the computational
times between CPU and GPU, i.e. computational times
of GPU is 8 faster than CPU. The performance of GPU
is maximized by accumulating a large number of input
vectors that is dependent on the GPU configuration,
thus the CPU generates the input vectors as much as
possible, which will processed by GPU in one step.
The OpenMP helped to reduce computational times
processed in CPU, thus can reduce the differentiation
of the computation times. Consequently, the OpenMP
helped to reduce the bottleneck between the CPU and
GPU.
Fig. 7. Computational times of three
architectures.
Fig. 8. Differentiation of computational times
with and without OpenMP.
5. Conclusions
This paper proposed faster and more efficient multi-
threaded implementation on both commodity graphics
hardware and multi-core CPU. A CUDA was used, as
the CUDA code is C language style and has less
computational restriction while the traditional GPU
could be programmed though only a graphics API that
requires much special knowledge about computer
graphics. Moreover, OpenMP, which can help to
concurrently process more than two data with single
instruction on multi-core CPU while processing only
one data on GPU, was used to minimize difference
between two computational times on only one graphics
hardware. Based on the proposed architecture, we
implemented neural network, where feature extraction
is processed on multi-core CPU and main operation of
NN consisting of inner-product operations and a
activate function is processed on CUDA. The
experiments evaluated the proposed implementation
through NN-based text detection, and showed faster
computational times on the proposed architecture than
on only CUDA or CPU.
Acknowledgment: This work was supported by grant
No.(R01-2006-000-11214-0) from the Basic Research
Program of the Korea Science.
6. References
[1] K.S Kyong and K. Jung. GPU Implementation of Neural
Network, Pattern Recognition, Vol. 37, Issue 6, pp. 1311-1314,
2004.
[2] K. Moreland and E. Angel. The FFT on a GPU, Proceedings of
SIGGRAPH Conference on Graphics Hardware, pp. 112-119, 2003.
[3] J. Mairal, R. Keriven, and A. Chariot. Fast and Efficient Dense
Variational Stereo on GPU, Proceedings of International
Symposium on 3D Data Processing, Visualization, and Transmission,
pp. 97-704, 2006.
[4] R. Yang and G. Welch. Fast Image Segmentation and
Smoothing using Commodity Graphics hardware, Journal of
Graphics Tools, Vol. 17, issue 4, pp. 91-100, 2002.
[5] J. Fung and S. Man. OpenVIDIA: Parallel GPU Computer
Vision, Proceedings of ACM International Conference on
Multimedia, pp. 849-852, 2005.
[6] http://ati.amd.com/developer/
[7] http://developer.nvidia.com/object/cg_toolkit.html/
[8] http://www.opengl.org/documentation/glsl/
[9] http://graphics.stanford.edu/projects/brookgpu/
[10] http://www.nvidia.com/object/cuda_home.html/
[11] J. Fung and S. Mann. OpenVIDIA: Parallel GPU Computer
Vision, Proceedings of ACM International Conference on
Multimedia, pp. 849-852, (2001).
[12] http://www.openmp.org/
[13] K. Jung. Neural Network-based Text Localization in Color
Images, Pattern Recognition Letters, Vol. 22, issue 4, pp. 1503-
1515, (2001).
161