CVNN
CVNN
I. I NTRODUCTION
Recent technological advancements in the human-computer
interaction field have shown that conventional tools, such as
keyboard, mouse, and light pen, do not provide a natural form Fig. 1. The left image is a human hand after applying edge detection
algorithm, while the image on the right is showing the branches that were
of interaction. Even though those tools were the standard produced after applying thinning algorithm to the image
forms of input for many decades, the ubiquity of digital
systems revealed the urgent requirement for a more reachable
interaction method that can be used by anyone regardless the image captured using Kinect camera [7]. We used three
of his educational background. Since the hand has always layer complex-valued neural network (CVNN), and Complex
been the natural interaction methods among humans, a recent Levenberg-Marquardt (CLM) algorithm [8] for the training,
resurgence in developing new hand modeling techniques has due to the nature of the data that we can collect from the
been observed. Regardless of the technique used, the main generated hand-skeleton representation. We investigate the
goals have always been: using descriptive gestures while recognition performance with respect to various activation
keeping the computer processing and modeling as simple as functions in the hidden layer. The output layer, however, uses a
possible. recently proposed activation function [11], that helps an output
The system can be applied to various background, change- neuron behaving like a discriminative function.
able lighting of the environment and different kinds of human The remainder of the paper is organized as follows. Section
colors. To achieve that, we construct simple representation II, discusses the procedures to generate the hand-skeleton
of human hand (hand-skeleton) after applying edge detection structure. Section III introduces CLM algorithm and various
with thinning algorithm (Fig.1) to the input image, we then activation functions. Computer simulation results are discussed
define gestures for each English characters [1]. in Section IV. Finally, concluding remarks are given in Section
Recently, complex-valued data are used in many applica- V.
tions, such as array signal processing [2], radar and magnetic
resonance data processing [3], [4], communication systems [5], II. P ROCEDURES
signal representation in complex baseband [6], and processing In this research, we utilized the camera of a Microsoft
data in the frequency domain [3]. Kinect motion-sensing input device [7], accompanied by
In our approach, complex-valued data that represent hand OpenCV platform which have the computational capabilities
gesture can be obtained after applying sequence of filters to required for real-time image acquisition and handling. Fig.
2 represents a modular view of the final system. The image human hand and face. Then we used HSL representation of
acquired by the Video Input Module is passed to both the Hand color to identify the color of human skin. It is known that
Location Module and the Image Processing Module. While the HSL representation identifies the color of human skin more
Hand Location Module is responsible for detecting the location accurately than RGB representation[12]. The last step involved
of the hand within the image, the Image Processing Module separating the image of the hand from image of the face (Fig.
processes the area that has been previously detected by the 4).
Hand Location Module. The output of the Image Processing
Module is then passed to the Hand-Skeleton Construction
Module which is responsible for creating a skeleton module
from the image of the hand. This module has two outputs:
skeleton model of the hand, and supplementary data passed to
the Hand Location Module to increase the location detection
accuracy. The skeletal model is then passed to the CLM where
the actual recognition takes place. The following subsections
Fig. 4. The result after skin detection, showing human hand with two states.
describe in detail each stage of the system: In the left, the human hand with the fingers are together, and in the right, the
fingers are apart
Input&Frames& Although the hues of the hand’s color and the face’s color
are different, the difference is too small to be considered
reliable by itself. Accordingly, we had to support that by
Hand&Location& Image&Processing& another source of information. For instance, when the system
is initialized, it depends on the motion of the hand to distin-
Hand&Skeleton!
guish it from the face, and then the system would realize the
Hand5tree&Construction& Recognition location of the hand using the feed-back of the hand-skeleton
! Construction! !
! ! construction part of the system.
!
Fig. 2. Human hand gesture recognition system, showing its module and the
From Fig. 4, we can notice that when the fingers are close to
connections between them each other, we might lose some information about each finger
state. To compensate for that problem, we used a sequence of
Input&Frames&
A. Tracking and Detection image processing algorithms to aid the correct recognition of
the finger’s state, as described in the next section.
The first step involves separating the image of the hand from
the rest of the image. To do that, we used Kinect’s depth map B. Image Processing
Hand&Location& Image&Processing&
to wipe the background of the image. As we can see in Fig. After locating the human hand in the image, the system
3. , only the silhouette of the human body was extracted from filters the region where the hand is located as shown in Fig.
the image, disposing any other unneeded objects. Hand5tree&Construction& Recognition
5. !
!
!
Hand%Image% Hand&!
Dilation% Thinning% Hand%Tree%
! Skeleton!!
!
Fig. 5. The image filters that applied in real-time for the hand location,
that produce connected branches of line which represent the fingers of human
hand
Fig. 3. Simple illustration of background deletion, the image in the left is the
source image, the image in the middle is the depth map that Kinect’s camera
provides, and the image in the right is the result after deleting the background First, the system applies Sobel Edge Detection Algorithm
[10] to get a contour of the hand. This filter scans the image
Next, we removed the regions not having the color of for sharp contrast differences, and assigns a white color shade
human skin. The resulting image contains the location of the equivalent to the contrast in that region.
Monday, January 9, 12
Next, by restricting the whiteness to a specific threshold, elements. Each element xi , 1 ≤ i ≤ 8, is a Cartesian
the system deletes any noisy edges effectively creating a representation of each segment of the hand skeleton, √ .i.e.,
sharper edge representation. However, that step would produce xi = ri ejθi = ri cos(θi ) + jri sin(θi ), where j = −1. To
disconnected regions in the edges of the hand, affecting the classify the patterns represented by complex-valued feature
outcome of thinning algorithm. To avoid that drawback, we vectors, we apply a feedforward complex-valued neural net-
used a dilation algorithm resulting in a fully connection figure. work (CVNN) with one hidden layer. The output layer uses
The dilated image is then passed to a thinning algorithm an activation function proposed in [11] which can act as a
[8]. This algorithm generates one line of pixels representing discriminating function giving a discriminating score. We call
the branching of the structure. The outcome of that step is: the function here as discrim. The function has the following
a representing the hand as interconnected lines meeting at form:
2
multiple nodes. fC→R (z) = (fR (u) − fR (v)) (1)
The final step involves reading that representation. The
system creates pairs of data for each branching by traces the where z = u + jv denotes the weighted sum of input signals
line that connects the nodes, calculating the length and the along with the bias and called net-input and fR () is a real-
angle of these lines. The calculation method for the length of valued log-sigmoid function. The hidden layer, however, may
the line which connects two nodes and its relative angle is take any activation function found in the literature of the
shown in Fig. 6. CVNNs. Recently CLM has been proposed by [8] as a fast
learning algorithm for the feedforward complex-valued neural
f
networks (CVNN). The CVNN in this study is trained with
g b
the CLM algorithm because of its faster convergence. Since
θ5
we have a total of 26 different gestures, the output layer has
θ6
26 neurons, each representing one gesture.
d
Two fingers Four fingers h θ7 θ4 θ3 θ1
In order to see the effect of hidden layer activation functions
θ2
θ8
e
in the hand gesture recognition problem, we investigate a
c
i
number of complex activation functions listed below.
training"error"(MSE)"
[1]. 0.8"
For recognizing the English character, we defined distin- asin"
guishable gestures of the hand to represent each character. 0.6" George"
These gestures have been chosen so that it will be easier for
discrim"
the system to recognize them. Consideration was also taken on 0.4"
human’s natural skills to move from one gesture to other. Our linear"
algorithm detects the edges between the fingers even though 0.2"
the fingers are stuck together. This allowed us to design a
0"
simple representation for each character as shown in Fig. 7.
1" 6" 11" 16" 21" 26" 31" 36" 41" 46"
itera7on"
g h i j k l
Table 1, shows the classification error for different activation
functions sorted by the smallest value. From the table we can
m n o p q r notice that the split type activation functions performed better
for this problem.
s t u v w x
TABLE I
C LASSIFICATION ERROR FOR DIFFERENT ACTIVATION FUNCTIONS
y z Activation Functions Classification Error(%)
splitTanh[14] 08.46
splitSigm[14] 11.29
Fig. 7. Hand gesture for each English character, we can notice that the
linear[15] 13.59
gestures are differing in the number of fingers and the angles they make with asin[15] 15.65
each other tan[15] 16.41
discrim [11] 17.18
We presented all the input data to the CLM and computed acos[15] 17.69
tanh[15] 17.95
the outputs and the validation error. From Fig.8 we can notice George[13] 19.49
that the optimal number for the neurons in the hidden layer sin[15] 20.00
asinh[15] 20.26
was 4, sinh[15] 21.54
atan[15] 23.85
90" atanh[15] 28.97
splitTanh"
80"
sin"
valida1on"error"(percentage)"
70"
asin"
60" V. C ONCLUSION
George"
50" In this paper, the CLM algorithm has been used in a hand
discrim"
40" linear"
gesture recognition system to distinguish 26 differed gestures
(English Alphabet). By using Kinect depth map and the human
30"
skin color we could isolate human hand from the rest of the
20" image, then we used a sequence of image filters to generate a
10" descriptive representation of human hand. We call it ”Hand-
Skeleton”. This representation allows us to use the CLM
0"
algorithm for learning and recognition stage. The results shows
1" 2" 3" 4" 5" 6"
that the CLM algorithm with the split type activation functions
number"of"neurons"in"the"hidden"layer"
achieves the highest recognition performance.
Fig. 8. Validation error for different activations function and number of
neurons in the hidden layer ACKNOWLEDGMENT
The learning process was terminated if some stopping This study was supported by grants to K.M. from Japanese
criteria were met, such as, validation error increases rather Society for promotion of Sciences and Technology, and the
than a decrease. University of Fukui.
R EFERENCES
[1] A. Hafiz, Md.F. Amin and K. Murase, Real-Time Hand Gesture Recogni-
tion Using Complex-Valued Neural Network (CVNN), 2011 International
Conference on Neural Information Processing (ICONIP 2011), Shanghai,
China, Nov. 2011.
[2] H.L. van Trees, Optimum Array Processing, Wiley Interscience, New
York, 2002.
[3] A. Hirose, An improved parallel thinning algorithm, Springer, Heidelberg,
2006.
[4] V.D. Calhoun, T. Adali, G.D. Pearlson, P.C.M. van Zijl, and J.J. Pekar,
Independent component analysis of fMRI data in the complex domain,
Magnetic Resonance in Medicine , v.48(1), 180-192, 2002.
[5] G.L. Stuber, Principle of Mobile Communication, Kluwer, Boston, 2001.
[6] C.W. Helstom, Elements of Signal Detection and Estimation, Prentice
Hall, New Jersey, 1995.
[7] J. Shotton and T. Sharp, Real-Time Human Pose Recognition in Parts
from Single Depth Images, published at Microsoft Research in Cambridge,
statosuedu, 2, 2011.
[8] Md.F. Amin, I.A. Muhammad ,Y.A.N. Ahmed and K. Murase, Wirtinger
Calculus Based Gradient Descent and Levenberg-Marquardt Learning
Algorithms in Complex-Valued Neural Networks, 2011 International Con-
ference on Neural Information Processing (ICONIP 2011), Shanghai,
China, Nov. 2011.
[9] Intel Corporation, Open Source Computer Vision Library, Reference
Manual, Copyright 1999-2001, Available: www.developer.intel.com
[10] I. Sobel and G. Feldman, A 3x3 Isotropic Gradient Operator for
Image Processing, presented at a talk at the Stanford Artificial Project,
unpublished but often cited 1968.
[11] Md.F. Amin, K. Murase, A single-Layered Complex-Valued Neural
Network for Real-Valued Classification Problems, Neurocomputing, Vol
72, pp.945-955. 2009.
[12] J. B. Martinkauppi, M. N. Soriano, and M. H. Laaksonen, Behavior
of skin color under varying illumination seen by different cameras at
different color spaces, in Proc. SPIE, Machine Vision Applications in
Industrial Inspection IX, San Jose, CA, Jan. 2001, pp. 102–113.
[13] M.G. Geotge and C. Koutsougeras, Complex domain backpropagation,
IEEE Transactions on Circuits and Systems II, 39(5):330-334, 1992.
[14] A. Hirose, Complex-Valued Neural Networks, Springer, 1-160. 2006.
[15] T. Kim and T. Adal, Approximation by fully complex multilayer percep-
trons, Neural Computation, vol. 15, no. 7, 1641 pages, 2003.