Paper 5
Paper 5
https://doi.org/10.1007/s42835-021-00972-6
ORIGINAL ARTICLE
Received: 14 September 2021 / Revised: 21 November 2021 / Accepted: 23 November 2021 / Published online: 21 January 2022
© The Korean Institute of Electrical Engineers 2021
Abstract
The use of gesture control has numerous advantages compared to the use of physical hardware. However, it has yet to gain
popularity as most gesture control systems require extra sensors or depth cameras to detect or capture the movement of
gestures before a meaningful signal can be triggered for corresponding course of action. This research proposes a method
for a hand gesture control system with the use of an object detection algorithm, YOLOv3, combined with handcrafted rules
to achieve dynamic gesture control on the computer. This project utilizes a single RGB camera for hand gesture recognition
and localization. The dataset of all gestures used for training and its corresponding commands, are custom designed by the
authors due to the lack of standard gestures specifically for human–computer interaction. Algorithms to integrate gesture
commands with virtual mouse and keyboard input through the Pynput library in Python, were developed to handle commands
such as mouse control, media control, and others. The mAP result of the YOLOv3 model obtained 96.68% accuracy based
on testing result. The use of rule-based algorithms for gesture interpretation was successfully implemented to transform
static gesture recognition into dynamic gesture.
Keywords Hand gesture · Human computer interaction · Deep learning · Object detection
13
Vol.:(0123456789)
1962 Journal of Electrical Engineering & Technology (2022) 17:1961–1970
convolutional neural networks (CNNs), many researchers on some aspect in HCI, such as mouse cursor control, where
began adopting CNNs that can be self-trained to extract continuous input of gesture class and its location is needed
important features, effectively removing the need for feature instead of single time command activation. Therefore, this
engineering while increasing the accuracy and recognition work proposed to use static gesture recognition that can out-
speed. put gesture class and its location continuously.
Most of the modern prominent researches on vision- For static gesture recognition, [17] used CNN to recog-
based gesture recognition are done based on video classi- nize gestures and achieved 97.12% accuracy. They claimed
fication [10, 21]. However, the video classification method that data augmentation on the dataset such as rescaling,
is restrained to one-off classification and each gesture can zooming, shearing, rotation, width, and height shifting had
only execute a one-time command rather than continuous increased their accuracy for 4%. A similar approach was
command input. For example, users cannot drag and drop a undertaken by [14] to recognize 24 static hand gestures
file or folder in a computer’s GUI since the user do not have from the alphabet sign language of PERU. Kim et al. [20]
full control over the command in terms of time and distance. proposed the use of You-Only-Look-Once (YOLO) object
Therefore, this paper proposed a ruled based algorithm that detection network and concluded that using YOLO with
allows gesture recognition using only normal RGB camera ROI segmentation achieved higher accuracy while accel-
with YOLOv3 object detector to recognize gestures and erating training speed. Meanwhile, Ni et al. [24] proposed
interpret sequences and movement of static gestures into Light YOLO model that improved the accuracy, speed, and
dynamic gestures to give computer commands through vir- model size of YOLOv2 model for gesture recognition. In Bai
tual keyboard and mouse input into the computer. In this et al. [4], a modified Single Shot Multibox Detector (SSD)
paper, the YOLOv3 object detector is chosen to classify and network is adopted for skeleton-based gesture recognition.
localize gestures. Several techniques are proposed in this In this work, the well-established YOLOv3 object detector
paper for gesture control using gesture class and its location. was chosen, as the gesture detector, as it is one of the fastest
state-of-the-art algorithms with high accuracy.
There are several hand gesture recognition datasets that
2 Related Work are publicly available. For example, one of the largest data-
sets is the Chalearn dataset that has both isolated and con-
Before CNNs were heavily used in vision-based gesture rec- tinuous gesture. However, the gestures in the dataset are
ognition, there was numerous research in gesture recogni- derived from Italian sign language and may not be univer-
tion using feature engineering [18, 31]. However, researchers sally suitable for interaction with computers. Meanwhile,
began to shift their interest in CNNs to recognize gestures the nvGesture dataset [23] is designed for in-car automotive
due to its superior performance. Furthermore. with the com- devices, and the EgoGesture dataset [32] is in the egocentric
mercialization of depth sensors (RGB-D) such as Microsoft view. Given their limitation to the objective of this project,
Kinect, many kinds of research on gesture or action recog- custom gesture dataset is designed for the proposed gesture
nition began to take advantage of it due to its robustness control system.
against illumination variations and abundant 3D structure
information.
Most of the research in hand gesture recognition focused 3 Methodology
on video classification problem to classify dynamic gestures.
Wan et al. [30] summarized current methods on utilizing The proposed system for gesture recognition and control
RGB-D sensors, which can be categorized into two main cat- system involves two main steps, which are gesture detection
egories: isolated and continuous gesture recognition. Unlike using YOLOv3 object detector, followed by a rule-based
isolated gesture recognition, continuous gesture recognition gesture interpreter. In this project, the camera that is used
is harder as the system needs to recognize more than one to capture and recognize gestures was the Logitech C922
gesture in a video. This is challenging as the detector should webcam. The computer used was ASUS GL552VW laptop
recognize the start and end of the gesture in the video by with Nvidia GTX960m graphics card which has 5.0 com-
itself [21]. There are several strategies for this problem. For puter capability and 2048 GDDR5 memory. The system was
example, Chai et al. [10] assume all gestures starts and ends designed for the Windows 10 operating system.
with the performers’ hands down, so that the system knows
when the gesture is performed. Meanwhile, Camgoz et al. 3.1 Defining Gestures and Corresponding
[8] treat the segmentation process as a learning volume, Commands
while Köpüklü et al. [21] proposed a hierarchical structure
to ensure single time activation for each performed gesture. The system utilizes image classification and localiza-
However, using video classification may have its limitation tion, combined with handcrafted rules to activate control
13
Journal of Electrical Engineering & Technology (2022) 17:1961–1970 1963
13
1964 Journal of Electrical Engineering & Technology (2022) 17:1961–1970
Table 1 Defined gestures and No. Control Left hand Right hand
corresponding commands
1 Close active application window None Sequential gesture
Gesture 1: Pre-flick
Gesture 2: Post-flick
2 Scroll up/down None Circular movement
Gesture: Two-fingers
3 Scroll horizontally Static movement Horizontal movement
Gesture: Fist Gesture: Two-fingers
4 Zoom in/out Static movement Circular movement
Gesture: Two-fingers Gesture: Two-fingers
5 Mouse control Static movement Sequential gesture
Gesture: Gesture 1: Pre-pinch
1. Palm: Absolute cursor Gesture 2: Post-pinch
movement
2. Fist: Relational cursor
movement
6 Mute None Sequential gesture
Gesture 1: Palm
Gesture 2: Fist
7 Media play or pause Sequential gesture None
Gesture 1: Palm
Gesture 2: Fist
8 Adjust volume Static movement Vertical movement
Gesture: Palm Gesture: Two-fingers
9 Show desktop Static movement Sequential gesture
Gesture: Fist Gesture 1: Palm
Gesture 2: Fist
1 100 70 15
2 100 100 15
3 100 130 15
4 115 70 15
5 115 100 15
6 115 130 15
7 130 70 15
8 130 100 15
9 130 130 15
10 100 30 15
11 130 30 15
12 120 80 35
13
Journal of Electrical Engineering & Technology (2022) 17:1961–1970 1965
13
1966 Journal of Electrical Engineering & Technology (2022) 17:1961–1970
1920 × 1080 pixels computer screen. The blue frame rep- 3.5.2 Scrolling Algorithm and Related Commands
resents the output video (640 × 840) of the YOLO detec-
tor, while the red control frame (240 × 135) is a virtual One simple technique for scrolling that is proposed is by
frame where the position of the hand gesture controls the using a dragging-like concept as if a touchscreen device on
position of the cursor. The control frame is kept signifi- midair. This technique is used for scrolling left and right.
cantly smaller so that the ends of the computer screen can That is, when “right two-fingers” is detected, the distance
be reached easily with smaller hand movement. of the hand movement is translated to the amount of scroll-
The YOLO network outputs two coordinates to form a ing horizontally. The horizontal distance, S is calculated by
bounding box around a hand gesture, which is the top left subtracting previous hand horizontal coordinate, Cp from the
and bottom right vertices of the box. To ease the smooth- current coordinate, Cc.
ness of the control, a centre coordinate is used because
using the corner coordinates may cause undesired over
S = Cc − Cp (9)
fluctuations. The average of both coordinates is calculated The S variable is stored in the system, and when a thresh-
using old is reached, the scrolling command is activated, and S is
(
Cx1 + Cx2 Cy1 + Cy2
) reset to 0 again to repeat the process. The chosen threshold is
Cx,y = , . (6) 80 for scrolling left, and −80 for scrolling right. This method
2 2
is also used in volume adjustment commands, where mov-
Then, with cursor position coordinate as P, ing gesture up and down controls the volume adjustment.
This approach is straightforward and is very user friendly.
P = (C − Cl) × i (7) However, when a large amount of scrolling is needed, it
where Cl is the top and left clearance distance of the con- becomes inefficient as the system does not recognize which
trol frame to the screen and i is the scale coefficient for the hand movement should registered as scrolling, and which
hand coordinate to the cursor position. For this project, the movement is not when the user intends to return the hand
clearance is set with Clx = 300 and Cly = 180 as shown in position for more scrolling. Therefore, another method is
Fig. 4. Therefore, with 1080p screen, i = 8. The result of proposed for vertical scrolling.
these configurations is that the control frame slights to the The vertical scrolling command is activated when the
right of the camera frame because only the right hand is used user uses the two-finger gesture on the right hand with a
to control the cursor position. |In addition, this function is circular movement. However, an algorithm is needed since
used with other gesture commands that require hand position the computer cannot detect the circular movement easily.
input, such as scrolling and volume adjusting. A method is proposed to detect circular rotation indirectly
by registering directional changes of the hand position. The
3.5.1.2 Relative Positioning Cursor positioning based on method works by comparing the latest hand gesture coor-
the absolute coordinate of the hand is not precise and can dinate and previous coordinate to determine the movement
be hard to control due to inconsistent bounding box. To direction of the hand. Each directional change corresponds
solve the problem, a relational cursor movement approach to one circular movement. Therefore, the scrolling command
is proposed to reduce the cursor movement. This is acti- can be outputted. For example, gesture movement coming
vated by changing the left-hand modifier gesture to fist from left and down and changes into left and up corresponds
(see Table 1). The relational approach moves the position to a clockwise movement (Fig. 5). Therefore, scroll down
of the cursor according to the distance of the movement of command is activated.
the hand. By using hand coordinate output from the previ- The location of the predicted hand location can fluctu-
ous frame, the vector distance to move the cursor, M can ate over time. That may register unintended movement. A
be calculated as
Cc − Cp
M= (8)
s
where Cc is the current hand coordinate output, Cp is the pre-
vious hand coordinate output, and s is the sensitivity index.
The selected s in this project is 3. As a comparison by com-
paring Eqs. (7) and (8), the cursor moves 24 times slower
using relational cursor positioning. Hence, the cursor can be
controlled more precisely. Fig. 5 Illustration of scrolling command activation for clockwise
rotation
13
Journal of Electrical Engineering & Technology (2022) 17:1961–1970 1967
threshold of movement distance can easily be set for the be able to perform well in recognizing gestures of different
scrolling command to be activated to eliminate the fluctuat- people. Further, hand gestures with higher feature similarity
ing hand coordinate. The threshold used for this project is can increase the rate of false detection, such as the ‘right pre-
80. flick’ and ‘right post-flick’ gestures, or the ‘fist’ and ‘post-
pinch’ gestures which have very similar features. Therefore,
3.5.3 Sequential Gestures the use of gestures with distinctive features should improve
the recognition rate.
Gesture commands that require sequential gestures are easily The chart in Fig. 7 shows a significant decrease in mAP
implemented by storing the gesture that is performed in the when the IoU is increased to 75%. The combined mAP series
previous video capture frame and using if/else statement for indicates the evaluation of the entire testing dataset com-
execution. For example, to close active Windows application bined and is shown to decrease by 25.26% with the increased
using the flick gesture, if the current gesture is post-flick and IoU. The extent of the reduction indicates that the model
the previous gesture is pre-flick, the virtual ‘CTRL + F4’ is not very good at predicting the accurate location of the
would be pressed to close the active window. This is done bounding box. This is because, with a higher IoU threshold,
by assigning ‘previous gesture’ = ‘current gesture’ at the end detections that do not meet the threshold is categorized as
of each execution loop. false positives. A similar trend is also observed in the origi-
nal YOLOv3 article [27], where the drop of AP is relatively
higher compared to other state-of-the-art object detectors.
4 Experiment result and Discussion Table 3 shows the comparison results of hand recognition
with other similar studies of gesture recognition by image
4.1 Model Evaluation classification. The highest mAP at 50% IoU for this project
is 96.68% and is very close to other studies. All the studies
The model’s performance during testing using the testing reported rather high recognition rate due to the fact that ges-
dataset is shown in Fig. 6. The three categories that involve tures are generally distinctive and consistent in its features.
foreign subjects performed significantly worse than those
with training subject. Subject 2 of the three categories had
the lowest mAP at 45.90%. The test category of 120 cm
showed the highest mAP at 96.68%. One conclusion that
can be drawn from this experiment is that feature similarities
of the gestures affect the detection rate greatly. For exam-
ple, subject 2 and 3 showed lower average precision which
may be due to gender and age differences from the training
subject. Meanwhile, subject 1 may perform better due to the
same gender with the training subject. Although the model
was trained with a very limited dataset, it still showed rela-
tively good performance on other test subjects. Therefore, if
more training dataset is provided to train the model, it should Fig. 7 Graph of mAP with different IoUs
13
1968 Journal of Electrical Engineering & Technology (2022) 17:1961–1970
Table 3 Comparison with other studies more superior than the method using feature engineering
Author (year) Method Accuracy (%) with color segmentation, such as the works by Huang et al.
[16] and Rahmat et al. [26], where gestures appeared in
Maqueda et al. [22] VS-LBP + SVM (h–h) 97.3 front of the face could not be recognized accurately due
Chen et al. [11] GMM + GMF 98.7 to similar skin color. Furthermore, the detector does not
Oyedotun and Khashman CNN and SDAE 91.33 and 92.83 detect any gesture in non-gesture frames when an appro-
[25]
priate confidence level threshold is applied.
Ni et al. [24] Light YOLO 98.06
The system is very responsive and easy to use when
Bush et al. [7] SSD + CNN 98.97
outputting commands that involved sequential gesture.
This project YOLOv3 96.68
However, there is a limitation for cursor control where
the control may experience fluctuation in position due to
inconsistent bounding box from the YOLOv3 output. This
is undeniable considering that YOLOv3 has lower AP at
higher IoU which indicates that it has a lower capability to
output accurate bounding box compared to another state-
of-the-art object detector [27]. Some smoothing algorithm
may be done to improve the control of cursor position,
such as applying the Bezier curves. Aside from that, other
object detectors may be used to improve the system. For
example, the newer YOLOv4 [6] is reported to have bet-
ter accuracy and detection speed. Also, the use of class-
agnostic non-maximal suppression with low IoU threshold
(0.1 to 0) can improve the system by eliminating repeated
posture detection on a single hand, thereby lowering the
false positive rate.
13
Journal of Electrical Engineering & Technology (2022) 17:1961–1970 1969
Acknowledgements This research was funded by Universiti Malaysia 15. Geirhos R, Schütt HH, Medina Temme CR, Bethge M, Rauber
Sarawak under the UNIMAS publication support fee fund. J, Wichmann FA (2018) Generalisation in humans and deep
neural networks. In: Advances in neural information processing
systems
Declarations 16. Huang H, Chong Y, Nie C, Pan S (2019) Hand gesture recognition
with skin detection and deep learning method. J Phys Conf Ser.
Conflict of interest On behalf of all authors, the corresponding author https://doi.org/10.1088/1742-6596/1213/2/022001
states that there is no conflict of interest. 17. Islam MZ, Hossain MS, Ul Islam R, Andersson K (2019) Static
hand gesture recognition using convolutional neural network with
data augmentation. In: 2019 Joint 8th international conference on
informatics, electronics and vision, ICIEV 2019 and 3rd inter-
References national conference on imaging, vision and pattern recognition,
IcIVPR 2019 with international conference on activity and behav-
1. Al-Shamayleh AS, Ahmad R, Abushariah MAM, Alam KA, ior computing, ABC 2019. https://doi.org/10.1109/ICIEV.2019.
Jomhari N (2018) A systematic literature review on vision based 8858563
gesture recognition techniques. Multimed Tools Appl. https://doi. 18. Ji S, Xu W, Yang M, Yu K (2013) 3D Convolutional neural net-
org/10.1007/s11042-018-5971-z works for human action recognition. IEEE Trans Pattern Anal
2. Anwar S, Sinha SK, Vivek S, Ashank V (2019). Hand gesture rec- Mach Intell. https://doi.org/10.1109/TPAMI.2012.59
ognition: a survey. Lecture notes in electrical engineering. https:// 19. Kim H, Albuquerque G, Havemann S, Fellner DW (2005) Tangi-
doi.org/10.1007/978-981-13-0776-8_33 ble 3D: hand gesture interaction for immersive 3D modeling. In:
3. Ayooshkathuria (2018) pytorch-yolo-v3. Github 9th international workshop on immersive projection technology—
4. Bai Y, Zhang L, Wang T, Zhou X (2019) A skeleton object detec- 11th Eurographics symposium on virtual environments, IPT/
tion-based dynamic gesture recognition method. In: Proceedings EGVE 2005
of the 2019 IEEE 16th international conference on networking, 20. Kim S, Ji Y, Lee KB (2018) An effective sign language learning
sensing and control, ICNSC 2019. https://d oi.o rg/1 0.1 109/I CNSC. with object detection based ROI segmentation. In: Proceedings—
2019.8743166 2nd IEEE international conference on robotic computing, IRC
5. Beyer G, Meier M (2011) Music interfaces for novice users: com- 2018. https://doi.org/10.1109/IRC.2018.00069
posing music on a public display with hand gestures. In: Proceed- 21. Köpüklü O, Gunduz A, Kose N, Rigoll G (2019) Real-time hand
ings of the international conference on new interfaces for musical gesture detection and classification using convolutional neural
expression networks. In: Proceedings—14th IEEE international conference
6. Bochkovskiy A, Wang CY, Liao M (2020) YOLOv4: optimal on automatic face and gesture recognition, FG 2019. https://doi.
speed and accuracy of object detection. https://a rxiv.o rg/p df/2 004. org/10.1109/FG.2019.8756576
10934v1.pdf 22. Maqueda AI, Del-Blanco CR, Jaureguizar F, García N (2015)
7. Bush IJ, Abiyev R, Arslan M (2019) Impact of machine learn- Human-computer interaction based on visual hand-gesture rec-
ing techniques on hand gesture recognition. J Intell Fuzzy Syst. ognition using volumetric spatiograms of local binary patterns.
https://doi.org/10.3233/JIFS-190353 Comput Vis Image Underst. https://doi.org/10.1016/j.cviu.2015.
8. Camgoz NC, Hadfield S, Bowden R (2017) Particle filter based 07.009
probabilistic forced alignment for continuous gesture recognition. 23. Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016)
In: Proceedings—2017 IEEE international conference on com- Online detection and classification of dynamic hand gestures with
puter vision workshops, ICCVW 2017. https://doi.org/10.1109/ recurrent 3D convolutional neural networks. In: Proceedings of
ICCVW.2017.364 the IEEE computer society conference on computer vision and
9. Chandrasekaran G, Periyasamy S, Panjappagounder Rajaman- pattern recognition. https://doi.org/10.1109/CVPR.2016.456
ickam K (2020) Minimization of test time in system on chip using 24. Ni Z, Chen J, Sang N, Gao C, Liu L (2018) Light YOLO for
artificial intelligence-based test scheduling techniques. Neural high-speed gesture recognition. In: Proceedings—international
Comput Appl. https://doi.org/10.1007/s00521-019-04039-6 conference on image processing, ICIP. https://doi.org/10.1109/
10. Chai X, Liu Z, Yin F, Liu Z, Chen X (2016) Two streams recurrent ICIP.2018.8451766
neural networks for large-scale continuous gesture recognition. 25. Oyedotun OK, Khashman A (2017) Deep learning in vision-based
In: Proceedings—international conference on pattern recognition. static hand gesture recognition. Neural Comput Appl. https://doi.
https://doi.org/10.1109/ICPR.2016.7899603 org/10.1007/s00521-016-2294-8
11. Chen D, Li G, Sun Y, Kong J, Jiang G, Tang H, Ju Z, Yu H, Liu H 26. Rahmat RF, Chairunnisa T, Gunawan D, Pasha MF, Budiarto R
(2017) An interactive image segmentation method in hand gesture (2019) Hand gestures recognition with improved skin color seg-
recognition. Sensors (Switzerland). https://d oi.o rg/1 0.3 390/s 1702 mentation in human-computer interaction applications. J Theor
0253 Appl Inf Technol 97(3):727-739
12. Chua SND, Lim SF, Lai SN et al (2019) Development of a child 27. Redmon J, Farhadi A (2018) YOLO v.3. Tech Report
detection system with artificial intelligence using object detection 28. Tzutalin (2015) LabelImg. LabelImg. https://g ithub.c om/t zutal in/
method. J Electr Eng Technol 14:2523–2529. https://doi.org/10. labelImg
1007/s42835-019-00255-1 29. Walker A (2013) Voice commands or gesture recognition: how
13. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman will we control the computers of the future? https://www.indep
A (2010) The pascal visual object classes (VOC) challenge. Int J endent.co.uk/life-style/gadgets-and-tech/voice-commands-or-
Comput Vis. https://doi.org/10.1007/s11263-009-0275-4 gesture-recognition-how-will-we-control-the-computers-of-the-
14. Flores CJL, Cutipa AEG, Enciso RL (2017) Application of convo- future-8899614.html
lutional neural networks for static hand gestures recognition under 30. Wan J, Li SZ, Zhao Y, Zhou S, Guyon I, Escalera S (2016)
different invariant features. In: Proceedings of the 2017 IEEE 24th ChaLearn looking at people RGB-D isolated and continuous data-
international congress on electronics, electrical engineering and sets for gesture recognition. In: IEEE computer society conference
computing, INTERCON 2017. https://doi.org/10.1109/INTER on computer vision and pattern recognition workshops. https://d oi.
CON.2017.8079727 org/10.1109/CVPRW.2016.100
13
1970 Journal of Electrical Engineering & Technology (2022) 17:1961–1970
31. Yang X, Tian Y (2014) Super normal vector for activity recogni- K. Y. Richard Chin received his B.Eng. from Universiti Malaysia
tion using depth sequences. In: Proceedings of the IEEE computer Sarawak, Malaysia. His research interests include computer vision and
society conference on computer vision and pattern recognition. mechanical processes.
https://doi.org/10.1109/CVPR.2014.108
32. Zhang Y, Cao C, Cheng J, Lu H (2018) EgoGesture: a new data- S. F. Lim received her Ph.D. degree from National University of Sin-
set and benchmark for egocentric hand gesture recognition. IEEE gapore, Singapore. Her research interests include adsorption process
Trans Multimed. https://doi.org/10.1109/TMM.2018.2808769 and AI.Pushpdant Jain received his Ph.D. from National Institute of
Technology, Rourkela (Odisha). His research interests include new
Publisher's Note Springer Nature remains neutral with regard to product development and finite element analysis.
jurisdictional claims in published maps and institutional affiliations.
Pushpdant Jain received his Ph.D. from National Institute of Tech-
nology, Rourkela (Odisha). His research interests include new product
development and finite element analysis.
S. N. David Chua received his Ph.D. degree from Dublin City Univer-
sity, Ireland. His research interests include finite element modelling,
simulation, biomechanical and AI.
13