Computer Vision Based Hand Gesture Recognition For Human-Robot Interaction
Computer Vision Based Hand Gesture Recognition For Human-Robot Interaction
https://doi.org/10.1007/s40747-023-01173-6
Abstract
As robots have become more pervasive in our daily life, natural human-robot interaction (HRI) has had a positive impact on the
development of robotics. Thus, there has been growing interest in the development of vision-based hand gesture recognition
for HRI to bridge human-robot barriers. The aim is for interaction with robots to be as natural as that between individuals.
Accordingly, incorporating hand gestures in HRI is a significant research area. Hand gestures can provide natural, intuitive,
and creative methods for communicating with robots. This paper provides an analysis of hand gesture recognition using
both monocular cameras and RGB-D cameras for this purpose. Specifically, the main process of visual gesture recognition
includes data acquisition, hand gesture detection and segmentation, feature extraction and gesture classification, which are
discussed in this paper. Experimental evaluations are also reviewed. Furthermore, algorithms of hand gesture recognition for
human-robot interaction are examined in this study. In addition, the advances required for improvement in the present hand
gesture recognition systems, which can be applied for effective and efficient human-robot interaction, are discussed.
123
Complex & Intelligent Systems
in contemporary human-computer interaction. A successful gesture classifier that used information from inertial mea-
implementation of gesture recognition will provide a revolu- surement units on the user’s fingertip.
tionary human-computer interaction experience. (2) Electromyography (EMG)
Using electrodes affixed to the skin or injected into the
Taxonomy of hand gesture recognition muscles, electromyography records the electrical activ-
ity of the muscle tissues. This technique primarily uses
Gesture recognition methods are divided into sensor-based sensors to gather electrical signals from the skin and mus-
and vision-based. cles on the surface of a human body. After expanding the
signal data and processing it further, the method screens
Sensor-based methods the information that might be contained in each gesture
before recognizing gestures.
Due to the range of sensors, the primary recognition may be In 2002, Vuskovic and Du [8] used two-channel sEMG
grouped into three categories: using data gloves, using EMG signals simultaneously to identify six different gestures
signals and using Wi-Fi. with an accuracy of 78%. Nazarpour et al. [9] performed
feature extraction using high-order statistics in 2005, and
(1) Data gloves correctly identified 4 forearm movements using a clus-
With the widespread use of sensors, wearable sensor- tering method with an accuracy of 91%. In 2018, Wu et
based gesture recognition has advanced quickly. To al. [10] considered 15 features, including integral elec-
recognize gestures using wearable technology, a glove tromyographic data, to recognize five hand motions with
with numerous sensors must be worn on a hand, and an average accuracy of 90% using upgraded k-nearest
the data from the glove must be analyzed. In particu- neighbor algorithms. The maximum accuracy of the ran-
lar, gesture recognition based on data gloves may more dom forest neural network and the best accuracy of the
intuitively gather three-dimensional spatial information support vector machine (SVM) classifier, were 88.7%
from hand posture by utilizing many sensors and is not and 85.9%, respectively, in 2015, when Guo et al. [11]
limited by the surrounding environment. used four different types of data as feature input. To iden-
A sensing glove was proposed by Komura and Lam [1] for tify four different types of gesture motions, Kim et al. [12]
controlling 3D game characters. Kim et al. [2] suggested considered power spectral density as feature input and
a data glove-based sign language recognition system and chose the SVM classifier with the accuracy of 91.97%.
achieved a 99.26% motion detection rate and an approx- (3) Wi-Fi and radar
imately 98% finger state recognition rate. Using data Due to the growing use of Wi-Fi devices in indoor set-
gloves, Rodriguez et al. [3] investigated the use of ges- tings, gesture recognition technology based on Wi-Fi
tures for human-computer interaction (HCI) in virtual signals has drawn increasing attention. Wi-Fi devices are
reality (VR) applications. Individuals with hearing and now present in many different indoor settings.
speech impairments can readily wear and use the gloves Wi-Fi signals were first used for sensing in 2000 by
that Helen Jenefa and Gokulakrishnan [4] made for peo- Bahl and Padmanabhan [13], who suggested a system for
ple with speech impairments. Such a glove is equipped indoor localisation based on the received signal strength
with bending sensors, accelerometer sensors, and touch of such signals. In 2013, Pu et al. [14] proposed Wisee,
sensors that measure the bending and movement of the which used Doppler shift as a feature for gesture recog-
user and enable nonmute persons to comprehend the nition to identify nine gestures with an accuracy of 94%
gestures produced by a speech-impaired individual. To for gestures such as pushing and pulling gestures. Wisee,
recognize several motions, including move-ready, grip, however, used specialized software defined on radio
loose, landing, takeoff, and hover motions, as well as devices and could not be immediately implemented on
to enable remote control of a six-axis vehicle, Huang et the existing Wi-Fi devices. The method required a dis-
al. [5] created a data glove. The researchers achieved an tance of more than 10 cm between the two antennas to
overall recognition rate of 84.3%. Using five fundamental obtain good results, although Mudra [15] developed a
classification algorithms—decision trees, support vec- Wi-Fi-based finger-level gesture detection method with
tor machines, logistic regression, Gaussian naive Bayes, a 96% accuracy by utilizing the difference in signals
and a multilayer perceptron—Antillon et al. [6] created between antennas in different locations. WiFall [16]
a smart diving glove that was trained and tested. Addi- applied the random forest method and an SVM classifier
tionally, a study was performed underwater to determine to categorize various human activities and implement fall
whether the environment had any impact on how each detection , yielding an average false alarm rate of 18% and
algorithm classified objects. Mummadi et al. [7] demon- a detection accuracy of 87%. To address the issue of sig-
strated a data glove prototype with a glove-embedded nal propagation through walls, Wu et al. [17] suggested
123
Complex & Intelligent Systems
a passive human activity identification system based on frames of static gestures, implying that the latter are a subset
Wi-Fi signals without any extra equipment. of dynamic gestures [22].
The main process of visual gesture recognition is as fol-
Users of sensor-based typically must wear gloves that lows:
have sensors or probes attached to users’ arms. Addition-
ally, the methods used in a laboratory setting are frequently
constrained by the instruments that must be set up before • Data acquisition: acquiring gesture images with video
recognition. camera and preprocessing images;
• Gesture detection and segmentation: detecting the posi-
Computer vision-based methods tion of the hand in the gesture image and segmenting the
hand region;
Image sensor technology has undergone continuous updating • Gesture recognition: extracting image features from the
and iteration since its inception. Due to the inability of 2D- hand region and recognizing the gesture type based on
based image sensors to provide the additional information the features. In “Hand gesture recognition process”, the
required to fulfill the needs of the contemporary society, the discussion will be divided into the two respective parts.
interest of academics in such sensors is currently declining,
and the AI internet-of-things (IoT) field is shifting toward
“Hand gesture recognition process” will thoroughly ana-
3D. Monocular, binocular, and depth (RGB-D) cameras are
lyze and define the essential tactics involved in gesture
the three main types of cameras used in vision-based gesture
recognition, using the basic steps of gesture recognition as the
detection systems.
main consideration. “Experimental Evaluation” will present
Microsoft’s Kinect V1 (the first generation of Kinect) is a
several evaluation metrics for gesture recognition and seg-
depth camera that was first unveiled on June 14, 2010, com-
mentation. With the emergence of depth cameras, there has
bining OpenNI and the SDK library to monitor the bones of
been significant growth in studies of gesture recognition
human joints as a foundation for gesture recognition research.
based on depth data; “Hand gesture recognition based on
A dynamic Arabic sign language recognition system for
RGB-D cameras” will detail the research of various schol-
Kinect was introduced by Hisham and Hamouda [18]. It
ars on gesture recognition based on RGB-D cameras. The
combined decision trees with Bayesian classifiers for gesture
applications of gesture recognition, particularly in robotics
identification and then used the AdaBoost method to improve
and human-computer interaction, will be described in “Hand
the system, resulting in a recognition rate of approximately
gesture recognition applications”. “Problems, outlook, and
93.7%.
conclusion” will discuss the current difficulties and the future
Leap Motion, a body controller manufacturer focused on
directions in gesture recognition.
the PC and Mac platforms, introduced its body controller
on February 27, 2013, utilizing the stereo vision principle
and two cameras to determine coordinates of spatial objects
similarly to the human eye. Hand gesture recognition process
RealSense cameras from Intel are also depth cameras with
gesture recognition capabilities. Extracting valid descriptors Designed to enable the information exchange between users
from the hand skeleton’s connected joints returned by the and intelligent devices, vision-based gesture recognition
Intel RealSense depth camera, De Smedt et al. [19] proposed technology refers to the acquisition of video images contain-
a skeleton-based 3D gesture recognition method. The Fisher ing operation gestures through acquisition devices such as
vector obtained by using a Gaussian mixture model repre- cameras and the corresponding processing of video images,
sented the encoding of each descriptor used to obtain the such as gesture segmentation, gesture feature extraction, ges-
final feature vector. The original data were not filtered, and ture feature classification, etc.
SVM was the only classifier used, leading to a comparatively
poor recognition rate.
Data acquisition
Gesture recognition processes
The data collected for vision-based gesture recognition is an
Static gesture recognition and dynamic gesture recognition image frame. The image needs to be preprocessed because
are two categories of gesture recognition technologies [20]. it cannot be recognized directly after being acquired by the
The former implies that the hand is fixed for recognition and camera. To improve the overall performance of the system,
that aspects such as hand posture, shape, and location do not the image preprocessing stage modifies the input image or
change [21]. Dynamic gestures are composed of sequential video. The following are some common preprocessed steps.
123
Complex & Intelligent Systems
Image grayscaling detected edges [30]. The advantage of this algorithm is that it
can effectively remove noise from an image and extract clear
Image grayscaling is the conversion of color images to and continuous edges. The disadvantage is that it is computa-
grayscale images for better processing of image informa- tionally intensive, requires multiple filtering and processing
tion. Ĉadik et al. [23] and Benedetti et al. [24] reviewed the operations, and is highly complex. The Sobel algorithm is
research on grayscaling of color images. In grayscaling, the similar to the Canny algorithm in that it detects edges by cal-
value of each pixel in an image is calculated from the values culating the gradients in the x and y directions of each pixel
of its red, green, and blue channels by a specific algorithm to and has the advantage of being computationally simple, fast,
obtain a gray value that represents the luminance of that pixel and capable of detecting fine edges [32]. The disadvantage
[25]. The purpose of grayscaling is to simplify image pro- is that it is not sufficiently accurate to detect straight edges,
cessing and reduce computational and storage requirements. and it easily generates noise. Therefore, the image is usually
Common image grayscaling algorithms include the average smoothed using methods such as Gaussian filtering in the
method, the weighted average method, the maximum value early stages of edge detection to reduce the effect of noise.
method, and the minimum value method. In general, the aver-
age method and the weighted average method are used more Morphological image processing
often because they are simpler and more effective.
Morphological operations (e.g., expansion, erosion, etc.) are
Image smoothing used to morphologically process an image to better extract
the shape information of a gesture [26]. Commonly used
Noise is removed from images using image smoothing tech- morphological operations include expansion, erosion, open
niques (e.g., Gaussian filtering, median filtering, etc.) to operation, and close operation. The expansion operation can
extract gesture information more accurately [26]. Gaussian expand the target region in the image to make it more visible,
filtering is a smoothing method based on a Gaussian function which is suitable for extracting gesture edge information. The
that can effectively smooth an image while preserving edge erosion operation can shrink the target area in the image,
information. Such filtering involves convolving the image which is facilitates removing noise and small details in the
and replacing the value of each pixel with the weighted aver- image. The open and close operations can remove burrs and
age of pixels in the region around that pixel. The kernel size holes in the image, respectively, to make the shape of the
and standard deviation of Gaussian filtering determines the gesture clearer.
degree of image smoothing; in gesture preprocessing, the
appropriate Gaussian kernel size and standard deviation are Optimum thresholding
usually chosen to achieve the best smoothing effect. Median
filtering, on the other hand, is a smoothing method based on Threshold segmentation algorithms (such as the Otsu algo-
ranking statistics that can eliminate outliers such as pretzel rithm, the Niblack algorithm, etc.) are used to divide an image
noise while preserving the details of the image. The basic into two parts, foreground and background, to better segment
idea is to replace the value of a pixel with the median of its the gestures [33]. The Otsu algorithm is a global thresh-
neighbors. Certain objects with hue and saturation charac- old segmentation algorithm; its basic idea is to divide the
teristics similar to those of skin can produce pretzel noise pixel gray values of an image into two classes such that the
in images generated by skin region detection, and the noise sum of the variances of the two classes is minimized [34].
can be suppressed using median filtering and morphological This algorithm can adaptively determine the threshold and
methods [27]. In [28, 29], researchers used median filtering is hence suitable for image segmentation tasks in various
to process images. scenes. The Niblack algorithm is a local binarization algo-
rithm that divides the image into several small regions and
Edge detection then binarizes each region; it uses gray value thresholding to
decide whether each pixel belongs to the foreground or the
Edge detection algorithms (e.g., the Canny algorithm, the background [35]. This algorithm is more suitable than the
Sobel algorithm, etc.) are used to detect edge information in global thresholding algorithm for segmenting images with
an image for better segmentation of gestures. It is a process of highly diverse gray value distributions and can effectively
identifying and locating points of sharp discontinuities in an handle images with uneven illumination and complex back-
image that represent the boundary between an object and the grounds.
background or between adjacent objects [30, 31]. The Canny In general, an input image is first thresholded as a binary
algorithm detects edges by calculating the gradient and ori- image, and then noise is subtracted using the median and
entation of pixel points in an image, and its main features are Gaussian filters; a preprocessing stage using morphological
high accuracy, low noise sensitivity, and clear details of the operations follows.
123
Complex & Intelligent Systems
Gesture detection and segmentation combined with effective template matching using the princi-
pal component analysis was proposed in [41]. Veluchamy
The gesture must first be separated from the background for et al. [42] used a skin color thresholding model for seg-
the computer to recognize it. The reason is that the com- mentation; numerous characteristics were extracted using
puter records the details of both the gesture and the scene the scale-invariant feature transform and monogenic binary
in which it occurs. Hand segmentation is the division of the coding algorithms before being identified using an efficient
collection of pixel point coordinates obtained in the earlier classifier. In [43], the problem of segmentation of the hands
gesture detection phase, which thus reduces the computa- involved in gesture production was solved differently, using
tion of pixel points and facilitates the subsequent operations. a ribbon-based segmentation algorithm, the first using spe-
The first crucial stage in the algorithm for recognizing ges- cial color stickers for the fingers, and the second based on
tures is gesture segmentation, and a successful completion normal skin color segmentation. Wang et al. [44] segmented
of this stage is necessary for accurate gesture identification. the hand region by locating the skin tone region in the CbCr
There are many gesture segmentation techniques, but from plane of the YCbCr color space using the “elliptical boundary
the practical and application point of view, almost all of them model”. Considering the YCbCr color space, Patel [45] noted
still face great difficulties in terms of accuracy, stability and that the hand region could be cropped from all the images
speed (such as in the case of gesture segmentation in complex in the dataset by thresholding segmentation. The skin color
backgrounds), the impact of the distance between the camera detection algorithm used in [46] facilitated communication
and the person on gesture segmentation, etc. A comparison between the user and the computer.
of gesture detection and segmentation methods is shown in
Table 1.
Contour information segmentation
Skin color segmentation
Another crucial component of gesture segmentation is detec-
The most fundamental apparent characteristic of a human tion of the presence of contours and edges. A new techno-
hand is skin color. Even though everyone has a unique skin logical advance in gesture segmentation has been provided
tone, the skin tones of the human body are concentrated in by target segmentation techniques based on contour infor-
a certain region of a particular color space. In addition, the mation. Edge detection operators, template matching, and
orientation, size, and perspective of an image itself have very active contour models are the three primary categories of
little impact on skin color, which is highly invariant in terms conventional gesture segmentation techniques based on con-
of rotation, translation, and scale reduction. Hence, a signif- tour information.
icant portion of current studies of gesture recognition rely Traditionally, gesture contours have been extracted from
on skin tone information for gesture segmentation. The three photos by using edge detection operators to identify edges
most commonly used color spaces are the RGB, HSV, and in images. In the context of gesture segmentation, template
YCbCr color systems.. matching—a traditional target localization technique—has
In [36], the hand skin tone was segmented using the thresh- also been applied to some extent. To discover the best match,
old method. In [37], a skin color detection method was used template matching requires placing a preset template on a
to detect hands and faces. A skin tone model was utilized in point in the image, calculating how well the template matches
[38] for segmentation, and the HSV color model was selected the image at that point, and then iteratively moving through
after a comparison with the RGB model because of the influ- the entire image.
ence of luminance. A simple segmentation technique based A segmentation of gestures’ grayscale images was per-
on calculating the maximum and minimum skin probabilities formed in [47] using a histogram thresholding segmentation
of the input RGB image was used in [39]. In [40], a skin detec- technique. A morphological filtering technique was created
tion model and an approximate median model were applied to represent the gesture contours to successfully filter out the
to segment the image. The approximate median model was background and target noise from the segmented image. The
utilized for background subtraction, and the skin detection wrist cropping approach in [48] used a width and contour
model was used to identify the hands and fingers in the image. heuristic to estimate the wrist position among the segmented
At the same time, without relying on any artificial neural hand patches and then extracted the hand from the estimated
network training, Dhule determined the precise sequence of wrist position to separate the segmented hand from the rest
moving hands and fingers by calculating the change in RBG of the arm. The nodes of the human body were identified in
color pixel values in a video and controlled the mouse move- [49] by using background subtraction, edge detection, con-
ment in a window in real time according to the hand and tour detection, and edge detection using the Sobel operator.
finger movements. A gesture recognition scheme based on Generating a population gesture feature collection for multi-
the skin color model approach and the threshold approach view gesture photos was proposed in [50], along with a new
123
Complex & Intelligent Systems
Based on skin color Color space Fast processing, invariance to rota- Susceptible to interference from
tion, partial occlusion, pose change skin-like areas
Edge detection operator Fast and accurate gesture edge Edge extraction results may be bro-
information extraction ken, overlapping, etc., and require
subsequent processing
Based on contour information Template matching Adaptability to different shapes The need to prepare many templates
and sizes in gesture segmentation; in advance increases the system
for relatively simple gestures, the complexity
matching accuracy is high and the
segmentation results are more accu-
rate
Active contour model Adaptive adjustment of contour High algorithmic complexity and
shape, suitable for gesture seg- significant demand for computa-
mentation of various shapes. Bet- tional resources
ter for gesture segmentation in the
presence of noise, complex back-
grounds, etc.
Based on a depth sensor – High identification efficiency Reliance on depth cameras and the
need for improved accuracy
Based on deep learning – No requirement of manual analy- Need to improve the real-time per-
sis of gesture data for segmentation, formance of detection
which is thus more convenient and
robust
Gesture tracking (dynamic gestures) Frame differential method The algorithm’s simplicity to imple- Inability to extract the complete
ment, and the low programming. area of an object, and presence of
Lower sensitivity to scene changes “holes” inside the object; extraction
such as light, ability to adapt to of only the boundary with an outline
various dynamic environments, and that is coarse and often larger than
relative robustness the actual object
Background subtraction Ability to extract the complete area Required initial background model-
of an object, reduced sensitivity to ing and sensitivity to light changes
changes in the scene such as light
variations, and ability to adapt to
various dynamic environments
Optical flow Ability to extract the complete area Inadequate detection effectiveness
of an object, and insensitivity to in the case of fast movement or
light changes unclear object surface texture
MeanShift algorithm Good real-time performance, fast Ineffectiveness of tracking in cases
calculations of large variations of the shape and
size of the target
CAMShift algorithm Better tracking for large variations The calculations is large and takes a
in target shape and size long time to calculate
Particle filtering algorithm Fast calculations and low storage Ease of mismatching and losing the
requirements detailed features if the target is sim-
ilar to the background
Pareto optimum frontier-based multiview gesture detection in the construction of a system resistant to changing lighting
method. conditions. Additionally, edge-based grayscale segmentation
Of course, there are other methods of gesture segmenta- ensured that the method could be applied to users with a
tion based on appearance features apart from skin color and range of skin tones and backgrounds. In [53], a hand detec-
contour. The shape and direction of the hand, taken from an tion method that incorporated skin filtering and three-frame
input video stream captured under stable lighting and sim- differencing was proposed.
ple background conditions, were proposed in [51] as a basis
for recognition of static gesture images. In [52], an adaptive
threshold binarization-based homomorphic filter was used
123
Complex & Intelligent Systems
Other segmentation approaches with the traditional MeanShift algorithm to effectively solve
the problem of tracking effectiveness degradation caused by
The depth sensor-based gesture segmentation method uses a the moving target’s occlusion.
depth camera to gather hand-depth structure information and The CAMShift algorithm is a modification of the Mean-
then segment the hand region, as detailed in “Hand gesture Shift algorithm, and its full name is “Continuously Adaptive
recognition based on RGB-D cameras”. MeanShift”. Its basic idea is that all frames of video images
Gesture segmentation methods mentioned above require are used for the MeanShift operation, and the result of the pre-
manual segmentation by designing features of target images, vious frame (i.e., the search window’s center and size) is used
which can achieve gesture segmentation by creating features as the initial value of the search window of the next frame of
in simple contexts; however, it is difficult to design effec- the MeanShift algorithm, etc., iteratively [63]. It can adapt to
tive gesture features in complex environments, and thus this target deformation by changing the window size adaptively,
approach is difficult to apply in natural human-computer but if the surrounding environment is complex, the tracking
interaction systems. Development of deep learning opens up window will diverge, and the target will be lost [64]. Several
new options for gesture segmentation since a model can be studies have used the CAMShift method to track the posi-
trained using vast amounts of data to automatically learn tar- tion of gestures; examples are the applications in [65, 66]
get gesture attributes, allowing it to complete target gesture to detect and track gestures. The CAMShift method detects
recognition and gesture segmentation using the identified the position of gestures by continuously resizing the search
target gestures [54]. Paul et al. [55] offered a method for window.
extending the hand segmentation approach based on convo- Motion detection is a popular technique for segmenting
lutional neural networks from still photos to video images. dynamic targets. Its fundamental principle is to fully localize
The proposed technique was more resistant to distortion and extract currently moving targets by combining the visual
and occlusion issues, resulting in improved accuracy and information of previous moments in a video. The tempo-
delay tradeoffs. Compared to traditional approaches, the deep ral difference approach [53, 67], the background subtraction
learning-based gesture segmentation method eliminates the method [68, 69] and the optical flow method [70, 71] are the
need for manual analysis of gesture data for segmentation, three primary categories of conventional gesture segmenta-
making segmentation more convenient and being a promising tion techniques based on motion information.
approach to gesture segmentation [56, 57]. However, in terms The temporal difference method’s primary premise is to
of the current development status, this method has flaws: first, choose a number of adjacent frames in a video sequence,
some network layers are complicated, slowing gesture seg- execute the difference operation, and then extract the moving
mentation; second, edge detection may yield blurred results, target by separating it from the backdrop using a predeter-
and its accuracy still needs to be improved [56, 57]. mined threshold. The two frames before and after the image’s
Multiple methods have indeed been used to acquire hand pixels are subtracted from one another. If the impact qual-
segmentation images. For example, a single Gaussian model ity difference is negligibly different from the environmental
was applied in [58] to describe the hand color in the HSV brightness, the item can be assumed to be stationary. How-
color space. To achieve reliable hand tracking, Ding and Su ever, if a significant change in pixel values occurs anywhere
[58] combined optical flow and color cues. Zhao and Jia [59] in the image region, the change is assumed to be the result
proposed a manual segmentation technique for depth images of a moving object. If a moving object was previously sta-
based on a random decision forest architecture. tionary or blocked by another object, the temporal difference
approach and its upgraded algorithm frequently lose object
Tracking information. Additionally, because the temporal difference
method assumes the image backdrop’s invariance, it is inap-
In this paper, tracking is considered a component of segmen- propriate if the background is moving. In [72], the issue of
tation because the goal of both tracking and segmentation is to the frame difference pairs between the stored gestures and the
separate the hand from the background. The frame-by-frame query gestures used for matching was resolved. The hand tra-
analysis of temporally continuous images and the determi- jectory following the center was monitored according to the
nation of the tracked target during the image change interval direction between consecutive frames and the distance from
are the fundamentals of gesture tracking. the center of the frame.
Fukunaga proposed the MeanShift algorithm in 1975; its The background subtraction method’s fundamental con-
basic idea is to use the gradient ascent of probability density cept is similar to that of the temporal difference method in
to find the local optimum [60]. It is a straightforward algo- that the input image is compared to the background model,
rithm with excellent real-time performance. If the target size and the moving target is segmented by looking for changes
changes, however, it is prone to tracking drift [61]. Khan et in either statistical information such as histograms or fea-
al. [62] combined the spatial information of moving targets tures such as a grayscale representation or changes in the
123
Complex & Intelligent Systems
histogram. Prior to storing the backdrop image, a background hands. In the study of [80], the Kalman particle filter was
model is constructed. The input picture is subtracted from the introduced as an improvement to particle filtering in ges-
background image, and a certain threshold T is used to decide ture tracking. The Kalman filter was also applied for gesture
which pixel belongs to the foreground target when the cur- tracking in [81, 82].
rent frame is subtracted from the background image based on
exceeding that threshold. After thresholding, the background Feature Extraction
picture is represented by the point in the image with a value
of 0, and the pixel point in motion in the scene is represented Gesture feature extraction is the key step in gesture recogni-
by the point with a value of 1. If the entire backdrop image is tion. The feature extraction part involves processing the input
available, this approach can capture objects more effectively gesture image and then extracting the features that can rep-
with good real-time performance. However, the method is resent the gesture from the image. Gesture features are the
less reliable, and the outcome is significantly affected by features that can characterize the gesture form and motion
changes in the dynamic scene. Simple background subtrac- state extracted through the analysis of hand movements and
tion was avoided in [73] due to the complex background postures [83]. The selection and design of gesture features
and potential dynamic hand movements. Instead, certain are related not only to the accuracy of gesture recognition
morphological approaches and two-stage skin color identifi- but also to the complexity and real-time performance of the
cation were applied to reduce noise. The method suggested in system. Gesture features are mainly divided into global and
[74] used background subtraction techniques and skin color- local features.
based schemes to identify palms in video feeds, exploring
the potential for a usable computer vision framework for ges- (1) Global features
ture detection. To overcome the restriction on gesture input Global features are features that describe the morphol-
caused by gesture background subtraction, hand, and rotation ogy and movement of the entire hand. They include the
invariance, a new effective recognition elimination method size, shape, direction, speed, acceleration, rotation angle,
was provided in [75]. etc., of the hand. These features can describe the overall
An optical flow field is a velocity field that depicts action state of the hand and are applicable to some simple
three-dimensional movements of object points via a two- gesture recognition tasks. Global features also usually
dimensional map. Optical flow is defined based on pixel include color histograms, grayscale histograms, Gabor
points that indicate picture changes caused by motion over filters, etc. A color histogram refers to the statistics of the
the time interval. The pixel regions that best match the motion occurrence frequency of various colors in an image, rep-
model can be found using the optical flow method and the resented in the form of a histogram. A gray histogram is
regions can then be combined to create moving objects for a count of the frequency of occurrence of each gray level
object detection. The optical flow method can be used to in an image, represented in the form of a histogram. A
detect the object independently without the need for addi- Gabor filter is a filter capable of extracting image texture
tional camera data, but in most cases, this method is more information; it detects the texture direction and frequency
time-consuming than desired, computationally complex, and in an image and represents it in the form of a feature vec-
susceptible to noise. Real-time detection can only be accom- tor [84]. Global features are simple to compute and have
plished using specialized hardware support, meaning that the intuitive representations and good invariance properties.
optical flow method has a significant time overhead and poor However, such features mostly have pixel point-based
real-time performance and is impractical. In a related study, feature representations, and hence there are problems
the hand was separated from the background by using the such as high feature dimensionality and large compu-
inverse projection of the color model and motion cues in [76]. tational effort. In addition, such feature descriptions are
Ganokratanaa and Pumrin [77] proposed a dynamic gesture inapplicable in the case of image blending and occlusion.
identification system for older individuals that tracked six (2) Local features
dynamic movements and categorized their meanings using Local features are features extracted by analyzing local
optical flow and speckle analysis used in vision-based ges- areas such as fingers, palms, and wrists. They include the
ture recognition. curvature of fingers, the degree of palm protrusion, the
For visual tracking, the particle filter method is commonly rotation angle of the wrist, etc. These features can more
applied. It is a sequential Monte Carlo significance-sampling accurately characterize the detailed information about
method for estimating the latent state variables of a dynamic the hand and are more applicable to gesture tasks that
system from a series of observations. Particle filtering is com- require high-precision recognition. Commonly used local
monly used in conjunction with other techniques for gesture features include the histogram of the oriented gradient
tracking. The combination of particle filtering and the Mean- (HOG), the local binary pattern (LBP), the shift-invariant
Shift algorithm was shown in [78, 79] to accurately identify feature transform (SIFT), the sped-up robust features
123
Complex & Intelligent Systems
(SURF), features derived from the principal component Table 2 Partial gesture feature extraction techniques
analysis (PCA), and linear discriminant analysis (LDA), References Features Accuracy
etc.
HOG is a feature descriptor proposed by Navneet Dalal [92] LBP, PCA 99.97
and Bill Triggs in 2005 [85]. It is used as a feature descrip- [93] Harr-like features 95 37
tor for target detection and is a statistical value used to [94] SIFT 99
compute the directional information of local image gra- [95] SURF 63
dients in computer vision and image processing [85, 86]. [96] Fused features consisting of 90.0
LBP is a feature extraction method applied in image pro- blended Hu moments, fin-
ger angle counts, skin tone
cessing and computer vision. It converts a pixel point into angles, and nonskin tone
binary code by comparing the magnitude of the pixel’s angles
gray value with that of the surrounding pixels to obtain [97] SIFT, Hu moments, LBP 87.3,85.1
a feature that describes the texture of the image. LBP [98] Distance and angle from the 92.13
features have significant advantages such as grayscale end point of the hand
invariance and rotation invariance [86, 87] [99] SURF 84.6
SIFT is a scale- and rotation-invariant feature extraction [100] SURF, longest common sub- 93.0
technique proposed by Lowe [88], which has been widely sequence
used in gesture recognition. It is a feature detection [101]] Skin detector 97
and description algorithm for image processing that can [102] Harris 94.8
detect and describe feature points in images at different [103] SIFT 90
scales and rotation angles. SURF is a descriptor devel- [104] HOG, SIFT 91
oped from SIFT that improves the speed and robustness [105] PCA 91.5
of feature detection and description by using techniques
such as integral image and fast Hessian matrix com-
putation. SURF has the absolute advantage of being
features and the accuracy rates mentioned in several recent
computationally fast compared to SIFT [89]. Sykora et
studies; the results are shown in Table 2.
al. [90] applied a support vector machine classifier to
Feature elimination and selection is an important step in
classify SIFT and SURF features extracted from 500
machine learning that aims to select the most useful features
test images with recognition rates of 81.2% and 82.8%,
for a classification or regression task to improve the accuracy
respectively.
and generalizability of the model.
PCA is a commonly used data dimensionality reduc-
tion and feature extraction method that converts high-
dimensional data into low-dimensional data while pre- • Variance filtering: Eliminate features with variance below
serving as much information as possible about the a certain threshold because they have less impact on the
original data [83]. LDA is a common classification classification or regression task.
algorithm and feature extraction method that converts • Correlation filtering: Eliminate features that have a low
high-dimensional data into low-dimensional data while correlation with the target variable.
maximizing the variability between different categories • Regularization method: Eliminate features by making the
and minimizing the variability within each category [91]. weights of some features converge to zero through L1 or
L2 regularization.
In the study of gesture image features, global features have Features can be selected by the following methods:
difficulty extracting the information of interest within the
hand region due to the small proportion of the gesture region • Filtering: Evaluate each feature according to dispersion
in the image, and therefore, the image performance is poor. or relevance, set a threshold or the number of thresholds
In contrast, local image features are numerous and stable, for features to be selected, and select features.
have low interfeature correlation compared with that of the • Wrapper: Select a number of features or exclude a num-
global features, can avoid occlusion of the hand region to ber of features each time according to the objective
some extent and are robust to image transformations such function until the best subset is selected.
as illumination, rotation, and viewpoint change. Therefore, • Embedding: First, use machine learning algorithms and
to enrich gesture feature information, most researchers tend models for training to obtain the weight coefficients of
to fuse local features with global features to achieve higher each feature, according to the coefficient from the largest
recognition rates. In this paper, we summarize the gesture to the smallest selection of features. This approach is
123
Complex & Intelligent Systems
similar to the Filtering method, but training is used to in [106]. Chaudhary and Raheja [107] argued that lighting
determine the utility of features. inconsistencies and backdrop irregularities had an impact on
image segmentation, and the scholars provided a method for
In traditional methods, after extracting multiple features recognizing gestures based on constant light intensity. The
such as HOG, Harr, skin color, etc., classifiers such as SVM system was tested using the Euclidean distance approach and
are used to segment or recognize gestures, which involves artificial neural networks, and the method relied on a gesture
feature elimination and selection. In deep learning methods, image database to match the test movements.
multiple branches can be used to extract different features;
for example, one branch can be used to extract the motion Methods based on geometric information
trajectory of the gesture, and another can be used to extract
the color information of the gesture. Then, these branches Gestures can also be recognized using geometric informa-
can be used accordingly to obtain the final gesture features, tion, e.g., by fingertip detection, convex packet detection,
which also involves feature elimination and selection. the circle drawing method, the cross-hatch method, etc.
Depending on the application, fingertip detection can be
Gesture classification separated into single-fingertip detection and multiple finger-
tips’ detection. The “distance from the center of gravity”
Following the acquisition of the segmented image, important method can be used to detect a single fingertip. This method
information from the image is extracted via feature extrac- involves finding the point in the hand area that is furthest from
tion, and the gesture type is recognized using these features. the center of gravity and then determining whether the point
Gesture classification is the classification of the extracted is the fingertip. If the distance from the point to the center
spatiotemporal features of gestures and is the last stage of of gravity is greater than 1.6 times the average distance from
gesture recognition. The main methods of gesture classifi- the edge to the center of gravity, the point is the fingertip;
cation are listed and compared in Table 3. otherwise, it is not.
Wen and Niu [108] suggested a fingertip angle calculation
Template matching method to detect the fingers of the hand after discovering that
the fingertip angle values of the turning points of the curve
The first suggested recognition technique was a very simple were much larger than the fingertip angle values of other
template matching technique, usually used for static ges- points. Shin and Kim [109] detected fingertips based on the
ture recognition. The approach involves classifying an input coordinates of the hand position derived from the skeletal
image in accordance with how closely it matches a template information. Meng and Wang [110] read and preprocessed
(a point, a curve, or a shape). To calculate the matching gesture sample templates, which included filtering, segment-
degree, one can use the coordinate distance, the point set ing using the HSV color space threshold, and extracting
distance, contour edge matching, elastic map matching, etc. contours. The contours were then calculated approximately
Although the classification accuracy is not very high and to produce polygons, enabling the detection of the fingertips.
the types of gestures that can be recognized are limited, the The Hu moment and the number of fingertips were finally
template matching method has the advantages of being very determined.
quick in the case of small samples, being adaptable to light- A convex packet, which can hold all the points in a contour,
ing and background changes, and having a wide range of is a convex polygon created by linking the outermost points.
applications. Following contour analysis, convex packet identification is
Similarity of gestures was determined in [47] by assessing frequently utilized. A convex packet can be built for each con-
the similarity of sequences of local gesture outlines. Bhame et tour following the contour analysis of a binary image, and
al. [39] used variable distance features and straightforward the set of points contained in the packet is returned once the
logic to calculate the active fingers involved in a gesture, building is finished. The convex packet matching the contour
which sped up recognition and made the technique suit- can be drawn using the returned collection of points. Convex
able for real-time human-computer interaction applications profiles are often fully convex or at least flat. A convexity
in contrast to the conventional method of extracting input fea- defect is having a concavity in at least one location. In a
tures and comparing them with all database features. In [74], related study, Wang et al. [111] applied the Douglas-Peucker
an iterative polygonal shape approximation technique was technique for contour approximation to produce polygons
proposed and combined with a unique chain coding scheme throughout the feature recognition procedure. The type of
for shape similarity matching to recognize gestures. RGB and gesture was then determined by bump detection on the poly-
depth descriptors were combined to classify the movements gons.
123
Complex & Intelligent Systems
Template matching High speed in the case of small sam- Low classification accuracy and
ples, good adaptability to light and limited types of gestures that can be
background changes, wide range of recognized
applications
Geometric information-based Fingertip detection Quick detection of the location and Recognition effectiveness may be
number of fingers different for different hand types,
and in the case of small finger spac-
ing, false or missed detections may
occur
Convex packet detection Suitability for recognition of var- Greater algorithmic complexity and
ious hand types and good ability difficulty of implementation; the
to detect the position and number possibility of impacted recognition
of fingers; ability to perform detec- effectiveness in the case of hand
tion correctly if the finger spacing is occlusion or insufficient light
small
Dynamic time warping Good algorithmic performance in Greatly reduced recognition speed
matching and recognizing gestures and stability of the algorithm if the
with different motion speeds if the gesture sample template library is
gesture sample template library is large, especially if the gestures are
relatively small complex or in the case of a combi-
nation of two-handed gestures
Hidden Markov model Ability to capture dynamic features Needing a large amount of data,
and important timing information in which may lead to over- or under-
gestures fitting if the dataset is too small
Machine learning Simplicity of the algorithm and ease Sensitivity to the interference of
of implementation and debugging; light, angle and background of ges-
relatively small data requirements tures, low accuracy, and the need
for traditional algorithms for manual extraction of gesture fea-
tures
Deep learning Automatic extraction of gesture fea- Greater algorithmic complexity and
tures, eliminating the need for man- the need for a large amount of
ual extraction, and greater robust- data and computational resources;
ness to interference such as that longer training time and greater
of lighting, angle, and background; computational requirements
high accuracy
Dynamic time warping The DTW technique was utilized in [113] to determine the
optimal alignment between the query features and the stored
Dynamic time warping (DTW) is a nonlinear time-normalized database features for the recognition of Indian sign language.
matching a technique that is frequently used in speech Zhi et al. [114] implemented and trained two classifiers for
recognition, image matching, gesture classification, etc. It static and dynamic gesture recognition: an N-dimensional
overcomes the matching problem of inconsistent lengths of DTW classifier and a multiclass support vector machine clas-
two sequences. To identify the optimal alignment strategy for sifier. The running time was greatly reduced, and the average
the two end sequences, a dynamic programming approach is recognition rate was 95.5%.
used to allow the point values of the input sequence and
the template sequence to achieve one-to-many or one-to-one Hidden Markov model
matching [112]. The DTW algorithm performs well at match-
ing and recognizing gestures with various motion speeds if Russian scientist Vladimir V. Markovnikov developed the
the gesture sample template library is not too large. How- hidden Markov model (HMM), a statistical model, in the
ever, if there is a vast number of gesture sample templates, 1970s [115, 116]. HMM offers a broad range of applications
especially if the gestures are complicated or in the case of and a great learning capacity and is efficient in model-
a mix of two-handed gestures, the identification speed and ing time-varying and nonstationary time series. Due to the
stability of the algorithm decrease significantly. context-sensitive nature of gesture actions, in the domain
of gesture recognition problems HMM is better suited for
123
Complex & Intelligent Systems
continuous gesture recognition scenarios. However, HMM effect on computational complexity, the SVM approach
training and recognition are computationally demanding in is frequently applied to high-dimensional problems. The
continuous signal applications, where the state transition main challenges currently faced by research on gesture-
aspect will require a significant amount of probability density driven interaction are how to process the acceleration
computing, and as the number of parameters rises, the pace of values of gesture signals, build a multiclassification
model training and target identification will decline. Discrete model using the SVM algorithm, improve the accuracy
HMM is frequently utilized in generic gesture recognition of gesture recognition, and create a gesture-based inter-
systems to overcome this problem. action model.
Numerous studies [37, 48, 117–119] used HMM for ges- Predicted wrist positions in [125] are used to extract the
ture classification. HMM in [37] was used to compute the HOG image descriptors, and a multiclass SVM trained
log-likelihood of the symbols and to identify the most likely offline is used to categorize the hand shapes. In [126], a
route through a network work. In [48], the discrete-density previously trained SVM is used to normalize and clas-
and continuous-density HMMs were trained and tested, and sify the collected feature vectors. In [127], a method
it was demonstrated that the continuous HMM performed for RGB video-based gesture identification using SVM
better than the discrete HMM at classifying data. In [119], is proposed. In [128], real-time video-based signs are
the retrieved symbols were classified and identified using retrieved using a skin tone segmentation method, and
HMM, which was commonly used following the testing of appropriate feature vectors are produced from motion
alternative techniques such as Independent Bayesian classi- sequences. The features are then categorized using an
fier combinations. SVM. In [129], two modules, namely an SVM model for
static gesture classification and an HMM for dynamic
Machine learning single-stroke gesture detection, are used to understand a
user command consisting of a set of static and dynamic
With the use of machine learning, computers can create mod- gestures. The classifier for hand gesture recognition in
els that follow the fundamental rules of data in big datasets. this system is a linear SVM described in [130]. Local
For example, Tutsoy [120] proposed an artificial intelligence- binary patterns and a binary SVM classifier are utilized as
based multidimensional policy-making algorithm, which feature vectors in [131] to look for probable hand motions
was an advanced predictive model developed under a large in every frame of a video stream.
number of uncertain factors and time-varying dynamics, (2) Neural networks
aimed at controlling epidemic casualties. Also, Tutsoy Deep learning algorithms rely on neural networks, which
[121] proposed a new high-order, multidimensional, strongly are a subset of machine learning. The name and structure
coupled, parametric suspicious-infected-death model. This of such algorithms are inspired by the human brain, and
model uses machine learning algorithms to learn from they is designed to mimic the way biological neurons
large data sets, including epidemiological data, demographic communicate with one another [132].
information and environmental factors, to identify complex A common structure in artificial neural networks, a multi-
relationships and make accurate predictions. There are many layer perceptron (MLP), is a feedforward artificial neural
well-known machine learning classification algorithms, such network that is typically implemented with the backprop-
as support vector machines, neural networks, and conditional agation (BP) algorithm. Artificial neural networks are
random field and k-nearest neighbor algorithms, which can highly parallel, have a powerful ability to process infor-
resolve the problem of gesture recognition. mation, establish a nonlinear mapping from the input
space to the output space, have better fault tolerance and
(1) Support vector machines memory function, and store memory information in neu-
Support vector machines, first described in [122] in 1995, rons [133]. The MLP, which accepts the feature set of
are a class of generalized linear classifiers for binary clas- clusters as input, correctly classifies the clusters and out-
sification of data via supervised learning [123], which puts their intensity levels, is the learning algorithm used
is mainly governed by the idea of structural risk min- for classification in [49]. A context-aware gesture-based
imization and Vapnik–Chervonenkis dimension theory. intelligent system architecture is presented in [134].
SVM-based gesture recognition is currently an important Connection weights between the hidden layer and the
research method in gesture recognition technology [124]. input layer in radial basis function (RBF) neural net-
SVM is a revolutionary learning technique that optimizes works, which can be of both approximate and exact
both empirical risk and model complexity. The training varieties, are determined in a fixed manner rather than
error is the constraint of the optimization problem, and at random [135]. A classification of gestures in pho-
the goal is to minimize the confidence range. Because tographs using the chosen combinatorial characteristics
increasing the dimensionality of the sample space has no is proposed in [136]based on an upgraded version of the
123
Complex & Intelligent Systems
RBF neural network. In the latter, the estimated weight the chosen features were inputs to the ANN, SVM, and
matrix is iteratively updated for better hand gesture image KNN models, which were then fused to create a classifier
identification using the least-mean-square algorithm, and fusion model with the accuracy of 92.23%. In [146], three
the center is automatically determined using the k-means classification techniques—the mean classifier (NMC),
algorithm. KNN, and the Naive Bayes classifier—were applied to
(3) Conditional random fields categorize and compare data. In [147], five classifiers—
The theory of a conditional random field (CRF) model, SVM, KNN, Naive Bayes, ANN, and extreme learning
first proposed in 2001, was quickly applied to a variety of machines (ELM)—were utilized for a comparative analy-
problems because of its extensions and extensions to the sis. Their respective accuracy rates were 96.17% (ELM),
structure of an undirected graph model, which could char- 96.95% (SVM), 96.60% (KNN), and 96.38% (NB). Neu-
acterize data dependence problems more accurately than ral networks and Naive Bayesian classification methods
other models. CRF, a probabilistic graph model, was first based on data mining were applied in [148] for ges-
described by Lafferty et al. [137]. This original construc- ture learning and recognition, with the neural network
tion of a conditional random field was based on HMM attaining an accuracy of 98.11% and the plain Bayesian
in terms of model structure and was influenced by the classification method having an accuracy of 88.84%.
maximum entropy model (MEM) in terms of model prob-
ability representation. In [138], a gesture identification Deep learning
approach was suggested to detect forward and backward
movement toward and away from the camera, respec- A recent area of research in machine learning is deep learn-
tively, using CRF as a classifier and a parallax map-based ing. The latter is the process of discovering the innate patterns
center-of-mass motion and its intensity’s fluctuation as and depths of sample data representation; the knowledge
features. gained from such learning can be very useful in understand-
(4) K-nearest neighbors ing the meaning of data such as text, photos, and sounds.
In 1968, Cover et al. proposed the K-nearest neighbors Its ultimate objective is to provide machines with analyti-
(KNN) algorithm. Although the idea behind this tradi- cal learning abilities similar to those of humans, enabling
tional classification algorithm is straightforward and easy machines to recognize data types including text, images, and
to understand, the algorithm is now very mature and sta- sounds. Deep learning does not require manual engineering,
ble [139, 140]. The classification effectiveness of this in contrast to conventional learning algorithms, making it
algorithm, which is one of the fundamental classification possible to utilize the rapidly expanding amounts of data and
algorithms, is good, but due to the use of numer- computational power that are currently available [149]. Con-
ous square root operations, the computational efficiency volutional neural networks and recurrent neural networks are
is insufficient compared to that of other classification two popular deep learning algorithms.
algorithms when dealing with complex classification sce-
narios, particularly in regard to image classification [141, (1) Recurrent neural network
142]. Jasim et al. [143] used the KNN algorithm to Saratha Sathasivam proposed the recurrent neural net-
classify static hand gestures and the longest common work (RNN), a Hopfield network, in 1982 [150]. Each
subsequence (LCS) algorithm to classify dynamic hand moment t in a recurrent neural network is processed
gestures. sequentially and is closely related to the moment t before
(5) Naive Bayes classifiers it. The RNN’s potent temporal modeling ability intro-
The Naive Bayes classifier [144] is an algorithm for duces a novel strategy for gesture recognition. However,
classification based on the Bayes’ theorem. First, it is if there are more than 10-time steps between the rele-
assumed that the extracted features are uncorrelated and vant input and the target event, it is challenging to train
are independent of each other; this assumption simplifies a simple RNN structure [151].
the subsequent operations. Hands that match skin color Neverova et al. [152] were the first to use recurrent neural
patches were identified using a Bayesian classifier, and networks for gesture recognition. The researchers’ pro-
spots with the desired color distribution were recognized posal included a multimodal gesture recognition system
and modeled in [145]. for speech, skeleton pose, and depth. Prior to a long time-
In addition to discussing the common machine learning dependent model being built by an RNN for data fusion
algorithms for gesture recognition described above, many and final classification, each modality was first processed
researchers compared different classifiers. Cropped images’ independently on a short time series, and its features were
HOG characteristics were generated in [45] and used to manually extracted or obtained by learning. To examine
train the classifier. The study discussed the recognition the benefits of RNN using various training methods and
rates of classifiers such as LDA, SVM, and KNN. In [53], to suggest an efficient learning process based on suitable
123
Complex & Intelligent Systems
adjustments to the real-time recursive learning algorithm, et al. [166] enhanced the 3D-CNN model. A 3D-CNN
RNN was also utilized to recognize gestures in [48] and structure for extracting spatiotemporal features and a
compared with HMM. For skeleton action recognition, recurrent layer for global temporal modeling were both
Geng et al. [153] proposed a sequence-to-sequence hier- included in the network model. Li et al. [167] enhanced
archical RNN structure. Shin and Kim [154] separated the 3D-CNN model of Tran et al. [168] for large-scale
the features into various components and fed each hand’s gesture recognition using depth and RGB videos. Simi-
input into a GRU-RNN. This enhanced performance and larly to the above study of Tran et al., Camgoz et al. [169]
lowered the number of parameters needed for the neural developed an end-to-end 3D-CNN model for extensive
network. Zhang et al. [155] proposed a variant of a long gesture recognition.
short-term memory(LSTM) model for dynamic gesture Lightweight convolutional neural networks were created
recognition by combining ResC3D and ConvLSTM. in recent years as a result of the development of con-
(2) Convolutional neural network volutional neural networks by numerous academics. A
A convolutional neural network (CNN) [156] is a feed- hardware-friendly neural network was made possible by
forward neural network that has emerged quickly in the a lightweight neural network, which was a lighter model
field of image analysis and processing. CNN effectively that performed on par with a heavier model. The accuracy
avoids the preprocessing stage and a substantial amount of the MobileNetv2- and CNN-based gesture recognition
of manual involvement in the project compared to tra- in [170] was 99.96%. Baumgartl et al. [170] proposed
ditional image processing algorithms. However, a large a lightweight, robust and fast CNN for manual gesture
amount of data consists of more than just one image. To recognition by image classification. In [171], a hybrid
process video data, a 3D CNN [157] should be created network structure of a lightweight VGG16 model and a
and used as soon as possible for the task of behavior random forest was presented for recognition of gestures
recognition in surveillance videos. based on visual input.
Two-dimensional convolutional neural networks are mostly
used to process static gestures or dynamic gesture
sequences on a frame-by-frame basis. John et al. [158] Experimental evaluation
used a long-term recursive neural network to classify
gesture video sequences. All 24 motions from Thomas In “Hand gesture recognition process”, the process of ges-
Moeslund’s gesture recognition database were used to ture recognition has been described. In this section, several
apply deep learning to the gesture identification prob- evaluation metrics for gesture recognition and segmentation
lem in [159], demonstrating that deep neural networks are presented.
were capable of learning complicated gesture categoriza-
tion tasks with a low error rate. The approach in [160]
combined a skeletonization algorithm with CNN, which Accuracy
lessened the impact of capture angle and surroundings
on the recognition effectiveness and increased the preci- Accuracy is the ratio of the number of samples correctly
sion of gesture recognition in complicated contexts. In classified by a classifier to the total number of samples. In
[161], a transformation of Arabic sign language letters gesture recognition and segmentation, the accuracy rate can
into Arabic voice using a vision- and CNN-based system be used to measure the overall performance of the classifier.
was suggested. In [162], a comparison of various gesture The calculation formula is
recognition techniques showed that CNN outperformed
TP + TN
other classification systems. Noreen et al. [163] proposed Accuracy = ,
a multiparallel streaming two-dimensional CNN model TP + TN + FP + FN
to recognize hand gestures.
Several three-dimensional convolutional neural network where TP denotes true-positive cases, TN denotes true-
(3D-CNN) models have been proposed for gesture recog- negative cases, FP denotes false-positive cases, and FN
nition. To address the lack of a large number of labeled denotes false-negative cases.
gesture datasets, an efficient deep convolutional neural
network method called 3D-CNN was proposed in [164]. Precision
A 3D-CNN model is suggested by Molchanov et al. [165]
to identify driving gestures based on depth and inten- The precision rate is the percentage of samples identified
sity data and to combine data from various spatial scales by the classifier as belonging to positive classes that indeed
for the final prediction. Using a recurrent mechanism for belong to positive classes. In gesture recognition and segmen-
dynamic gesture detection and classification, Molchanov tation, the precision rate can be used to measure the accuracy
123
Complex & Intelligent Systems
of the classifier. The calculation formula is Given the popularity of depth sensors, scholars have con-
ducted extensive research on gesture segmentation based on
TP
Precision = . depth information. In [172], Jiang used the Kinect sensor to
TP + FP gather depth information, established a threshold for each
frame in accordance with each pixel’s depth value, extracted
Recall
the largest region as the foreground, and then removed the
other patches with smaller areas. Kane and Khanna [173]
The recall rate is the proportion of samples that indeed belong
described the creation of an acquisition module that used
to positive classes and that the classifier correctly identifies
depth thresholding and velocity tracking to execute pen lift-
as belonging to positive classes. In gesture recognition and
ing and falling movements. The hand needed to be in the
segmentation, recall can be used to measure the completeness
foreground of the camera for the depth thresholding-based
of the classifier. The formula is as follows:
hand segmentation method to work well. Zhao and Jia [59]
TP presented an enhanced hand segmentation approach based
Recall = . on the random choice forest framework for depth sensor-
TP + FN
acquired images by manually integrating the essentials of
F1 score depth thresholding-based segmentation methods. To ensure
the accuracy of hand segmentation, the method generated
The F1 score is the summed average of accuracy and recall, new depth features from the centroid of the hand structure,
and hence assesses both the accuracy and the completeness improved the generalizability of earlier depth features, and
of the classifier. In gesture recognition and segmentation, the specified the depth invariance of hand pixels as much as pos-
F1 score can be used as an evaluation metric to help select sible.
the optimal classifier. The calculation formula is Using a depth camera, more information about the appear-
ance features can be obtained. In [174], a hand contour
Precision × Recall model was suggested to make gesture matching easier by
F1 = 2 × .
Precision + Recall incorporating the Kinect sensor, which could make gesture
matching less computationally complex. Brazilian sign lan-
Intersection over union (IoU) guage’s phonetic structure was investigated in [175], relying
on RGB-D sensors to collect intensity, location, and depth
IoU is the ratio of the overlapping part between the predicted data. Seven vision-based traits were extracted by Almeida
region and the real one to the overall size. In gesture seg- et al. [175] from RGB-D photos. Almeida et al. studied the
mentation, IoU can be used to measure the segmentation relationship between the extracted features and the structural
effectiveness of the model. The calculation formula is elements based on hand shape, motion, and location in Brazil-
Intersection ian sign language. A 3D hand shape was separated from a
IoU = , cluttered background using a depth map of hand gestures
Union
recorded by the Kinect sensor to extract patterns of 3D shape
where intersection denotes the intersection of the predicted features, and a 3D shape context description approach was
frame and the true frame, and union indicates the merging proposed in [176] for 3D gesture representation. A TOF depth
of the predicted frame and the true frame. The larger the camera was used in [177] to gather depth data, determine the
intersection is, the closer the predicted result is the truth. wrist cut edge, and and capture palm images. In [178], the
coordinates of the 21 bonding points of the human hand were
recorded using Leap Motion, and the motion images were
Hand gesture recognition based on RGB-D captured using an RGB camera.
cameras One of the more popular state space-based techniques for
matching time-varying data is a hidden Markov model with
The paper has introduced the use of three widely used pairs of two main stages, namely, training and classification. The
depth cameras, namely, Kinect, Leap Motion and RealSense, Baum–Welch algorithm is a fundamental algorithm used to
in “Introduction”. The data structure of RGB-D images pro- solve the training problem, whereas the Viterbi algorithm is a
duced by depth cameras is more complicated than that of fundamental algorithm used to solve the classification prob-
earlier 2D images, opening new possibilities for gesture lem [179]. Hoque et al. [180] proposed a real-time gesture
recognition studies. A simple thresholding algorithm can recognition system based on Kinect that could manipulate
accurately split the hand zone in the depth map using depth desktop objects by identifying the hand’s 3D position using
data, reducing the gesture detection problem to a problem of Kinect’s depth sensor. To identify predefined gestures, such
recognition of the 3D shape of the hand. location points were subsequently examined. The HMM was
123
Complex & Intelligent Systems
Table 4 Results of the studies • Healthcare: Emergency rooms and operating rooms can
References Experimental results be chaotic, with a significant amount of noise from indi-
viduals and equipment. In such an environment, voice
[172] Average task completion time for target capture: commands are not as effective as hand gestures. Touch-
“semiautomatic” (176.9s) and “manual” (287.4s)
screens are also not an option because of the strict
[173] On the self-acquisition dataset, the EPS solution
reached an accuracy of over 96.5% with an average boundaries between sterile and nonsterile domains. How-
runtime of 30 ms. On the AIR Handwriting dataset, ever, accessing information and images during surgery
the recognition time per gesture was 24.3 ms with an or other procedures is possible with gesture recognition
average accuracy of 95.5% technology, as demonstrated by Microsoft. GestSure, a
[59] The percentages of false negatives and false positives gesture control technology that can be used to control
were 2.00% and 4.38%, respectively. Training time
was approximately 16min. Real-world image classi-
medical devices, allows physicians to examine MRI, CT
fication time was approximately 2.1 sec/frame and other images with simple gestures without scrubbing.
[174] The accuracy rate of hand inspection procedures This touch-free interaction reduces the number of times
improved to 95.43%. The average accuracy of hand doctors and nurses touch patients, reducing the risk of
part classification improved to 74.65% cross-contamination.
[175] The average recognition rate was above 80% • Safe driving: Advanced driver assistance systems that
[177] Average recognition success rate of 84.5% incorporate gesture recognition can somewhat increase
[178] Highest recognition accuracy of up to 99.66% driving safety. Through an advanced driver assistance
[180] The accuracy rate reached 89% system, drivers can modify many parameters inside the
[181] The best accuracy for static gesture recognition was automobile using gestures, allowing them to focus more
95.6%. The best accuracy for dynamic gesture recog- on the road and perhaps reducing traffic accidents. The
nition was 97.2%
BMW 7 Series has an integrated hand gesture recog-
[176] On the NTU Hand Digit Dataset, the best obtained
nition system that recognizes five gestures to control
performance was 98.7%. On the Kinect Leap DataSet,
an accuracy of 96.8% was reached. On the Senz3d music, incoming calls, etc. Reducing interaction with the
Dataset, an accuracy of 100% was attained. On the touchscreen makes the driving experience safer and more
ASL-FS Dataset, an accuracy of 87.1% was obtained. convenient.
On the ChaLearn LAP IsoGD Dataset, the accuracy of
• Sign language awareness: The primary means of com-
60.12% was reached. The average runtime per query
on an average PC (without a GPU) was only 6.3 ms munication for hearing-impaired individuals is sign lan-
guage; however, understanding sign language is difficult
for those who have not received formal instruction.
The ability of hearing-impaired and other individuals to
trained using the Baum–Welch algorithm, which resulted in communicate will be enhanced substantially using sign
the accuracy rate of 89%. A dynamic hand gesture detection recognition technology for sign language cognition. The
system based on an RGB-D camera was proposed by Simao Italian startup Limix combines IoT and dynamic gesture
et al. [181]. Hand segmentation in color photos was per- recognition technology to record sign language, translate
formed using a large illumination-invariant skin tone model, it to text, and then play it back on a smartphone via a voice
and hand detection in depth images was performed using synthesizer.
a chamfer distance matching-based technique. Hand move- • Virtual reality: Gesture recognition allows users to inter-
ments were modeled and classified using the HMM with the act with and control virtual reality scenes more naturally,
left-right band state graph topology. enhancing users’ immersion and experience. In 2016,
Table 4 shows the results of the studies mentioned above. Leap Motion demonstrated updated gesture recognition
software that allowed users to track gestures in virtual
reality in addition to controlling computers. ManoMo-
tion’s hand-tracking application recognizes 3D gestures
Hand gesture recognition applications through a smartphone camera (on Android and iOS) and
can be applied to AR and VR environments. Use cases for
Gesture recognition has a wide range of applications, such this technology include gaming, IoT devices, consumer
as healthcare, safe driving, sign language awareness, virtual electronics, and robotics.
reality, and device control. This section mainly focuses on • Device control: Intelligent robots can also be controlled
human-robot interaction using vision-based hand gestures by gestures. With the advancement of artificial intel-
captured by monocular cameras and RGB-D cameras. The ligence, home robots or smart home equipment will
main application areas for gesture recognition technologies progressively appear in millions of households, and con-
are listed below. sumers will feel more at ease using gesture control as
123
Complex & Intelligent Systems
123
Complex & Intelligent Systems
background environment, which will increase the difficulty aimed to automatically segment gestures from a given video
of gesture detection and lead to a decrease in the accu- in cases of different lighting conditions and complex back-
racy of gesture recognition. Many researchers have sought grounds. Qi et al. [218] suggested an improved atrous spatial
to enhance the robustness of gesture recognition in complex pyramid pool to improve the accuracy of gesture feature rep-
backgrounds and to improve the interaction capability of ges- resentation in images. Zhou et al. [219] proposed a two-stage
ture recognition in complex scenes. gesture recognition system to solve the problem of recogniz-
Sheenu et al. [212] proposed a new method for gesture ing gestures in cases of complex backgrounds.
recognition in images with complex backgrounds based on
histograms of the orientation gradient and sequential mini-
Distance and hand anatomy
mal optimization, which had an overall recognition rate of
93.12% for complex backgrounds. Chen et al. [213] sug-
The distance between the cameras and the person is an impor-
gested a gesture recognition method based on an improved
tant factor in hand gesture segmentation. If the cameras are
YOLOv5 approach that reduced various types of interference
too far away, hand gestures may not be captured accurately,
in gesture images with complex backgrounds and improved
resulting in incorrect segmentation. However, if the cameras
the robustness of the network to complex backgrounds.
are too close, there may be occlusion as hands move in and
Zhang et al. [214] proposed a two-stage gesture recognition
out of the frame, leading to incomplete segmentation.
method. In the first stage, the convolutional pose machine was
Additionally, the hand anatomy should be considered. Dif-
used to localize a hand’s key points, which could effectively
ferent types of hand gestures involve different parts of the
localize the hand’s key points even in cases of complex back-
hand, and some gestures may be more difficult to segment
grounds. Vishwakarma [215] researched and developed a
accurately than others. For example, gestures that involve
method for effective detection and classification of hand ges-
the fingers being close together may be more challenging to
tures in cases of complex backgrounds. Pabendon et al. [216]
distinguish from each other.
suggested a gesture recognition method based on spatiotem-
To improve hand gesture segmentation accuracy, researchers
poral domain pattern analysis, which could significantly
may use various techniques, such as depth sensing (e.g.,
reduce the irregular noise affecting gesture recognition in
RealSense and Kinect), machine learning algorithms, and
cases of complex backgrounds. Elsayed et al. [217] described
hand-tracking algorithms. These methods can help identify
a robust gesture segmentation method based on adaptive
the different parts of the hand and track their movement accu-
background subtraction with skin color thresholding, which
rately even in complex gesture sequences.
123
Complex & Intelligent Systems
Future outlook and more sensitive sensors can capture more subtle hand
movements, improving the accuracy of gesture recogni-
With the rise of artificial intelligence, deep learning is tion.
undoubtedly a benign accelerator for gesture recognition, and • More capable of real-time performance: Future gesture
gesture recognition systems will become more accurate and recognition technology will operate closer to real time,
stable. Future gesture recognition systems will also be more and be capable of processing large numbers of gestures
diversified and applicable to more fields such as medical care, and translating them into commands or actions. This will
education, entertainment, etc., bringing more convenience enable gesture recognition’s wider use in virtual reality,
and innovation to people. In the future, gesture recognition gaming, medical, and other fields.
technology will continue to develop in the following direc- • More reliable: As the applications of gesture recognition
tions: technology expand, its reliability becomes increasingly
important. Future gesture recognition technologies will
• More intelligent: Gesture recognition will become more require more rigorous testing and validation to ensure
intelligent with the continued development of deep learn- their reliable operation in a variety of environments.
ing and artificial intelligence technology. Training a • More personalized: Future gesture recognition technolo-
model will allow it to understand more complex gestures gies will be more personalized and able to adapt to
while reducing the user requirements, making gesture different users’ gesture habits and preferences. For exam-
recognition more natural and intelligent. ple, users may be able to customize specific gestures to
• More accurate: As computer vision and sensor technol- accomplish a particular operation or function.
ogy continue to improve, gesture recognition will become
more accurate. For example, higher-resolution cameras
123
Complex & Intelligent Systems
123
Complex & Intelligent Systems
15. Zhang O, Srinivasan K (2016) Mudra: user-friendly fine-grained 37. Zaki MM, Shaheen SI (2011) Sign language recognition using a
gesture recognition using wifi signals. In: Proceedings of the 12th combination of new vision based features. Pattern Recognit Lett
international on conference on emerging networking experiments 32(4):572–577
and technologies, pp 83–96 38. Shangeetha R, Valliammai V, Padmavathi S (2012) Computer
16. Wang Y, Wu K, Ni LM (2016) Wifall: device-free fall detection vision based approach for Indian sign language character recog-
by wireless networks. IEEE Trans Mob Comput 16(2):581–594 nition. In: 2012 international conference on machine vision and
17. Wu X, Chu Z, Yang P, Xiang C, Zheng X, Huang W (2018) Tw- image processing (MVIP). IEEE, pp 181–184
see: human activity recognition through the wall with commodity 39. Bhame V, Sreemathy R, Dhumal H (2014) Vision based hand
wi-fi devices. IEEE Trans Veh Technol 68(1):306–319 gesture recognition using eccentric approach for human com-
18. Hisham B, Hamouda A (2019) Supervised learning classifiers for puter interaction. In: 2014 international conference on advances
Arabic gestures recognition using kinect v2. SN Appl Sci 1(7):1– in computing, communications and informatics (ICACCI). IEEE,
21 pp 949–953
19. De Smedt Q, Wannous H, Vandeborre J-P (2016) Skeleton-based 40. Dhule C, Nagrare T (2014) Computer vision based human-
dynamic hand gesture recognition. In: Proceedings of the IEEE computer interaction using color detection techniques. In: 2014
conference on computer vision and pattern recognition work- fourth international conference on communication systems and
shops, pp 1–9 network technologies. IEEE, pp 934–938
20. Rautaray SS, Agrawal A (2015) Vision based hand gesture recog- 41. Ahuja MK, Singh A (2015) Static vision based hand gesture recog-
nition for human computer interaction: a survey. Artif Intell Rev nition using principal component analysis. In: 2015 IEEE 3rd
43:1–54 international conference on moocs, innovation and technology in
21. Ji Y, Kim S, Lee K-B (2017) Sign language learning system with education (MITE). IEEE, pp 402–406
image sampling and convolutional neural network. In: 2017 first 42. Veluchamy S, Karlmarx L, Sudha JJ (2015) Vision based gestu-
IEEE international conference on robotic computing (IRC). IEEE, rally controllable human computer interaction system. In: 2015
pp 371–375 international conference on smart technologies and management
22. ElBadawy M, Elons A, Shedeed HA, Tolba M (2017) Arabic sign for computing, communication, controls, energy and materials
language recognition with 3d convolutional neural networks. In: (ICSTM). IEEE, pp 8–15
2017 eighth international conference on intelligent computing and 43. Sreekanth N, Narayanan N (2017) Dynamic gesture recognition—
information systems (ICICIS). IEEE, pp 66–71 a machine vision based approach. In: Proceedings of the interna-
23. Ĉadík M (2008) Perceptual evaluation of color-to-grayscale image tional conference on signal, networks, computing, and systems.
conversions. In: Computer graphics forum, vol 27. Wiley Online Springer, pp 105–115
Library, pp 1745–1754 44. Wang K, Xiao B, Xia J, Li D, Luo W (2016) A real-time vision-
24. Benedetti L, Corsini M, Cignoni P, Callieri M, Scopigno R (2012) based hand gesture interaction system for virtual east. Fusion Eng
Color to gray conversions in the context of stereo matching algo- Des 112:829–834
rithms: an analysis and comparison of current methods and an 45. Patel P, Patel N (2019) Vision based real-time recognition of hand
ad-hoc theoretically-motivated technique for image matching. gestures for Indian sign language using histogram of oriented
Mach Vis Appl 23:327–348 gradients features. Int J Next-Gener Comput 10:92–102
25. Fairchild MD (2013) Color appearance models. Wiley, New York 46. Zhou W, Lyu C, Jiang X, Li P, Chen H, Liu Y-H (2017) Real-time
26. Rosenfeld A (1976) Digital picture processing. Academic Press, implementation of vision-based unmarked static hand gesture
Cambridge recognition with neural networks based on fpgas. In: 2017 IEEE
27. Xu Y, Gu J, Tao Z, Wu D (2009) Bare hand gesture recognition international conference on robotics and biomimetics (ROBIO).
with a single color camera. In: 2009 2nd international congress IEEE, pp 1026–1031
on image and signal processing. IEEE, pp 1–4 47. Gupta L, Ma S (2001) Gesture-based interaction and communi-
28. Zhang H, Wang Y, Deng C (2011) Application of gesture recog- cation: automated classification of hand gesture contours. IEEE
nition based on simulated annealing bp neural network. In: Trans Syst Man Cybern Part C (Applications and Reviews)
Proceedings of 2011 international conference on electronic & 31(1):114–120
mechanical engineering and information technology, vol 1. IEEE, 48. Ng CW, Ranganath S (2002) Real-time gesture recognition system
pp 178–181 and application. Image Vis Comput 20(13–14):993–1007
29. Lahiani H, Elleuch M, Kherallah M (2015) Real time hand gesture 49. Sharma N, Maringanti HB, Asawa K (2012) Upper body pose
recognition system for android devices. In: 2015 15th interna- recognition and classifier. In: Acm compute conference: intelli-
tional conference on intelligent systems design and applications gent & scalable system technologies
(ISDA). IEEE, pp 591–596 50. Sun J, Zhang Z, Yang L, Zheng J (2020) Multi-view hand
30. Canny J (1986) A computational approach to edge detection. IEEE gesture recognition via pareto optimal front. IET Image Proc
Trans Pattern Anal Mach Intell 6:679–698 14(14):3579–3587
31. Panda CS, Patnaik S (2010) Better edgegap in grayscale image 51. Li Y (2012) Hand gesture recognition using kinect. In: 2012
using gaussian method. Int J Comput Appl Math 5(1):53–66 IEEE international conference on computer science and automa-
32. Deng G, Pinoli J-C (1998) Differentiation-based edge detection tion engineering. IEEE, pp 196–199
using the logarithmic image processing model. J Math Imaging 52. Anant S, Veni S (2018) Safe driving using vision-based hand ges-
Vis 8:161–180 ture recognition system in non-uniform illumination conditions.
33. Sonka M, Hlavac V, Boyle R (2014) Image processing, analysis, J ICT Res Appl 12(2)
and machine vision. Cengage Learning, Boston 53. Singha J, Roy A, Laskar RH (2018) Dynamic hand gesture
34. Otsu N (1979) A threshold selection method from gray-level his- recognition using vision-based approach for human-computer
tograms. IEEE Trans Syst Man Cybern 9(1):62–66 interaction. Neural Comput Appl 29(4):1129–1141
35. Niblack W (1985) An introduction to digital image processing. 54. Simonyan K, Zisserman A (2014) Very deep convolutional net-
Strandberg Publishing Company, Copenhagen works for large-scale image recognition. Comput Sci
36. Malima AK, Özgür E, Çetin M (2006) A fast algorithm for vision- 55. Paul S, Bhattacharyya A, Mollah AF, Basu S, Nasipuri M (2020)
based hand gesture recognition for robot control Hand segmentation from complex background for gesture recog-
123
Complex & Intelligent Systems
nition. In: Emerging technology in modelling and graphics: recognition. In: 2018 IEEE international symposium on smart
proceedings of IEM graph 2018. Springer, pp 775–782 electronic systems (iSES)(Formerly iNiS). IEEE, pp 265–268
56. Shelhamer E, Long J, Darrell T (2016) Fully convolutional net- 76. Wachs JP, Stern HI, Edan Y, Gillam M, Handler J, Feied C, Smith
works for semantic segmentation. IEEE Trans Pattern Anal Mach M (2008) A gesture-based tool for sterile browsing of radiology
Intell 1 images. J Am Med Inform Assoc 15(3):321–323
57. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep 77. Ganokratanaa T, Pumrin S (2017) The vision-based hand gesture
convolutional encoder-decoder architecture for image segmenta- recognition using blob analysis. In: 2017 international conference
tion. IEEE Trans Pattern Anal Mach Intell 1 on digital arts, media and technology (ICDAMT). IEEE, pp 336–
58. Ding I-J, Su J-L (2022) Designs of human-robot interaction 341
using depth sensor-based hand gesture communication for smart 78. Shan C, Wei Y, Tan T, Ojardias F (2004) Real time hand tracking
material-handling robot operations. Proc Inst Mech Eng Part B J by combining particle filtering and mean shift. In: IEEE inter-
Eng Manuf 237(3):392–413 national conference on automatic face & gesture recognition, pp
59. Zhao M, Jia Q (2016) Hand segmentation using randomized 669–674
decision forest based on depth images. In: 2016 international con- 79. Shan C, Tan T, Wei Y (2007) Real-time hand tracking using a mean
ference on virtual reality and visualization (ICVRV). IEEE, pp shift embedded particle filter. Pattern Recognit 40(7):1958–1970
110–113 80. Li P, Zhang T, Pece A (2003) Visual contour tracking based on
60. Fukunaga K, Hostetler L (1975) The estimation of the gradient of particle filters. Image Vis Comput 21(1):111–123
a density function, with applications in pattern recognition. IEEE 81. Ma C, Wang A, Ge C, Chi X (2018) Hand joints-based ges-
Trans Inf Theory ture recognition for noisy dataset using nested interval unscented
61. Guo Y, Şengür A, Akbulut Y, Shipley A (2018) An effective color kalman filter with lstm network. Vis Comput 34(6–8):1053–1063
image segmentation approach using neutrosophic adaptive mean 82. Lech M, Kostek B (2012) Hand gesture recognition supported
shift clustering. Measurement 119:28–40 by fuzzy rules and kalman filters. Int J Intell Inf Database Syst
62. Khan B, Khan AK, Raja G, Yousaf MH (2013) Implementation 6(5):407–420
of modified mean-shift tracking algorithm for occlusion handling. 83. Kumar G, Bhatia PK (2014) A detailed review of feature extrac-
Life Science Journal 10(11s):337–342 tion in image processing systems. In: 2014 fourth international
63. Bradski GR (1998) Computer vision face tracking for use in a conference on advanced computing & communication technolo-
perceptual user interface. Intel Technol J gies. IEEE, pp 5–12
64. Allen JG, Xu R, Jin JS (2004) Object tracking using camshift 84. Luan S, Chen C, Zhang B, Han J, Liu J (2018) Gabor convolutional
algorithm and multiple quantized feature spaces networks. IEEE Trans Image Process 27(9):4357–4366
65. Ghotkar A, Kharate G (2012) Hand segmentation techniques to 85. Dalal N, Triggs B, Schmid C (2006) Human detection using
hand gesture recognition for natural human computer interaction. oriented histograms of flow and appearance. In: Computer vision–
Int J Hum Comput Interact 3(1):15–25 ECCV 2006: 9th European conference on computer vision, Graz,
66. Akmeliawati R, Dadgostar F, Demidenko S, Gamage N, Sengupta Austria, May 7–13, 2006. Proceedings, Part II 9. Springer, pp
G (2009) Towards real-time sign language analysis via marker- 428–441
less gesture tracking. In: IEEE instrumentation & measurement 86. Surasak T, Takahiro I, Cheng C-h, Wang C-e, Sheng P-y (2018)
technology conference Histogram of oriented gradients for human detection in video.
67. Collins RT, Lipton AJ, Kanade T, Fujiyoshi H, Burt P (2000) A In: 2018 5th international conference on business and industrial
system for video surveillance and monitoring. VSAM final report, research (ICBIR). IEEE, pp 172–176
Carnegie Mellon University Technical Report 87. Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-
68. Shen Y, Wen H, Yang M, Liu J, Chou CT (2012) Efficient back- scale and rotation invariant texture classification with local binary
ground subtraction for tracking in embedded camera networks. patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987
ACM 88. Lowe DG (2004) Distinctive image features from scale-invariant
69. Apolinário L, Armesto N, Cunqueiro L (2012) An analysis of the keypoints. Int J Comput Vis 60:91–110
influence of background subtraction and quenching on jet observ- 89. Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust
ables in heavy-ion collisions features. Lect Notes Comput Sci 3951:404–417
70. Denman S, Chandran V, Sridharan S (2007) An adaptive optical 90. Sykora P, Kamencay P, Hudec R (2014) Comparison of sift and
flow technique for person tracking systems. Pattern Recognit Lett surf methods for use on hand gesture recognition based on depth
28(10):1232–1239 map. Aasri Proc 9:19–24
71. Jayabalan E, Krishnan A, Pugazendi R (2007) Non rigid object 91. Suriya M, Sathyapriya N, Srinithi M, Yesodha V (2016) Survey
tracking in aerial videos by combined snake and optical flow tech- on real time sign language recognition system: an lda approach.
nique. In: Computer graphics, imaging & visualisation In: International conference on exploration and innovations in
72. Chanda K, Ahmed W, Mitra S (2015) A new hand gesture engineering and technology, ICEIET, pp 219–225
recognition scheme for similarity measurement in a vision based 92. Ahmed AA, Aly S (2014) Appearance-based Arabic sign lan-
barehanded approach. In: International conference on image infor- guage recognition using hidden markov models. In: 2014 interna-
mation processing, pp 17–22 tional conference on engineering and technology (ICET). IEEE,
73. Liao C-J, Su S-F, Chen M-C (2015) Vision-based hand gesture pp 1–6
recognition system for a dynamic and complicated environment. 93. Hsieh C-C, Liou D-H (2015) Novel haar features for real-time
In: 2015 IEEE international conference on systems, man, and hand gesture recognition using svm. J Real-Time Image Proc
cybernetics. IEEE, pp 2891–2895 10:357–370
74. De O, Deb P, Mukherjee S, Nandy S, Chakraborty T, Saha S 94. Tharwat A, Gaber T, Hassanien AE, Shahin MK, Refaat B (2015)
(2016) Computer vision based framework for digit recognition Sift-based Arabic sign language recognition system. In: Afro-
by hand gesture analysis. In: 2016 IEEE 7th annual information European conference for industrial advancement: proceedings
technology, electronics and mobile communication conference of the first international afro-european conference for industrial
(IEMCON). IEEE, pp 1–5 advancement AECIA 2014. Springer, pp 359–370
75. Panigrahi A, Mohanty JP, Swain AK, Mahapatra K (2018) 95. Hartanto R, Susanto A, Santosa PI (2014) Real time static hand
Real-time efficient detection in vision based static hand gesture gesture recognition system prototype for Indonesian sign lan-
123
Complex & Intelligent Systems
guage. In: 2014 6th international conference on information In: 2016 international conference on information science (ICIS).
technology and electrical engineering (ICITEE). IEEE, pp 1–6 IEEE, pp 120–125
96. Yun L, Lifeng Z, Shujun Z (2012) A hand gesture recognition 114. Zhi D, de Oliveira TEA, da Fonseca VP, Petriu EM (2018)
method based on multi-feature fusion and template matching. Proc Teaching a robot sign language using vision-based hand gesture
Eng 29:1678–1684 recognition. In: 2018 IEEE international conference on compu-
97. Pan T-Y, Lo L-Y, Yeh C-W, Li J-W, Liu H-T, Hu M-C (2016) Real- tational intelligence and virtual environments for measurement
time sign language recognition in complex background scene systems and applications (CIVEMSA). IEEE, pp 1–6
based on a hierarchical clustering classification method. In: 2016 115. Rabiner LR (1989) A tutorial on hidden markov models and
IEEE second international conference on multimedia big data selected applications in speech recognition. In: Proc IEEE, p 77
(BigMM). IEEE, pp 64–67 116. Oka K, Sato Y, Koike H (2002) Real-time fingertip tracking and
98. Rokade US, Doye D, Kokare M (2009) Hand gesture recognition gesture recognition. IEEE Comput Graph Appl 22(6):64–71
using object based key frame selection. In: 2009 international 117. Chen FS, Fu CM, Huang CL (2003) Hand gesture recognition
conference on digital image processing. IEEE, pp 288–291 using a real-time tracking method and hidden Markov models.
99. Bao J, Song A, Guo Y, Tang H (2011) Dynamic hand gesture Image Vis Comput 21(8):745–758
recognition based on surf tracking. In: 2011 international confer- 118. Malgireddy MR, Nwogu I, Govindaraju V (2013) Language-
ence on electric information and control engineering. IEEE, pp motivated approaches to action recognition. Springer, Cham
338–341 119. Jebali M, Dakhli A, Jemni M (2021) Vision-based continuous sign
100. Baranwal N, Nandi GC (2017) An efficient gesture based language recognition using multimodal sensor fusion. Evolut Syst
humanoid learning using wavelet descriptor and mfcc techniques. 12(4):1031–1044
Int J Mach Learn Cybern 8:1369–1388 120. Tutsoy O (2022) Pharmacological, non-pharmacological policies
101. Ibrahim NB, Selim MM, Zayed HH (2018) An automatic Ara- and mutation: an artificial intelligence based multi-dimensional
bic sign language recognition system (arslrs). J King Saud Univ policy making algorithm for controlling the casualties of the
Comput Inf Sci 30(4):470–477 pandemic diseases. IEEE Trans Pattern Anal Mach Intell
102. Chen J, Han M, Yang S, Chang Y (2016) A fingertips detection 44(12):9477–9488
method based on the combination of centroid and Harris corner 121. Tutsoy O, Çolak Ş, Polat A, Balikci K (2020) A novel parametric
algorithm. In: 2016 17th IEEE/ACIS international conference on model for the prediction and analysis of the covid-19 casualties.
software engineering, artificial intelligence, networking and par- IEEE Access 8:193898–193906
allel/distributed computing (SNPD). IEEE, pp 225–230 122. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn
103. Dardas N, Chen Q, Georganas ND, Petriu EM (2010) Hand ges- 20(3):273–297
ture recognition using bag-of-features and multi-class support 123. Vapnik V, Vapnik V et al (1998) Statistical learning theory. Wiley,
vector machine. In: 2010 IEEE international symposium on haptic New York
audio visual environments and games. IEEE, pp 1–5 124. Keerthi SS, Gilbert EG (2002) Convergence of a generalized smo
104. Gupta B, Shukla P, Mittal A (2016) K-nearest correlated neigh- algorithm for svm classifier design. Mach Learn 46(1):351–360
bor classification for Indian sign language gesture recognition 125. Song Y, Demirdjian D, Davis R (2012) Continuous body and hand
using feature fusion. In: 2016 international conference on com- gesture recognition for natural human-computer interaction. ACM
puter communication and informatics (ICCCI). IEEE, pp 1–5 Trans Interact Intell Syst (TiiS) 2(1):1–28
105. Huong TNT, Huu TV, Le Xuan T et al (2015) Static hand gesture 126. Trigueiros P, Ribeiro F, Reis LP (2014) Vision-based Portuguese
recognition for Vietnamese sign language (vsl) using princi- sign language recognition system. In: New perspectives in infor-
ple components analysis. In: 2015 international conference on mation systems and technologies, vol 1. Springer, pp 605–617
communications, management and telecommunications (Com- 127. Al Farid F, Hashim N, Abdullah J (2019) Vision-based hand
ManTel). IEEE, pp 138–141 gesture recognition from rgb video data using svm. In: Interna-
106. Ohn-Bar E, Trivedi MM (2014) Hand gesture recognition tional workshop on advanced image technology (IWAIT) 2019,
in real time for automotive interfaces: a multimodal vision- vol 11049. SPIE, pp 265–268
based approach and evaluations. IEEE Trans Intell Transp Syst 128. Athira P, Sruthi C, Lijiya A (2019) A signer independent sign
15(6):2368–2377 language recognition with co-articulation elimination from live
107. Chaudhary A, Raheja J (2018) Light invariant real-time robust videos: an Indian scenario. J King Saud Univ Comput Inf Sci
hand gesture recognition. Optik 159:283–294 129. Trigueiros P, Ribeiro F, Reis LP (2013) Vision-based gesture
108. Wen X, Niu Y (2010) A method for hand gesture recognition based recognition system for human-computer interaction. In: Computa-
on morphology and fingertip-angle. In: 2010 the 2nd international tional vision and medical image processing IV: VIPIMAGE 2013,
conference on computer and automation engineering (ICCAE), pp 137–142
vol 1. IEEE, pp 688–691 130. Sahoo JP, Ari S, Ghosh DK (2018) Hand gesture recognition
109. Shin J, Kim CM (2016) Character input system using fingertip using dwt and f-ratio based feature descriptor. IET Image Proc
detection with kinect sensor. In: Proceedings of the international 12(10):1780–1787
conference on research in adaptive and convergent systems, pp 131. Maqueda AI, del-Blanco CR, Jaureguizar F, García N (2015)
74–79 Human-computer interaction based on visual hand-gesture recog-
110. Meng G, Wang M (2013) Hand gesture recognition based on fin- nition using volumetric spatiograms of local binary patterns.
gertip detection. In: 2013 fourth global congress on intelligent Comput Vis Image Underst 141:126–137
systems (GCIS). IEEE, pp 107–111 132. Kubat M (1999) Neural networks: a comprehensive foundation
111. Wang M, Lin J-S, Meng GQ (2015) Fingertip detection and ges- by Simon Haykin, Macmillan, 1994, isbn 0-02-352781-7. Knowl
ture recognition based on contour approximation. Int J Pattern Eng Rev 13(4):409–412
Recognit Artif Intell 29(07):1555016 133. Haykin SS, Haykin SS (2001) Kalman filtering and neural net-
112. Rakthanmanon T, Campana B, Mueen A, Batista G, Keogh E works, vol 284. Wiley Online Library
(2012) Searching and mining trillions of time series subsequences 134. Balasundaram A, Chellappan C (2017) Vision based gesture
under dynamic time warping. ACM recognition: a comprehensive study. IIOAB J 8:20–28
113. Ahmed W, Chanda K, Mitra S (2016) Vision based hand gesture
recognition using dynamic time warping for Indian sign language.
123
Complex & Intelligent Systems
135. Schwenker F, Kestler HA, Palm G (2001) Three learning phases 156. Anastassiou D, Kollias S (1988) Digital image halftoning using
for radial-basis-function networks. Neural Netw 14(4–5):439– neural networks. In: Visual communications and image process-
458 ing’88: third in a series, vol 1001. SPIE, pp 1062–1069
136. Ghosh DK, Ari S (2016) On an algorithm for vision-based hand 157. Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks
gesture recognition. SIViP 10(4):655–662 for human action recognition. IEEE Trans Pattern Anal Mach
137. Lafferty J, McCallum A, Pereira FC (2001) Conditional random Intell 35(1):221–231
fields: probabilistic models for segmenting and labeling sequence 158. John V, Boyali A, Mita S, Imanishi M, Sanma N (2016) Deep
data learning-based fast hand gesture recognition using representative
138. Laskar MA, Das AJ, Talukdar AK, Sarma KK (2015) Stereo frames. In: 2016 international conference on digital image com-
vision-based hand gesture recognition under 3d environment. Proc puting: techniques and applications (DICTA). IEEE, pp 1–8
Comput Sci 58:194–201 159. Oyedotun OK, Khashman A (2017) Deep learning in vision-
139. Jiang S, Pang G, Wu M, Kuang L (2012) An improved k-nearest- based static hand gesture recognition. Neural Comput Appl
neighbor algorithm for text categorization. Expert Syst Appl 28(12):3941–3951
39(1):1503–1509 160. Jiang D, Li G, Sun Y, Kong J, Tao B (2019) Gesture recognition
140. Su M-Y (2011) Using clustering to improve the knn-based clas- based on skeletonization algorithm and cnn with asl database.
sifiers for online anomaly network traffic identification. J Netw Multimed Tools Appl 78(21):29953–29970
Comput Appl 34(2):722–730 161. Kamruzzaman M (2020) Arabic sign language recognition and
141. Mejdoub M, Ben Amar C (2013) Classification improvement of generating Arabic speech using convolutional neural network.
local feature vectors over the knn algorithm. Multimed Tools Appl Wirel Commun Mob Comput 2020
64(1):197–218 162. Chhajed RR, Parmar KP, Pandya MD, Jaju NG (2021) Messaging
142. Sankaranarayanan J, Samet H, Varshney A (2007) A fast all and video calling application for specially abled people using
nearest neighbor algorithm for applications involving large point- hand gesture recognition. In: 2021 6th international conference
clouds. Comput Graph 31(2):157–174 for convergence in technology (I2CT). IEEE, pp 1–4
143. Jasim M, Zhang T, Hasanuzzaman M (2014) A real-time computer 163. Noreen I, Hamid M, Akram U, Malik S, Saleem M (2021)
vision-based static and dynamic hand gesture recognition system. Hand pose recognition using parallel multi stream cnn. Sensors
Int J Image Graph 14(01n02):1450006 21(24):8469
144. Venkatesh, Ranjitha KV (2019) Classification and optimization 164. Al-Hammadi M, Muhammad G, Abdul W, Alsulaiman M,
scheme for text data using machine learning nave Bayes classifier. Bencherif MA, Mekhtiche MA (2020) Hand gesture recognition
In: 2018 IEEE world symposium on communication engineering for sign language using 3dcnn. IEEE Access 8:79491–79509
(WSCE) 165. Molchanov P, Gupta S, Kim K, Kautz J (2015) Hand gesture
145. Argyros AA, Lourakis MI (2006) Vision-based interpretation of recognition with 3d convolutional neural networks. In: Proceed-
hand gestures for remote control of a computer mouse. In: Euro- ings of the IEEE conference on computer vision and pattern
pean conference on computer vision. Springer, pp 40–51 recognition workshops, pp 1–7
146. Kharate GK, Ghotkar AS (2016) Vision based multi-feature hand 166. Molchanov P, Yang X, Gupta S, Kim K, Tyree S, Kautz J (2016)
gesture recognition for Indian sign language manual signs. Int J Online detection and classification of dynamic hand gestures with
Smart Sens Intell Syst 9(1):124 recurrent 3d convolutional neural network. In: Proceedings of the
147. Misra S, Singha J, Laskar RH (2018) Vision-based hand gesture IEEE conference on computer vision and pattern recognition, pp
recognition of alphabets, numbers, arithmetic operators and ascii 4207–4215
characters in order to develop a virtual text-entry interface system. 167. Li Y, Li W, Mahadevan V, Vasconcelos N (2016) Vlad3: encoding
Neural Comput Appl 29(8):117–135 dynamics of deep features for action recognition. In: Proceedings
148. Heickal H, Zhang T, Hasanuzzaman M (2015) Computer vision- of the IEEE conference on computer vision and pattern recogni-
based real-time 3d gesture recognition using depth image. Int J tion, pp 1951–1960
Image Graph 15(01):1550004 168. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learn-
149. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature ing spatiotemporal features with 3d convolutional networks. In:
521(7553):436–44 Proceedings of the IEEE international conference on computer
150. Yamada T, Murata S, Arie H, Ogata T (2017) Representation vision, pp 4489–4497
learning of logic words by an rnn: from word sequences to robot 169. Camgoz NC, Hadfield S, Koller O, Bowden R (2016) Using con-
actions. Front Neurorobot 11:70 volutional 3d neural networks for user-independent continuous
151. Auephanwiriyakul S, Phitakwinai S, Suttapak W, Chanda P, gesture recognition. In: 2016 23rd international conference on
Theera-Umpon N (2013) Thai sign language translation using pattern recognition (ICPR). IEEE, pp 49–54
scale invariant feature transform and hidden Markov models. Pat- 170. Baumgartl H, Sauter D, Schenk C, Atik C, Buettner R (2021)
tern Recognit Lett 34(11):1291–1298 Vision-based hand gesture recognition for human-computer inter-
152. Neverova N, Wolf C, Paci G, Sommavilla G, Taylor G, Nebout action using mobilenetv2. In: 2021 IEEE 45th annual computers,
F (2013) A multi-scale approach to gesture detection and recog- software, and applications conference (COMPSAC). IEEE, pp
nition. In: Proceedings of the IEEE international conference on 1667–1674
computer vision workshops, pp 484–491 171. Ewe ELR, Lee CP, Kwek LC, Lim KM (2022) Hand gesture recog-
153. Geng L, Ma X, Wang H, Gu J, Li Y (2014) Chinese sign language nition via lightweight vgg16 and ensemble classifier. Appl Sci
recognition with 3d hand motion trajectories and depth images. 12(15):7643
In: Proceeding of the 11th world congress on intelligent control 172. Jiang H, Wachs JP, Duerstock BS (2013) Integrated vision-based
and automation. IEEE, pp 1457–1461 robotic arm interface for operators with upper limb mobility
154. Shin S, Kim W-Y (2020) Skeleton-based dynamic hand gesture impairments. In: 2013 IEEE 13th international conference on
recognition using a part-based gru-rnn for gesture-based interface. rehabilitation robotics (ICORR). IEEE, pp 1–6
IEEE Access 8:50236–50243 173. Kane L, Khanna P (2017) Vision-based mid-air unistroke char-
155. Zhang L, Zhu G, Mei L, Shen P, Shah SAA, Bennamoun M (2018) acter input using polar signatures. IEEE Trans Hum Mach Syst
Attention in convolutional lstm for gesture recognition. Advances 47(6):1077–1088
in neural information processing systems, p 31
123
Complex & Intelligent Systems
174. Yao Y, Fu Y (2014) Contour model-based hand-gesture recog- 194. Zhang T, Su Z, Cheng J, Xue F, Liu S (2022) Machine vision-based
nition using the kinect sensor. IEEE Trans Circuits Syst Video testing action recognition method for robotic testing of mobile
Technol 24(11):1935–1944 application. Int J Distrib Sens Netw 18(8):15501329221115376
175. Almeida SGM, Guimarães FG, Ramírez JA (2014) Feature extrac- 195. Wang W, He M, Wang X, Song H, Ma J (2022) Medical ges-
tion in Brazilian sign language recognition based on phonological ture recognition method based on improved lightweight network.
structure and using rgb-d sensors. Expert Syst Appl 41(16):7259– Available at SSRN 4102589
7271 196. Xu J, Li J, Zhang S, Xie C, Dong J (2020) Skeleton guided conflict-
176. Zhu C, Yang J, Shao Z, Liu C (2019) Vision based hand ges- free hand gesture recognition for robot control. In: 2020 11th
ture recognition using 3d shape context. IEEE/CAA J Autom Sin international conference on awareness science and technology
8(9):1600–1613 (iCAST). IEEE, pp 1–6
177. Yu C-W, Liu C-H, Chen Y-L, Lee P, Tian M-S (2018) Vision-based 197. Togo S, Ukida H (2021) Gesture recognition using hand region
hand recognition based on tof depth camera. Smart Sci 6(1):21–28 estimation in robot manipulation. In: 2021 60th annual conference
178. Wu B-X, Yang C-G, Zhong J-P (2021) Research on transfer learn- of the society of instrument and control engineers of Japan (SICE).
ing of vision-based gesture recognition. Int J Autom Comput IEEE, pp 1122–1127
18(3):422–431 198. Castro-Vargas J, Zapata-Impata B, Gil P, Garcia-Rodriguez J,
179. Starner TE (1995) Visual recognition of American sign language Torres F (2019) 3dcnn performance in hand gesture recognition
using hidden Markov models. Technical report, Massachusetts applied to robot
Inst of tech Cambridge Dept of brain and cognitive sciences 199. Almarzuqi AA, Buhari SM (2016) Enhance robotics ability in
180. Hoque SA, Haq MS, Hasanuzzaman M (2018) Computer vision hand gesture recognition by using leap motion controller. In:
based gesture recognition for desktop object manipulation. In: International conference on broadband and wireless computing,
2018 International conference on innovation in engineering and communication and applications. Springer, pp 513–523
technology (ICIET). IEEE, pp 1–6 200. Luo X, Amighetti A, Zhang D (2019) A human-robot interaction
181. Simao MA, Gibaru O, Neto P (2019) Online recognition of incom- for a mecanum wheeled mobile robot with real-time 3d two-hand
plete gesture data to interface collaborative robots. IEEE Trans Ind gesture recognition. J Phys Conf Ser 1267:012056
Electron 66(12):9372–9382 201. Moysiadis V, Katikaridis D, Benos L, Busato P, Anagnostis A,
182. Nguyen V-T, Tran T-H, Le T-L, Mullot R, Courboulay V (2015) Kateris D, Pearson S, Bochtis D (2022) An integrated real-time
Using hand postures for interacting with assistant robot in library. hand gesture recognition framework for human-robot interaction
In: 2015 seventh international conference on knowledge and sys- in agriculture. Appl Sci 12(16):8160
tems engineering (KSE). IEEE, pp 354–359 202. Gao Q, Chen Y, Ju Z, Liang Y (2021) Dynamic hand gesture
183. Grzejszczak T, Legowski A, Niezabitowski M (2015) Robot recognition based on 3d hand pose estimation for human-robot
manipulator teaching techniques with use of hand gestures. In: interaction. IEEE Sens J
2015 20th international conference on control systems and com- 203. Vishwakarma DK, Maheshwari R, Kapoor R (2015) An efficient
puter science. IEEE, pp 71–77 approach for the recognition of hand gestures from very low
184. Peral M, Sanfeliu A, Garrell A (2022) Efficient hand gesture resolution images. In: 2015 fifth international conference on com-
recognition for human-robot interaction. IEEE Robot Autom Lett munication systems and network technologies. IEEE, pp 467–471
7(4):10272–10279 204. Tsai T-H, Huang C-C, Zhang K-L (2020) Design of hand gesture
185. Shang-Liang C, Li-Wu H (2021) Using deep learning technology recognition system for human-computer interaction. Multimed
to realize the automatic control program of robot arm based on Tools Appl 79(9):5989–6007
hand gesture recognition. Int J Eng Technol Innov 11(4):241 205. Rawat P, Kane L, Goswami M, Jindal A, Sehgal S (2022) A review
186. Wu B, Zhong J, Yang C (2021) A visual-based gesture predic- on vison-based hand gesture recognition targeting rgb-d sensors.
tion framework applied in social robots. IEEE/CAA J Autom Sin Int J Inf Technol Decis Mak
9(3):510–519 206. Chanu OR, Pillai A, Sinha S, Das P (2017) Comparative study for
187. Qi W, Ovur SE, Li Z, Marzullo A, Song R (2021) Multi-sensor vision based and data based hand gesture recognition technique.
guided hand gesture recognition for a teleoperated robot using In: 2017 international conference on intelligent communication
a recurrent neural network. IEEE Robot Autom Lett 6(3):6039– and computational techniques (ICCT). IEEE, pp 26–31
6045 207. Hasan MM, Mishra PK (2010) Hsv brightness factor matching for
188. Torres SHM, Kern MJ, et al (2017) 7 dof industrial robot con- gesture recognition system. Int J Image Process (IJIP) 4(5):456–
trolled by hand gestures using microsoft kinect v2. In: 2017 IEEE 467
3rd Colombian conference on automatic control (CCAC). IEEE, 208. Xu C, Govindarajan LN, Zhang Y, Cheng L (2017) Lie-x: depth
pp 1–6 image based articulated object pose estimation, tracking, and
189. Gao Q, Ju Z, Chen Y, Wang Q, Chi C (2022) An efficient rgb-d action recognition on lie groups. Int J Comput Vis 123(3):454–478
hand gesture detection framework for dexterous robot hand-arm 209. Islam M et al (2020) An efficient human computer interaction
teleoperation system. IEEE Trans Hum Mach Syst through hand gesture using deep convolutional neural network.
190. Xue Z, Chen X, He Y, Cao H, Tian S (2022) Gesture-and vision- SN Comput Sci 1(4):1–9
based automatic grasping and flexible placement in teleoperation. 210. Zengeler N, Kopinski T, Handmann U (2018) Hand gesture
Int J Adv Manuf Technol 1–16 recognition in automotive human-machine interaction using depth
191. Fahn C-S, Chu K-Y (2011) Hidden-markov-model-based hand cameras. Sensors 19(1):59
gesture recognition techniques used for a human-robot interaction 211. Liu Y, Song S, Yang L, Bian G, Yu H (2022) A novel dynamic
system. In: International conference on human-computer interac- gesture understanding algorithm fusing convolutional neural net-
tion. Springer, pp 248–258 works with hand-crafted features. J Vis Commun Image Represent
192. Wang M, Chen W-Y, Li XD (2016) Hand gesture recognition 83:103454
using valley circle feature and hu’s moments technique for robot 212. Joshi G, Vig R et al (2015) A multi-class hand gesture recognition
movement control. Measurement 94:734–744 in complex background using sequential minimal optimization.
193. Zhao H, Hu J, Zhang Y, Cheng H (2017) Hand gesture based In: 2015 international conference on signal processing, computing
control strategy for mobile robots. In: 2017 29th Chinese control and control (ISPCC). IEEE, pp 92–96
and decision conference (CCDC). IEEE, pp 5868–5872
123
Complex & Intelligent Systems
213. Chen R, Tian X (2023) Gesture detection and recognition based 218. Cui Z, Lei Y, Wang Y, Yang W, Qi J (2022) Hand gesture seg-
on object detection in complex background. Appl Sci 13(7):4480 mentation against complex background based on improved atrous
214. Zhang T, Lin H, Ju Z, Yang C (2020) Hand gesture recognition in spatial pyramid pooling. J Ambient Intell Humaniz Comput 1–13
complex background based on convolutional pose machine and 219. Zhou W, Chen K (2022) A lightweight hand gesture recognition
fuzzy gaussian mixture models. Int J Fuzzy Syst 22:1330–1341 in complex backgrounds. Displays 74:102226
215. Vishwakarma DK (2017) Hand gesture recognition using shape
and texture evidences in complex background. In: 2017 interna-
tional conference on inventive computing and informatics (ICICI).
Publisher’s Note Springer Nature remains neutral with regard to juris-
IEEE, pp 278–283
dictional claims in published maps and institutional affiliations.
216. Pabendon E, Nugroho H, Suheryadi A, Yunanto PE (2017) Hand
gesture recognition system under complex background using spa-
tio temporal analysis. In: 2017 5th international conference on
instrumentation, communications, information technology, and
biomedical engineering (ICICI-BME). IEEE, pp 261–265
217. Elsayed RA, Sayed MS, Abdalla MI (2015) Skin-based adaptive
background subtraction for hand gesture segmentation. In: 2015
IEEE international conference on electronics, circuits, and sys-
tems (ICECS). IEEE, pp 33–36
123