Sensors 24 06262
Sensors 24 06262
Article
Real-Time Hand Gesture Monitoring Model Based on
MediaPipe’s Registerable System
Yuting Meng 1, *, Haibo Jiang 2 , Nengquan Duan 1 and Haijun Wen 1, *
1 College of Mechanical Engineering, North University of China, Taiyuan 030051, China; [email protected]
2 SIMITECH Co., Xi’an 710086, China; [email protected]
* Correspondence: [email protected] (Y.M.); [email protected] (H.W.)
Abstract: Hand gesture recognition plays a significant role in human-to-human and human-to-
machine interactions. Currently, most hand gesture detection methods rely on fixed hand gesture
recognition. However, with the diversity and variability of hand gestures in daily life, this paper
proposes a registerable hand gesture recognition approach based on Triple Loss. By learning the
differences between different hand gestures, it can cluster them and identify newly added gestures.
This paper constructs a registerable gesture dataset (RGDS) for training registerable hand gesture
recognition models. Additionally, it proposes a normalization method for transforming hand gesture
data and a FingerComb block for combining and extracting hand gesture data to enhance features
and accelerate model convergence. It also improves ResNet and introduces FingerNet for registerable
single-hand gesture recognition. The proposed model performs well on the RGDS dataset. The
system is registerable, allowing users to flexibly register their own hand gestures for personalized
gesture recognition.
1. Introduction
In daily life, gestures provide an efficient means of communication and play a signif-
icant role in interpersonal interactions [1]. With the advancement of technology and the
Citation: Meng, Y.; Jiang, H.; Duan,
development of human–computer interactions, gesture recognition has gradually gained
N.; Wen, H. Real-Time Hand Gesture
attention as a natural and intuitive way of interacting [2]. Gesture recognition refers to
Monitoring Model Based on
the analysis and identification of human hand movements and postures by computers,
MediaPipe’s Registerable System.
Sensors 2024, 24, 6262. https://
enabling natural interactions between humans and machines. It is a vital research area in
doi.org/10.3390/s24196262
the fields of human–computer interaction and computer vision [3], with a wide range of
potential applications, such as virtual reality, smart homes, security monitoring, handwrit-
Academic Editor: Stefanos Kollias ing recognition, medical image analysis, and more. Gesture recognition technologies can
Received: 13 August 2024 be classified into two main categories: sensor-based gesture recognition and image-based
Revised: 24 September 2024 gesture recognition [4].
Accepted: 25 September 2024 In the early stages, gesture recognition heavily relies on sensor-based methods due
Published: 27 September 2024 to immature hardware and algorithms. Data gloves or electromagnetic waves are com-
monly used to capture hand movements. For instance, IBM introduced a device called
“DataGlove” [5], which monitored hand movements in real-time and transmitted them to
computers. While this approach often provided more accurate results, the need to wear
Copyright: © 2024 by the authors. sensor devices limited its widespread adoption in everyday applications.
Licensee MDPI, Basel, Switzerland. In recent years, with the rapid development of computer vision technology, the use
This article is an open access article
of visuals in the industry is becoming increasingly widespread [6–8]. The methods for
distributed under the terms and
recognizing target movements and postures through image or video analysis are becoming
conditions of the Creative Commons
more mature. Compared to traditional methods, image-based gesture recognition does
Attribution (CC BY) license (https://
not require additional wearable devices, making the recognition process more natural
creativecommons.org/licenses/by/
and gradually becoming mainstream technology [9]. In 2002, a research team at the
4.0/).
2. Related Work
There are many existing datasets for sign language recognition tasks. In this section, we
first review some existing sign language datasets. Athitsos et al. [17] proposed an American
sign language dataset called ASLLVD, consisting of 9800 video samples containing 3300 sign
language words. Sincan and Keles proposed a Türkiye sign language dataset called
AUTSL, which contained 38,336 video samples, including 226 sign languages performed by
43 different sign language speakers. These videos were recorded in both indoor and outdoor
environments. This dataset has color, depth, and skeleton modes. Necati [18] proposed
RWTH-PHOENIX-WEATHER-2014T, a German sign language dataset for continuous sign
language recognition, which was based on weather forecast videos broadcasted by nine
sign language hosts. The training set, validation set, and test set of the dataset contain
7096, 519, and 642 data samples, respectively. The CSL-Daily [19] dataset 3 can be used
for continuous sign language recognition and translation tasks. CSL-Daily focuses more
on daily life scenarios, including multiple themes such as family life, healthcare, and
school life. The training, validation, and testing sets of CSL-Daily contain 18,401, 1077, and
1176 video samples, respectively. These datasets have their own focuses, but they still vary
from the different posture situations of the same gesture that this article wants to solve.
Sensors 2024, 24, 6262 3 of 16
Therefore, in subsequent work, this article has created a small gesture dataset RGDS to
supplement the existing gesture dataset with fewer training samples for different postures
of the same gesture.
Generally speaking, constructing gesture recognition and classification models is the
main stage of gesture recognition technology. By extracting the spatiotemporal features
of gestures and classifying them, the existing technologies can mainly be divided into
two categories: traditional methods and deep learning. Most traditional methods use
pre-defined templates or sequence similarity matchingschemes. Among them, traditional
methods such as He Li et al. [14] used the maximum likelihood criterion Hausdorff distance
for recognition [15], and used a multi-resolution search strategy to improve the search speed
while also recognizing letter gestures as well. However, the recognition effect is not good
for gestures that undergo deformation (rotation and scaling). Zhang Liangguo et al. [20]
sed the Hausdorff distance template matching method to compare the contour feature
points of gesture areas and successfully completed the recognition of 30 Chinese sign
language gestures; Yang Xuewen et al. [21] overcame the shortcomings of existing gesture
recognition algorithms in handling gestures such as scaling, translation, and rotation,
using a method that combines gesture main direction and class Hausdorff distance. Zhu
Jiyu et al. [22] used a gesture recognition method based on structural analysis to obtain
the overall and local change information of gestures, increasing the types of recognizable
gesture types. However, these traditional methods also have common drawbacks, such as
relatively complex calculation processes, high manpower consumption, and unsatisfactory
real-time performance.
Deep learning methods have now become the mainstream method for completing ges-
ture recognition work: experimental results from studies [23–29] have shown that gesture
recognition technology based on neural network training has the ability to improve the
accuracy of some meaningful gesture recognition work to over 95%. Xin Wenbin et al. [23]
used a static gesture real-time recognition method based on the ShuffleNetv2-YOLOv3
model, extracted features from gesture images using ShuffleNetv2, and then classified
gestures using the YOLOv3 neural network model, increasing the fps from 41 to 44. By
adopting the CBAM attention mechanism module, the model’s accuracy increased from
96.6% to 98.2%. Wu [25] adopted a novel recognition algorithm of a dual channel convolu-
tional neural network (DC-CNN) to simultaneously extract features from gesture images
and hand images, effectively improving the generalization ability of CNN, but the improve-
ment in accuracy was not very significant. At this point, people noticed the shortcomings
of single visual recognition in reading gestures. Songyao Jiang et al. [27] proposed indepen-
dent sign language recognition (SAM-SLR-v2), a skeleton-aware multimodal framework
with a global synthesis model. This pattern provides annotations for skeleton-based sign
language recognition, but the recognition performance was poor in cases of occlusion
or insignificant posture. Papadimitriou et al. [28] proposed a deep learning framework
3D-DCNN + ST-MGCN using appearance and skeletal information for automatic sign
language recognition without special input, which reduced the relative error rate of Greek
language recognition by 53%. However, this was only for the recognition of the entire arm
and did not show any improvement compared to single-finger or hand recognition.
It can be found in the literature that current deep learning gesture recognition work
has developed quite maturely. However, current image-based gesture recognition is based
on recognizing the projected image of the hand on the image screen. However, what we
generally consider gestures is based on the mutual combination of finger joints, i.e., the same
gesture forms different images on the image screen in different poses. Therefore, image-
based gesture recognition has limitations in recognizing the same gesture in different poses.
In response to the above issues, this method obtains the three-dimensional coordinates
of the finger’s joint points and identifies the features of the joint points, enabling users to
recognize the same gesture in different postures. Considering the classification of features
in gesture recognition models, we did not adopt the commonly used softmax classifier. We
found that Florian Schroff et al. [30] used Triple Loss to construct FaceNet, which directly
Sensors 2024, 24, 6262 4 of 16
learned the mapping from facial images to a compact Euclidean space, where distance
directly corresponded to the measure of facial similarity. The goal of Triple Loss is to make
the embeddings of samples with the same label as close as possible in the embedding space
and the embeddings of samples with different labels as far apart as possible. The Softmax
loss function is commonly used for multi-class classification tasks. It maps the output of the
network to a probability distribution such that the sum of probabilities for each category is
one. The objective of the Softmax loss function is to maximize the probability of the correct
category while minimizing the probability of the incorrect category. However, the Softmax
loss function may be troubled when dealing with large intra-class differences as it only
considers the differences between samples of the same category and ignores the differences
between different categories. In contrast, the Softmax loss function mainly focuses on the
differences between samples of the same category, while the Triple Lossfunction pays more
attention to increasing the similarity between samples of the same category and increasing
the degree of difference between different categories.
Finally, the MediaPipe we used in our research is a cross-platform machine learning
framework open-source by Google [31] designed to assist developers in building machine
learning applications based on visual, audio, and sensor inputs. Finger in MediaPipe is
an algorithm module used for hand pose estimation [32]. It can locate and track fingers
through input hand images and estimate the three-dimensional posture of fingers. The
input of the Finger module is a set of hand images, such as hand images captured by a
camera or hand images that have already been captured. The Finger module detects and
tracks fingers, obtaining the position and posture information of each finger. Based on deep
neural networks, it trains on a large amount of hand image data to achieve the accurate
detection and tracking of fingers. In addition, the Finger module also utilizes some image
processing and computer vision algorithms, such as morphological processing, filters, and
Kalman filters, to further improve the accuracy and stability of detection and tracking.
3. Research Methodology
3.1. Gesture Recognition Process
In the context of image recognition and classification, there are four main tasks:
classification, localization, detection, and segmentation. For gesture recognition, we needed
to address the tasks of classification, localization, and detection. To achieve this, we used
the following steps:
1. Classification and Localization:
Utilize the open-source MediaPipe provided by Google to obtain the position and
coordinate points of the hand. This step helps us identify and locate the hand within
the image.
2. Data Preprocessing:
Apply geometric transformations to the coordinate points of the hand obtained in the
previous step to normalize the data. This normalization step ensures that the hand’s shape
and position are consistent and comparable across different images.
3. Feature Embedding:
Use FingerNet to perform embedding on normalized hand data. FingerNet is a
deep neural network designed to extract representative features from the hand’s coordi-
nate points.
4. Loss Functions:
Apply the Triple Loss and CrossEntropy Loss functions during the network training
phase. The Triple Loss function is responsible for pushing embeddings of samples with
the same gesture label closer together in the embedding space and pushing embeddings
of samples with different gesture labels farther apart. The CrossEntropy Loss function is
utilized for the classification aspect of network training.
of samples with different gesture labels farther apart. The CrossEntropy Loss function is
utilized for the classification aspect of network training.
By following these steps (Figure 1), the gesture recognition model can be trained
the same gesture label closer together in the embedding space and pushing embeddings
based on the features extracted from hand coordinates, leading to accurate classifying.
Sensors 2024, 24, 6262 of samples with different gesture labels farther apart. The CrossEntropy Loss function is
5 of 16
This approach combines localization and classification to achieve effective gesture recog-
utilized for the classification aspect of network training.
nition.
By following these steps (Figure 1), the gesture recognition model can be trained
based Byon the features
following theseextracted from hand
steps (Figure 1), thecoordinates, leading tomodel
gesture recognition accurate
can classifying.
be trained
This approach combines localization and classification to achieve effective gesture recog-
based on the features extracted from hand coordinates, leading to accurate classifying. This
nition. combines localization and classification to achieve effective gesture recognition.
approach
Figure 1. Gesture classification implementation process.
Figure 2. Gesture
Gesture data. (1)
(1) and
and (2)
(2) represent
represent gesture photographs for two different
different gestures.
Please
3.3. Data note that the RGDS dataset has been curated to encompass diverse hand gesture
Preprocessing
samples,
Afterfacilitating
processingeffective training
the images and thehands
containing evaluation of the
through thegesture recognition
MediaPipe model.
network, we
Figure 2. Gesture data. (1) and (2) represent gesture photographs for two different gestures.
obtained the three-dimensional coordinates of 21 key points of the fingers, as shown in
3.3. Data Preprocessing
3.3. Data Preprocessing
After processing the images containing hands through the MediaPipe network, we
Afterthe
obtained processing the imagescoordinates
three-dimensional containing of
hands through
21 key pointsthe MediaPipe
of the fingers, network,
as shownwein
obtained
Figure thethis
3. At three-dimensional
point, the x andcoordinates of of
y coordinates 21the
keythree-dimensional
points of the fingers, as are
points shown in
based
on the image’s left-bottom corner as the coordinate origin. The z-coordinate is based on the
0th point as the coordinate origin.
To facilitate further image processing, we standardized and rectified the coordinates
of the 21 key points of the fingers. Here, we take the 0th point as the coordinate origin; the
x-axis runs from 0 to 17, and the y-axis runs from 0 to 5. Using the right-hand rule, we can
obtain a three-dimensional coordinate system based on the 0th point, which will be used
Sensors 2024, 24, 6262 6 of 16
for subsequent processing.
Figure3.3.MediaPipe
Figure MediaPipefinger
fingerlandmark.
landmark.The Thered
reddots
dotsare
are the
the 2121 key
key points
points selected
selected forfor
thethe hand,
hand, which
which
are are connected
connected by aline
by a green green line to
to form form a complete
a complete line of identification
line of identification of the hand.
of the hand.
To facilitatetransformations
Canonical further image processing, we standardized
are mathematical and rectified
transformations the coordinates
commonly used to an-
of the 21
alyze andkey points the
simplify of the fingers. Here,
description we take
of physical the 0thThey
systems. pointinvolve
as the coordinate origin; the
selecting appropriate
x-axis 0 to 17, and the y-axis
coordinate system transformations to convert physical quantities in the original we
runs from runs from 0 to 5. Using the right-hand rule, can
coordi-
obtain a three-dimensional coordinate system based on the 0th point,
nate system, making the problem’s formulation more concise or convenient for solving. which will be used
for
Thesubsequent
core idea of processing.
canonical transformations is to choose a new set of basis vectors to repre-
sentCanonical
vectors in transformations are mathematical
the original coordinate system. Thesetransformations commonly
new basis vectors often used to ana-
possess spe-
lyze and simplify
cial properties, the as
such description of physical
orthogonality systems. They
or normalization, involvethe
to simplify selecting appropriate
problem’s descrip-
coordinate system transformations to convert physical quantities in the original coordinate
tion or solution.
system, making the problem’s
In this specific case, we areformulation
considering more
twoconcise
sets of or convenient
basis for solving.
vectors: one based onThethe
core idea of canonical transformations is to choose
points 0 to 5 and another based on the points 0 to 17. a new set of basis vectors to represent
vectors in the original coordinate system. These new basis vectors often possess special
(1) Taking the 0th point as the coordinate origin, we performed a translation of the orig-
properties, such as orthogonality or normalization, to simplify the problem’s description
inal coordinates. Let x0, y0 represent the original coordinates of point 0, and xi, yi rep-
or solution.
resent the original coordinates of point i. After the translation, we obtained new co-
In this specific case, we are considering two sets of basis vectors: one based on the
ordinates X0, Y0.
points 0 to 5 and another based on the points 0 to 17.
(1) Taking the 0th point as the coordinate 𝑋 =𝑥 − 𝑥 we performed a translation of the
origin, (1)
original coordinates. Let x0 , y0 represent the original coordinates of point 0, and xi , yi
represent the original coordinates of 𝑌 point
= 𝑦 i.−After
𝑦 the translation, we obtained new (2)
(2) coordinates
We calculatedX0 the
, Y0 .vector 𝑃 𝑃 ⃑ and 𝑃 𝑃⃑
0 17 0 5
X0 = x i − x 0 (1)
𝑃 𝑃 ⃑ = ((𝑥 − 𝑥 ), (𝑦 − 𝑦 ), (𝑧 − 𝑧 )) (3)
Y0 = yi − y0 (2)
𝑃 𝑃⃑P=⇀ ((𝑥 − 𝑥⇀ ), (𝑦 − 𝑦 ), (𝑧 − 𝑧 )) (4)
(2) We calculated the vector 0 P17 and P0 P5
(3) We calculated the normal
⇀ vector 𝑃⃑ of the plane formed by vectors 𝑃 𝑃⃑ and 𝑃 𝑃 ⃑
P0 P17 = (( x17 − x0 ), (y17 − y0 ), (z17 − z0 )) (3)
⇀
𝑃⃑ = 𝑃 𝑃⃑ × 𝑃 𝑃 ⃑ (5)
P0 P5 = (( x5 − x0 ), (y5 − y0 ), (z17 − z0 )) (4)
⇀ ⇀ ⇀
(3) We calculated the normal vector Pz of the plane formed by vectors P0 P5 and P0 P17
⇀ ⇀ ⇀
Pz = P0 P5 × P0 P17 (5)
⇀ i j k
Pz = x5 − x0 y5 − y0 z5 − z0 (6)
x17 − x0 y17 − y0 z17 − z0
(4) Change in Basis
Sensors 2024, 24, 6262 7 of 16
⇀ ⇀ ⇀ ⇀
For a point P in the basis of Q = a , b , c , its coordinates are given by (xr , yr , zr )
⇀ ⇀ ⇀ ⇀
using Equation (7). In the basis of Q = A, B, C , its coordinates are given by (xq , yq , zq )
⇀⇀ ⇀ ⇀
using Equation (8). The coordinates of A, B, and C in the basis of R are given, respectively,
by (Xar , Yar , Zar ), (Xbr , Ybr , Zbr ), and (Xcr , Ycr , Zcr ) using Equations (9)–(11). Formula (12)
can be obtained by Formulas (9)–(11); we obtained Equation (13), which represents the
coordinates of point P in both bases, showing that (7) = (8). From this, we could derive the
⇀ ⇀
coordinate transformation matrix F (14) that transforms point P from R to Q.
xr ⇀ xr ⇀ ⇀
⇀ ⇀ ⇀ ⇀
P = yr · R = yr · a , b , c = xr a + yr b + zr c (7)
zr zr
xq ⇀ xq ⇀ ⇀ ⇀ ⇀ ⇀ ⇀
P = yq · Q = yq · A, B, C = xq A + yq B + zq C
(8)
zq zq
⇀ Xar ⇀
⇀ ⇀ ⇀
A = Yar · P = Xar a + Yar c + Zar c (9)
Zar
⇀ Xbr ⇀
⇀ ⇀ ⇀
B = Ybr · P = Xbr a + Ybr c + Zbr c (10)
Zbr
⇀ Xcr ⇀
⇀ ⇀ ⇀
C = Ycr · P = Xcr a + Ycr c + Zcr c (11)
Zcr
⇀
⇀ ⇀ ⇀
Xar Xbr Xcr ⇀
Q = A, B, C = Yar Ybr Ycr · P (12)
Zar Zbr Zcr
xq x r ⇀ ⇀ −1 xr
y q = yr · P · Q = yr · F (13)
zq zr zr
⇀ ⇀ −1
F = P ·Q (14)
Input 1 × 4, 32 s = 1, p =1 1 × 22 × 3
Conv2_x 1 × 3, 32 s = 𝟏1,× = 1 𝟑𝟐 𝐬 = 𝟏, 𝐩 = 𝟎1 × 33 × 32
p 𝟐,
×2
1× 3, 32 s = 𝟏1,× p 𝟐,
= 1 𝟑𝟐 𝐬 = 𝟏, 𝐩 = 𝟎
1 × 3, 64 s = 𝟏× p = 1𝟑𝟐 𝐬 = 𝟏, 𝐩 = 𝟏
2, 𝟑,
FingerComb 1 × 3, 64 s = 1, p = 1 1 × 33 × 3
Conv3_x 𝟏 × 𝟑, 𝟑𝟐 𝐬 = 𝟏, 𝐩 = 𝟏 1 × 17 × 64
𝟏× 𝟒,= 1𝟑𝟐 𝐬 = 𝟏, 𝐩 = 𝟏
1 × 3, 64 s = 1, p
1 × 3, 64 s = 𝟏× 𝟒,= 1𝟑𝟐
1, p 𝐬 = 𝟏, 𝐩 = 𝟏
1 × 3, 128 𝟏 s= × 2,
𝟑,p =𝟑𝟐 1 𝐬 = 𝟏, 𝐩 = 𝟏
Conv4_x
Conv2_x 1 × 3, 128 𝟏 s= × 1,
𝟑,p =𝟑𝟐 1 𝐬 = 𝟏, 𝐩 = 𝟏 ×1 𝟐× 9 × 128
1 × 33 × 3
𝟏× p = 1𝟔𝟒 𝐬 = 𝟐, 𝐩 = 𝟏
1, 𝟑,
1 × 3, 128 s =
Conv3_x 1 × 3, 128 s = 𝟏× 1, 𝟑,
p = 1𝟔𝟒 𝐬 = 𝟏, 𝐩 = 𝟏 1 × 17 × 6
1 × 3, 256 s = 𝟏× 𝟑,= 1𝟔𝟒 𝐬 = 𝟏, 𝐩 = 𝟏
2, p
𝟏× 𝟑,= 1𝟔𝟒 𝐬 = 𝟏, 𝐩 = 𝟏 1 × 5 × 256
Conv5_x 1 × 3, 256 s = 1, p
𝟏× 1, 𝟑,
p = 𝟏𝟐𝟖 𝐬 = 𝟐, 𝐩 = 𝟏
1 × 3, 256 s = 1
1 × 3, 256 s = 𝟏× 1, 𝟑,
p = 𝟏𝟐𝟖
1 𝐬 = 𝟏, 𝐩 = 𝟏
Conv4_x 1 × 9 × 12
Avgpool 256 K = 3, s =𝟏1,×p𝟑,= 1 𝟏𝟐𝟖 𝐬 = 𝟏, 𝐩 = 𝟏 1 × 1 × 256
FC 𝟏 × 𝟑, 𝟏𝟐𝟖 𝐬 = 𝟏, 𝐩 = 𝟏 1 × 128
(256, 128)
FC (128, 32) 1 × 32
𝟏 × 𝟑, 𝟐𝟓𝟔 𝐬 = 𝟏, 𝐩 = 𝟏
Conv5_x 1 × 5 × 256
𝟏 × 𝟑, 𝟐𝟓𝟔 𝐬 = 𝟏, 𝐩 = 𝟏
𝟏 × 𝟑, 𝟐𝟓𝟔 𝐬 = 𝟏, 𝐩 = 𝟏
Avgpool 256 K = 3, s = 1, p = 1 1 × 1 × 256
Sensors 2024, 24, 6262 FC (256, 128) 1 × 128 9 of 16
FC (128, 32) 1 × 32
Structureof
Figure5.5.Structure
Figure ofFingerNet.
FingerNet.
CrossEntropy loss, where m is the number of samples, Wj represents the jth weight in
the model parameters, yi is the true label for each sample, and ŷi is the model output.
Formula (17) represents the total loss, which is the combination of Triple Loss and CrossEn-
tropy Loss.
Lt = max{d( a, p) − d( a, n) + margin, 0} (15)
1 m 1 m 1 m
Σ w
L(w) = − ∑ yi (ŷi ) + ∑ (1 − yi ) log e m j=1 j (16)
m i=1 m i=1
L = Lt + L(w) (17)
After the completion of the model’s construction, various training parameters needed
to be set before training, as shown in Table 3. The model was trained for a total of
500 epochs, and the margin was a parameter for the Triple Loss. Additionally, the
“min_tracking_confidence” is the minimum confidence threshold for hand tracking in
MediaPipe finger detection, while the “min_detection_confidence” represents the mini-
mum confidence threshold for hand detection. The “max_num_hands” parameter specifies
the maximum number of hands to be recognized in the input data.
Type Parameter
epoch 500
BatchSize 256
margin 5
lr 0.2
min_tracking_confidence 0.8
min_detection_confidence 0.62
max_num_hands 1
After setting up the various training parameters, the model was trained, and Figure 6
shows the change in model training loss with respect to the training steps. Before 200k
steps, the model’s loss appears to be unstable but shows an overall convergence trend.
After 200k steps, the model’s loss mostly tends to approach zero; although there may be
some fluctuations in the process, the loss values are very small.
After setting up the various training parameters, the model was trained, and Figure
Afterthe
6 shows setting
changeupinthe various
model training
training lossparameters, the
with respect tomodel was trained,
the training and Figure
steps. Before 200k
6steps,
shows the change in model training loss with respect to the training steps. Before
the model’s loss appears to be unstable but shows an overall convergence trend. 200k
steps, the model’s
After 200k loss
steps, the appears
model’s to mostly
loss be unstable
tendsbut shows anzero;
to approach overall convergence
although trend.
there may be
Sensors 2024, 24, 6262 11 ofbe
16
After 200k steps, the model’s loss mostly tends to approach
some fluctuations in the process, the loss values are very small. zero; although there may
some fluctuations in the process, the loss values are very small.
Figure 7.
Figure 7. Test
Test results
resultsbox
boxdiagram.
diagram.The
Theparts
partsmarked
markedininred
redare
arethe
the gestures
gestures in in this
this class
class that
that have
have the
Figure 7. TestL2results
the smallest boxcompared
distance diagram. The parts
to other
the marked
other in red are the gestures in this class that have
gestures.
smallest L2 distance compared to the gestures.
the smallest L2 distance compared to the other gestures.
For example,
For example,ininthe thefirst
first subplot,
subplot, thethe registered
registered gesture
gesture is labeled
is labeled as theas0th
the 0th From
class. class.
this
From For
plot, example,
thisit plot, in the
can be first
can beitobserved thatsubplot,
the L2
observed the
theregistered
distance
that L2between gesture
distance the is labeled
registered
between the0th as the
class 0th
0thclass.
gesture
registered and
class
From 0th
other thisclass
plot,gestures
it can beareobserved that indicating
the smallest, the L2 distance between
that there the intra-class
is a high registeredsimilarity.
0th class
On the other hand, the distance between the registered gesture and gestures from other
classes is larger, indicating a greater inter-class dissimilarity.
Additionally, each class of gestures is well-clustered, showing high intra-class cohesion.
Therefore, when setting the threshold to 40, it is possible to completely distinguish between
gestures of the same class as the registered gesture and gestures from other classes. This
indicates that the FingerNet model is capable of effectively clustering and differentiating
between hand gestures based on the learned gesture embeddings.
After training the model, real-time hand gesture processing and displays were per-
formed using OpenCV to read images from the camera. The images were processed through
MediaPipe and then passed to FingerNet for hand gesture recognition. The results are
shown in Figure 8.
Sensors 2024, 24, 6262 12 of 16
From this figure, it can be observed that after processing by FingerNet, the model
accurately detected the hand positions and hand gesture classes in real-time. Moreover, the
model was able to recognize new hand gesture classes through the registration process.
The real-time processing and recognition capabilities of FingerNet demonstrate its
effectiveness and practicality in hand gesture recognition applications. The model shows
promising performance in accurately identifying hand gestures and can be extended to
recognize a wide range of hand movements and poses.
In this paper, in order to further validate the effectiveness of the model method, we also
compared some existing models with the model proposed in this chapter on the AUTSL
dataset (Table 4), and the results are displayed in accuracy order in the table. Among
them, the baseline RGB accuracy comes from the CNN-FPM BiLSTM attention model
proposed by Sincan et al. [33], which extracts image features using CNN, fuses multi-scale
diffusion convolution through FPM, collects time-domain features using BiLSTM with the
attention mechanism, and finally performs time-domain pooling and gesture classifications.
The accuracy of LSTM, Transformer, ST-GCN, and SL-GCN methods comes from the
OpenHands framework proposed by Selvaraj et al. [34], which mainly uses pre-training
and self-supervised learning to improve the accuracy of the model. The remaining models
FE + LSTM, MViT-SLR, HWGAT, 3D-DCNN + ST-MGCN, SAM-SLR, and STF + LSTM are
from the corresponding papers [28,35–40]. These models reflect the best results from the
current AUTSL dataset. From the comparison results, it can be seen that the accuracy of the
method proposed in this paper reaches 0.953. Although not the best results on this dataset,
these models were able to effectively recognize large-scale gestures through multimodal
skeleton data.
Table 4. Comparison results between our method and existing methods on AUTSL.
Method Accuracy
Baseline RGB 0.425
Baseline RGB-D 0.632
LSTM 0.774
Transformer 0.810
ST-GCN 0.904
SL-GCN 0.919
FE + LSTM 0.934
MViT-SLR 0.957
HWGAT 0.958
3D-DCNN + ST-MGCN 0.984
SAM-SLR 0.985
STF + LSTM 0.986
FINGER-NET 0.953
Sensors 2024, 24, 6262 13 of 16
The sample recording in the AUTSL dataset is clear, the lighting is uniform, the frame
rate is high, and there is less dynamic blur, making it suitable for skeleton pose estimation.
It can accurately obtain finger positions and recognize gestures using graph convolution
methods. However, in some scenarios, the quality of the samples to be recognized is
relatively low, which has a significant impact on the estimation of body and hand skeletons,
limiting the accuracy of subsequent gesture recognition.
In addition, the video recording time in the ChaLearLAPISOGD gesture dataset is
relatively early, the frame rate is relatively low, and there are many samples that lack
lighting and blur, which makes skeleton estimation prone to containing many errors. We
also conducted comparative tests on the model. Table 5 shows the accuracy comparison
results of this method on the dataset.
Table 5. Comparison results between our method and existing methods on ChaLearn IsoGD.
Method Accuracy
Baseline 0.241
Zhu et al. [41] 0.509
Wang et al. [42] 0.555
Li et al. [43] 0.569
FINGER-NET 0.572
At the same time, we also conducted further research on another contribution of this
article, the registrable dataset RGDS. We tested the existing algorithm models and our
model on our dataset and compared them in the smaller-scale dataset. The accuracy is
shown in the following Table 6.
Table 6. Comparison results between our method and existing methods on RGDS.
Method Accuracy
LSTM 0.574
Transformer 0.610
Wang et al. [42] 0.706
Li et al. [43] 0.745
StepNet [44] 0.821
FINGER-NET 0.878
further promote and improve this model by achieving the registrable recognition of hand
gestures and enhancing the model’s ability to understand complex gestures.
Author Contributions: The authors confirm their contributions to the paper as follows: Concep-
tualization and design of the study: Y.M., H.J., N.D. and H.W.; Data collection: H.J.; Analysis and
interpretation of the results: N.D.; Drafting of the manuscript: Y.M. and H.J. All authors have read
and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: For reasons (the study was not in the biomedical field, the
hand data are not identifiable, and the source of the hand data was only the authors of this study),
ethical review and approval was waived for this study.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Data is unavailable due to privacy restrictions.
Conflicts of Interest: The authors declare that they have no conflict of interest to report regarding
the present study. And Haibo Jiang is employed by SIMITECH company. The company played no
role in the design of the study, the collection, analysis, or interpretation of data, the writing of the
manuscript, or the decision to publish the article.
References
1. Jiang, D.; Zheng, Z.; Li, G.; Sun, Y.; Kong, J.; Jiang, G.; Xiong, H.; Tao, B.; Xu, S.; Yu, H.; et al. Gesture Recognition Based on
Binocular Vision. Cluster Comput. 2019, 22, 13261–13271. [CrossRef]
2. Feng, Z.; Yang, B.; Zheng, Y.W.; Xu, T.; Tang, H.K. Hand Tracking Based on Behavioral Analysis for Users; China Science Publishing &
Media Ltd.: Beijing, China, 2013; Volume 24, pp. 2101–2116.
3. Rautaray, S.S.; Agrawal, A. Vision based hand gesture recognition for human computer interaction: A survey. Artif. Intell. Rev.
2015, 43, 1–54. [CrossRef]
4. Kaaniche, M.B.; Bremond, F. Recognizing gestures by learning local motion signatures of HOG descriptors. IEEE Trans. Pattern
Anal. Mach. Intell. 2012, 34, 2247–2258. [CrossRef] [PubMed]
5. Weissmann, J.; Salomon, R. Gesture recognition for virtual reality applications using data gloves and neural networks. In
Proceedings of the International Joint Conference on Neural Networks, Washington, DC, USA, 10–16 July 1999; pp. 2043–2046.
6. Yan, S.; Shao, H.D.; Min, Z.S.; Peng, J.J.; Cai, B.P.; Liu, B. FGDAE: A new machinery anomaly detection method towards complex
operating conditions. Reliab. Eng. Syst. Saf. 2023, 236, 109319. [CrossRef]
7. Chen, M.Z.; Shao, H.D.; Dou, H.X.; Li, W.; Liu, B. Data augmentation and intelligent fault diagnosis of planetary gearbox using
ILoFGAN under extremely limited samples. IEEE Trans. Reliab. 2022, 72, 1029–1037. [CrossRef]
8. Wang, Z.J.; Li, Y.J.; Dong, L.; Li, Y.F.; Du, W.H. A RUL Prediction of Bearing Using Fusion Network through Feature Cross
Weighting. Meas. Sci. Technol. 2023, 34, 105908. [CrossRef]
9. Wang, T.Q.; Li, Y.D.; Hu, J.F.; Khan, A.; Liu, L.; Li, C.; Hashmi, A.; Ran, M. A survey on vision-based hand gesture recognition. In
Smart Multimedia; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11010,
pp. 219–231.
10. Wang, Y.; Zhang, J.; Qin, Y.; Chai, X. Gesture Recognition Method Based on Multi-feature Fusion. J. Syst. Simul. 2019, 31, 346–352.
11. Oikonomidis, I.; Kyriazis, N.; Argyros, A.; Hoey, J.; McKenna, S. Efficient model-based 3D tracking of hand articulations using
Kinect. In Proceedings of the British Machine Vision Conference, Dundee, UK, 29 August–2 September 2011; Volume 2. 11p.
12. Moeslund, T.B.; Hilton, A.; Krüger, V. A survey of advances in vision-based human motion capture and analysis. Comput. Vis.
Image Underst. 2006, 104, 90–126. [CrossRef]
13. Yedder, H.B.; Cardoen, B.; Hamarneh, G. Deep learning for biomedical image reconstruction: A survey. Artif. Intell. Rev. 2021, 54,
215–251. [CrossRef]
14. Suk, H.; Sin, B.; Lee, S. Hand Gesture Recognition Based on Dynamic Bayesian Network Framework. Pattern Recognit. 2010, 43,
3059–3072. [CrossRef]
15. Bécha Kaâniche, M.; Brémond, F. Gesture Recognition by Learning Local Motion Signatures. In Proceedings of the 2010
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010;
pp. 2745–2752.
16. Baligh, M.; Sabri, A. Arabic online handwriting recognition (AOHR): A survey. ACM Comput. Surv. 2017, 50, 33.
17. Neidle, C.; Thangali, A.; Sclaroff, S. Challenges in Development of the American Sign Language Lexicon Video Dataset (ASLLVD)
Corpus. In Proceedings of the 5th Workshop on the Representation and Processing of Sign Languages: Interactions between
Corpus and Lexicon, LREC 2012, Istanbul, Turkey, 27 May 2012.
Sensors 2024, 24, 6262 15 of 16
18. Camgoz, N.C.; Hadfield, S.; Koller, O.; Ney, H.; Bowden, R. Neural Sign Language Translation. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018.
19. Zhou, H.; Zhou, W.; Qi, W.; Pu, J.; Li, H. Improving Sign Language Translation with Monolingual Data by Sign Back-Translation.
In Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN,
USA, 20–25 June 2021.
20. Zhang, L.G.; Wu, J.Q. Hand Gesture Recognition Based on Hausdorff Distance. J. Image Graph. 2002, 7, 1144–1150.
21. Yang, X.W.; Feng, Z.Q.; Huang, Z.Z.; He, N.N. Gesture Recognition Based on Combining Main Direction of Gesture and
Hausdorff-like Distance. J. Comput.-Aided Des. Comput. Graph. 2016, 28, 75–81.
22. Zhu, J.Y.; Wang, X.Y.; Wang, W.X.; Dai, G.Z. Hand Gesture Recognition Based on Structure Analysis. Chin. J. Comput. 2012, 29,
2130–2137.
23. Xin, W.B.; Hao, H.M.; Bu, M.L.; Lan, Y.; Huang, J.H.; Xiong, X.Y. Static gesture real-time recognition method based on ShuffleNetv2-
YOLOv3 model. J. Zhejiang Univ. 2021, 55, 1815–1824.
24. Lin, H.I.; Hsu, M.H.; Chen, W.K. Human hand gesture recognition using a convolution neural network. In Proceedings of the 2014
IEEE International Conference on Automation Science and Engineering, New Taipei, Taiwan, 18–22 August 2014; pp. 1038–1043.
25. Wu, X.Y. A hand gesture recognition algorithm based on DC-CNNJ. Multimed. Tools Appl. 2020, 79, 91939205.
26. Wu, J.; Tian, Q.; Yue, J. Static gesture recognition based on residual dual attention and cross level feature fusion module. Comput.
Syst. Appl. 2022, 31, 111–119.
27. Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble. arXiv
2021, arXiv:2110.06161.
28. Papadimitriou, K.; Potamianos, G. Sign Language Recognition via Deformable 3D Convolutions and Modulated Graph Convo-
lutional Networks. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5.
29. Al-onazi, B.B.; Nour, M.K.; Alshahran, H.; Elfaki, M.A.; Alnfiai, M.M.; Marzouk, R.; Othman, M.; Sharif, M.M.; Motwakel, A.
Arabic Sign Language Gesture Classification Using Deer Hunting Optimization with Machine Learning Model. Comput. Mater.
Contin. 2023, 75, 3413–3429. [CrossRef]
30. Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; Volume 7, pp. 815–823.
31. Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al.
MediaPipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172.
32. Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. MediaPipe Hands: On-Device
Real-Time Hand Tracking. arXiv 2020, arXiv:2006.10214.
33. Sincan, O.M.; Keles, H.Y. Autsl: A large scale multi-modal turkish sign language dataset and baseline methods. IEEE Access 2020,
8, 181340–181355. [CrossRef]
34. Selvaraj, P.; Nc, G.; Kumar, P.; Khapra, M. OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained
Models across Languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin,
Ireland, 22–27 May 2022; Volume 1, pp. 2114–2133.
35. Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. Sensors 2023,
23, 2284. [CrossRef]
36. Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Skeleton Aware Multi-modal Sign Language Recognition. In Proceedings of the
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June
2021; pp. 3408–3418.
37. Patra, S.; Maitra, A.; Tiwari, M.; Kumaran, K.; Prabhu, S.; Punyeshwarananda, S.; Samanta, S. Hierarchical Windowed Graph
Attention Network and a Large Scale Dataset for Isolated Indian Sign Language Recognition. arXiv 2024, arXiv:2407.14224.
38. Maxim, N.; Leonid, V.; Ruslan, M.; Dmitriy, M.; Iuliia, Z. Fine-tuning of sign language recognition models: A technical report.
arXiv 2023, arXiv:2302.07693.
39. Ryumin, D.; Ivanko, D.; Axyonov, A. Cross-Language Transfer Learning Using Visual Information for Automatic Sign Gesture
Recognition. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, XLVIII-2/W3, 209–216. [CrossRef]
40. De Coster, M.; Van Herreweghe, M.; Dambre, J. Isolated sign recognition from RGB video using pose flow and self-attention. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June
2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3441–3450.
41. Zhu, G.; Zhang, L.; Mei, L.; Shao, J.; Song, J.; Shen, P. Large-scale Isolated Gesture Recognition using pyramidal 3Dconvolutional
networks. In Proceedings of the International Conference on Pattern Recognition, Cancun, Mexico, 4–8 December 2016; IEEE:
Piscataway, NJ, USA, 2016; pp. 19–24.
42. Wang, P.; Li, W.; Liu, S.; Gao, Z.; Tang, C.; Ogunbona, P. Large-scale isolated gesture recognition using convolutional neural
networks. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8
December 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 7–12.
Sensors 2024, 24, 6262 16 of 16
43. Li, Y.; Miao, Q.; Tian, K.; Fan, Y.; Xu, X.; Li, R.; Song, J. Large-scale gesture recognition with a fusion of rgb-d data based on the
c3d model. In Proceedings of the International Conference on Pattern Recognition, Cancun, Mexico, 4–8 December 2016; IEEE:
Piscataway, NJ, USA, 2016; pp. 25–30.
44. Shen, X.; Zheng, Z.; Yang, Y. StepNet: Spatial-temporal Part-aware Network for Isolated Sign Language Recognition. ACM Trans.
Multimedia Comput. Commun. 2024, 226, 19. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.