Vietnamese Sign Language Detection Using Mediapipe
Vietnamese Sign Language Detection Using Mediapipe
162
ICSCA 2021, February 23–26, 2021, Kuala Lumpur, Malaysia Bach Khuat et al.
A few studies implement the Hidden Markov Model (HMM) 1.3 Contributions
and its modifications to build a Sign Language Recognition system. Deaf and dumb people must use hand sign language to communi-
HMM is a statistical model where a set of parameters is hidden. cate. However, there is a gap between them and the community,
The hidden parameters can be acquired from related observation especially those that are not familiar with Vietnamese sign lan-
parameters [5]. HMM is also used in glove-based Sign Language guage. In this research, we tried to create a model that can trans-
Recognition systems. Hidden Markov Model is used to tackle se- late Vietnamese sign language into Vietnamese regular language.
quential data, and Sign Language consists of continuous gestures This research provides a much more convenient way to capture
that make up a word or sentence [5]. Wang et al. 2006 [6] used multi-hand gesture movement than Convolutional Neural Network
the Multi-Dimensional Hidden Markov Model for American Sign (CNN) and OpenCV. By modified Mediapipe, this approach takes
Language Recognition, which has 96.7% accuracy. The most recent advantage in hand detecting, needs no additional device, more gran-
research is based on HMM using Kinect to create 3D models of the ularity, lightweight data than a direct frame, and less affected by
captured gestures. Kinect-Based using Hidden Markov Model (Lee background.
et al., 2014) [7] attained an 85.14% recognition rate.
A Self-Organizing Map was used in an Argentinean Sign Lan-
guage Recognition system (Ronchetti et al.,(2016) [8]) using Prob-
Som that classified hand shapes and achieved an accuracy of above
2 MATERIAL AND METHOD
90%. We tried to approach it differently from using OpenCV with CNN or
In recent years, there has been a growing interest in feature ex- catching hand skeleton with RNN. We use 21 landmarks, RNN, and
traction with deep neural networks due to superior representation modified Mediapipe version of raBit64 on GitHub [12] to recognize
capability [9]. Convolutional Neural Network (CNN) gradually be- sign language movement. With this modified Mediapipe, we get
came familiar to recognize Sign Language. Huang et al. (2015) [10] text files and output hand tracking videos.
used a 3D CNN to interpret Sign Language into text and speech and We extract 42 landmarks pairs (21*2) for each frame and combine
reach 94.2% accuracy. Sign language recognition aims at learning them into one text file to find the corresponding file and modify
the correspondences between input sequences and sign labels with the Mediapipe code. So that we have to use videos as inputs in-
sequence learning models [9]. Recurrent neural network (RNN) is stead of using a data stream like a webcam (Mediapipe is optimized
taking HMM’s place in analysis sequence learning from continuous for real-time detection, but we wanted it to create a data set with
time-series. video input). To make output data ready for the RNN model to
Sarfaraz et al. [11] made a model using CNN and RNN to rec- learn, we extract text files for each video (1 word) for numbers
ognize video sequences of Argentinean Sign Language gestures of videos. Then combine the text file for every word and label
containing 46 gesture categories. They used the inception model, it into a pkl file to automatically make the python shell script.
which is a deep convolutional neural network (CNN), to train mod- The Mediapipe does not provide a file to extract landmarks au-
els on spatial features and recurrent neural networks (RNN) to tomatically. In this case, landmarks are only used for intermedi-
train the model on temporal features [11] and were able to achieve ate values inside the graph pipeline. We have to replace the land-
high accuracy of 95.2%. This model so far has a good approach marks_to_render_data_calculator.cc file with a modified version of
and highly accurate. However, their work used additional devices raBit64’s version. The default input of Mediapipe is a webcam, so
(colored gloves in this case) with CNN reduces its practicality. Using we must use raBit64’s build.py python shell script to automatically
every picture frame as RNN’s input made the training time take exact processed mp4 video and text data files. We then using LSTM
too long, even overfitting the model. architect for classification task as shown in Figure 1. Where LSTM
1 to 3 respectively denotes the layer of 256,128, 64 units.
163
Vietnamese sign language detection using Mediapipe ICSCA 2021, February 23–26, 2021, Kuala Lumpur, Malaysia
Figure 3: The Mediapipe misunderstands sleeve wrinkles into hand. The Mediapipe misses capturing right hand movement.
3 EXPERIMENTS In these cases, preprocessing video recorded far from the frame,
the center makes Mediapipe unable to detect hand movement cor-
3.1 Dataset
rectly. The preprocessing data was not as good as we expected, so
Data collected by team members as shown in Figure 2. The model the input data was inferior. This makes the label messy since dif-
will be trained in different poses, and hand tracking positions in ferent words have the same undetected or unwanted points. Since
the middle far from the middle left and right (25% in weight for this is a Mediapipe’s disadvantage, we tried to create more standard
each data position). Choosing data samples with high variation to data in the next experiment.
estimate the efficiency of Mediapipe hands detection so that we After the training process with 50 epochs, the accuracy and the
can judge this approach effectively or not. loss of our model is represented by the graphs in Figure 4.
By using Gradient Descent, Model loss reduce from 2.25 to 0.75
after only 23 epochs(Figure 4 Right side) and was not improve since
3.2 Results that. In Figure 4 Left side, Model validation converge dynamically
The accuracy of this research mostly depends on Mediapipe frame- throughout the 30 first epoch, model validation was not improved
works (Figure 3). The more accuracy of Mediapipe hand recognition, after then. The model is a little bit overfitting. This could be easily
the better result. The variation of position and data size also greatly improve by removing some unwanted data and adding more valid
impacts the accuracy of Mediapipe hand detection. We want to data from the preprocessing steps.
conduct this research with two sets of data of different sizes to 5 show how accuracy model is on Test data. The overall accuracy
estimate the model’s efficiency. reach 0.635. “AmAp” and “AoThuat” are words that accuracy that
164
ICSCA 2021, February 23–26, 2021, Kuala Lumpur, Malaysia Bach Khuat et al.
Figure 4: The accuracy and Loss values generated by the model at corresponding epoch.
REFERENCES
[1] G Adithya V., Vinod P. R., Usha Gopalakrishnan , “Artificial Neural Network
Based Method for Indian Sign Language Recognition” , IEEE Conference on
Information and Communication Technologies (ICT), 2013, pp. 1080-1085.
[2] Rajam, P. Subha and Dr G Balakrishnan, "Real Time Indian Sign Language Recog-
nition System to aid Deaf and Dumb people", 13th International Conference on
Communication Technology (ICCT), 2011, pp 737-742.Martin A. Fischler and
Robert C. Bolles. 1981. Random sample consensus: a paradigm for model fitting
with applications to image analysis and automated cartography. Commun. ACM
24, 6 (June 1981), 381–395. https://doi.org/10.1145/358669.358692
[3] G. R. S. Murthy, R. S. Jadon. (2009). “A Review of Vision Based Hand Gestures
Recognition”, International Journal of Information Technology and Knowledge
Management, vol. 2(2), pp. 405-410.Jon M. Kleinberg. 1999. Authoritative sources
in a hyperlinked environment. J. ACM 46, 5 (September 1999), 604–632. https:
//doi.org/10.1145/324133.324140
[4] P. Garg, N. Aggarwal and S. Sofat. (2009). “Vision Based Hand Gesture Recog-
nition”, World Academy of Science, Engineering and Technology, Vol. 49, pp.
972-977.James W. Demmel, Yozo Hida, William Kahan, Xiaoye S. Li, Soni Mukher-
jee, and Jason Riedy. 2005. Error Bounds from Extra Precise Iterative Refinement.
Technical Report No. UCB/CSD-04-1344. University of California, Berkeley.
Figure 5: Confusion Matrix on Test data [5] >”A Survey of Hand Gesture Recognition Methods in Sign Language Recognition”,
Pertanika Journal of Science and Technology, 2018.Jason Jerald. 2015. The VR
Book: Human-Centered Design for Virtual Reality. Association for Computing
Machinery and Morgan & Claypool.
[6] Wang et al. (2006). “American Sign Language Recognition Using Multi-
lower than 0.5. Since these words movements are similar to words dimensional Hidden Markov Models” , Journal of Information Science and Engi-
“HomNay”. The common in these three words is they have simple neering 22, 1109-1123R Core Team. 2019. R: A Language and Environment for
movement and low position to the Mediapipe coordinate. Although Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
https://www.R-project.org/
“AoThuat” is similiest to others so it’s being misunderstood by [7] Lee et al. (2014). “Kinect-based Taiwanese sign-language recognition system” ,
others most. Words with more complex movements and signature Springer Science+Business Media New York 2014.
movements return a much higher accuracy. Example: “CauHoi” [8] Ronchetti et al. (2016). “Handshape recognition for Argentinian Sign Language
using ProbSom” , Instituto de Investigación en Informática LIDI, Facultad de
(0.75), “Ban” (0.75), “AnhTrai”(0.8), “AoDai”(0.775). Informática, Universidad Nacional de La Plata.Jason Jerald. 2015. The VR Book:
In this experiment accuracy has improved dynamically.This show Human-Centered Design for Virtual Reality. Association for Computing Machin-
ery and Morgan & Claypool.
the huge potential of this approach on translating Vietnamese sign [9] Runpeng Cui, Hu Liu, Changshui Zhang. “Recurrent Convolutional Neural Net-
language into formal language. Even through there is still misrecog- works for Continuous Sign Language Recognition by Staged Optimization”.R Core
nition from words that have similar movement. Team. 2019. R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[10] Huang et al. (2015). “Sign Language Recognition Using 3D Convolutional Neural
ACKNOWLEDGMENTS Networks”.
[11] Sarfaraz et al. “Real-Time Sign Language Gesture (Word) Recognition from Video
Acknowledgments are placed before the references. Add informa- Sequences Using CNN and RNN”.
tion about grants, awards, or other types of funding that you have [12] raBit64 https://github.com/rabBit64/Sign-language-recognition-with-RNN-and-
Mediapipe.
received to support your research. Author can capture the grant
sponsor information, by selecting the grant sponsor text and
apply style ‘GrantSponsor’. After this, select grant no and apply
165