0% found this document useful (0 votes)
104 views4 pages

Vietnamese Sign Language Detection Using Mediapipe

Uploaded by

phandinhlongnhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views4 pages

Vietnamese Sign Language Detection Using Mediapipe

Uploaded by

phandinhlongnhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Vietnamese sign language detection using Mediapipe

Khuat Duy Bach Phung Thai Duong Pham Thi Thu Ha


ICT Department, FPT University, ICT Department, FPT University, ICT Department, FPT University,
Hanoi, Vietnam Hanoi, Vietnam Hanoi, Vietnam
Email:[email protected] Email:[email protected] Email:[email protected]

Bui Ngoc Anh Ngo Tung son


ICT Department, FPT University, ICT Department, FPT University,
Hanoi, Vietnam. Hanoi, Vietnam.
Email:[email protected] Email:[email protected]
ABSTRACT 1 INTRODUCTION
Sign language is the only way used in communication for deaf and 1.1 Research Context
dumb people who cannot hear and speak. Viet Nam has nearly
Common communication is the most difficult problem for deaf and
2.5 million people with hearing and speaking disabilities, while
dumb people. Therefore, they only can express their thoughts by
the number of sign language interpreters in Vietnam is tiny. The
using sign language. Sign language is a combination of many fac-
hearing impaired has the same need for normal communication,
tions: gestures, shape, hands, body language, and facial expression.
access to information, and public services such as hospitals as ordi-
Each gesture’s sequence gives a different meaning and can present
nary people. The lack of sign language interpreters and effective
for a word or a sequence in a common language. Using it, people
methods to assist ordinary people in communicating with the hear-
with hearing and speaking disabilities can convey many messages,
ing impaired require a convenient tool that makes sign language
alphabet, digit, and word. However, many sequence gestures to
friendly for everyone. This paper presents an implementation using
remember are a hindrance for both dumb-deaf people and ordinary
a recurrent neural network (RNN) with a Mediapipe hand tracking
people to learn. Moreover, sign language is different in different
framework for Sign Language Gesture Recognition. Training data
places around the world. Even sign languages in the same country
is created from input video using Multi-Hand Tracking and deep
are slightly different. So it requires effective methods or tools to
learning model that can recognize gestures by Hand Landmark Fea-
reduce struggles, saving time, and effort when dumb-deaf people
tures per frame with RNN training. The dataset contains gestures
and ordinary people approach sign language. With the development
of the most common words in Vietnamese. This model produces
of machine computer and the improvement of neural networks,
good accurate results in word recognition.
many sign language’s models appeared and proved its power in
gesture recognition and learning sequence of sign gestures. Using
CCS CONCEPTS neural networks as convolutional neural networks (CNN), recurrent
• Computer methodologies → Artificial Intelligence; Computer neural networks (RNN) have achieved breakthroughs. However,
Vision; Computer Vision Problem; Object recognition.. sign language recognition with deep neural networks remains chal-
lenging and non-trivial. In this paper, we approach Vietnamese
Sign Language by using the Mediapipe framework from Google. A
KEYWORDS
simple Recurrent Neural Network (RNN) captures body movement,
Helmet detection, YOLOv5, Deep Learning, Motorcycle safety; including wrist, hand, and fingers. By capturing different poses
of movement in the video, splitting into frames by OpenCV. We
ACM Reference Format: use Long Short Term Memory (LSTM) to flatten data. The fully
Khuat Duy Bach, Phung Thai Duong, Pham Thi Thu Ha, Bui Ngoc Anh, connected layer data flows continuously, which will enrich the
and Ngo Tung son. 2021. Vietnamese sign language detection using Me-
network and improve the model’s accuracy. The data will be ap-
diapipe. In 2021 10th International Conference on Software and Computer
proached in relative position: calculate each (x,y) coordinate change
Applications (ICSCA 2021), February 23–26, 2021, Kuala Lumpur, Malaysia.
ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3457784.3457810 for two-hand landmarks per frame and transform it into integers.

1.2 Related Works


Sign language comprises different gestures, shapes, and movements
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed of the hand, body, and facial expression. With the help of sign
for profit or commercial advantage and that copies bear this notice and the full citation language, deaf and dumb people express their different thoughts
on the first page. Copyrights for components of this work owned by others than ACM [1]. Each movement of the hand, gesture, and facial expression has
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a a unique meaning [2]. The hand plays a vital role in sign language.
fee. Request permissions from [email protected]. Therefore, recognizing hand gestures is the basis for developing an
ICSCA 2021, February 23–26, 2021, Kuala Lumpur, Malaysia efficient sign language recognition system [4]. Most of the methods
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8882-5/21/02. . . $15.00 used to recognize hand gestures are based on vision [3], sensor
https://doi.org/10.1145/3457784.3457810 gloves, and color.

162
ICSCA 2021, February 23–26, 2021, Kuala Lumpur, Malaysia Bach Khuat et al.

Figure 1: Model architecture

A few studies implement the Hidden Markov Model (HMM) 1.3 Contributions
and its modifications to build a Sign Language Recognition system. Deaf and dumb people must use hand sign language to communi-
HMM is a statistical model where a set of parameters is hidden. cate. However, there is a gap between them and the community,
The hidden parameters can be acquired from related observation especially those that are not familiar with Vietnamese sign lan-
parameters [5]. HMM is also used in glove-based Sign Language guage. In this research, we tried to create a model that can trans-
Recognition systems. Hidden Markov Model is used to tackle se- late Vietnamese sign language into Vietnamese regular language.
quential data, and Sign Language consists of continuous gestures This research provides a much more convenient way to capture
that make up a word or sentence [5]. Wang et al. 2006 [6] used multi-hand gesture movement than Convolutional Neural Network
the Multi-Dimensional Hidden Markov Model for American Sign (CNN) and OpenCV. By modified Mediapipe, this approach takes
Language Recognition, which has 96.7% accuracy. The most recent advantage in hand detecting, needs no additional device, more gran-
research is based on HMM using Kinect to create 3D models of the ularity, lightweight data than a direct frame, and less affected by
captured gestures. Kinect-Based using Hidden Markov Model (Lee background.
et al., 2014) [7] attained an 85.14% recognition rate.
A Self-Organizing Map was used in an Argentinean Sign Lan-
guage Recognition system (Ronchetti et al.,(2016) [8]) using Prob-
Som that classified hand shapes and achieved an accuracy of above
2 MATERIAL AND METHOD
90%. We tried to approach it differently from using OpenCV with CNN or
In recent years, there has been a growing interest in feature ex- catching hand skeleton with RNN. We use 21 landmarks, RNN, and
traction with deep neural networks due to superior representation modified Mediapipe version of raBit64 on GitHub [12] to recognize
capability [9]. Convolutional Neural Network (CNN) gradually be- sign language movement. With this modified Mediapipe, we get
came familiar to recognize Sign Language. Huang et al. (2015) [10] text files and output hand tracking videos.
used a 3D CNN to interpret Sign Language into text and speech and We extract 42 landmarks pairs (21*2) for each frame and combine
reach 94.2% accuracy. Sign language recognition aims at learning them into one text file to find the corresponding file and modify
the correspondences between input sequences and sign labels with the Mediapipe code. So that we have to use videos as inputs in-
sequence learning models [9]. Recurrent neural network (RNN) is stead of using a data stream like a webcam (Mediapipe is optimized
taking HMM’s place in analysis sequence learning from continuous for real-time detection, but we wanted it to create a data set with
time-series. video input). To make output data ready for the RNN model to
Sarfaraz et al. [11] made a model using CNN and RNN to rec- learn, we extract text files for each video (1 word) for numbers
ognize video sequences of Argentinean Sign Language gestures of videos. Then combine the text file for every word and label
containing 46 gesture categories. They used the inception model, it into a pkl file to automatically make the python shell script.
which is a deep convolutional neural network (CNN), to train mod- The Mediapipe does not provide a file to extract landmarks au-
els on spatial features and recurrent neural networks (RNN) to tomatically. In this case, landmarks are only used for intermedi-
train the model on temporal features [11] and were able to achieve ate values inside the graph pipeline. We have to replace the land-
high accuracy of 95.2%. This model so far has a good approach marks_to_render_data_calculator.cc file with a modified version of
and highly accurate. However, their work used additional devices raBit64’s version. The default input of Mediapipe is a webcam, so
(colored gloves in this case) with CNN reduces its practicality. Using we must use raBit64’s build.py python shell script to automatically
every picture frame as RNN’s input made the training time take exact processed mp4 video and text data files. We then using LSTM
too long, even overfitting the model. architect for classification task as shown in Figure 1. Where LSTM
1 to 3 respectively denotes the layer of 256,128, 64 units.

163
Vietnamese sign language detection using Mediapipe ICSCA 2021, February 23–26, 2021, Kuala Lumpur, Malaysia

Figure 2: The Mediapipe catching movement in diferent area of the frame.

Figure 3: The Mediapipe misunderstands sleeve wrinkles into hand. The Mediapipe misses capturing right hand movement.

3 EXPERIMENTS In these cases, preprocessing video recorded far from the frame,
the center makes Mediapipe unable to detect hand movement cor-
3.1 Dataset
rectly. The preprocessing data was not as good as we expected, so
Data collected by team members as shown in Figure 2. The model the input data was inferior. This makes the label messy since dif-
will be trained in different poses, and hand tracking positions in ferent words have the same undetected or unwanted points. Since
the middle far from the middle left and right (25% in weight for this is a Mediapipe’s disadvantage, we tried to create more standard
each data position). Choosing data samples with high variation to data in the next experiment.
estimate the efficiency of Mediapipe hands detection so that we After the training process with 50 epochs, the accuracy and the
can judge this approach effectively or not. loss of our model is represented by the graphs in Figure 4.
By using Gradient Descent, Model loss reduce from 2.25 to 0.75
after only 23 epochs(Figure 4 Right side) and was not improve since
3.2 Results that. In Figure 4 Left side, Model validation converge dynamically
The accuracy of this research mostly depends on Mediapipe frame- throughout the 30 first epoch, model validation was not improved
works (Figure 3). The more accuracy of Mediapipe hand recognition, after then. The model is a little bit overfitting. This could be easily
the better result. The variation of position and data size also greatly improve by removing some unwanted data and adding more valid
impacts the accuracy of Mediapipe hand detection. We want to data from the preprocessing steps.
conduct this research with two sets of data of different sizes to 5 show how accuracy model is on Test data. The overall accuracy
estimate the model’s efficiency. reach 0.635. “AmAp” and “AoThuat” are words that accuracy that

164
ICSCA 2021, February 23–26, 2021, Kuala Lumpur, Malaysia Bach Khuat et al.

Figure 4: The accuracy and Loss values generated by the model at corresponding epoch.

‘GrantNumber’ from style panel. Example of Grant sponsor: C


ompetitive Research Programme and example of Grant no: CRP
10-2012-03.

REFERENCES
[1] G Adithya V., Vinod P. R., Usha Gopalakrishnan , “Artificial Neural Network
Based Method for Indian Sign Language Recognition” , IEEE Conference on
Information and Communication Technologies (ICT), 2013, pp. 1080-1085.
[2] Rajam, P. Subha and Dr G Balakrishnan, "Real Time Indian Sign Language Recog-
nition System to aid Deaf and Dumb people", 13th International Conference on
Communication Technology (ICCT), 2011, pp 737-742.Martin A. Fischler and
Robert C. Bolles. 1981. Random sample consensus: a paradigm for model fitting
with applications to image analysis and automated cartography. Commun. ACM
24, 6 (June 1981), 381–395. https://doi.org/10.1145/358669.358692
[3] G. R. S. Murthy, R. S. Jadon. (2009). “A Review of Vision Based Hand Gestures
Recognition”, International Journal of Information Technology and Knowledge
Management, vol. 2(2), pp. 405-410.Jon M. Kleinberg. 1999. Authoritative sources
in a hyperlinked environment. J. ACM 46, 5 (September 1999), 604–632. https:
//doi.org/10.1145/324133.324140
[4] P. Garg, N. Aggarwal and S. Sofat. (2009). “Vision Based Hand Gesture Recog-
nition”, World Academy of Science, Engineering and Technology, Vol. 49, pp.
972-977.James W. Demmel, Yozo Hida, William Kahan, Xiaoye S. Li, Soni Mukher-
jee, and Jason Riedy. 2005. Error Bounds from Extra Precise Iterative Refinement.
Technical Report No. UCB/CSD-04-1344. University of California, Berkeley.
Figure 5: Confusion Matrix on Test data [5] >”A Survey of Hand Gesture Recognition Methods in Sign Language Recognition”,
Pertanika Journal of Science and Technology, 2018.Jason Jerald. 2015. The VR
Book: Human-Centered Design for Virtual Reality. Association for Computing
Machinery and Morgan & Claypool.
[6] Wang et al. (2006). “American Sign Language Recognition Using Multi-
lower than 0.5. Since these words movements are similar to words dimensional Hidden Markov Models” , Journal of Information Science and Engi-
“HomNay”. The common in these three words is they have simple neering 22, 1109-1123R Core Team. 2019. R: A Language and Environment for
movement and low position to the Mediapipe coordinate. Although Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
https://www.R-project.org/
“AoThuat” is similiest to others so it’s being misunderstood by [7] Lee et al. (2014). “Kinect-based Taiwanese sign-language recognition system” ,
others most. Words with more complex movements and signature Springer Science+Business Media New York 2014.
movements return a much higher accuracy. Example: “CauHoi” [8] Ronchetti et al. (2016). “Handshape recognition for Argentinian Sign Language
using ProbSom” , Instituto de Investigación en Informática LIDI, Facultad de
(0.75), “Ban” (0.75), “AnhTrai”(0.8), “AoDai”(0.775). Informática, Universidad Nacional de La Plata.Jason Jerald. 2015. The VR Book:
In this experiment accuracy has improved dynamically.This show Human-Centered Design for Virtual Reality. Association for Computing Machin-
ery and Morgan & Claypool.
the huge potential of this approach on translating Vietnamese sign [9] Runpeng Cui, Hu Liu, Changshui Zhang. “Recurrent Convolutional Neural Net-
language into formal language. Even through there is still misrecog- works for Continuous Sign Language Recognition by Staged Optimization”.R Core
nition from words that have similar movement. Team. 2019. R: A Language and Environment for Statistical Computing. R Foun-
dation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[10] Huang et al. (2015). “Sign Language Recognition Using 3D Convolutional Neural
ACKNOWLEDGMENTS Networks”.
[11] Sarfaraz et al. “Real-Time Sign Language Gesture (Word) Recognition from Video
Acknowledgments are placed before the references. Add informa- Sequences Using CNN and RNN”.
tion about grants, awards, or other types of funding that you have [12] raBit64 https://github.com/rabBit64/Sign-language-recognition-with-RNN-and-
Mediapipe.
received to support your research. Author can capture the grant
sponsor information, by selecting the grant sponsor text and
apply style ‘GrantSponsor’. After this, select grant no and apply

165

You might also like