0% found this document useful (0 votes)
19 views8 pages

A Survey On Deep Learning Based Lip-Reading Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views8 pages

A Survey On Deep Learning Based Lip-Reading Techniques

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).

IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4


2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV) | 978-1-6654-1960-4/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICICV50876.2021.9388569

A Survey on Deep Learning based Lip-Reading


Techniques
Sheetal Pujari Sneha SK Vinusha R Bhuvaneshwari P
Department of Computer Science Department of Computer Department of Computer Assistant Professor
and Engineering Science and Engineering Science and Engineering Department of Computer
Nitte Meenakshi Institute of Nitte Meenakshi Institute of Nitte Meenakshi Institute of Science and Engineering
Technology Technology Technology Nitte Meenakshi Institute of
Bangalore,India Bangalore,India Bangalore,India Technology
[email protected] [email protected] [email protected] Bangalore,India
[email protected]

Yashaswini C
Department of Computer Science
and Engineering
Nitte Meenakshi Institute of
Technology
Bangalore,India
[email protected]

Abstract— Lip reading is a way of using skills and knowledge


to understand a speaker's verbal communication by visually
interpreting the lip movements of the people. This becomes a Deep Learn ing is a subset of machine learn ing concerned
tedious task when there are obstructions or background noise in with algorith ms which mimics the structure and function of
the data. Over the decade, with the buzz of Deep learning there the human brain called art ificial neural networks. Deep
has been an increasing demand for developing systems that can learning models are trained by using a large set of labeled data
help mankind by converting speech to text. Lip reading systems
and mult i layered neural network arch itectures which can be
are those intelligent automated systems which on looking at the
face of the user try to comprehend what he/she is trying to
visualized as a set of points each of which decides based on
convey and represent it visually. This process can be done by the inputs to the node. These algorithms learn gradually about
using various deep learning algorithms to detect the face, localize the image as it goes through each neural network layer. In itial
the lips, extract the features, train the classifier to detect the lip layers learn how to detect low-level features and following
movement and finally convert into text. This survey paper layers co mbine features fro m in itial layers into a mo re holistic
presents a summary of various deep learning techniques, representation. Far fro m tradit ional methods of machine
different datasets, adopted methodologies by highlighting their learning techniques, deep learning classifiers are trained
performance and limitations in the speech and vision through automatic feature learning.
applications.
ALR (Automatic Lip Reading) [13][18] systems are
comparatively behind when co mpared to the Automat ic
Keywords— deep learning, lip reading, intelligent automated Speech Recognition (ASR). In automatic Lip reading, we use
systems, converting speech to text different approaches to preprocess the raw data, extract the
features and train the model in such a way speech is converted
I. INT RODUCT ION
to text through Visual cues [4]. Just like speech recognition,
Co mmunicat ion is fundamental to the existence and these lip-reading systems encounter several challenges which
survival of humans to exp ress their ideas and feelings to reach may be in the form of background noise, color of the skin, the
a common understanding among the people. Lip Reading intensity of a person's speech and many more which can be
plays a supreme role in grasping human speech particularly dealt with and solved with the help of various deep learning
for the listeners with hearing impairment. This serves as a algorithms.
hearing aid for the hearing impaired particularly interacting Co mmunicat ion is the only key to survival and we
with people with no knowledge of sign language. For building personally wanted that to be possible for everyone in the
this system various techniques and approaches have been world.
adopted by so many researchers. As Artificial Intelligence This motivated us to do a survey on the various deep learning
outperforms well in majority of the domains, this is where techniques existing for a lip-reading model which helps the
deep learning co mes into play in the field of automat ic lip- hearing impaired for better understanding of the speaker.
reading.

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 1286

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

The paper is organized as follows - In section 2, we have


presented the literature survey of the analogous work. In 7 Wand, GRID LST M 85% The lip
section 3, the characteristics of various lip-reading datasets M [8] corpus to indication for
95% alike words was
generally used, and the process flow has been outlined. Later one of the most
in section 4, we trace out some of the deep learning network common
models used for the implementation of lip-reading systems problems faced
followed by the overall summary of our work by the
researchers.

II LITERATURE SURVEY 8 Kastani CUAV VPR, HMM, 45.6 The lip gesture
otis, D E and ASR, RBM, 3% for similar
[9] VPR VPR, HMM, words was one
In this section, we have explo red the performance are ASR, RBM, of the most
used in DBN common
and limitat ions of distinct deep learning algorithms on various this problems faced
datasets (Table I). study by the
researchers. As
TABLE I. SUMMARY OF STATE OF ART TECHNIQUES IN DEEP LIP this was one of
READING the main aspects
this is the
limitation or
drawbacks of
S.No Author Datase Techniques Perf Limitation the proposed
t orm model.
ance
1 Jake T CD- HMM, CNN, 69.5 Accuracy was 9 Yuanya Custom Fusion of CNN The Yet to explore
Burton T IMIT LST M 8% less when o Lu et audio- and Bi-LST M BiLS the proposed
[1] compared to al. [11] visual neural network TM method in a
speaker- databas architecture. achie practical
dependent e. ved a environment
machines. great with a vast
HMM did not er database.
accur
train some of
acy
the data due to than
insufficient LST
occurrences of M
phonemes.
10 Abderr AVLett Heterogeneous 90.8 Accuracy was
2 Abinav Grid Bi-directional - The attempts in ahim ers,Ou Convolutional 6% low on AV
Thanda Audio RNN, visual feature Mesbah luvs2, Neural on Letters
[2] visual connectionist engineering a [15] BBC- networks BBC dataset.
databas temporal using LRW LR
e classification unsupervised W
methods like datas
multi-model et
auto encoder did
not yield
notable solution 11 N. T he HMM with DCT HMM with
3 Puviara recorde DCT and DWT - DWT
Dilip Grid 3D-2D-CNN- - Grid dataset
san d video features 91% performed
Kumar and BLST Mand gave more
[16] of DW better than
Marga Indian Connectionist performance
speaker T- HMM with
m [3] English Temporal than Indian
s 97% DCT
Dataset Classification English dataset.
(CT C)
12 Apurva IBMVI CNN+BLST M, 87.6 Some
4 Fengho BBC- Viseme 64.6 Accuracy
ur,S. LRS2 classifiers & % dropped after H. AVOI DNN and % challenges are:
[4] attention based conversion of Kulkar CE LST M Characters
transformer. visemes to ni [17] CUAV producing the
words. E same sounds.
AVLlet (e.g.: buy, by)
5 Mathul XM2V Gaussian - Classification is ters2 When decoding
aprangs TS Mixture Model entirely speech in multi-
an [6] dependent on view scenarios.
the extracted
features
6 Fung, Ouluvs CNN in addition 87.6 The model III. EXPERIMENT AL DAT ABASE
I., & 2 10- to BLST M with % performance
Mak, B phrase a fusion of max In this section we have discussed the prominent datasets
decreased when
[7] task out more than 2 used by the researchers in lip-reading along with its properties.
kindling units. dimensions
were used.

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 1287

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

A. Datasets approval the dataset can be downloaded this generally takes a


The Fig.1 depicts the sample MIRACL-VC1 [10] dataset lot of time.
that comprises both depth and color images of (640 x 480 Ouluvs2 dataset is a low resource dataset/corpus. The
pixels) fifteen speakers who were in front of the MS Kinect dataset is divided into 3 different parts, 5 views and 52
sensor. They were made to pronounce a collection of 10 words subjects wherein 13 are female and 39 are male. It also
along with 10 phrases (Table II) 10 times each which led to a includes videos where head motion fro m front to left, fro m
total of 3000 instances. front to right and right to left, 5 times for each person without
any utterance included.
T ABLE II. THE SAMPLE WORDS FROM MIRACL-VC1 DATASET
Fig. 3 shows a oecond clip of a male speaker pronouncing the
word ‘about’, one second clip is divided into various frames
[14] [20]. Oulu VS2 dataset [21] contains thousands of videos
recorded using six cameras fro m different angles, which
makes Ouluvs2 dataset the most suitable one for lip reading
setup.

Fig. 1 Instances of MIRACL-VC1 dataset [10]

The MIRACL-VC1 corpus has two divisions namely


cropped and dataset folder. Here the cropped folder contains
the region of interest which is the lip region whereas the
dataset folder contains the whole image of the speaker. Fig. 2 T he sample of speakers in LRW dataset [20]
Lip-reading in the Wild [12] dataset is a dataset that has a
set of phrases and sentences spoken by talking faces which
makes it challenging to capture the words spoken by the people
at their own speed. It also contains up to 1000 utterances of
500 distinct words, which are spoken by hundreds of distinct
speakers (excerpts fro m BBC-TV). Th is dataset contains the
videos, wherein all the videos are of 29 frames in length , the
word generally will be uttered or caught in the middle portion
of the video.
LRW dataset is depicted in Fig.2 that can be downloaded
once we sign a Data Sharing agreement with BBC Research Fig. 3 One second clip of speaker uttering the word ‘about’ [15] [20]
and Development before getting access. Only after the

In GRID audio-visual dataset, each sentence in GRID The Fig. 3 shows a one second clip of a male speaker
has a fixed structure with six words structured as: “Co mmand pronouncing the word ‘about’, one second clip is divided into
+ Co lor + Preposition + Letter + Dig it + Adverb”. For various frames. Oulu VS2 dataset proposed in [21] contains
example,” set black into x four p lease”. The dataset consists of thousands of videos recorded using six cameras fro m different
51 d istinct words, which include 4 Co mmands, 4 Co lors, 4 angles, which makes Ouluvs2 dataset the most suitable one for
Prepositions, 25 Letters, 10 Dig its, and 4 Adverbs. In GRID, lip reading setup.
each sentence is randomly chosen into a combination of these
words and the duration of each sentence utterance is 3 Table III shows the sample phrases used in Ouluvs2 dataset,
seconds. GRID contains 51 vocabulary, 33000 utterances, 33 these are just the general phrases which a person might use
speakers. every day. Table IV depicts the simple co mparison among the

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 1288

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

various datasets in terms of their nu mber of speakers,


utterances, and accuracy.

T ABLE III. LIST OF DIFFERENT PHRASES IN OULUVS2 DAT ASET.

Sl.no Phrases Used


1. Hello
2. How are you doing
3. T hanks
4. Nice to meet you
5. See you
6. I am sorry
7. Have a good time
8. You are welcome
9. Excuse me
10. Goodbye

T ABLE IV. COMPARISON AMONG VARIOUS DAT ASETS


Fig. 4 Process flow of auto lip-reading

Databases SpeakersUtterancesAccuracy Year Input data which is in the form of raw v ideo is fed to
MIRACL- 15 10 71.15% 2014 the system that consist of lot of noise in it. By using
VC1 [25] words,10
phrases
preprocessing technique, the above specified unwanted data are
Lip 100’s 1000 64.6% 2016
removed. There are several factors which d isturb the
reading in utterances performance and the accuracy of image processing techniques
the Wild of 500 like the pose, scaling, background noise, facial hair, rotation of
[4] words head etc. This led to focus on the specific features that plays a
GRID [8] 33 3300 80% 2006 vital ro le in converting speech to text. There are a number of
utterances preprocessing techniques performed on the datasets before
Ouluvs2 52 10000 93.72% 2015
[15]
actually training them to cut out the interference of irrelevant
factors for better recognition performance and obtaining finer
outcomes. If the dataset used for preprocessing contains video,
IV. M ET HODOLOGY the videos are first sampled into even sized image frames, after
this the facial landmarks will be located and sent as an input to
A. Data Preprocessing the neural network for further processing. The further
Data preprocessing is the first step in the automatic lip-reading processing here refers to the usage of Single shot box detector
process (Fig.5). Preprocessing has more impact in the entire (SSD) [4] which is a detector based on CNN used for detecting
process because it enhances the performance of the algorithm. the face and to recognize facial landmarks. Thus, data
The main challenges faced during lip reading are: preprocessing was required because the data consisted of
plenty of background information which was useless in
1. Presence of background noise, variation in the intensity
lipreading task
of speech.
2. Color of skin B. Feature Extraction
3. Somet imes lips can't be detected because of the The next most crucial part of lip-reading is the feature
presence of moustache extraction. Hence face detection can be detected easily with
4. Characters which have the same pronunciation sound any of the cascade algorith ms [16] as shown in Fig.5, the
also known as Phonemes. feature extraction which is extract ing the lip region is a
tedious task. However, in any standard face, the mouth will be
in the lo wer half of the face, the reg ion of interest can be set
by gradually reducing the width and length. Lips are the most
deformable part of the face, thus making it even harder to
detect. After preprocessing, localizing the lip and detecting the
lip movement using appropriate techniques are applied. Here
the rough region of the mouth is evaluated appropriate to the
face.

The main methods [5] used for any lip detection can be
classified as follows:

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 1289

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

x Image-Based: This technique makes use of 1. Extracting the ROI:


informat ion like the intensity, corners, motion, edges Train ing all the images together on a neural network is
and sometimes the pixel color. time consuming, so it would be better and more
accurate if we cropped the images accordingly. Using
x Model-Based: Several models like Active Shape
the facial land mark detector of dlib library, the lip
Models (ASM), Active Appearance Models, Hidden
coordinates can be obtained and extracted as shown in
Markov Models are examples .
Fig.6. All such frames which contain just the lip region
can be added into an array and sent to the neural
Whenever the fresh image is sent as an input to the network or the frames can just be concatenated and
network, ASM starts searching for the lips, these given as an input to the model
models are generally built during the training phase
wherein during the training period d ifferent images
along with their corresponding facial landmarks and
the coordinates of the same will be provided when
the model will be built . Th is model is not feasib le for
developing the lip-reading system as it is person
specific and is highly dependent on the person. So,
the speaker dependency is more here. Another
traditional familiar model used for recognition of
movement of lips are the Hidden Markov Model
(HMM) [5] and Gaussian Markov Models (GMM)
are analyzed in detail. HMMs are basically statistical
models relevant to use with the feature fo r temporal Fig.6 T he detection of ROI
informat ion type whereas GMMs are more 2. Concatenation of Frames:
appropriate to apply to physiological features that
We concatenate the extracted frames to form a
deal with the statistical data. Lo mbardi et al
sequence as shown in Fig.7. Every person has a
described a hybrid model of Active Markov Model different intensity of speech and thus, the number of
(AMM) [5] and HMM. AMM is used for detection of
frames fo r a word vary fro m person to person, to
the facial features & feature points fro m faces
combat this we can extend the small sequences to
directly whereas HMM which is a statistical model is match the other frames accordingly.
used for lip recognition. HMM is based on Baum-
Welch-forward-backward re -estimation algorithm.
The AMM model appeared to be mo re pro mising
when compared to HMM because it was more
consistent in detecting non-speech sections which
contained complex lip movements.

Fig.7 T he Array of Concatenated frames [10]

Amit Garg et al [10] used MIRACL-VC1 dataset in


his imp lementation. To detect and ext ract faces fro m the
images the face-detector module in Open CV was used where
the image was resized into 128 x 128. Since the dataset was
small, each image fro m the dataset was further aug mented to
create more instances with different properties fro m a single
frame.
In another approach [11] the custom audio-visual database
which had a frame rate of 25-30 frames/s was used. These
frames carried redundant details wh ich led to the acut e
increase in computation and complexity in the processing.
With an intention to remove the interference of several
Fig. 5 Extracting Region of Interest [11]
amb iguities in the database, the sequence of lip movements
The two operations which are needed to be performed are:
was processed and the frames containing the reg ion of interest

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 1290

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

were extracted. Init ially the samp le v ideo was div ided into ten
equal segments. Then the central frame of each segment was
extracted. After the extraction of the key frames OpenCV was
used to detect faces. In addition, 5 key points of the mouth
location are tracked which is used to locate the center o f the
mouth to obtain the lip region finally.

In [15] the dataset used is A VLetters and Oulu VS2 dig its
dataset. While wo rking with the Ou lu Vs2 d igit dataset the first Fig. 8 VGG and LST M model [10]
step is the ret rieval of the frames fro m videos and then
combating the issue of scaling. Whereas while using the ‘Lip - They tried 2 different model. In the first model they
reading in the Wild’ dataset, after the 1st step of retrieval of decreased the learning rate by 100 and 1000 of VGGNet and
the frames then the video is cut into several frames in o rder to trained the complete model. Both gave equivalent results. In
ensure that only the frames wh ich map to the wo rd spoken are the second model they freeze VGGNet and trained only
kept. LSTM. This involved performing a solo forward pass through
The systems built on GRID and Oulu Vs2 [15] have shown the VGGNet, the extracted features of which is then reiterated
more accuracy in co mparison to the datasets like LRS2.The as inputs to the LSTM without updating the VGGNet, for each
main reason here being LRS2 consists of sentences which epoch. The pre-trained VGGNet was able to achieve well on
have 40,000 random words wherein GRID and Ou lu Vs2 the integrated images. The LSTM model failed, the reason
consists of mostly repetitive sentences in a sequence, this being it did not operate the sequence until after feature
happen to be range of small vocabulary. extraction was done. In Addition, especially wh ile updating
the VGGNet the LSTM model took an extended time to train.
Preprocessing and feature extraction are the most With more time it may have produced better results.
important step because it ensures that the region of the face
needed for classification will be ext racted and which will in B. CNN-BiLSTM Network
turn be served as an input to the neural network. In some In this segment we outline the hybrid architecture of
cases, ROI must also undergo several greyscale conversions, Convolutional Neural Network together with Bidirect ional
sometimes z-score normalizat ions also. Since it plays a major Long Short-Term Memo ry neural network p roposed by
role, t rain ing a neural network improves the performance as Yuanyao Lu et al. [11] and depicted in Fig.9.
well as the accuracy of the algorith m. So me of the major deep
learning algorithms used in the lip-reading are as follows.

V NEURAL NETWORK STRUCTURE


The next phase after feature ext raction is the Recognition
models. In this phase the template will be matched, and the
appropriate text will be displayed. As Neural networks are
outperforming in all the applications, in this section we
describe various Neural Network architectures used for the
implementation of Lip-Read ing System. Neural networks are a
series that mimic the operations of a hu man brain to recognize
human behavior and train the model in such a way that it is a Fig. 9 T he CNN-BiLST M hybrid architecture [11]
substitute for hu man actions. The structure of the Neural
Network consists of the input layer and the output layer. The In this system the salient frames fro m each indiv idual
input layer in-turn consists of a single input layer and a single video-based clip was extracted at first. Then unvarying
output layer which is in-turn equal to the input as mentioned features of each mouth image are obtained using 8-layer CNN,
above. which consist of a convolutional layer together with a pooling
layer p lus the fully connected layer. Then the extracted feature
A. VGG-LSTM Model
was led into the Bid irectional Long Short-Term Memory in
The Fig. 8 illustrates the VGG-LSTM Model which was order to reproduce the correlation of sequential details among
proposed by Amit Garg et al [10]. extracted features in 2 different d irections. The probability of
each category was obtained by a SoftMax function, which
adopts the greater values as the final recognition result. After
evaluation of experiments on the dataset this method
outperformed the traditional approaches being Active Contour
Model (ACM) as well as HMM.

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 1291

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

C. Heterogeneous Convolution Neural Network (HCNN ) Implementing the lip-reading system on a variety of platforms
The HCNN network mainly aims at resolving the lip - after improving its performance and recognition rates could be
reading problem by processing the images of lips efficiently a major helping hand for the hearing-impaired people.
and rapidly. HCNN [15] is a co mbination o f Convolutional REFERENCES
neural networks and different orthogonal mo ments (Fig.6).
[1] Burton, J., Frank, D., Saleh, M., Navab, N. and Bear,H.L., 2018, December.
These Orthogonal mo ments [19] are very useful tools used in The speaker-independent lipreading play-off; a survey of lipreading machines. In
pattern recognition and well as image analysis. Here 2018 IEEE International Conference on Image Processing, Applications and
orthogonal mo ments refer to the Hahn mo ments [15] which Systems (IPAS) (pp. 125-130). IEEE.
are based on the Hahn polynomials described over an image
coordinate space. HCNN imp roves the process of feature [2] Thanda, A., & Venkatesan, S. M. (2016, December). Audio visual speech
extraction in the image. recognition using deep recurrent neural networks,in IAPR workshop on
Hahn mo ments are mo ments fro m wh ich we can multimodal pattern recognition of social signals in human-computer interaction
(pp. 98-109). Springer, Cham.
extract most useful informat ion fro m the image, th is is done
by using the feature of covering the global and the local
[3] Margam, D.K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S. and
features with good efficiency at the same t ime simultaneously. Venkatesan, S.M., 2019. LipReading with 3D-2D-CNN BLSTM-HMM and word-
Using Hahn mo ments, the authors of [15] achieved 20% mo re CT C models. arXiv preprint arXiv:1906.12170.
improvement in accuracy when compared to CNN. Fig 4.3
describes the HCNN architecture. [4] Fenghour, S., Chen, D., Guo, K. and Xiao, P., 2020. Lip Reading Sentences
Using Deep Learning with Only Visual Cues. IEEE Access, 8, pp.215516 -215530.
D. Deep Neural Network (DNN)
A deep neural network or DNN [18] is a powerful category
[5] Lombardi, L., 2013, September. A survey of automatic lip-reading approaches.
of M L. DNN is an art ificial neural network consisting of N In Eighth International Conference on Digital Information Management (ICDIM
number layers in between the input and output layers. The 2013) (pp. 299-302). IEEE.
mu ltip le layers between the input layer and output layers are
called hidden layers. DNN requires a large amount of [6] Mathulaprangsan, S., Wang, C.Y., Kusum, A.Z., Tai, T.C. and Wang, J.C.,
explained data for training. M illions of labeled images can be 2015, December. A survey of visual lip reading and lip-password verification. In
separated by using the algorithms of deep neural networks as 2015 International Conference on Orange Technologies (ICOT) (pp. 22 -25).
IEEE.
shown in Fig.10. DNN is used in lip reading by identify ing the
layers.
[7] Fung, I. and Mak, B., 2018, April. End-to-end low-resource lip-reading with
maxout CNN and LSTM. In 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) (pp. 2511-2515). IEEE.

[8].Wand, M., Koutník, J. and Schmidhuber, J., 2016, March. Lipreading with
long short-term memory. In 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) (pp. 6115-6119). IEEE.
Fig. 10 DNN used in lip reading [11]

[9] Kastaniotis, D., T sourounis, D. and Fotopoulos, S., 2020, October. Lip
Reading modeling with Temporal Convolutional Networks for medical support
VI. CONCLUSION applications. In 2020 13th International Congress on Image and Signal Processing,
BioMedical Engineering and Informatics (CISP -BMEI) (pp. 366-371). IEEE.
As deep learning is outperforming in many new applications,
this survey gave a brief idea about the various deep learning
[10] Garg, A., Noyola, J. and Bagadia, S., 2016. Lip reading using CNN and
algorith ms wh ich yield better results in the field of speech to LST M. T echnical report, Stanford University, CS231 n project report.
text conversion. In this paper we have reviewed various
techniques and datasets which have paved the path for better
[11] Lu, Y. and Yan, J., 2020. Automatic Lip-Reading Using Convolution Neural
performance in auto matic lip -reading. Upon the analysis of Network and Bidirectional Long Short-term Memory. International Journal of
various datasets, we have figured out the performance and Pattern Recognition and Artificial Intelligence, 34(01), p.2054003.
limitat ions of various algorithms. With respect to the dataset
analyzed so far, the Ou lu VS2 dataset outperformed the others. [12] Stafylakis, T. and Tzimiropoulos, G., 2018, April. Deep word embeddings for
Furthermore, the key observation from this survey is that the visual speech recognition. In 2018 IEEE International Conference on Acoustics,
Convolution Neural Network and Bi-LSTM were do minating Speech and Signal Processing (ICASSP) (pp. 4974-4978). IEEE.
in yielding better results. Ho wever, so me of the models did
not show remarkab le performance because of certain physical [13] Howell, D., Cox, S. and Theobald, B., 2016. Visual units and confusion
factors which include skin color, facial hair, face angle, modelling for automatic lip-reading. Image and Vision Computing, 51, pp.1-12.
background noise etc. Also, lip traces of similar words were
one of the most common problems faced by researchers. [14]. Alam, M., Samad, M.D., Vidyaratne, L., Glandon, A. and Iftekharuddin ,
K.M., 2020. Survey on deep neural networks in speech and vision systems.
Moreover, some of the proposed methods were yet to be Neurocomputing, 417, pp.302-321.
explored in a pract ical environment with large datasets.

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 1292

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4

[15]. Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H. and
Daoudi, M., 2019. Lip reading with Hahn convolutional neural networks. Image
and Vision Computing, 88, pp.76-83.

[16] Puviarasan, N. and Palanivel, S., 2011. Lip reading of hearing-impaired


persons using HMM. Expert Systems with Applications, 38(4), pp.4477-4481.

[17]. Kulkarni, A.H. and Kirange, D., 2019, July. Artificial Intelligence: A Survey
on Lip-Reading T echniques. In 2019 10 th International Conference on Computing,
Communication and Networking Technologies (ICCCNT) (pp. 1-5). IEEE.

[18]. Fernandez-Lopez, A. and Sukno, F.M., 2018. Survey on automatic lip -


reading in the era of deep learning. Image and Vision Computing, 78, pp.53-72.

[19]. Zhou, J., Shu, H., Zhu, H., Toumoulin, C. and Luo, L., 2005, September.
Image analysis by discrete orthogonal Hahn moments. In International
Conference Image Analysis and Recognition (pp. 524-531). Springer, Berlin,
Heidelberg.

[20]. Chung, J. S., & Zisserman, A. (2016, November). Lip reading in the wild. In
Asian Conference on Computer Vision (pp. 87-103). Springer, Cham.

[21] Anina, I., Zhou, Z., Zhao, G., & Pietikäinen, M. (2015, May). Ouluvs2: A
multi-view audiovisual database for non-rigid mouth motion analysis. In 2015
11th IEEE International Conference and Workshops on Automatic Face and
Gesture Recognition (FG) (Vol. 1, pp. 1-5). IEEE.

[25] Rekik, A., Ben-Hamadou, A., & Mahdi, W. (2014, October). A new visual
speech recognition approach for RGB-D cameras. In International conference
image analysis and recognition (pp. 21-28). Springer, Cham.

978-1-6654-1960-4/21/$31.00 ©2021 IEEE 1293

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.

You might also like