A Survey On Deep Learning Based Lip-Reading Techniques
A Survey On Deep Learning Based Lip-Reading Techniques
Yashaswini C
Department of Computer Science
and Engineering
Nitte Meenakshi Institute of
Technology
Bangalore,India
[email protected]
Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
II LITERATURE SURVEY 8 Kastani CUAV VPR, HMM, 45.6 The lip gesture
otis, D E and ASR, RBM, 3% for similar
[9] VPR VPR, HMM, words was one
In this section, we have explo red the performance are ASR, RBM, of the most
used in DBN common
and limitat ions of distinct deep learning algorithms on various this problems faced
datasets (Table I). study by the
researchers. As
TABLE I. SUMMARY OF STATE OF ART TECHNIQUES IN DEEP LIP this was one of
READING the main aspects
this is the
limitation or
drawbacks of
S.No Author Datase Techniques Perf Limitation the proposed
t orm model.
ance
1 Jake T CD- HMM, CNN, 69.5 Accuracy was 9 Yuanya Custom Fusion of CNN The Yet to explore
Burton T IMIT LST M 8% less when o Lu et audio- and Bi-LST M BiLS the proposed
[1] compared to al. [11] visual neural network TM method in a
speaker- databas architecture. achie practical
dependent e. ved a environment
machines. great with a vast
HMM did not er database.
accur
train some of
acy
the data due to than
insufficient LST
occurrences of M
phonemes.
10 Abderr AVLett Heterogeneous 90.8 Accuracy was
2 Abinav Grid Bi-directional - The attempts in ahim ers,Ou Convolutional 6% low on AV
Thanda Audio RNN, visual feature Mesbah luvs2, Neural on Letters
[2] visual connectionist engineering a [15] BBC- networks BBC dataset.
databas temporal using LRW LR
e classification unsupervised W
methods like datas
multi-model et
auto encoder did
not yield
notable solution 11 N. T he HMM with DCT HMM with
3 Puviara recorde DCT and DWT - DWT
Dilip Grid 3D-2D-CNN- - Grid dataset
san d video features 91% performed
Kumar and BLST Mand gave more
[16] of DW better than
Marga Indian Connectionist performance
speaker T- HMM with
m [3] English Temporal than Indian
s 97% DCT
Dataset Classification English dataset.
(CT C)
12 Apurva IBMVI CNN+BLST M, 87.6 Some
4 Fengho BBC- Viseme 64.6 Accuracy
ur,S. LRS2 classifiers & % dropped after H. AVOI DNN and % challenges are:
[4] attention based conversion of Kulkar CE LST M Characters
transformer. visemes to ni [17] CUAV producing the
words. E same sounds.
AVLlet (e.g.: buy, by)
5 Mathul XM2V Gaussian - Classification is ters2 When decoding
aprangs TS Mixture Model entirely speech in multi-
an [6] dependent on view scenarios.
the extracted
features
6 Fung, Ouluvs CNN in addition 87.6 The model III. EXPERIMENT AL DAT ABASE
I., & 2 10- to BLST M with % performance
Mak, B phrase a fusion of max In this section we have discussed the prominent datasets
decreased when
[7] task out more than 2 used by the researchers in lip-reading along with its properties.
kindling units. dimensions
were used.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
In GRID audio-visual dataset, each sentence in GRID The Fig. 3 shows a one second clip of a male speaker
has a fixed structure with six words structured as: “Co mmand pronouncing the word ‘about’, one second clip is divided into
+ Co lor + Preposition + Letter + Dig it + Adverb”. For various frames. Oulu VS2 dataset proposed in [21] contains
example,” set black into x four p lease”. The dataset consists of thousands of videos recorded using six cameras fro m different
51 d istinct words, which include 4 Co mmands, 4 Co lors, 4 angles, which makes Ouluvs2 dataset the most suitable one for
Prepositions, 25 Letters, 10 Dig its, and 4 Adverbs. In GRID, lip reading setup.
each sentence is randomly chosen into a combination of these
words and the duration of each sentence utterance is 3 Table III shows the sample phrases used in Ouluvs2 dataset,
seconds. GRID contains 51 vocabulary, 33000 utterances, 33 these are just the general phrases which a person might use
speakers. every day. Table IV depicts the simple co mparison among the
Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
Databases SpeakersUtterancesAccuracy Year Input data which is in the form of raw v ideo is fed to
MIRACL- 15 10 71.15% 2014 the system that consist of lot of noise in it. By using
VC1 [25] words,10
phrases
preprocessing technique, the above specified unwanted data are
Lip 100’s 1000 64.6% 2016
removed. There are several factors which d isturb the
reading in utterances performance and the accuracy of image processing techniques
the Wild of 500 like the pose, scaling, background noise, facial hair, rotation of
[4] words head etc. This led to focus on the specific features that plays a
GRID [8] 33 3300 80% 2006 vital ro le in converting speech to text. There are a number of
utterances preprocessing techniques performed on the datasets before
Ouluvs2 52 10000 93.72% 2015
[15]
actually training them to cut out the interference of irrelevant
factors for better recognition performance and obtaining finer
outcomes. If the dataset used for preprocessing contains video,
IV. M ET HODOLOGY the videos are first sampled into even sized image frames, after
this the facial landmarks will be located and sent as an input to
A. Data Preprocessing the neural network for further processing. The further
Data preprocessing is the first step in the automatic lip-reading processing here refers to the usage of Single shot box detector
process (Fig.5). Preprocessing has more impact in the entire (SSD) [4] which is a detector based on CNN used for detecting
process because it enhances the performance of the algorithm. the face and to recognize facial landmarks. Thus, data
The main challenges faced during lip reading are: preprocessing was required because the data consisted of
plenty of background information which was useless in
1. Presence of background noise, variation in the intensity
lipreading task
of speech.
2. Color of skin B. Feature Extraction
3. Somet imes lips can't be detected because of the The next most crucial part of lip-reading is the feature
presence of moustache extraction. Hence face detection can be detected easily with
4. Characters which have the same pronunciation sound any of the cascade algorith ms [16] as shown in Fig.5, the
also known as Phonemes. feature extraction which is extract ing the lip region is a
tedious task. However, in any standard face, the mouth will be
in the lo wer half of the face, the reg ion of interest can be set
by gradually reducing the width and length. Lips are the most
deformable part of the face, thus making it even harder to
detect. After preprocessing, localizing the lip and detecting the
lip movement using appropriate techniques are applied. Here
the rough region of the mouth is evaluated appropriate to the
face.
The main methods [5] used for any lip detection can be
classified as follows:
Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
were extracted. Init ially the samp le v ideo was div ided into ten
equal segments. Then the central frame of each segment was
extracted. After the extraction of the key frames OpenCV was
used to detect faces. In addition, 5 key points of the mouth
location are tracked which is used to locate the center o f the
mouth to obtain the lip region finally.
In [15] the dataset used is A VLetters and Oulu VS2 dig its
dataset. While wo rking with the Ou lu Vs2 d igit dataset the first Fig. 8 VGG and LST M model [10]
step is the ret rieval of the frames fro m videos and then
combating the issue of scaling. Whereas while using the ‘Lip - They tried 2 different model. In the first model they
reading in the Wild’ dataset, after the 1st step of retrieval of decreased the learning rate by 100 and 1000 of VGGNet and
the frames then the video is cut into several frames in o rder to trained the complete model. Both gave equivalent results. In
ensure that only the frames wh ich map to the wo rd spoken are the second model they freeze VGGNet and trained only
kept. LSTM. This involved performing a solo forward pass through
The systems built on GRID and Oulu Vs2 [15] have shown the VGGNet, the extracted features of which is then reiterated
more accuracy in co mparison to the datasets like LRS2.The as inputs to the LSTM without updating the VGGNet, for each
main reason here being LRS2 consists of sentences which epoch. The pre-trained VGGNet was able to achieve well on
have 40,000 random words wherein GRID and Ou lu Vs2 the integrated images. The LSTM model failed, the reason
consists of mostly repetitive sentences in a sequence, this being it did not operate the sequence until after feature
happen to be range of small vocabulary. extraction was done. In Addition, especially wh ile updating
the VGGNet the LSTM model took an extended time to train.
Preprocessing and feature extraction are the most With more time it may have produced better results.
important step because it ensures that the region of the face
needed for classification will be ext racted and which will in B. CNN-BiLSTM Network
turn be served as an input to the neural network. In some In this segment we outline the hybrid architecture of
cases, ROI must also undergo several greyscale conversions, Convolutional Neural Network together with Bidirect ional
sometimes z-score normalizat ions also. Since it plays a major Long Short-Term Memo ry neural network p roposed by
role, t rain ing a neural network improves the performance as Yuanyao Lu et al. [11] and depicted in Fig.9.
well as the accuracy of the algorith m. So me of the major deep
learning algorithms used in the lip-reading are as follows.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
C. Heterogeneous Convolution Neural Network (HCNN ) Implementing the lip-reading system on a variety of platforms
The HCNN network mainly aims at resolving the lip - after improving its performance and recognition rates could be
reading problem by processing the images of lips efficiently a major helping hand for the hearing-impaired people.
and rapidly. HCNN [15] is a co mbination o f Convolutional REFERENCES
neural networks and different orthogonal mo ments (Fig.6).
[1] Burton, J., Frank, D., Saleh, M., Navab, N. and Bear,H.L., 2018, December.
These Orthogonal mo ments [19] are very useful tools used in The speaker-independent lipreading play-off; a survey of lipreading machines. In
pattern recognition and well as image analysis. Here 2018 IEEE International Conference on Image Processing, Applications and
orthogonal mo ments refer to the Hahn mo ments [15] which Systems (IPAS) (pp. 125-130). IEEE.
are based on the Hahn polynomials described over an image
coordinate space. HCNN imp roves the process of feature [2] Thanda, A., & Venkatesan, S. M. (2016, December). Audio visual speech
extraction in the image. recognition using deep recurrent neural networks,in IAPR workshop on
Hahn mo ments are mo ments fro m wh ich we can multimodal pattern recognition of social signals in human-computer interaction
(pp. 98-109). Springer, Cham.
extract most useful informat ion fro m the image, th is is done
by using the feature of covering the global and the local
[3] Margam, D.K., Aralikatti, R., Sharma, T., Thanda, A., Roy, S. and
features with good efficiency at the same t ime simultaneously. Venkatesan, S.M., 2019. LipReading with 3D-2D-CNN BLSTM-HMM and word-
Using Hahn mo ments, the authors of [15] achieved 20% mo re CT C models. arXiv preprint arXiv:1906.12170.
improvement in accuracy when compared to CNN. Fig 4.3
describes the HCNN architecture. [4] Fenghour, S., Chen, D., Guo, K. and Xiao, P., 2020. Lip Reading Sentences
Using Deep Learning with Only Visual Cues. IEEE Access, 8, pp.215516 -215530.
D. Deep Neural Network (DNN)
A deep neural network or DNN [18] is a powerful category
[5] Lombardi, L., 2013, September. A survey of automatic lip-reading approaches.
of M L. DNN is an art ificial neural network consisting of N In Eighth International Conference on Digital Information Management (ICDIM
number layers in between the input and output layers. The 2013) (pp. 299-302). IEEE.
mu ltip le layers between the input layer and output layers are
called hidden layers. DNN requires a large amount of [6] Mathulaprangsan, S., Wang, C.Y., Kusum, A.Z., Tai, T.C. and Wang, J.C.,
explained data for training. M illions of labeled images can be 2015, December. A survey of visual lip reading and lip-password verification. In
separated by using the algorithms of deep neural networks as 2015 International Conference on Orange Technologies (ICOT) (pp. 22 -25).
IEEE.
shown in Fig.10. DNN is used in lip reading by identify ing the
layers.
[7] Fung, I. and Mak, B., 2018, April. End-to-end low-resource lip-reading with
maxout CNN and LSTM. In 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) (pp. 2511-2515). IEEE.
[8].Wand, M., Koutník, J. and Schmidhuber, J., 2016, March. Lipreading with
long short-term memory. In 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) (pp. 6115-6119). IEEE.
Fig. 10 DNN used in lip reading [11]
[9] Kastaniotis, D., T sourounis, D. and Fotopoulos, S., 2020, October. Lip
Reading modeling with Temporal Convolutional Networks for medical support
VI. CONCLUSION applications. In 2020 13th International Congress on Image and Signal Processing,
BioMedical Engineering and Informatics (CISP -BMEI) (pp. 366-371). IEEE.
As deep learning is outperforming in many new applications,
this survey gave a brief idea about the various deep learning
[10] Garg, A., Noyola, J. and Bagadia, S., 2016. Lip reading using CNN and
algorith ms wh ich yield better results in the field of speech to LST M. T echnical report, Stanford University, CS231 n project report.
text conversion. In this paper we have reviewed various
techniques and datasets which have paved the path for better
[11] Lu, Y. and Yan, J., 2020. Automatic Lip-Reading Using Convolution Neural
performance in auto matic lip -reading. Upon the analysis of Network and Bidirectional Long Short-term Memory. International Journal of
various datasets, we have figured out the performance and Pattern Recognition and Artificial Intelligence, 34(01), p.2054003.
limitat ions of various algorithms. With respect to the dataset
analyzed so far, the Ou lu VS2 dataset outperformed the others. [12] Stafylakis, T. and Tzimiropoulos, G., 2018, April. Deep word embeddings for
Furthermore, the key observation from this survey is that the visual speech recognition. In 2018 IEEE International Conference on Acoustics,
Convolution Neural Network and Bi-LSTM were do minating Speech and Signal Processing (ICASSP) (pp. 4974-4978). IEEE.
in yielding better results. Ho wever, so me of the models did
not show remarkab le performance because of certain physical [13] Howell, D., Cox, S. and Theobald, B., 2016. Visual units and confusion
factors which include skin color, facial hair, face angle, modelling for automatic lip-reading. Image and Vision Computing, 51, pp.1-12.
background noise etc. Also, lip traces of similar words were
one of the most common problems faced by researchers. [14]. Alam, M., Samad, M.D., Vidyaratne, L., Glandon, A. and Iftekharuddin ,
K.M., 2020. Survey on deep neural networks in speech and vision systems.
Moreover, some of the proposed methods were yet to be Neurocomputing, 417, pp.302-321.
explored in a pract ical environment with large datasets.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV 2021).
IEEE Xplore Part Number: CFP21ONG-ART; 978-0-7381-1183-4
[15]. Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H. and
Daoudi, M., 2019. Lip reading with Hahn convolutional neural networks. Image
and Vision Computing, 88, pp.76-83.
[17]. Kulkarni, A.H. and Kirange, D., 2019, July. Artificial Intelligence: A Survey
on Lip-Reading T echniques. In 2019 10 th International Conference on Computing,
Communication and Networking Technologies (ICCCNT) (pp. 1-5). IEEE.
[19]. Zhou, J., Shu, H., Zhu, H., Toumoulin, C. and Luo, L., 2005, September.
Image analysis by discrete orthogonal Hahn moments. In International
Conference Image Analysis and Recognition (pp. 524-531). Springer, Berlin,
Heidelberg.
[20]. Chung, J. S., & Zisserman, A. (2016, November). Lip reading in the wild. In
Asian Conference on Computer Vision (pp. 87-103). Springer, Cham.
[21] Anina, I., Zhou, Z., Zhao, G., & Pietikäinen, M. (2015, May). Ouluvs2: A
multi-view audiovisual database for non-rigid mouth motion analysis. In 2015
11th IEEE International Conference and Workshops on Automatic Face and
Gesture Recognition (FG) (Vol. 1, pp. 1-5). IEEE.
[25] Rekik, A., Ben-Hamadou, A., & Mahdi, W. (2014, October). A new visual
speech recognition approach for RGB-D cameras. In International conference
image analysis and recognition (pp. 21-28). Springer, Cham.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 09:59:16 UTC from IEEE Xplore. Restrictions apply.