LipReadNet A Deep Learning Approach To Lip Reading

Uploaded by

ayushshandilya212001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views6 pages

LipReadNet A Deep Learning Approach To Lip Reading

Uploaded by

ayushshandilya212001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2023 International Conference on Applied Intelligence and Sustainable Computing (ICAISC)

LipReadNet: A Deep Learning Approach to Lip

Reading
1st Kuldeep Vayadande 2nd Tejas Adsare 3rd Neeraj Agrawal
2023 International Conference on Applied Intelligence and Sustainable Computing (ICAISC) | 979-8-3503-2379-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICAISC58445.2023.10200426

Department of Information Technology Department of Information Technology Department of Information Technology

Vishwakarma Institute of Technology Vishwakarma Institute of Technology Vishwakarma Institute of Technology
Pune, India Pune, India Pune, India
[email protected] [email protected] [email protected]

4th Tejas Dharmik 5th Aishwarya Patil 6th Sakshi Zod

Department of Information Technology Department of Information Technology Department of Information Technology
Vishwakarma Institute of Technology Vishwakarma Institute of Technology Vishwakarma Institute of Technology
Pune, India Pune, India Pune, India
[email protected] [email protected] [email protected]

Abstract—LipReadNet is a deep learning approach to lip Lipreading is a challenging task that involves understanding
reading that aims to improve speech recognition technology for the movement and shape of the lips, as well as the speaker's
individuals with hearing impairments or in noisy environments. facial expressions and context. Traditional methods of
Lip reading has long been known as an effective method of lipreading rely on handcrafted features and statistical models.
communication for people who have hearing problems, and with However, these methods have limitations and are not robust to
the advancement of deep learning algorithms, it has become variations in lighting, pose, and speaker identity. As a result,
possible to automate the process of lip reading. The LipReadNet there has been an increasing interest in using deep neural
model has the potential to revolutionize the field of speech networks for this purpose, as they can automatically learn
recognition technology, making it more accessible and useful for
features from raw data and improve performance through
individuals with hearing impairments, as well as in scenarios
where the audio signal is degraded or absent. The LipReadNet
training on large datasets. In recent times, the progress made
model comprises a 3D CNN and LSTM that is trained on large in deep learning has led to an increasing fascination with using
datasets of video and audio recordings. The model first extracts deep neural networks to facilitate lipreading. This study
visual features from the mouth region of a person's face, then introduces LipReadNet, a deep learning approach that can
combines these features with the corresponding audio signal to accurately recognize words by analyzing the visual cues of the
predict the spoken words. This approach is highly effective as it speaker's lips. LipReadNet employs a CNN architecture that
can recognize spoken words even in cases where the audio signal can derive features from raw images, making it robust to
is corrupted or missing entirely. LipReadNet outperforms variations in lighting, pose, and speaker identity. The
existing lip-reading models in terms of accuracy, robustness, performance of LipReadNet is evaluated on two well-known
and efficiency. The goal of a lip-reading project is to develop a datasets, namely LRW and GRID.
system that can accurately recognize speech through visual cues,
without the use of audio. Achieved accuracy of 93% in a lip- II. RELATED WORKS
reading project which rely on different factors, such as quality Afouras, et al. in January 2021 proposes a approach to lip
of data and diversity of data of training, the choice of machine reading using 3DCNN. The authors argue that 3D CNNs are
learning algorithms, and the evaluation metrics used to calculate
better suited to the task of lip reading than 2D CNNs because
the competence of the system.
they can capture the temporal features of lip movements in
Keywords—Deep learning, LipReadNet, 3D CNN, LSTM, addition to spatial features. To evaluate their approach, the
visual cues. authors use two large-scale lip-reading datasets, the LRS2 and
GRID datasets, and compare their results with previous
I. INTRODUCTION methods [1].
Automatic lipreading systems, which use deep learning The paper named Lip Reading Sentences in the Wild" by
algorithms to extract features from video frames of a speaker's Miao, Li, and Wang state reading lips in real-world,
face and map them to phonemes or words, have emerged as a unconstrained environments, or what the authors call "the
potential solution to improve communication for individuals wild." The authors argue that existing approaches to lip
with hearing impairments or in noisy environments. Despite reading often fail in these scenarios due to variations in
facing challenges, such as variations in lighting, pose, speaker lighting conditions, camera angles, and speaker pose. To
identity, and the difficulty in generalizing across speakers due address these challenges, the authors proposed architecture
to different lip shapes and movements, automatic lipreading called the Deep Lip-Reading Network (DLRN), which
systems have shown promise in enhancing speech consists of a 3DCNN and Bidirectional LSTM. The DLRN
comprehension through visual cues of the speaker's lips. uses dataset of sentences spoken by a diverse set of speakers
Speech is a fundamental means of communication that in various environments. To evaluate their approach, the
humans have evolved to rely on. However, for individuals authors test the DLRN on two datasets, the LRW dataset and
with hearing impairments, comprehending speech can be a the LRS3-TED dataset, both of which contain sentences
challenge. Lipreading, also known as speechreading, is the spoken in unconstrained environments. It achieves a word-
ability to understand spoken language through visual cues of level recognition accuracy of 80.4% on LRW and 70.1% on
the speaker's lips, and it can be a useful tool for people having LRS3-TED [2].
problem to hear. It is also beneficial in noisy environments,
such as crowded places, where speech can be difficult to hear. Xu, Xie, and Lu contributed to Lip-reading using
Temporal Convolutional Networks which was published in

979-8-3503-2379-5/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 10:03:23 UTC from IEEE Xplore. Restrictions apply.
2020. This paper proposes a new approach to lip reading using evaluate their approach, the authors test the HCLRN on three
temporal convolutional networks (TCNs). To evaluate their benchmark datasets, including the GRID, LRW, and LRS2
approach, the authors use two benchmark datasets, the GRID datasets. It achieves a word recognition accuracy of 91.2% on
and LRW datasets, and compare their results. They also the GRID dataset, 81.5% on the LRW dataset, and 70.8% on
perform ablation studies to investigate the impact of different the LRS2 dataset[7].
network architectures and training strategies. The authors also
show that incorporating self-attention mechanisms and Y. M. Chung, W. Lee, and A. Senior published a paper in
training with curriculum learning further improves the the year 2017. A model involves CNN and RNN to learn the
performance [3]. mapping from videos of lip movements to corresponding
sentences. To evaluate their approach, the authors test LipNet
In this paper Zhao, Zhao, and Wang [4]. explains new on the GRID and LRW datasets, which contain sentences
approach to lip reading that combines spatial and temporal spoken by different speakers under varying lighting
features using attention mechanisms. The authors argue that conditions and camera angles. It achieves a word-level
previous approaches to lip reading have focused primarily on accuracy of 93.4% on the GRID dataset and 80.1% on the
either spatial or temporal features, but that a combination of LRW dataset [8].
the two is necessary to achieve high accuracy in challenging
The paper presents a comparison of different deep learning
scenarios. To address this challenge, the authors propose a
model called the Spatiotemporal Attention-based Lip-Reading models for lipreading and an online evaluation of the best-
Network (STALR), which consists of a 3D CNN to capture performing model. The authors use the GRID corpus, a large-
spatiotemporal features, a spatial attention module to highlight scale dataset of audio-visual recordings of people speaking
important regions of the lips, and a temporal attention module sentences with different words, to train and evaluate the
to weigh the importance of different time steps. To evaluate models Overall, the paper demonstrates the effectiveness of
their approach, the authors test the STALR on the LRS3-TED deep learning models for lipreading and highlights the
dataset, which contains sentences spoken in unconstrained importance of comparing different models and evaluating
environments. It achieves a word-level recognition accuracy them on practical tasks [9].
of 70.9% [4]. Aditya Nagrath et al. published the paper in 2021 The
Kalayeh, Bas, and Shah contributed to the paper where paper proposes a lipreading approach that can work in
they proposed a new approach to simultaneously perform lip unconstrained settings, where there is a large variation in
reading and speaker identification using 3D convolutional speakers, lighting, and background. The authors use both
neural networks (CNNs). The authors argue that lip reading supervised and unsupervised learning techniques to train their
and speaker identification are complementary tasks that can model and evaluate it on the LRW dataset, which contains
be jointly optimized to improve the accuracy and robustness short video clips of people speaking single words They then
of both. To address this challenge, the authors propose a use a combination of supervised and unsupervised learning to
model called the Multi-Task Lip Reading and Speaker train the model [10].
Identification Network (MTLSN), which consists of a 3D III. METHODS
CNN that extracts spatiotemporal features from lip
movements and speaker embeddings. To evaluate their The proposed system for lip reading detection comprises
approach, the authors test the MTLSN on the LRW and LRS3- four primary stages. Initially, the video dataset is collected,
TED datasets, which contain sentences spoken in followed by preprocessing the dataset, designing the Model
unconstrained environments. It achieves a word-level Architecture, evaluating the system by comparing the output
recognition accuracy of 81.2% on LRW and an identification letter by letter with the actual output, and ultimately deploying
accuracy of 98.2% on LRS3-TED [5]. the model. In this system, the design and test of a model for
video frames have been done. The dataset includes the GRID
Y. Zhao, J. Shen, and X. Liu contributed to the Lip dataset which also includes a set of manually transcribed
Reading using Dilated Convolutional Neural Networks. They sentences for evaluation. Lip reading involves the
suggested new approach to lip reading using dilated interpretation of visual information from the lips and face to
convolutional neural networks (DCNNs). To address this understand spoken language[11]. In order to conduct lip
challenge, the authors propose a model called the Dilated reading, it is necessary to extract visual characteristics from
Convolutional Lip-Reading Network (DCLRN), which the video frames and subsequently utilize these features to
consists of several dilated convolutional layers to extract predict the corresponding text. One strategy for accomplishing
features of lip movements. To evaluate their approach, the this task is to utilize a blend of 3DCNN and LSTM networks.
authors test the DCLRN on three benchmark datasets, In this methodology, the 3D CNN is employed to extract
including the GRID, LRW, and LRS2 datasets. It achieves a spatio-temporal features, while the LSTM network is utilized
word recognition accuracy of 93.1% on the GRID dataset, to model the temporal dependencies between the frames and
83.8% on the LRW dataset, and 78.4% on the LRS2 dataset generate the ultimate prediction.
[6].
A. Collection of LipRead Dataset
Y. Zhou, W. Chen, and X. Liu contributed to the paper
The GRID corpus serves as a dataset for training and
states about new approach to lip reading using hierarchical
evaluating LipNet, which is an end-to-end deep learning
convolutional neural networks (HCNNs). The authors argue
model utilized for sentence-level lipreading. The dataset
that HCNNs can capture both local and global dependencies
consists of thousands video clips of thirty-four speakers, each
of lip movements by processing the input at multiple scales.
saying 1000 sentences, resulting in a sum of 34,000 sentences.
To address this challenge, the authors propose a model called
The sentences are drawn from a set of 1000 high-frequency
the Hierarchical Convolutional Lip-Reading Network
English words and cover a wide range of syntactic structures
(HCLRN), which consists of several hierarchical
and semantic contexts. The video clips are captured at 25
convolutional layers to extract features of lip movements. To

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 10:03:23 UTC from IEEE Xplore. Restrictions apply.
frames per second and are 3 seconds long, resulting in 75 to concentrate on crucial features, such as lip movements,
frames per clip. The audio is recorded simultaneously with the while disregarding other irrelevant data like the
video and is sampled at 16kHz. The training set comprises 27 background[13]. Therefore, pre-processing the video data in
speakers, while the validation set contains 3 speakers and the this manner is vital for the system's effectiveness. The first
testing set comprises 4 speakers. step is to load the video data from the specified path using the
OpenCV library. Then, each frame of the video is extracted,
and its colour is converted from RGB to grayscale using
TensorFlow. The frames are then cropped to focus on the lip’s
region using the specified coordinates. After the frames are
processed, the mean is calculated across all frames to
normalize the data, and the standard deviation is computed to
scale the data. The resulting pre-processed frames are then
returned as a list of float tensors. Additionally, normalizing
and scaling the data helps to ensure that the model is not biased
towards any particular video or frame, which can lead to better
generalization and improved accuracy.

Fig. 1. Graph of Accuracy Comparison on different dataset.

In Fig 1 it shows the comparison between the accuracies

of the work done on different datasets in the past.

TABLE. I. GRID DATASET DETAILS

GRID (Glasgow University's Spoken
Name
Dialogue Corpus)
Audiovisual dataset for speech recognition
Description
and lipreading
Source University of Glasgow
Year 2003
Number of Speakers 34 Fig. 2. Cropped Part of the Lip region.
Age range 18-55
Gender 18 male, 16 female The vocabulary is: ['', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k',
Language English 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'", '?', '!', '1',
Recording Quality
25 fps, 720x576 resolution, 48 kHz audio '2', '3', '4', '5', '6', '7', '8', '9', ' '] (size =40)
sampling rate
Recording environment Well-lit studio with plain background It is further mapped with the indexes starting from 1 to 40.
Sentences 1000
Words 51,500
In fig 2 it shows the pre-processed part of the frame which
Vocabulary size 51 is cropped region of the mouth which is further passed as an
Phonemes 44 input for the model.
Visemes 20
Split Train: 691 sentences, Test: 309 sentences
C. Model Architecture
Availability
Publicly available with non-commercial This code defines a sequential model for lip reading using
license a combination of 3D convolutional and LSTM. Input of the
model is a sequence of 75 frames, each of size 46x140 pixels,
As mentioned in the table 1, we have used the GRID with a single channel representing grayscale images. The
corpus dataset which includes the description of the dataset as model processes the sequence using the following layers:
it is used to evaluate LipNet due to its sentence-level nature As described in fig 3 Conv3D layer with 128 filters, kernel
and large dataset. The sentences in the corpus are generated size 3, and padding same, followed by ReLU activation and
from a simple grammar comprising categories for command, MaxPool3D layer with pool size (1,2,2).
colour, preposition, letter, digit, and adverb, with four-word
choices in each category except for the letter category, which • Conv3D layer with 256 filters, kernel size 3, and
has 25 choices. This results in 64000 possible sentences, padding same, followed by ReLU activation and
providing a diverse range of syntactic structures and semantic MaxPool3D layer with pool size (1,2,2).
contexts. The utilization of this dataset has resulted in • Conv3D layer with 75 filters, kernel size 3, and
substantial progress in the realm of audio-visual speech padding same, followed by ReLU activation and
recognition[12]. The videos in the GRID dataset are aligned MaxPool3D layer with pool size (1,2,2).
with phoneme and viseme labels, which are obtained through
automatic speech recognition and manual annotation of lip • TimeDistributed layer with Flatten operation to
movements, respectively. These alignments are crucial for convert the output from the convolutional layers to a
training and evaluating audio-visual speech recognition 2D array of shape (sequence length, flattened feature
systems, as they enable the modelling of the relationship maps).
between the sound-related and pictorial features of speech.
• Bidirectional LSTM layer with 128 units, Orthogonal
B. Pre-processing of Dataset kernel initializer, and return_sequences=True to output
In audio-visual speech recognition, it is imperative to a hidden states sequence for input frame. This layer is
preprocess the video data in this manner to enable the model followed by a dropout layer.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 10:03:23 UTC from IEEE Xplore. Restrictions apply.
• Other Bidirectional LSTM layer as the previous one, information about the entire video. Finally, the output of the
with another dropout layer. 3D CNN is passed through a connected layer, followed by a
• Lastly, a Dense layer with units equal to the number of Softmax which produces the probability of the different
unique characters in the vocabulary, plus one extra unit classes. In lipreading, the classes correspond to different
for unknown or blank characters. The output layer phonemes or words. Overall, the 3D CNN algorithm is a
employs the softmax activation function, which powerful technique for extracting spatio-temporal features
produces a probability distribution over the vocabulary from videos and has been shown to be effective in lipreading
for every output frame. tasks.
Overall, this model uses a combination of 3D 2) Bidirectional LSTM (Long Short Term Memory):
convolutional and recurrent layers to learn to recognize Bidirectional algorithm technique commonly used in
spoken words from visual cues in lip movements. The lipreading models. It involves running the input sequence
architecture is designed to handle sequential input data and through two separate recurrent neural networks (RNNs) in
can learn to adapt to variations in lip movements across opposite directions: one in the forward and backward. This
different speakers and words. allows the model to consider both past and future context
when making predictions[14]. Bidirectional LSTM networks
are widely utilized in lipreading models. These networks
comprise two layers of LSTM cells that process the input
sequence in both backward and forward directions. The
outputs of the two layers are then merged to generate the final
prediction. The use of Bidirectional LSTM networks in
lipreading has been shown to improve the accuracy of
lipreading models. By considering both past and future
context, these networks can better capture the temporal
relationships between spoken words and lip movements.
Overall, Bidirectional LSTM networks are a powerful tool for
lipreading and have been used in many state-of-the-art
lipreading models. In fig 4 it shows the still images of
different speakers from the dataset that we have used to train
our model [15]. The training of the model is an important step
after building the model architecture. The model is then
compiled. The compile function is called on the model object
and takes three arguments. The first parameter specifies the
optimizer to be used for training, which is Adam with a
learning rate of 0.0001. The Adam optimizer is widely used
in deep learning. The learning rate determines the extent to
which the model weights are updated during each training
iteration. A very high learning rate can cause the optimizer to
overshoot the minimum of the loss function, whereas a very
low learning rate can make the optimization process slow to
Fig. 3. Diagram of Model Architecture
converge.
Algorithms used for the Model –
1) 3DCNN (3D Convolutional Neural Network): The
3DCNN is a popular algorithm used in lipreading. The
algorithm has three dimensions, where the third dimension
represents time, and the first two dimensions represent the
spatial information of the input image or video frame. The
first step in the 3D CNN algorithm is to extract features from Fig. 4. Still Images from Grid Dataset
the input video using convolutional layers. These layers use
filters and output of these layers is passed through activation In fig. 5 it shows the complete flowchart of the system
functions like ReLU, which introduces non-linearity into the about how the input is pre-processed, use of ensembled
model. The output of these convolutional layers is then algorithm of 3DCNN and Bi-LSTM and generating a text[16].
passed through pooling layers to reduce the dimensionality of
the features and remove unnecessary information. The next
step is to stack the feature maps from the previous step along
the time dimension, forming a 3D volume that contains

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 10:03:23 UTC from IEEE Xplore. Restrictions apply.
Fig. 7. Graph of Validation accuracy and loss

Fig. 5. Flowchart of the LipNet System

Fig. 8. Graph of Training and validation accuracy and loss
D. Deployment of Model using StreamLit
In fig 8 it shows the train loss and accuracy and validation
The model is then deployed using a streamlit application loss and accuracy on the GRID dataset that was used to make
which takes the video from the dataset and it converts the system.
video into the gif file in which only contains the mouth portion
of the video. This is the pre-processed gif which is taken as an
input by the deep learning model. It shows the output in the
form of the text to the user.
IV. RESULTS AND DISCUSSIONS

Fig. 9. Prediction of text via test video input

In fig. 9 it shows how the videos are converted into the

Fig. 6. Graph of Training accuracy and loss
frames by cropping the mouth region and predicting the
The proposed model is capable of accurately analyzing generated text.
scenes from the input video and providing an appropriate Around epoch 50, the WER for the validation set starts to
output. It was trained for 50 epochs and achieved an overall plateau, while the WER for the training set continues to
accuracy of 95%, indicating a high level of performance. decrease. To prevent overfitting, the training could be stopped
Additionally, the validation accuracy was recorded to be 84%. earlier, or regularization techniques could be applied, such as
The graph in fig 6 and 7 shows the Word Error Rate (WER) dropout or weight decay. Overall, the graph shows that the lip-
for both the training and validation sets during training reading model is improving over time, but it also highlights
indicating better performance. At the start of training, the the importance of monitoring the validation performance to
WER for on the graph, the x-axis denotes the number of ensure that the model generalizes well to new data. In table 2
epochs, while the y-axis represents the Word Error Rate depicts about the evaluation matrices of accuracy precision
(WER), a metric used to evaluate the model's performance. A recall and F1 score on the GRID dataset.
lower WER for both the training and validation sets indicates
better performance. TABLE. II. PERFORMANCE METRICS
Performance Metrics Value
Accuracy 0.93
Precision 0.86
Recall 0.88
F1 Score 0.87

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 10:03:23 UTC from IEEE Xplore. Restrictions apply.
User Interface: In fig. 10 it shows the user interface where REFERENCES
user can select different videos and can predict the text. [1] "Lip Reading with 3D Convolutional Neural Networks" by T. Afouras,
J. S. Chung, and A. Zisserman (2021).
[2] "Lip Reading Sentences in the Wild" by Y. Miao, S. Li, and S. Wang
(2020).
[3] "LipReading using Temporal Convolutional Networks" by K. Xu, L.
Xie, and Y. Lu (2020).
[4] "Attention-based Lip Reading with Spatiotemporal Feature Fusion" by
Z. Zhao, X. Zhao, and Y. Wang (2021).
[5] "Multi-task Lip Reading and Speaker Identification using 3D
Convolutional Neural Networks" by M. Kalayeh, E. Bas, and M. Shah
(2021).
[6] "Lip Reading using Dilated Convolutional Neural Networks" by Y.
Zhao, J. Shen, and X. Liu (2020).
[7] "Lip Reading using Hierarchical Convolutional Neural Networks" by
Fig. 10. User Interface of the LipNet Model Y. Zhou, W. Chen, and X. Liu (2020).
[8] "LipNet: End-to-End Sentence-level Lipreading" by Y. M. Chung, W.
V. CONCLUSION Lee, and A. Senior (2017).
LipNet is a groundbreaking work in lipreading, utilizing [9] "Deep Lip Reading: A Comparison of Models and an Online
Evaluation" by Richard Bowden, Georgios Tzimiropoulos, and
3DCNN and bidirectional LSTMs to transcribe speech from William J. Christmas (2019).
lip movements. With the incorporation of the CTC loss
[10] "Lip Reading in the Wild using Legacy Supervision and Unsupervised
function and attention mechanism, it shows promising results Learning" by Aditya Nagrath, Rohit Kumar, Abhishek Dey, and R.
in capturing spatial and temporal information and improving Venkatesh Babu (2021).
accuracy. This advancement holds potential for [11] "End-to-End Lip Reading with Temporal Convolutional Networks and
communication technologies for hearing-impaired individuals Attention" by Doyeon Kim and Chanho Ahn (2021).
and noisy environments. Future prospects include integrating [12] Shillingford, B., & Hazan, V. (2020). The effectiveness of lipreading
lipreading into hearing aids or cochlear implants to enhance training for improving speech recognition in noise for adults with
speech recognition in challenging settings. Expanding and hearing loss: a systematic review. International Journal of Audiology,
59(12), 871-879.
diversifying datasets will improve model generalization.
[13] Petridis, S., Luettin, J., & Pantic, M. (2018). Audio-visual speech
Further research can explore multimodal approaches, recognition using lip information and facial action units. IEEE
combining lip movements with facial expressions and Transactions on Multimedia, 20(8), 2081-2093.
gestures. These advancements aim to empower individuals [14] "Attention-based Lip-Reading Recognition with CNN-RNN Hybrid
with hearing impairments, enabling effective communication Networks" by Liang Li, Ruimin Hu, Xiaohui Liu, and Shuang Zhao
and improving their quality of life. Lipreading's integration (2020).
with deep learning offers exciting possibilities, and [15] Garg, S., & Niranjan, U. C. (2021). Lip-reading: An overview of speech
continuous innovation in the field will lead to a future where enhancement and feature extraction techniques. Journal of Ambient
Intelligence and Humanized Computing, 12(5), 4663-4676.
seamless and accurate communication tools are accessible to
all. [16] Assael, Y. M., Shillingford, B., Whiteson, S., & Hazan, V. (2019).
Lipreading with 3D convolutions. arXiv preprint arXiv:1911.06698

Authorized licensed use limited to: PES University Bengaluru. Downloaded on November 20,2024 at 10:03:23 UTC from IEEE Xplore. Restrictions apply.

Final Project Report
No ratings yet
Final Project Report
21 pages
Cep Report
No ratings yet
Cep Report
21 pages
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
No ratings yet
NMITCON 2024 Submitted 23 24 B27 Lip Reading Using Deep Learning Project Paper
8 pages
Lip Reading With Hahn Convolutional Neural Networks
No ratings yet
Lip Reading With Hahn Convolutional Neural Networks
28 pages
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
No ratings yet
Learning Spatio-Temporal Features With Two-Stream Deep 3D Cnns For Lipreading
13 pages
Lip Reading Using Deep Learning in Turkish Language
No ratings yet
Lip Reading Using Deep Learning in Turkish Language
12 pages
Chung 18
No ratings yet
Chung 18
28 pages
Learning Individual Speaking Styles For Accurate L
No ratings yet
Learning Individual Speaking Styles For Accurate L
11 pages
Deep Learning-Based Automated Lip-Reading A Survey
No ratings yet
Deep Learning-Based Automated Lip-Reading A Survey
22 pages
Zhu 2020 J. Phys. Conf. Ser. 1651 012076
No ratings yet
Zhu 2020 J. Phys. Conf. Ser. 1651 012076
8 pages
Lipx
No ratings yet
Lipx
9 pages
Lip Reading Using Committee Networks With Two Different Types of Concatenated Frame Images
No ratings yet
Lip Reading Using Committee Networks With Two Different Types of Concatenated Frame Images
7 pages
Lip Decoder
No ratings yet
Lip Decoder
11 pages
Lip Reading Sentences Using Deep Learning With Only Visual Cues
No ratings yet
Lip Reading Sentences Using Deep Learning With Only Visual Cues
15 pages
Analyzing Lower Half Facial Gestures For Lip Reading Applications Survey On Vision Techniques
No ratings yet
Analyzing Lower Half Facial Gestures For Lip Reading Applications Survey On Vision Techniques
45 pages
Piyush Project Report (LipReader)
No ratings yet
Piyush Project Report (LipReader)
35 pages
Mutual Information Maximization For Effective Lip Reading: Xing Zhao, Shuang Yang, Shiguang Shan, Xilin Chen
No ratings yet
Mutual Information Maximization For Effective Lip Reading: Xing Zhao, Shuang Yang, Shiguang Shan, Xilin Chen
8 pages
Developing Phoneme-Based Lip-Reading Sentences System For Silent Speech Recognition
No ratings yet
Developing Phoneme-Based Lip-Reading Sentences System For Silent Speech Recognition
10 pages
Deformation Flow Based Two-Stream Network For Lip Reading
No ratings yet
Deformation Flow Based Two-Stream Network For Lip Reading
7 pages
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
No ratings yet
LRW-1000: A Naturally-Distributed Large-Scale Benchmark For Lip Reading in The Wild
8 pages
Pseudo-Convolutional Policy Gradient For Sequence-to-Sequence Lip-Reading
No ratings yet
Pseudo-Convolutional Policy Gradient For Sequence-to-Sequence Lip-Reading
8 pages
Engineering Science and Technology, An International Journal
No ratings yet
Engineering Science and Technology, An International Journal
10 pages
Toward Language-Independent Lip Reading A Transfer Learning Approach
No ratings yet
Toward Language-Independent Lip Reading A Transfer Learning Approach
4 pages
A Survey On Deep Learning Based Lip-Reading Techniques
No ratings yet
A Survey On Deep Learning Based Lip-Reading Techniques
8 pages
Ashrith Miniproject 2
No ratings yet
Ashrith Miniproject 2
11 pages
Authnet: A Deep Learning Based Authentication Mechanism Using Temporal Facial Feature Movements
No ratings yet
Authnet: A Deep Learning Based Authentication Mechanism Using Temporal Facial Feature Movements
7 pages
Lipreading With 3D-2D-Cnn BLSTM-HMM and Word-Ctc Models
No ratings yet
Lipreading With 3D-2D-Cnn BLSTM-HMM and Word-Ctc Models
5 pages
Vision Based Lip Reading System Using Deep Learning: July 2022
No ratings yet
Vision Based Lip Reading System Using Deep Learning: July 2022
7 pages
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
No ratings yet
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
10 pages
Documentation (AA20)
No ratings yet
Documentation (AA20)
62 pages
Analysis of Lip-Reading Using Deep Learning Techniques A Review
No ratings yet
Analysis of Lip-Reading Using Deep Learning Techniques A Review
6 pages
Lip-Reading With Densely Connected Temporal Convolutional Networks
No ratings yet
Lip-Reading With Densely Connected Temporal Convolutional Networks
10 pages
Silent Speech Interpretation Using Ai: Dr.M. Hemalatha, M. Akshayaa
No ratings yet
Silent Speech Interpretation Using Ai: Dr.M. Hemalatha, M. Akshayaa
3 pages
Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN
No ratings yet
Multi-Grained Spatio-Temporal Modeling For Lip-Reading: Wangchenhao17@mails - Ucas.ac - CN
11 pages
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
No ratings yet
Hybrid Attention Mechanisms in 3D CNN For Noise-Resilient Lip Reading in Complex Environments
11 pages
2001 08702v1
No ratings yet
2001 08702v1
6 pages
ANN Paper
No ratings yet
ANN Paper
7 pages
DL Review
No ratings yet
DL Review
4 pages
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
No ratings yet
Lipsound2: Self-Supervised Pre-Training For Lip-To-Speech Reconstruction and Lip Reading
11 pages
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
No ratings yet
A Novel Machine Lip Reading Model A Novel Machine Lip Reading Model
6 pages
584 Camera Ready
No ratings yet
584 Camera Ready
6 pages
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
No ratings yet
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
11 pages
1 s2.0 S1877050922001843 Main
No ratings yet
1 s2.0 S1877050922001843 Main
6 pages
Year 8 Mathematics Autumn White Rose Higher B
0% (1)
Year 8 Mathematics Autumn White Rose Higher B
12 pages
Lipreading Using A Comparative Machine Learning Approach
No ratings yet
Lipreading Using A Comparative Machine Learning Approach
7 pages
A Lip Reading Method Based On 3D Convolutional Vision Transformer
No ratings yet
A Lip Reading Method Based On 3D Convolutional Vision Transformer
8 pages
1 s2.0 S2666764923000450 Main
No ratings yet
1 s2.0 S2666764923000450 Main
10 pages
Lip Reading Using CNN and LTSM
No ratings yet
Lip Reading Using CNN and LTSM
9 pages
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
No ratings yet
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
8 pages
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
No ratings yet
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
5 pages
ANN Paper
No ratings yet
ANN Paper
6 pages
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
No ratings yet
Lip Reading Word Classification: Abiel Gutierrez Stanford University Zoe-Alanah Robert Stanford University
9 pages
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
No ratings yet
Lip Reading Using External Viseme Decoding: 1 Javad Peymanfard 2 Mohammad Reza Mohammadi 3 Hossein Zeinali
5 pages
SBSPS Challenge 11002
No ratings yet
SBSPS Challenge 11002
3 pages
Automatic Lip Reading Classification of
No ratings yet
Automatic Lip Reading Classification of
5 pages
DELEM Install GB
No ratings yet
DELEM Install GB
81 pages
SpeakVision: A Comprehensive Survey On End-to-End Sentence Level Lipreading
No ratings yet
SpeakVision: A Comprehensive Survey On End-to-End Sentence Level Lipreading
4 pages
LIP Reading Using Facial Feature Extraction and Deep Learning
No ratings yet
LIP Reading Using Facial Feature Extraction and Deep Learning
5 pages
Deep Audio-Visual Speech Recognition
No ratings yet
Deep Audio-Visual Speech Recognition
13 pages
Deep Learning Model For Lip Reading To ImproveAccessibility
No ratings yet
Deep Learning Model For Lip Reading To ImproveAccessibility
6 pages
GED OnlineTest R2 Sci
No ratings yet
GED OnlineTest R2 Sci
18 pages
Catia V5 Bending Torsion Tension Shear Tutorial
No ratings yet
Catia V5 Bending Torsion Tension Shear Tutorial
18 pages
7SR10 Argus Complete Technical Manual
0% (1)
7SR10 Argus Complete Technical Manual
205 pages
Group1 A
No ratings yet
Group1 A
18 pages
Dynamic Equilibrium
No ratings yet
Dynamic Equilibrium
4 pages
Effect of Axial Load On Plastic Modulus
No ratings yet
Effect of Axial Load On Plastic Modulus
6 pages
B.A Economics
No ratings yet
B.A Economics
129 pages
Likert Scales, Levels of Measurement and The 'Laws' of Statistics PDF
No ratings yet
Likert Scales, Levels of Measurement and The 'Laws' of Statistics PDF
8 pages
Picrosiriusred Protocol
No ratings yet
Picrosiriusred Protocol
8 pages
M1L3 LN
No ratings yet
M1L3 LN
7 pages
SR Elite, Aiims S60, MPL & LTC Aits Grand Test - 20 Paper (01-05-2023)
No ratings yet
SR Elite, Aiims S60, MPL & LTC Aits Grand Test - 20 Paper (01-05-2023)
30 pages
Biostatistics Classes PDF
No ratings yet
Biostatistics Classes PDF
156 pages
Assignment 8 Embedded
No ratings yet
Assignment 8 Embedded
9 pages
Wespwer Alp 09
No ratings yet
Wespwer Alp 09
16 pages
RF Circuits With Multisim 10 - Exp - 1 - 8
No ratings yet
RF Circuits With Multisim 10 - Exp - 1 - 8
52 pages
14 Loci and Transformations
No ratings yet
14 Loci and Transformations
83 pages
TG63 DS en
No ratings yet
TG63 DS en
4 pages
Spot Test Series: NEET 2017
No ratings yet
Spot Test Series: NEET 2017
16 pages
Transfer Learning Using VGG-16 With Deep Convoluti
No ratings yet
Transfer Learning Using VGG-16 With Deep Convoluti
9 pages
Module 3
No ratings yet
Module 3
5 pages
Enumerations in WinCC
No ratings yet
Enumerations in WinCC
16 pages
Acti 9 iEM3000 - A9MEM3255
No ratings yet
Acti 9 iEM3000 - A9MEM3255
3 pages
5 Dynamic Hand Gesture Recognition For Wearable Devices With Low Complexity Recurrent Neural Networks
No ratings yet
5 Dynamic Hand Gesture Recognition For Wearable Devices With Low Complexity Recurrent Neural Networks
4 pages
International Standards in Nanotechnologies: A B C C D
No ratings yet
International Standards in Nanotechnologies: A B C C D
15 pages
38th IMO 1997-FIX
No ratings yet
38th IMO 1997-FIX
6 pages
Reliability
No ratings yet
Reliability
10 pages
Extraction of Visual Features For Lipreading
No ratings yet
Extraction of Visual Features For Lipreading
16 pages
Physical Chemistry (471) : Faculty of Applied Sciences Laboratory Report
No ratings yet
Physical Chemistry (471) : Faculty of Applied Sciences Laboratory Report
12 pages
1 s2.0 S2405959520304756 Main
No ratings yet
1 s2.0 S2405959520304756 Main
7 pages
8 Hand Gesture Recognition System As An Alternative Interface For Remote Controlled Home Appliances
No ratings yet
8 Hand Gesture Recognition System As An Alternative Interface For Remote Controlled Home Appliances
5 pages
6 Hand Gesture Recognition With Inertial Sensors
No ratings yet
6 Hand Gesture Recognition With Inertial Sensors
4 pages
Reed Switch Oil Level Sensor
No ratings yet
Reed Switch Oil Level Sensor
2 pages
IoT based Projects: Realization with Raspberry Pi, NodeMCU and Arduino
From Everand
IoT based Projects: Realization with Raspberry Pi, NodeMCU and Arduino
Rajesh Singh
No ratings yet
Cookbook for Mobile Robotic Platform Control: With Internet of Things And Ti Launch Pad
From Everand
Cookbook for Mobile Robotic Platform Control: With Internet of Things And Ti Launch Pad
Dr. Anita Gehlot
No ratings yet