Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
Developing Phoneme Based Lip Reading Sentences System For Silent Speech Recognition
DOI: 10.1049/cit2.12131
ORIGINAL RESEARCH
- -Revised: 15 July 2022 Accepted: 20 July 2022
1
School of Engineering, London South Bank Abstract
University, London, UK
Lip‐reading is a process of interpreting speech by visually analysing lip movements.
2
Faculty of Informatics and Computer Science, Recent research in this area has shifted from simple word recognition to lip‐reading
British University in Egypt, Cairo, Egypt
3
sentences in the wild. This paper attempts to use phonemes as a classification schema
School of Electronics and Informatics,
for lip‐reading sentences to explore an alternative schema and to enhance system per-
Northwestern Polytechnical University, Xi'an, China
formance. Different classification schemas have been investigated, including character‐
Correspondence based and visemes‐based schemas. The visual front‐end model of the system consists
Randa El‐Bialy, School of Engineering, London of a Spatial‐Temporal (3D) convolution followed by a 2D ResNet. Transformers utilise
South Bank University, London, UK. multi‐headed attention for phoneme recognition models. For the language model, a
Email: [email protected]
Recurrent Neural Network is used. The performance of the proposed system has been
testified with the BBC Lip Reading Sentences 2 (LRS2) benchmark dataset. Compared
with the state‐of‐the‐art approaches in lip‐reading sentences, the proposed system has
demonstrated an improved performance by a 10% lower word error rate on average
under varying illumination ratios.
KEYWORDS
deep learning, deep neural networks, lip‐reading, phoneme‐based lip‐reading, spatial‐temporal convolution,
transformers
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
-
properly cited.
© 2022 The Authors. CAAI Transactions on Intelligence Technology published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology and Chongqing
University of Technology.
and tackling the effect of different face structures, etc. [5]. phonemes when using viseme classifiers as an illustration the
Different pronunciations exist due to different dialects of viseme class ‘FV’ can be mapped into ‘ae’ ‘eh’ ‘ay’ ‘ey’ ‘hh’
people in different regions. Also, in reality, some people have phoneme classes [20, 21]. In comparison, only two variations
short lip movements compared to others. Furthermore, people of phoneme dictionaries have been used. In addition, most
can talk from different angles towards a camera. Because of English words have a one‐to‐one mapping to a word with only
these issues, it is essential to create more robust models [6]. a few exceptions with a one‐to‐many relationship to a set of
With the power of Deep Learning (DL) architectures and words. Therefore, converting recognized phonemes to words
the availability of large‐scale databases, it is possible to shift will have less complexity and require less computational effort.
from the early lip‐reading systems, which addressed simple This paper uses phonemes as a classification schema for
word recognition tasks, to more realistic and complex tasks [7]. lip‐reading sentences in the wild rather than character‐based or
Accordingly, these DL architectures have led to current sys- visemes‐based schemas. The main aim of this research is to
tems that target continuous lip‐reading [8, 9] and to improve explore an alternative schema and to enhance system's per-
the performance of visual speech recognition in general. formance. The proposed system's performance has been vali-
However, due to the complexity of image processing and the dated using the BBC Lip Reading Sentences 2 (LRS2)
difficulty of training classifiers, it is difficult for traditional lip‐ benchmark dataset. The system displayed a 10% average
reading systems to meet the requirements of real‐time appli- reduction in a word error rate under varying illumination ratios
cations. As a result of the advancements in lip‐reading systems, compared to the state‐of‐the‐art systems in lip‐reading
numerous applications are conceivable, for example, resolving sentences.
multi‐talker simultaneous speech [10], developing augmented The rest of this paper is organized as follows: Section 2
lip views to assist people with hearing impairments [11], provides a literature review on phoneme‐based lip‐reading
dictating messages to smartphones in noisy environments [12], systems and discusses the different architectures used in
transcribing and re‐dubbing silent films [8], and discriminating feature extraction, phoneme labelling, and classifiers. In addi-
between native and non‐native speakers [13]. tion, the relevant works on Automatic Speech Recognition
Currently, there are two leading approaches to solving the (ASR) are discussed. Section 3 discusses the methodology and
lip‐reading problem. The first approach handles it as a word or the proposed system in detail, including its pre‐processing
phrase classification task. This approach uses video samples to steps, the structure of the visual front‐end model, the
predict a word or phrase label [14]. The second is a more phoneme recognition model, the pronunciation dictionary
recent one, which has gained its strength from the deep net- used, and the Recurrent Neural Network‐based language
work's capability to perform text predictions, such as complete model is explained. Section 4 briefly discusses the BBC Lip
sentences. Accordingly, instead of predicting word labels for Reading Sentences 2 benchmark dataset. Section 5 presents
solving lip‐reading problems, this approach predicts character Models comparison, which addresses the details of the state‐
sequences or a viseme sequence [8, 15, 16]. As for visual of‐the‐art character‐based lip‐reading system and a viseme‐
speech units, the two primary forms are phonemes and based lip‐reading system. Section 6 presents the experimental
visemes. Phonemes are strongly linked to an acoustic speech results and demonstrates the performance of the proposed
signal [17]. A viseme, on the other hand, is the most basic model with evaluation. In addition, presenting how to add
visual unit of speech, reflecting a gesture of the mouth, face, noise to testify the robustness of the proposed model, Gamma
and visible elements of the teeth and tongue, also known as correction has also been considered in the experiments. Finally,
visible articulators [4]. Even though phonemes represent concluding remarks and future work are given in Section 7.
distinct short sounds, some studies employ phoneme units to
increase lip‐reading accuracy [18], while others focus on
visemes. The efficacy of phoneme or viseme units in lip‐ 2 | RELATED WORKS
reading systems is a point of contention.
Using phonemes for lip‐reading sentences has some ad- Phonemes are mainly used with acoustic signals, considered
vantages over other systems since it overcomes the cumulative the main building blocks of speech. However, in scenarios
loss of information caused by the mapping process from where audio signals are corrupted or unavailable, in noisy
phoneme classes (the number of classes is between 45 and 53) environments, or in the case of individuals with partial or total
to viseme classes (the number of classes is between 10 and 14) hearing loss, it will usually be challenging to detect audio sig-
[19]. However, due to the reduction of a set of phonemes to a nals. Accordingly, lip‐reading is a complementary method to
set of visemes, the complexity of the pronunciation dictionary compensate for the lack of audio information. The literature
increases due to the increasing volume of homophonic words, can therefore be divided into two directions. The first one is
and the discriminative power of the classification model is where audio signals are absent, and only video signals are
reduced. Essentially, there is a trade‐off between unit and available. The second direction is where only audio signals are
model accuracy at the sentence level. present. As phonemes have traditionally been associated with
Another problem that a viseme‐based system suffers is the sound or audio, research studies are rare in relation to video‐
large number of the proposed phoneme‐to‐visemes maps as a based lip‐reading. This section reviews the literature on
phoneme is related to one viseme class but a viseme may video‐based phoneme recognition for lip‐reading and some of
represent many phonemes. This cause ambiguity between the relevant works on Automatic Speech Recognition (ASR).
24682322, 2023, 1, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12131 by Bangladesh Hinari NPL, Wiley Online Library on [22/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EL‐BIALY ET AL.
- 131
According to Howell et al. [22], an approach was proposed The work by Anjie Fang et al. [26] aimed to present a
to treat visual speech as dysarthria to compensate for the gap classification model that is robust to ASR errors and acquires
of having a reduced phonemic repertoire. For visual feature pronunciation similarities ignored in word‐level representations
extraction, the Active Appearance Model was used, the Hidden by creating an ASR transcription at the phoneme level. Four
Markov Model was utilised for phoneme recognition, and existing datasets were used in the research, including the
Weighted Finite‐State Transducers were employed for word Stanford Sentiment Treebank (SST), The TREC Question
recognition. The dataset was captured from a single female classification (TQ), SQuAD, and the subjectivity dataset
speaker who spoke six repetitions of a set of 112 isolated (SUBJ) by generating noisy ASR transcriptions for them. The
words. The word accuracy rate was 49.70%. authors demonstrated the integration of phoneme embedding
Noda et al. [23] were the first to apply Convolutional into existing neural network architectures and the improve-
Neural Networks (CNN) for feature extraction in visual speech ment of classification models when handling data containing
recognition systems and also for phoneme recognition as their ASR errors. The accuracies for SST, TQ‐50, and TQ6, were
purpose was to prove whether feature extraction mechanisms 41%, 65%, and 75%, respectively.
using CNN would outperform other models, which use the A comparison was conducted by Mohammad Zeineldeen
classic dimensionality reduction techniques. As for identifying et al. [27] between phoneme‐based and grapheme‐based
isolated words, HMM was used. The dataset used contained six output labels utilising the encoder‐decoder‐attention ASR
different speakers pronouncing 300 Japanese words. The model. Furthermore, the use of byte‐pair‐encoding (BPE)‐
average phoneme recognition rate was 58%, and the accuracy based phonemes and single phonemes as output labels was
for word recognition was 37%. investigated with a conclusion that both had a similar perfor-
Thangthai et al. [18] compared viseme‐based and mance, and this has further proven that phoneme‐based
phoneme‐based lip‐reading systems to add more evidence to models are competitive to grapheme‐based models. Switch-
the argument that phonemes can surpass visemes lip‐reading board 300 h and LibriSpeech 960 h benchmarks were used to
systems, and they suggested that phonemes are the current conduct the experiments. As a result of these experiments, the
optimal class labels for lip‐reading. Using the TCD‐TIMIT accuracies obtained when using the switchboard 300 h dataset
corpus for sentences, Discrete Cosine Transformer and for BPE‐based grapheme were 85%, and 86.2% was achieved
Eigenlip for feature extraction, and Weighted‐Finite‐state for both single and BPE‐based phonemes. As for the Lib-
Transducer as word recogniser, the phoneme recognition ac- rispeech 960 h dataset, the accuracies acquired were 89.44%,
curacy acquired was 33.44%. Furthermore, the word accuracy 86.2%, and 90.86 for BPE‐based phoneme, single phoneme,
rate was 48.74% in speaker‐dependent tests using Eigenlip and BPE‐based grapheme, respectively. As such, it was
compared to 46.6% and 33.06% for viseme and word recog- observed that grapheme and phoneme‐based BPE outperform
nition, respectively. single phonemes on Librispeech 960 h, which contradict the
To enhance the accuracy of phoneme‐based lip‐reading results of Switchboard 300 h.
systems, Shillingford et al. [24] proposed a Deep Neural Wei Zhou et al. [28] adopted a simple competitive
Network and a production‐level speech decoder for both approach for phoneme‐based neural transducer modelling,
mapping videos into a sequence of phoneme distributions and sustaining the advantages of both classical and end‐to‐end
generating the corresponding word sequences, respectively. A approaches. In order to maintain the sequence‐to‐sequence
Large‐Scale Visual Speech Recognition dataset was constructed modelling consistency, a simplified neural network structure
and used in the study. Spatio‐temporal Convolutions, Bi‐ along with direct integration with an external word‐level lan-
directional Long Short‐Term Memory, and Finite‐State guage model was presented by utilising the local dependencies
Transducers were utilised for feature extraction, phoneme of phonemes. Furthermore, augmentation for word‐end‐based
recognition, and word recognition. The phoneme recognition phoneme labels was proposed to improve the system perfor-
accuracy rate was 66%, and the word accuracy rate was 60%. mance. Furthermore, frame‐wise cross‐entropy loss was used
The relevant works on Automatic Speech Recognition are for an efficient training procedure. The proposed model was
discussed below. evaluated on both TED‐LIUM release 2 (TLv2) and Switch-
Chiu et al. [25] presented a so‐called Listen, Attend, and board (SWBD) corpora, and the word error rate obtained was
Spell (LAS) architecture, an attention‐based encoder‐decoder, 6.3% and 11.5, respectively (Tables 1 and 2,).
in which traditional automatic speech recognition system As shown in the literature, research on phoneme‐based lip‐
components were included in a single neural network. They reading systems is very limited, and to the best of the authors'
proved that graphemes could be substituted with a word knowledge, this study is the first work that purely uses pho-
piece model. The work has shown that the performance of nemes from videos for lip‐reading sentences in the wild. Most
ASR can be significantly improved by optimising the LAS of the time, using phonemes is associated with audio signals;
model and introducing a multi‐head attention architecture. however, in this research, the audio is not presented/provided
Also, they improved the accuracy by exploring synchronous as in some scenarios and potential applications, such as CCTV
training, scheduled sampling, label smoothing, and minimum footage analysis, forensic investigations, silent dictation in
word error rate optimization. The experiments were con- public places, wearable optical technologies to aid hearing,
ducted on a 12,500‐h voice search task (Google Voice animation and digital avatars, silent movies restoration, and last
Search). but not least, humanoid robotics.
24682322, 2023, 1, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12131 by Bangladesh Hinari NPL, Wiley Online Library on [22/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
132
- EL‐BIALY ET AL.
References Dataset Year Signal Feature extraction Phoneme recognition PAR Classification Classifier WAR (%)
Howell et al. [22] ‐‐‐‐‐‐‐‐‐ 2013 Video AAM HMM ‐‐‐‐‐‐ Isolated words WFTS 49.7
Noda et al. [23] ‐‐‐‐‐‐‐ 2014 Video CNN CNN 58% Isolated words HMM 37
Thangthai et al. [18] TCD‐TIMIT 2017 Video Eigenlip Hybrid DNN‐HMM 33.4% Sentences WFTS 48.7
Shillingford et al. [24] LSVSR 2018 Video Spatio‐temporal convolutions Bi‐LSTM 66% Sentences FST 60
Fang et al. [26] Amazon ALEXA DATA 2020 Audio CNN 76%
Mohammad Zeineldeen et al. [27] Switchboard 300h 2020 Audio Attention‐based encoder‐decoder BPE‐grapheme 85%
Single‐phoneme 86.2%
BPE‐phoneme 86.2%
LibriSpeech BPE‐grapheme 90.8%
Single‐phoneme 86.3%
BPE‐phoneme 90.7%
SWBD 88.5
3 | METHODOLOGY
All of the videos are pre‐processed as shown in Figure 2. With � Step 1: Sample videos are into image frames.
25 frames per second framing rate, images with red, green, and � Step 2: Identify face landmarks with the videos sampled.
blue pixel values with a resolution of 160 pixels by 160 pixels Based on iBug [29], a Convolutional Neural Network de-
are utilised. Because the region of interest (ROI) and feature tector known as The Single Shot Multi‐Box Detector (SSD)
input to the visual front end are speaker's lips, the following [34], facial landmarks are extracted by detecting face pres-
steps for video pre‐processing are: ence in every single frame.
24682322, 2023, 1, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12131 by Bangladesh Hinari NPL, Wiley Online Library on [22/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EL‐BIALY ET AL.
- 133
� Step 3: Generate image dimensions of 112 � 112 � T di- each video is added to 180 characters is to ensure that all
mensions (where T is the number of image frames) by con- videos are of the same length, a space character, Start of
verting each video frame to a greyscale, followed by scaling Sentence <SoS>, and End of Sentence <EoS>. Table 3 il-
and cropping in the centre of the facial landmark boundary. lustrates the classes used by the phoneme classifier.
� Step 4: Conduct data augmentation with horizontal flipping, As the primary building block in an encoder‐decoder ar-
random frame removal [30, 31], and random shifts in the chitecture, as illustrated in Figure 4, multi‐headed attention is
temporal and spatial dimensions of �2 frames and �5 implemented by Transformers [34]. A stacked self‐attention
pixels, respectively. layer with the input tensor as attention queries and keys and
� Step 5: Normalise each pixel in a frame to its overall mean values constitutes the encoder. As for the decoder, it follows
and variance. the model presented in [15] and consists of a dense layer, batch
normalisation, RELU, and a dropout layer probability of 0.1
for each of the three fully connected layers; 1024 nodes in the
{[pad], <sos> ‘AA’ , ‘AE’, ‘AH’, ‘AH’, ‘AO’, ‘AW’ , ‘AY’ , ‘B’ , ‘CH’ , ‘D’ ,
‘DH’ , ‘EH’ , ‘EH’ , ‘ER’, ‘EY’, ‘F’, ‘G’ , ‘HH’, ‘IH’, ‘IY’, ‘JH’,’K’ , ‘L’ ,
‘M’ , ‘N’ , ‘NG’, ‘OW’ , ‘OY’ , ‘P’, ‘R’ , ‘S’ , ‘SH’, ‘T’, ‘TH’,’UH’, ‘UW’,
‘V’, ‘W’, ‘Y’, ‘Z’, ‘ZH’, <eos>, [space]}.
FIGURE 4 Phoneme transformer architecture
24682322, 2023, 1, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12131 by Bangladesh Hinari NPL, Wiley Online Library on [22/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
134
- EL‐BIALY ET AL.
first and the last fully connected dense layers, and as for the 4 | DATASET USED
dense middle layer, it consists of 2048 nodes. Furthermore, the
decoder generates phoneme probabilities with a cross‐entropy The BBC Lip‐reading Sentences 2 (LRS2) [36] dataset has been
loss function corresponding to the ground truth table. The used for this work. There are over 46,000 videos in total in the
encoder utilises the [34] base model, which has six layers, a dataset with over two million‐word occurrences and a vocab-
model size of 512, eight attention heads, and a dropout ulary of approximately 40,000 words. The video with the most
probability of 0.1. extended duration is 180 frames long with each video having a
frame rate of 25 frames per second. The dataset consists of
spoken sentences, where each sentence is up to 100 ASCII
3.4 | Language model characters extracted from BBC videos, with a range of facial
positions from frontal to profile. The dataset is quite chal-
An attention‐based RNN is utilised to convert the recognized lenging because of the variety of perspectives, lighting settings,
phonemes into meaningful sentences [35]. As shown in genres, and speakers.
Figure 5, the network consists of two multilayer RNNs: an
encoder for the source phoneme sequences and a decoder for
the target word sequences. By initialising the decoder with the 5 | MODELS COMPARISON
last hidden state of the encoder, the decoder will gain access to
the source information. The main goal of the attention Recently, two prominent techniques are used to tackle the lip‐
mechanism is to create direct connections between the target reading challenge. The first technique treats it as a word or
and the source. At every decoder time step, the following phrase recognition problem, where video samples are analysed
process of attention computations takes place. First, all source to predict a word or phrase label. The second technique, the
states and current target hidden states are compared to pro- latest solution, addresses the lip‐reading problem by predicting
duce/drive attention weights as in Equation (1). Second, a a viseme sequence or a character sequence rather than a word
context vector is computed based on the attention weights as label. In this section, a comparison is provided regarding the
shown in Equation (2). Third, as shown in Eqution (3), in work of both [15, 37].
order to produce the attention vector, the current target hidden Lip‐reading systems consist of 3 stages; the first stage is the
state is combined with the context vector and then fed as input pre‐processing stage, where videos are input into the system
to the following time step, where αts represents attention and apply facial landmark extraction, which consists of face
weights, ht is the target hidden state, hs is the source hidden detection, face tracking, and facial landmark detection, grey-
state, C t is the context vector, and at is the attention vector. scale conversion, scaling, central cropping horizontal flipping,
random frame removal, pixel shifting, and Z‐score normal-
expðscore ðht ; hs ÞÞ isation to extract the felicitous region of interest (ROI). Next,
αts ¼ P ð1Þ
S the extracted ROI as a sequence of frames is input into the
expðscore ðht ; hs0 ÞÞ
s0 visual front‐end model, where a spatial‐temporal (3D)
X convolution is applied. Subsequently, a 2D ResNet is utilised to
Ct ¼ αts hs ð2Þ decrease the spatial dimensions with depth. Accordingly, the
s output would be a 512‐dimensional feature vector for each
input video frame.
at ¼ f ðct ; ht Þ ¼ tanhðWc ½ct ; ht �Þ ð3Þ The second stage depends on the classification scheme [37]
using characters for labelling the videos with 26 classes; the
authors in [15] use visemes with 13 classes; the authors in [37]
discussed three models for this task: the first is a recurrent
model consisting of stacked Bidirectional LSTM layers, the
second model is fully convolutional, and the third is a Trans-
former model that follows an encoder‐decoder structure with
multi‐head attention layers as a building block. The authors in
[15] presented a Transformer model with a different decoder
and a dense layer structure than the work presented in [37] due
to the difference in nature between visemes and characters.
The third stage is the language model, where the input is
the labels for each sequence of frames and outputs of the
uttered sentence. For example, the author [37] uses a character‐
level external language model consisting of four unidirectional
layers of a Recurrent Neural Network with 1024 LSTM cells
each that outputs a sentence character by character. As for [15],
a word detector consists of two steps: the first step is a word
FIGURE 5 Encoder‐decoder‐attention architecture lookup step and the second is Perplexity Calculations.
24682322, 2023, 1, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12131 by Bangladesh Hinari NPL, Wiley Online Library on [22/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
EL‐BIALY ET AL.
- 135
The datasets used by [37] are Lip Reading in the Wild 6.2 | Experimental results
(LRW) and the Lip Reading Sentences 2 (LRS2); the authors in
[15] used the Lip Reading Sentences 2 (LRS2). Both [15, 37] In this section, the proposed model is evaluated and compared
trained their models on a single GeForce GTX 1080 Ti GPU to different state‐of‐the‐art classification schemas. Different
with 11 GB memory and implemented all operations in ratios of training and testing are used to verify the robustness
TensorFlow. of the model. The model/phoneme classifier was trained for
According to [37], Transformer is the best performing 2000 epochs. The results for the first simulations are shown in
network among the three presented networks with a 50% word Table 4, and the plots for the loss and the PER for both the
error rate. While in [15], the model presented achieved better training and validation for 2000 epochs are given in Figures 6
performance with a 35.4% word error rate. and 7. The confusion matrix for classification of ASCII char-
acters is provided in Figure 8.
The phoneme‐based lip‐reading system achieved an overall
6 | EXPERIMENTS AND RESULTS WER of 40%, a reduction of 10% compared to the highest
result of 50% of the previous state‐of‐the‐art model [37] as
In this section, we provide some experiments to compare the presented in Table 5.
performance of our proposed model using phonemes as Comparing the results of the phoneme‐based lip‐reading
classifiers with the work presented in [37] using characters as system with those of a viseme‐based system that uses the
classifiers and the authors in [15] use visemes as classifiers. The LRS2 dataset with the same ratio of the number of training
matrix used for model evaluation includes Word Error Rate
(WER) and Character Error Rate (CER) as discussed below. TABLE 4 Results of phoneme‐based lip‐reading system
All the simulations have been implemented with Tensor-
Flow and on a GeForce GTX 1080 Ti GPU with 11 GB Epochs Validation samples PER (%) CER (%) WER (%)
memory for the first set of simulations with 90% training to 2000 1500 30 32 40
10% validation and the second set of simulations using
GeForce RTX‐3070 GPU with 16 GB memory for 70%
training to 30% testing.
SþDþI
ER ¼ ð4Þ
N FIGURE 6 Loss curve for training and validation
P Sþ P D þ P I
PER ¼ ð5Þ
VN
C Sþ C D þ C I
CER ¼ ð6Þ
CN
WSþ WD þ WI
WER ¼ ð7Þ
WN FIGURE 7 PER curve for training and validation
24682322, 2023, 1, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12131 by Bangladesh Hinari NPL, Wiley Online Library on [22/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
136
- EL‐BIALY ET AL.
samples to test samples, the observed accuracy of the 6.3 | Gamma correction
phoneme‐based model was lower than that of the viseme‐
based. Table 6 shows the PER, VER (Viseme Error Rate,) The pixel brightness has been altered to provide illumination to
and WER results. VER is calculated as image frames in order to test the robustness of the proposed
model. Videos consist of images with red, green, and blue
VSþ VD þ VI pixels, and the intensity of the numerical values ranges from
VER ¼ ð8Þ
VN 0 as a minimum to 255 as a maximum. Normalisation is the
first step used to map the pixel values ranging from a minimum
After running more simulations with a different ratio of of 0 to a maximum of 1. The next step is to apply gamma
training to testing till no further convergence was recorded, the correction according to Equation (9):
achieved results are reported below in Table 7, and the plots
for the loss and the PER for both the training and the vali- Vo ¼ AV γI ð9Þ
dation are shown in Figures 9 and 10.
where A represents a constant equals to 1, VI represents the
matrix of pixels, γ, when given a value of less than <1 makes
dark parts lighter, and when given values larger than >1 makes
shadowed parts darker. The last step is re‐normalisation, where
all the pixels are re‐normalised to values from 0 to 255.
Tables 8 and 9 show the performance of the phoneme‐
based lip‐reading system under varying illumination ratios
compared to the state‐of‐the‐art model [37]. The proposed
system has an improved performance with a 10% lower word
error rate. It can be seen that the lip‐reading system is generally
Viseme‐based lip‐reading
Phoneme‐based lip‐reading [15]
PER (%) WER (%) VER (%) WER (%)
30 40 5 35
1.5 33.3 45.0 14.1 36.2 51.4 20.2 DATA AVA I L AB I LI TY S TAT E M E N T
BBC Lip Reading Sentences 2 Dataset LRS2 (https://www.
robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html).
7 | CONCLUSION R E F E R E N C ES
1. Dupont, S., Luettin, J.: Audio‐visual speech modeling for continuous
speech recognition. IEEE Trans. Multimed. 2(3), 141–151 (2000).
Using phonemes is usually associated with audio signals;
https://doi.org/10.1109/6046.865479
however, in this research, audio signals are not presented/ 2. Zhou, Z., et al.: A review of recent advances in visual speech decoding.
provided. A purely phoneme‐based lip‐reading system from Image Vis Comput. 32(9), 590–605 (2014). https://doi.org/10.1016/j.
videos using spatial‐temporal Convolution Neural Network as imavis.2014.06.004
the front end and Recurrent Neural Network as the back end 3. Twaddell, W.F.: On defining the phoneme. Linguistic Society of America
has been proposed in this study. The advantage of using 11(1), 5–62 (1935). https://doi.org/10.2307/522070
4. Fisher, C.G.: Confusions among visually perceived consonants. J. Speech
phonemes for lip‐reading sentences is to overcome the cu- Hear. Res. 11(4), 796–804 (1968). https://doi.org/10.1044/jshr.1104.796
mulative loss of information caused by the mapping process 5. Fernandez‐lopez, A., Sukno, F.M.: Survey on automatic lip‐reading in the
from phoneme to viseme. Another advantage is having only era of deep learning. Image Vis Comput. 78, 53–72 (2018). https://doi.
two variations of dictionaries used in phoneme recognition org/10.1016/j.imavis.2018.07.002
compared to the large number of phoneme‐to‐viseme maps; 6. Mestri, R., et al.: Analysis of feature extraction and classification models
for lip‐reading. In: Proceedings of the International Conference on
that means that the conversion part in the phoneme system has Trends in Electronics and Informatics, ICOEI, pp. 911–915 (2019).
less complexity, that is, the required computational effort is https://doi.org/10.1109/icoei.2019.8862649
lower. With the BBC LRS2 benchmark dataset, the proposed 7. Hao, M., et al.: A survey of research on lipreading technology. IEEE
model has demonstrated an improved performance by an 18% Access 8, 204518–204544 (2020). https://doi.org/10.1109/ACCESS.
2020.3036865
lower word error rate on average compared with the state‐of‐
8. Chung, J.S., et al.: Lip reading sentences in the wild. In: Proceedings ‐
the‐art lip‐reading sentences. The results prove that using 30th IEEE Conference on Computer Vision and Pattern Recognition,
phonemes as a classification schema is a promising alternative CVPR 2017, pp. 3444–3450 (2017). https://doi.org/10.1109/CVPR.
to other classification schemas. 2017.367
24682322, 2023, 1, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cit2.12131 by Bangladesh Hinari NPL, Wiley Online Library on [22/06/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
138
- EL‐BIALY ET AL.
9. Fenghour, S., et al.: Deep learning‐based automated lip‐reading: a survey. 25. Chiu, C.C., et al.: State‐of‐the‐Art speech recognition with sequence‐to‐
IEEE Access, 9–121205 (2021). https://doi.org/10.1109/ACCESS. sequence models. In: ICASSP, IEEE International Conference on
2021.3107946 Acoustics, Speech and Signal Processing ‐ Proceedings, pp. 4774–4778
10. Noda, K., et al.: Audio‐visual speech recognition using deep learning. (2018). https://doi.org/10.1109/ICASSP.2018.8462105
Appl. Intell. 42(4), 722–737 (2015). https://doi.org/10.1007/s10489‐ 26. Fang, A., et al.: Using phoneme representations to build predictive
014‐0629‐7 models robust to ASR errors. In: SIGIR 2020 ‐ Proceedings of the 43rd
11. Mattos, A.B., Oliveira, D.A.B.: Multi‐view mouth renderization for International ACM SIGIR Conference on Research and Development in
assisting lip‐reading. In: Proceedings of the 15th Web for All Conference: Information Retrieval, pp. 699–708 (2020). https://doi.org/10.1145/
Internet of Accessible Things (2018). W4A 2018, (August). https://doi. 3397271.3401050
org/10.1145/3192714.3192824 27. Zeineldeen, M., et al.: A systematic comparison of grapheme‐based vs.
12. Gabbay, A., et al.: Seeing through noise: visually driven speaker separa- phoneme‐based label units for encoder‐decoder‐attention models.
tion and enhancement. ICASSP, IEEE Int. Conf. Acoust. Speech Signal (2020). http://arxiv.org/abs/2005.09336
Process. ‐ Proc. 2018‐April(December), 3051–3055 (2018). https://doi. 28. Zhou, W., et al.: Phoneme Based Neural Transducer for Large Vocabu-
org/10.1109/ICASSP.2018.8462527 lary Speech Recognition Human Language Technology and Pattern
13. Georgakis, C., Petridis, S., Pantic, M.: Visual‐only discrimination between Recognition, pp. 5644–5648. Computer Science Department, AppTek
native and non‐native speech. In: ICASSP, IEEE International Confer- GmbH, 52062 Aachen, Germany, Proc. ICASSP (2021)
ence on Acoustics, Speech and Signal Processing ‐ Proceedings, 29. Fenghour, S., et al.: An effective conversion of visemes to words for
pp. 4828–4832 (2014). https://doi.org/10.1109/ICASSP.2014.6854519 high‐performance. Sensors 21(November), 7890 (2021). https://doi.org/
14. Chung, J.S., Zisserman, A.: Lip Reading in the Wild (2018). https://doi. 10.3390/s21237890
org/10.1007/978‐3‐319‐54184‐6 30. Fenghour, S., et al.: Lip Reading Sentences Using Deep Learning with
15. Fenghour, S., Chen, D., Xiao, P., et al.: Disentangling Homophemes in Only Visual Cues. IEEE Access (2020). https://doi.org/10.1109/
Lip Reading Using Perplexity Analysis, pp. 1–17 (2020) ACCESS.2020.3040906
16. Fenghour, S., Chen, D., Xiao, P.: Decoder‐encoder LSTM for lip reading. 31. Afouras, T., et al.: Deep Audio‐Visual Speech Recognition, pp. 1–13.
In: Proceedings of the Conference: 8th International Conference on IEEE Transactions on Pattern Analysis and Machine Intelligence (2018).
Software and Information Engineering. ICSIE (2019) https://doi.org/10.1109/TPAMI.2018.2889052
17. Henning, P., Finn, U., Tønnessen, E.: The status of the concept of 32. Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with
phoneme in psycholinguistics the status of the concept of phoneme in LSTMs for lip‐reading. In: Proceedings of the Annual Conference of the
psycholinguistics. Speech.Lang.Hear.Res. (2014). https://doi.org/10. International Speech Communication Association, pp. 3652–3656.
1007/s10936‐010‐9149‐8 INTERSPEECH (2017). https://doi.org/10.21437/Interspeech.2017‐85
18. Thangthai, K., Bear, H.L., Harvey, R.: Comparing phonemes and 33. Treiman, R., Kessler, B., Bick, S.: Memory and Language Context
visemes with DNN‐based lip‐reading', (September) (2018). http://arxiv. sensitivity in the spelling of English vowels. Memory and Language 47(3),
org/abs/1805.02924 448–468 (2002). https://doi.org/10.1016/s0749‐596x(02)00010‐4
19. Bear, H.L., et al.: Some observations on computer lip‐reading: moving 34. Vaswani, A.: Attention is all you need. In: Neural Information Processing
from the dream to the reality. In: Optics and Photonics for Counter- Systems (NIPS), pp. 5998–6008 (2017)
terrorism, Crime Fighting, and Defence X; and Optical Materials and 35. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to
Biomaterials in Security and Defence Systems Technology, vol. XI, attention‐based neural machine translation. In: Conference Proceedings
p. 92530G (2014). https://doi.org/10.1117/12.2067464.9253 —EMNLP 2015: Conference on Empirical Methods in Natural Lan-
20. Bear, H.L., Harvey, R.: Alternative visual units for an optimized guage Processing, pp. 1412–1421 (2015). https://doi.org/10.18653/v1/
phoneme‐based lip‐reading system. Appl. Sci. 9(18), 3870 (2019). d15‐1166
https://doi.org/10.3390/app9183870 36. Chung, J.S., et al.: Lipreading sentences in the wild. In: Proceedings—
21. Montgomery, A.A., Jackson, P.L.: Physical characteristics of the lips 30th IEEE Conference on Computer Vision and Pattern Recognition,
underlying vowel lip‐reading performance. J. Acoust. Soc. Am. 73(6), CVPR 2017, pp. 3444–3450 (2017). https://doi.org/10.1109/CVPR.
2134–2144 (1983). https://doi.org/10.1121/1.389537 2017.367
22. Howell, D., Theobald, B.‐J., Cox, S.J.: Confusion modelling for automated 37. Afouras, T., Chung, J.S., Zisserman, A.: Deep Lip Reading: A Compar-
lip‐reading using weighted finite‐state transducers. In: Proceedings of the ison of Models and an Online Application Deep Lip Reading: A Com-
International Conference on Auditory‐Visual Speech Processing, parison of Models and an Online Application, pp. 3514–3518.
197–203 (2013) Interspeech (2018). https://doi.org/10.21437/Interspeech.2018‐1943
23. Noda, K., et al.: Lipreading using convolutional neural network. In:
Proceedings of the Annual Conference of the International Speech
Communication Association, pp. 1149–1153. INTERSPEECH (2014). How to cite this article: El‐Bialy, R., et al.: Developing
(September). https://doi.org/10.21437/interspeech.2014‐293
phoneme‐based lip‐reading sentences system for silent
24. Shillingford, B., et al.: Large‐scale visual speech recognition. In: Pro-
ceedings of the Annual Conference of the International Speech speech recognition. CAAI Trans. Intell. Technol. 8(1),
Communication Association, pp. 4135–4139. INTERSPEECH (2019). 129–138 (2023). https://doi.org/10.1049/cit2.12131
https://doi.org/10.21437/Interspeech.2019‐1669