A Lip Reading Method Based On 3D Convolutional Vision Transformer
A Lip Reading Method Based On 3D Convolutional Vision Transformer
ABSTRACT Lip reading has received increasing attention in recent years. It judges the content of speech
based on the movement of the speaker’s lips. The rapid development of deep learning has promoted progress
in lip reading. However, due to lip reading needs to process the information of continuous video frames,
it is necessary to consider the correlation information between adjacent images and the correlation between
long-distance images. Moreover, lip reading recognition mainly focuses on the subtle changes of lips and
their surrounding environment, and it is necessary to extract the subtle features of small-size images.
Therefore, the performance of machine lip reading is generally not high, and the research progress is slow.
In order to improve the performance of machine lip reading, we propose a lip reading method based on
3D convolutional vision transformer (3DCvT), which combines vision transformer and 3D convolution
to extract the spatio-temporal feature of continuous images, and take full advantage of the properties of
convolutions and transformers to extract local and global features from continuous images effectively. The
extracted features are then sent to a Bidirectional Gated Recurrent Unit (BiGRU) for sequence modeling.
We proved the effectiveness of our method on large-scale lip reading datasets LRW and LRW-1000 and
achieved state-of-the-art performance.
have been proposed recently. Initially, a neural network receptive field, shared weight, dynamic attention, and global
is only used for feature extraction, which is combined context fusion. These features help us better extract the
with HHM [18]. Later, with the development of recursive temporal and spatial information of continuous video frames
networks, Long Short Term Memory(LSTM) and Gated and more accurately identify the changes in lip movement.
Recurrent Unit(GRU) gradually replaced HMMs [19], [20]. The proposed model is tested on LRW [31] and LRW-1000
Recently, It is proposed to use temporal convolution instead [32] datasets, the state-of-the-art performance is obtained,
of LSTM or GRU to obtain feature temporal correlation and the related comparative experiments are carried out to
[21], [22]. When 3D convolution and ResNet are proposed, prove our model’s effectiveness.
their combination becomes the most commonly used feature Our contributions are summarized as follows: (1) 3D
extractor [23], [24]. Lip reading researchers have begun to convolution is added to the CvT [30] to capture the
focus on self-attention mechanisms [25], [26]. The core of the spatio-temporal feature information of video frames. (2) SE
mechanism is determining the part we should pay attention structure is introduced into the Convolutional Token Embed-
to based on the goal and further analyzing after finding the ding layer of CvT to improve the extraction of effective fea-
details. tures on the channel. (3) we have provided some useful tricks
However, there are still some problems that need our atten- for lip reading tasks, and have done ablation experiments to
tion. Firstly, as the input is continuous video frames, we need prove their effectiveness. (4) On LRW and LRW-1000 lip
to attach importance to the temporal and spatial information reading datasets, Our proposed 3DCvT achieved an accuracy
between adjacent frames. Secondly, the recognition mainly of 88.5% and 57.5% respectively. The method’s effectiveness
focuses on the lips and their surroundings, which makes the outperforms the current state-of-the-art performance.
network model have more essential requirements for subtle The rest of this paper is organized as follows: the
feature extraction. Additionally, with the deepening of the Section II introduces the methods of feature extraction and
network, the loss of information caused by the resolution time series modeling. The Section III describes the 3DCvT
reduction should also be a concerned [16], [17]. Based on the method proposed in this paper. The Section IV describes the
above problems and the complexity of lip reading itself, the experimental content and result analysis. The final section
accuracy of lip reading tasks has not been high. The related gives our conclusions and prospects.
work has made slow progress in improving the accuracy of
lip reading.
In view of the above problems, this paper improves the II. RELATED WORK
previous lip reading method using ResNet as pipe network A. TRANSFORMER
[27], [28], and proposes a lip reading method based on 3D The transformer was proposed in 2017 [33], and its
convolution visual transformer (3DCvT). In our method, the performance in machine translation tasks is better than that
lip movement features are extracted by a 3D convolutional of RNN and CNN. Only encoder-decoder and attention
vision transformer. The extracted feature information is mechanism can achieve good results. The most significant
processed by Bidirectional Gated Recurrent Unit(BiGRU) advantage is that it can be parallelized efficiently, which was
and Full Connection(FC) layer for sequence modeling initially used in natural language processing and achieved
and classification. The method can effectively extract the considerable results [34], [35]. Recently, the transformer has
high-dimensional features of an image sequence, enhance been introduced into the field of visual recognition. For
the semantic representation between video keyframes, and example [36]–[38], they have achieved better results than
reduce the loss caused by the global average of an image before in visual processing. We have reason to believe that
sequence. transformer can also achieve good results in lip reading, and
Specifically, 3D convolution can obtain the spatio- we have done it.
temporal correlation information between continuous Vision transformer (ViT) [36] introduces a transformer into
images. The transformer divides the image into noncoinci- visual recognition with a large amount of data. It decomposes
dent patches conducive to capturing global information. The each input image into non-overlapping sub-images to form a
addition of convolution operation improves the fine-grained token with a fixed length. Then, several standard transformer
extraction of local regions of interest. We add the Squeezing layers are applied to model these tokens. Only transformer
and Excitation(SE) [29] module to the Convolutional Token is used in the ViT model, and good results are achieved in
Embedding [30]. This structure makes the weights of valid classification, but large datasets are required to achieve good
feature maps more significant and the weights of invalid convergence. Recently proposed by Haiping Wu et al. Con-
feature maps smaller, which makes the model achieve better volutional vision Transformer (CvT) [30] is used in image
results. In addition, we should gradually reduce the number vision processing and has achieved high accuracy in the
of tokens and increase the width of tokens in each stage to Imagenet dataset. In this architecture, Convolutional Token
realize spatial downsampling and richer representation and Embedding and Convolutional Projection, two convolution-
make up for the information loss caused by the reduction based operations, are introduced into the ViT. Convolutional
of resolution. It maintains the characteristics of convolution Token Embedding models local spatial context in a multi-
and adds the advantages of a transformer. For example, local stage method. The purpose of Convolutional Projection is
to obtain the local spatial context better. The convolution convolutional vision transformer(3DCvT), and its pipeline is
operation is used to model the local spatial context in shown in Figure 2.
multi-stage and realize the additional modeling of the local
spatial environment. However, CvT cannot obtain the timing 1) 3DCNN
information of continuous video frames. It only captures The traditional CNN only processes a single image and
features in the spatial dimension, without considering the cannot obtain the relevant information between continuous
channel and time dimension information. video frames in lip reading. In order to solve this problem,
3DCNN was added before the transformer block to deal
B. BiGRU with the time characteristics of input data. 3DCNN is
Gated Recurrent Unit(GRU) is a variant of traditional RNN first proposed and used in human action recognition [40].
[39], which can effectively capture the semantic association It adds a new dimension of information, named the time
between long sequences and alleviate the phenomenon of dimension, and mainly extracts the relevant information
gradient disappearance or explosion. BiGRU is a neural between pictures. The 3DCNN formula is as follows:
network model composed of two GRU in opposite directions.
xyz
The GRU formula is as follows: vij = ReLU (bij
X XPi −1 XQi −1 XRi −1 pqr (x−p)(y−q)(z−r)
zt = σ W z xt + U z ht−1
+ wijm v )
m p=0 q=0 r=0 (i−1)m
rt = σ W r xt + U r ht−1
(2)
h̃ = tanh(W h xt + U h (ht−1 rt )
xyz
ht = (1 − zt ) h̃t + zt ht−1 (1) where vij represents the value in the jth feature map at
position (x, y, z) in the ith layer. The ReLU represents the
where zt is update gate, rt is reset gate, h̃ is the unit value, ht activation function, and b is the offset, and m is the index of
is the hidden value, W and U are the input and hidden weight the feature map in the i − 1 layer connected to the feature
pqr
matrix, respectively. map in the current layer. The p, q and r in wijm are the width,
height, and spatial information of the convolution kernel,
III. MOTHED respectively.
This section describes the overall approach to lip reading. The input of lip reading task is a continuous video frame.
The overall architecture is shown in Figure 1. Our improved The input size is 88 × 88. The 3D convolution kernel size
3DCvT constitutes the front-end network for feature extrac- is 5 × 7 × 7, the stride set to 1 × 2 × 2, and the channels
tion. The back-end network is composed of BiGRU and are 64.
carries out sequence modeling on the extracted features.
Finally, the output is classified by FC layer. 2) TRANSFORMER BLOCK
The lip movement features extracted by 3DCNN are input
A. FRONT-END: 3DCVT to the Transformer Block. The Transformer Block extracts
In order to better obtain the temporal and spatial infor- the global feature information through the self-attention
mation between video frames, this paper constructs a 3D mechanism and captures the fine-grained local features
B. BACK-END:BIGRU
The back-end network consists of BiGRU for sequence
modeling of extracted visual features. Lip reading is the
processing of continuous video frames. Therefore, after
extracting the high-dimensional feature information of the
image, it is necessary to model these features in a temporal
sequence. In order to better obtain the global correlation of
feature sequences and identify essential information, BiGRU
FIGURE 3. The transformer block.
structure is applied in the lip reading task in this paper.
The visual feature information output from the front-end is
through the convolution structure. The specific structure is directly sent to BiGRU. The input dimension of the unit is
shown in Figure 3. 513, the hidden layer dimension is 1024, there are three layers
in total, and the output dimension is 2048. Finally, it is sent
a: SE-CONV-EMBEDDING to the FC layer for classification.
With the stacking of Transformer Block and the deepening
of the network, the resolution of the image decreases IV. EXPERIMENTS AND RESULT
and the ability of the network to capture spatial features In this section, we train and evaluate two large lip reading
information decreases. To solve this problem, we improve the datasets, LRW [30] and LRW-1000 [31]. The effectiveness
Convolutional Token Embedding layer, add Squeezing and of our method is proved and compared with other state-of-
Excitation structure into it. We define the new network layer the-art lip reading methods.
as SE-Conv-Embedding. It weights the channels to obtain
the importance of each feature channel, and correlates the A. DATASETS
channel information. When the features of spatial dimension 1) LRW [30]
is reduced, it will be more difficult to capture the feature. The LRW dataset was proposed by the visual geometry team
At this time, the number of channels is increasing, and the of Oxford University in 2016. Due to the rise of deep learning,
extraction of feature on channels is particularly important. the demand for large-scale datasets is growing, and LRW
The feature map is entered into the Squeeze to get the datasets emerge as the times require. Unlike the previous
statistical information of the channel through, then the datasets, the data of the LRW dataset comes from BBC
statistical information is entered into the Excitation to get Radio and television programs instead of being recorded by
the correlation of the channel, and finally the correlation is volunteers or experimenters, which makes the data volume of
entered into the Scale to get the new feature map. The formula this dataset a qualitative leap. The dataset selects the 500 most
is as follows: common words and intercepts the scenes of the speaker
saying these words. Therefore, more than 1000 speakers and
x̂ = Fscale Fex Fsq (conv (x)) , x
(3) more than 550 million discourse examples meet the demand
where x is the input feature map, Fsq is the squeeze function, of deep learning for the amount of data to a certain extent.
Fex is the excitation function, Fscale is the scale function,
and x̂ is the new feature map. This layer can change the 2) LRW-1000 [31]
parameters to increase the token feature dimension and LRW-1000 dataset was proposed by the team of the Institute
reduce the token sequence length, compensating for the loss of Computing, the Chinese Academy of Sciences, the
TABLE 1. Classification architecture of 3DCvT on LRW-1000. The default input video frame size is 88 × 88. k is the kernel_ size, s is the stride, c is the
number of channels. Hi is the number of heads of the i th MHSA module and Di is the embedded feature dimension of the i th MHSA module. The
expansion ratio of feature dimension in the i th MLP layer is expressed as Ri . Ni the number of Transformer Blocks of the i th stage, which is the number
of cycles. In addition, in the BiGRU, h is the hidden layer dimension, l is the number of layers, and Bi = True is the bidirectional GRU.
University of Chinese Academy of Sciences, and Huazhong we expand the data dimension from 521 to 513, called word
University of science and technology in 2018. It aims to boundary. This method can provide context and environment
establish a large-scale benchmark with different image sizes information, which is helpful to the classification of lip
in an outdoor environment. The dataset covers the natural reading.
changes of different speech modes and imaging conditions
to meet the challenges in practical applications. The dataset 2) NETWORK ARCHITECTURE
comes from Chinese TV programs and contains 1000 classes. This paper divides the lip reading network into front-
Each class corresponds to one or more Chinese words. The end and back-end. The specific information is shown in
dataset is the largest Chinese word lip reading dataset, with Table 1. In the front-end network 3DCvT, the input size is
more than 2000 speakers and nearly 720000 utterances. The (256 × 1 × 88 × 88). It first passes through the 3DCnov
data richness of the dataset ensures that the deep learning layer and then enters the Transformer Block. The parameter
model is fully trained. At the same time, the dataset is also settings of each stage are different. When entering the back-
the only open dataset of Mandarin lip reading. end network, first go through BiGRU with the input size
(256 × 513 × 5 × 5). Before the feature information enters
the back-end, we adjust the data dimension from 512 to
B. IMPLEMENTATION DETAILS 513 afterword boundary [42] processing and send it to the
1) DATA PREPROCESSING entire connection layer for classification processing.
We shuffle the initial input video and adjust its size to
96 × 96. Then it is cropped to 88 × 88, and the Mixup [41] 3) MODEL VARIANTS
method is used for data enhancement, which is the final By changing the number of Transformer Blocks of the ith
input. We select a batch of video frames with the size stage (Ni ) and the number of magnetic heads (Hi ) of the ith
of 256 in each epoch training. Then, each video frame is MHSA module, we have obtained three groups of models
flipped at a probability level of 0.5, and transformed into with different effects, which are respectively defined as
a grayscale image, finally normalized to [0, 1]. In addition, 3DCvT-I, 3DCvT-II and 3DCvT-III. Where n of model 1 is
before the extracted feature information enters the back-end, 1,2,10 and H is 1, 3, 6 respectively. N of model 2 is 1,4,16 and
TABLE 2. The comparison of the recognition rate of the variants after TABLE 3. Increase the effect of various tricks on the basis of the baseline.
changing the parameters of our model on the datasets.
4) TRAINING
This experiment utilized Python 3.8 and pytorch1.7 to build
a training network on the platform configured as NVIDIA
Tesla v100-pice 32GB graphics card. We set the batch size
to 256, use Adam optimizer, set the initial learning rate to
6e-4, 120 epochs for training, and use cosine learning rate
scheduling strategy. When the verification error is stable in
three consecutive epochs, we reduce the learning rate two
times. We choose the cross-entropy loss function with label
smoothing as the final decoder. The traditional formula for
cross-entropy loss function is as follows:
(
XN qi = 0, y 6 = i
L=− qi log(pi ) (5)
i=1 qi = 1, y = i
where q is the prediction probability, p = 1 − q, and y is the
real label value. In the cross-entropy loss function with label
smoothing mechanism, the q value is changed to:
,
y 6= i
N
qi = (6)
1 − N − 1 , y = i
FIGURE 4. (a) LRW dataset recognition accuracy curve. (b) LRW-1000
N dataset recognition accuracy curve.
where is a small constant and take the value 0.1, N is the
number of categories. Using label smoothing training can in the cross-entropy loss function. We performed ablation
produce a better calibration network to generalize the network experiments with baseline networks that did not use these
better and produce a more accurate prediction for invisible methods. We do experiments based on 3DCvT-III, and the
output data. results are shown in Table 3. Each adjustment is based on the
last change, from top to bottom. These results demonstrate
C. RESULT the effectiveness of our use of these tricks.
The effectiveness of our method is evaluated on two large The result verifies the scalability of the proposed model.
lip reading datasets LRW and LRW-1000. We changed Finally, the model proposed in this paper achieves the
the relevant parameters of 3DCvT, obtained the 3DCvT-I, best accuracy of 88.5% and 57.5% on the LRW dataset
3DCvT-II and 3DCvT-III, and made the corresponding and LRW-1000 dataset, respectively. The accuracy curve is
comparative experiments. 3DCvT-III achieves the best recog- shown in Figure. 4, in which (a) reflects that the accuracy
nition accuracy on both datasets. The specific results are of the LRW dataset rises sharply after the beginning of
shown in Table 2. training, then fluctuates slightly and tends to be flat after
In addition, we provide some useful tricks for lip reading 100 rounds. (b) shows that the accuracy of the LRW-1000
tasks and demonstrate their effectiveness in the form of dataset rises rapidly in 20 rounds before training, then rises
ablation experiments. It includes the word boundary input, slowly in 60 rounds, and tends to be flat after 100 rounds.
Mixup data enhancement technology we used in the data The linguistic features of the two large lip reading datasets
preprocessing stage, and the label smoothing technique used are different, which leads to different trends of accuracy.
Generally speaking, the accuracy of the Chinese lip reading [2] S. Mathulaprangsan, C.-Y. Wang, A. Z. Kusum, T.-C. Tai, and J.-C. Wang,
dataset LRW-1000 is low because Chinese is relatively ‘‘A survey of visual lip reading and lip-password verification,’’ in Proc. Int.
Conf. Orange Technol. (ICOT), Dec. 2015, pp. 22–25.
complex. [3] T. R. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, ‘‘Deep
Lip reading has developed rapidly in recent years, and audio-visual speech recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
various methods have been put forward. We list several rep- early access, Dec. 21, 2018, doi: 10.1109/TPAMI.2018.2889052.
[4] L. Chen, R. K. Maddox, Z. Duan, and C. Xu, ‘‘Hierarchical cross-modal
resentative methods and compare them with ours, Table 4 for talking face generation with dynamic pixel-wise loss,’’ in Proc. IEEE/CVF
details. It can be seen that most of the networks use ResNet Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 7832–7841.
as the baseline, and we introduce the visual application [5] K. R. Prajwal, R. Mukhopadhyay, V. P. Namboodiri, and C. V. Jawahar,
of transformer into lip reading, and the effect is also very ‘‘Learning individual speaking styles for accurate lip to speech synthesis,’’
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
obvious. The word recognition rate on LRW is 88.5%, and the Jun. 2020, pp. 13796–13805.
word recognition rate on LRW-1000 is 57.5%, which exceeds [6] R. Ding, C. Pang, and H. Liu, ‘‘Audio-visual keyword spotting based on
the current state-of-the-art results. multidimensional convolutional neural network,’’ in Proc. 25th IEEE Int.
Conf. Image Process. (ICIP), Oct. 2018, pp. 4138–4142.
[7] A. Fernandez-Lopez and M. F. Sukno, ‘‘Survey on automatic lip-reading
V. CONCLUSION in the era of deep learning,’’ Image Vis. Comput., vol. 78, pp. 53–72,
Oct. 2018.
This paper proposes a visual change method of a 3D
[8] H. McGurk and J. MacDonald, ‘‘Hearing lips and seeing voices,’’ Nature,
convolutional vision transformer, which take full advantage vol. 264, no. 5588, pp. 746–748, 1976.
of convolutions and transformers. In order to achieve the [9] J. S. Chung and A. Zisserman, ‘‘Learning to lip read words by watching
best performance, we have improved the previous vision videos,’’ Comput. Vis. Image Understand., vol. 173, pp. 76–85, Aug. 2018.
[10] E. D. Petajan, Automatic Lipreading to Enhance Speech Recognition
transformer. The spatio-temporal features of continuous (Speech Reading). Champaign, IL, USA: Univ. Illinois at Urbana-
video frames are extracted to effectively obtain the global Champaign, 1984.
features of images and the correlation between video frames. [11] K. Mase and A. Pentland, ‘‘Automatic lipreading by optical-flow analysis,’’
Syst. Comput. Jpn., vol. 22, no. 6, pp. 67–76, 1991.
Then the extracted features are sent to BiGRU for Sequence
[12] N. Eveno, A. Caplier, and P.-Y. Coulon, ‘‘Accurate and quasi-automatic
modeling. The experiment proved our method’s effectiveness lip tracking,’’ IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 5,
on LRW and LRW-1000 and obtained the most advanced pp. 706–715, May 2004.
performance. [13] R. Bowden, S. Cox, R. Harvey, Y. Lan, and B. J. Theobald, ‘‘Recent
developments in automated lip-reading,’’ Proc. SPIE, vol. 8901, Oct. 2013,
While introducing Transformer Block, this method also Art. no. 89010J.
increases the number of flops and parameters of the model, [14] Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, ‘‘A review of recent
and the train and test time is extended. After that, we will advances in visual speech decoding,’’ Image Vis. Comput., vol. 32, no. 9,
pp. 590–605, Sep. 2014.
continue to carry out lip reading in combination with [15] R. Seymour, D. Stewart, and J. Ming, ‘‘Comparison of image transform-
transformer technology. Optimize the transformer structure based features for visual speech recognition in clean and corrupted videos,’’
and apply it to lip reading with less calculation to facilitate EURASIP J. Image Video Process., vol. 2008, no. 2, pp. 1–9, 2008.
the practical application of lip reading. In addition, audio [16] D. Bhatt, C. Patel, H. Talsania, J. Patel, R. Vaghela, S. Pandya, K. Modi,
and H. Ghayvat, ‘‘CNN variants for computer vision: History, architecture,
and multimodal video inputs are used for training to improve application, challenges and future scope,’’ Electronics, vol. 10, no. 20,
the model’s performance. Applying transformer technology p. 2470, Oct. 2021.
to the back-end network to replace BiGRU and construct an [17] C. Patel, D. Bhatt, U. Sharma, R. Patel, S. Pandya, K. Modi, N.
Cholli, A. Patel, U. Bhatt, M. A. Khan, S. Majumdar, M. Zuhair, K.
overall transformer structure is also worth discussing. Patel, S. A. Shah, and H. Ghayvat, ‘‘DBGC: Dimension-based generic
convolution block for object recognition,’’ Sensors, vol. 22, no. 5, p. 1780,
Feb. 2022.
REFERENCES
[18] K. Noda, Y. Yamaguchi, K. Nakadai, O. Hg, and T. Ogata, ‘‘Lipreading
[1] L. Lu, J. Yu, Y. Chen, H. Liu, Y. M. Zhu, L. Kong, and M. Li, ‘‘Lip reading- using convolutional neural network,’’ in Made Available by the Northern
based user authentication through acoustic sensing on smartphones,’’ Territory Library Via the Publications Act. Darwin, NT, Australia:
IEEE/ACM Trans. Netw., vol. 27, no. 1, pp. 447–460, Feb. 2019. Northern Territory Library, 2014.
[19] S. Petridis, Y. Wang, Z. Li, and M. Pantic, ‘‘End-to-end multi-view [41] H. Zhang, M. Cisse, D. Yn, and D. Lopez-Paz, ‘‘Mixup: Beyond empirical
lipreading,’’ in Proc. Brit. Mach. Vis. Conf., 2017. risk minimization,’’ in Proc. 18th Int. Conf. Learn. Represent. (ICLR),
[20] J. Xiao, S. Yang, Y. Zhang, S. Shan, and X. Chen, ‘‘Deformation flow based 2018.
two-stream network for lip reading,’’ in Proc. 15th IEEE Int. Conf. Autom. [42] T. Stafylakis, M. H. Khan, and G. Tzimiropoulos, ‘‘Pushing the boundaries
Face Gesture Recognit. (FG), Nov. 2020, pp. 364–370. of audiovisual word recognition using residual networks and LSTMs,’’
[21] B. Martinez, P. Ma, S. Petridis, and M. Pantic, ‘‘Lipreading using temporal Comput. Vis. Image Understand., vol. 176, pp. 22–32, Nov. 2018.
convolutional networks,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal [43] M. Kim, J. Hong, S. J. Park, and Y. M. Ro, ‘‘Multi-modality associative
Process. (ICASSP), May 2020, pp. 6319–6323. bridging through memory: Speech sound recollected from face video,’’ in
[22] P. Ma, B. Martinez, S. Petridis, and M. Pantic, ‘‘Towards practical Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 296–306.
lipreading with distilled and efficient models,’’ 2020, arXiv:2007.06504. [44] M. Kim, J. H. Yeo, and Y. M. Ro, ‘‘Distinguishing homophenes using
[23] S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, multi-head visual-audio memory for lip reading,’’ in Proc. AAAI Conf.
‘‘End-to-end audiovisual speech recognition,’’ in Proc. IEEE Int. Conf. Artif. Intell. (AAAI), 2022, pp. 1174–1182.
Acoust., Speech Signal Process. (ICASSP), Apr. 2018, pp. 6548–6552.
[24] K. Xu, D. Li, N. Cassimatis, and X. Wang, ‘‘LCANet: End-to-end
lipreading with cascaded attention-CTC,’’ in Proc. 13th IEEE Int. Conf.
Autom. Face Gesture Recognit. (FG), May 2018, pp. 548–555.
[25] X. Zhang, F. Cheng, and W. Shilin, ‘‘Spatio-temporal fusion based
convolutional sequence learning for lip reading,’’ in Proc. IEEE/CVF Int.
Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 713–722. HUIJUAN WANG was born in Dacheng, Hebei,
[26] M. Luo, S. Yang, S. Shan, and X. Chen, ‘‘Synchronous bidirectional China, in 1982. She received the B.S. and M.S.
learning for multilingual lip reading,’’ in Proc. BMVC, 2020. degrees in computer science and technology from
[27] X. Weng and K. Kitani, ‘‘Learning spatio-temporal features with two- Nankai University, China, in 2005 and 2008,
stream deep 3D CNNs for lipreading,’’ in Proc. 30th Brit. Mach. Vis. Conf., respectively, and the Ph.D. degree from the Hebei
2019. University of Technology, China, in 2019. She
[28] D. Feng, S. Yang, S. Shan, and X. Chen, ‘‘Learn an effective lip reading is currently an Associate Professor with the
model without pains,’’ 2020, arXiv:2011.07557. North China Institute of Aerospace Engineering.
[29] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, ‘‘Squeeze-and-excitation She has published more than 20 articles. Her
networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 8,
research interests include computer vision, pattern
pp. 2011–2023, Aug. 2020.
[30] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and
recognition, and deep learning.
L. Zhang, ‘‘CvT: Introducing convolutions to vision transformers,’’ 2021,
arXiv:2103.15808.
[31] C. Js and A. Zisserman, ‘‘Lip reading in the wild,’’ in Proc. Asian Conf.
Comput. Vis., 2016, pp. 87–103.
[32] S. Yang, Y. Zhang, D. Feng, M. Yang, C. Wang, J. Xiao, K. Long, S. Shan,
and X. Chen, ‘‘LRW-1000: A naturally-distributed large-scale benchmark GANGQIANG PU was born in Bengbu, Anhui,
for lip reading in the wild,’’ in Proc. 14th IEEE Int. Conf. Autom. Face China, in 1995. He received the bachelor’s degree
Gesture Recognit. (FG), May 2019, pp. 1–8. in software engineering from the School of
[33] A. V. Aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Computer Science, Anhui Polytechnic University,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. NIPS,
China, in 2019. He is currently pursuing the M.S.
2017.
[34] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
degree with the Department of Computer, North
of deep bidirectional transformers for language understanding,’’ in Proc. China Institute of Aerospace Engineering, China.
NAACL, 2019. His research interests include computer vision,
[35] D. Mahajan, R. Girshick, V. Ramanathan, K. He, and L. Maaten, image processing, and pattern recognition.
‘‘Exploring the limits of weakly supervised pretraining,’’ in Proc. Eur.
Conf. Comput. Vis., 2018.
[36] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
T. Unterthiner, M. Dehghani, M. Minderer, and G. H. Sgju, ‘‘An image
is worth 16×16 words: Transformers for image recognition at scale,’’ in
Proc. Int. Conf. Learn. Represent., 2021.
[37] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
TINGYU CHEN was born in Langfang, Hebei,
and L. Shao, ‘‘Pyramid vision transformer: A versatile backbone for dense
prediction without convolutions,’’ 2021, arXiv:2102.12122.
China, in 1993. She received the bachelor’s degree
[38] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and in computer science and technology from the
S. Yan, ‘‘Tokens-to-token ViT: Training vision transformers from scratch Hebei University of Technology, in 2012, and the
on ImageNet,’’ 2021, arXiv:2101.11986. M.S. degree in advanced computing with man-
[39] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, ‘‘Empirical evaluation agement from the King’s College London, U.K.,
of gated recurrent neural networks on sequence modeling,’’ 2014, in 2017. She is currently a Lecturer at the North
arXiv:1412.3555. China Institute of Aerospace Engineering. Her
[40] K. Hara, H. Kataoka, and Y. Satoh, ‘‘Learning spatio-temporal features research interests include artificial intelligence and
with 3D residual networks for action recognition,’’ in Proc. IEEE Int. Conf. pattern recognition.
Comput. Vis. Workshops (ICCVW), Oct. 2017, pp. 3154–3160.