A Lip Reading Method Based On 3D Convolutional Vision Transformer

Uploaded by

tasinsafwathc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views8 pages

A Lip Reading Method Based On 3D Convolutional Vision Transformer

Uploaded by

tasinsafwathc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Received 26 June 2022, accepted 19 July 2022, date of publication 21 July 2022, date of current version 27 July 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3193231

A Lip Reading Method Based on 3D

Convolutional Vision Transformer
HUIJUAN WANG , GANGQIANG PU, AND TINGYU CHEN
Department of Computer, North China Institute of Aerospace Engineering, Langfang, Hebei 065000, China
Corresponding author: Huijuan Wang ([email protected])
This work was supported in part by the Scientific Research Key Project of Hebei Provincial Department of Education under Grant
ZD2020161, and in part by the Natural Science Foundation of Hebei Province under Grant F2021409007.

ABSTRACT Lip reading has received increasing attention in recent years. It judges the content of speech
based on the movement of the speaker’s lips. The rapid development of deep learning has promoted progress
in lip reading. However, due to lip reading needs to process the information of continuous video frames,
it is necessary to consider the correlation information between adjacent images and the correlation between
long-distance images. Moreover, lip reading recognition mainly focuses on the subtle changes of lips and
their surrounding environment, and it is necessary to extract the subtle features of small-size images.
Therefore, the performance of machine lip reading is generally not high, and the research progress is slow.
In order to improve the performance of machine lip reading, we propose a lip reading method based on
3D convolutional vision transformer (3DCvT), which combines vision transformer and 3D convolution
to extract the spatio-temporal feature of continuous images, and take full advantage of the properties of
convolutions and transformers to extract local and global features from continuous images effectively. The
extracted features are then sent to a Bidirectional Gated Recurrent Unit (BiGRU) for sequence modeling.
We proved the effectiveness of our method on large-scale lip reading datasets LRW and LRW-1000 and
achieved state-of-the-art performance.

INDEX TERMS Lip reading, 3D convolution, vision transformer, sequence modeling.

I. INTRODUCTION different people have different lip movements when they

Lip reading is to recognize speech information based on say the same word. Lip reading is very challenging with
the change of lip movement, also called visual speech the influence of angle, light, and background [9]. The lip
recognition. Its research process involves computer vision, reading system was first proposed in [10] and starts with
natural language processing, and other related fields and has face detection, locates the lip and its surrounding area,
broad application prospects in identity authentication [1], [2], and then carries out feature extraction and classification.
improving speech recognition performance [3], synthesizing In traditional lip reading models, lip-movement feature
talking face [4], [5], speaker recognition [6], helping hearing- extraction methods include geometry, motion, statistical
impaired people communicate, etc. model, or image transforms [11], [12]. Then Most methods
Speech plays a leading role in human communication. It is utilized Hidden Markov Models(HMMs) to transform visual
generally regarded as a process of multi-sensory cooperation, features into speech units [13]–[15]. With the development
including acoustic and visual cues [7], and the influence of of deep learning, machine lip reading has made significant
vision on language perception has long been proved [8]. Lip progress in recent years. Lip reading is a fine-grained feature
reading is complicated. For example, words with different recognition problem, which focuses on the subtle changes of
pronunciations have different lip reading similarities, and lips, and the effect of lip reading is determined by capturing
and analyzing the subtle changes. Architecture innovations
The associate editor coordinating the review of this manuscript and will significantly improve the capabilities of deep learning
approving it for publication was Zhenhua Guo . models [16], [17]. Therefore, many new lip reading models

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

VOLUME 10, 2022 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 77205
H. Wang et al.: Lip Reading Method Based on 3D Convolutional Vision Transformer

have been proposed recently. Initially, a neural network receptive field, shared weight, dynamic attention, and global
is only used for feature extraction, which is combined context fusion. These features help us better extract the
with HHM [18]. Later, with the development of recursive temporal and spatial information of continuous video frames
networks, Long Short Term Memory(LSTM) and Gated and more accurately identify the changes in lip movement.
Recurrent Unit(GRU) gradually replaced HMMs [19], [20]. The proposed model is tested on LRW [31] and LRW-1000
Recently, It is proposed to use temporal convolution instead [32] datasets, the state-of-the-art performance is obtained,
of LSTM or GRU to obtain feature temporal correlation and the related comparative experiments are carried out to
[21], [22]. When 3D convolution and ResNet are proposed, prove our model’s effectiveness.
their combination becomes the most commonly used feature Our contributions are summarized as follows: (1) 3D
extractor [23], [24]. Lip reading researchers have begun to convolution is added to the CvT [30] to capture the
focus on self-attention mechanisms [25], [26]. The core of the spatio-temporal feature information of video frames. (2) SE
mechanism is determining the part we should pay attention structure is introduced into the Convolutional Token Embed-
to based on the goal and further analyzing after finding the ding layer of CvT to improve the extraction of effective fea-
details. tures on the channel. (3) we have provided some useful tricks
However, there are still some problems that need our atten- for lip reading tasks, and have done ablation experiments to
tion. Firstly, as the input is continuous video frames, we need prove their effectiveness. (4) On LRW and LRW-1000 lip
to attach importance to the temporal and spatial information reading datasets, Our proposed 3DCvT achieved an accuracy
between adjacent frames. Secondly, the recognition mainly of 88.5% and 57.5% respectively. The method’s effectiveness
focuses on the lips and their surroundings, which makes the outperforms the current state-of-the-art performance.
network model have more essential requirements for subtle The rest of this paper is organized as follows: the
feature extraction. Additionally, with the deepening of the Section II introduces the methods of feature extraction and
network, the loss of information caused by the resolution time series modeling. The Section III describes the 3DCvT
reduction should also be a concerned [16], [17]. Based on the method proposed in this paper. The Section IV describes the
above problems and the complexity of lip reading itself, the experimental content and result analysis. The final section
accuracy of lip reading tasks has not been high. The related gives our conclusions and prospects.
work has made slow progress in improving the accuracy of
lip reading.
In view of the above problems, this paper improves the II. RELATED WORK
previous lip reading method using ResNet as pipe network A. TRANSFORMER
[27], [28], and proposes a lip reading method based on 3D The transformer was proposed in 2017 [33], and its
convolution visual transformer (3DCvT). In our method, the performance in machine translation tasks is better than that
lip movement features are extracted by a 3D convolutional of RNN and CNN. Only encoder-decoder and attention
vision transformer. The extracted feature information is mechanism can achieve good results. The most significant
processed by Bidirectional Gated Recurrent Unit(BiGRU) advantage is that it can be parallelized efficiently, which was
and Full Connection(FC) layer for sequence modeling initially used in natural language processing and achieved
and classification. The method can effectively extract the considerable results [34], [35]. Recently, the transformer has
high-dimensional features of an image sequence, enhance been introduced into the field of visual recognition. For
the semantic representation between video keyframes, and example [36]–[38], they have achieved better results than
reduce the loss caused by the global average of an image before in visual processing. We have reason to believe that
sequence. transformer can also achieve good results in lip reading, and
Specifically, 3D convolution can obtain the spatio- we have done it.
temporal correlation information between continuous Vision transformer (ViT) [36] introduces a transformer into
images. The transformer divides the image into noncoinci- visual recognition with a large amount of data. It decomposes
dent patches conducive to capturing global information. The each input image into non-overlapping sub-images to form a
addition of convolution operation improves the fine-grained token with a fixed length. Then, several standard transformer
extraction of local regions of interest. We add the Squeezing layers are applied to model these tokens. Only transformer
and Excitation(SE) [29] module to the Convolutional Token is used in the ViT model, and good results are achieved in
Embedding [30]. This structure makes the weights of valid classification, but large datasets are required to achieve good
feature maps more significant and the weights of invalid convergence. Recently proposed by Haiping Wu et al. Con-
feature maps smaller, which makes the model achieve better volutional vision Transformer (CvT) [30] is used in image
results. In addition, we should gradually reduce the number vision processing and has achieved high accuracy in the
of tokens and increase the width of tokens in each stage to Imagenet dataset. In this architecture, Convolutional Token
realize spatial downsampling and richer representation and Embedding and Convolutional Projection, two convolution-
make up for the information loss caused by the reduction based operations, are introduced into the ViT. Convolutional
of resolution. It maintains the characteristics of convolution Token Embedding models local spatial context in a multi-
and adds the advantages of a transformer. For example, local stage method. The purpose of Convolutional Projection is