Chung 18

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Learning to Lip Read Words by Watching Videos

Joon Son Chung, Andrew Zisserman


Visual Geometry Group, Department of Engineering Science, University of Oxford

Abstract

Our aim is to recognise the words being spoken by a talking face, given only
the video but not the audio. Existing works in this area have focussed on trying
to recognise a small number of utterances in controlled environments (e.g. digits
and alphabets), partially due to the shortage of suitable datasets.
We make three novel contributions: first, we develop a pipeline for fully
automated data collection from TV broadcasts. With this we have generated
a dataset with over a million word instances, spoken by over a thousand dif-
ferent people; second, we develop a two-stream convolutional neural network
that learns a joint embedding between the sound and the mouth motions from
unlabelled data. We apply this network to the tasks of audio-to-video synchroni-
sation and active speaker detection; third, we train convolutional and recurrent
networks that are able to effectively learn and recognize hundreds of words from
this large-scale dataset.
In lip reading and in speaker detection, we demonstrate results that exceed
the current state-of-the-art on public benchmark datasets.
Keywords: lip reading, lip synchronisation, active speaker detection, large
vocabulary, dataset

1. Introduction

Lip-reading, the ability to understand speech using only visual information,


is a very attractive skill. It has clear applications in speech transcription for
cases where audio is not available, such as for archival silent films or (less ethi-
cally) off-mike exchanges between politicians or celebrities (the visual equivalent
of open-mike mistakes). It is also complementary to the audio understanding of
speech, and indeed can adversely affect perception if audio and lip motion are
not consistent (as evidenced by the McGurk [1] effect). For such reasons, lip-
reading has been the subject of a vast research effort over the last few decades.
Our objective in this work is a scalable approach to large lexicon speaker
independent lip-reading. Furthermore, we aim to recognize words from contin-
uous speech, where words are not segmented, and there may be co-articulation
of the lips from preceding and subsequent words. Achieving this goal enables a
form of ‘word spotting’ in (no-audio) video streams.
In lip-reading there is a fundamental limitation on performance due to ho-
mophones. These are sets of words that sound different, but involve identical
movements of the speaker’s lips. Thus they cannot be distinguished using visual
information alone. For example, in English the phonemes ‘p’ ‘b’ and ‘m’ are visu-
ally identical, and consequently the words mark, park and bark, are homophones
(as are pat, bat and mat) and so cannot be distinguished by lip-reading. This
problem has been well studied and there are lists of ambiguous phonemes and
words available [2, 3]. It is worth noting that the converse problem also applies:
for example ‘m’ and ‘n’ are easily confused in audio, but are visually distinct.
We take account of such homophone ambiguity in assessing the performance of
our methods.
Apart from this limitation, lip-reading is a challenging problem in any case
due to intra-class variations (such as accents, speed of speaking, mumbling), and
adversarial imaging conditions (such as poor lighting, strong shadows, motion,
resolution, foreshortening, etc.).
In this paper we investigate using Convolutional Neural Networks (CNNs)
for directly recognizing individual words from a sequence of lip movements.
Our reason for considering CNNs, rather than the more usual Recurrent Neural
Networks that are used for sequence modelling, is their ability to learn to classify
images on their content given only image supervision at the class level, i.e.
without having to provide stronger supervisory information such as bounding
boxes or pixel-wise segmentation. This ability is evident from the results of the

2
ImageNet classification challenge [4].
We take advantage of this ability to recognize temporal signals in an image
time series. In particular we consider one second sequences of lip movements
of continuous speech and learn to recognize words within the sequence given
only class level supervision, but do not require stronger temporal supervision
such as specifying the start and end of the word. Clearly, spatial registration of
the mouth is an important element to consider in the design of the networks.
Typically, the imaged head will move in the video, either due to actual move-
ment of the head or due to camera motion. One approach would be to tightly
register the mouth region (including lips, teeth and tongue, that all contribute
to word recognition), but another is to develop networks that are tolerant to
some degree of motion jitter. We take the latter approach, and do not enforce
tight registration.
We make contributions in three areas: first, we build a pipeline for automated
large scale data collection, including visual and temporal alignment. With this
we are able to obtain training data for hundreds of distinct words, thousands
of instances for each word, and over a thousand speakers (Section 2); second,
we develop a two-stream convolutional neural network SyncNet that learns a
joint embedding between the sound and the mouth motions using cross-modal
self-supervision. We apply this network to the tasks of audio-to-video synchro-
nisation and active speaker detection (Section 3); third, we develop and compare
a number of network architectures for classifying multi-frame time series of lips
(Section 4). In speaker detection and lip reading, our results exceed the state-
of-the-art on public datasets, Columbia [5] and OuluVS2 [6].
As discussed in the related work below, in three aspects: (i) speaker inde-
pendence, (ii) learning from continuous speech, and (iii) lexicon (vocabulary)
size, we go far beyond the current state of the art. We also exceed the state of
the art in terms of performance, as is also shown in Section 5 by comparisons
on the standard OuluVS2 benchmark dataset [6].

3
1.1. Related work
Research on lip reading (a.k.a. visual speech recognition) has a long history.
A thorough survey of shallow (i.e. not deep learning) methods is given in the
recent review [7], and will not repeated in detail here. Many of the existing
works in this field have followed similar pipelines which first extract spatio-
temporal features around the lips (either motion-based, geometric-feature based
or both), and then align these features with respect to a canonical template. For
example, Pei et al. [8], which holds state-of-the-art on many datasets, extracts
the patch trajectory as a spatiao-temporal feature, and then aligns these features
to reference motion patterns.
A number of recent papers have used deep learning methods to tackle prob-
lems related to lip reading. Koller et al. [9] train an image classifier CNN to
discriminate visemes (mouth shapes, visual equivalent of phonemes) on a sign
language dataset where the signers mouth words. Similar CNN methods have
been performed by [10] to predict phonemes in spoken Japanese. In the context
of word recognition, [11] has used deep bottleneck features (DBF) to encode
shallow input features such as LDA and GIF [12]. Similarly [13] uses DBF to
encode the image for every frame, and trains a LSTM classifier to generate a
word-level classification.
One of the major obstacle to progress in this field has been the lack of
suitable datasets [7]. Table 1 gives a summary of existing datasets. The amount
of available data is far from sufficient to train scalable and representative models
that will be able to generalise beyond the controlled environments and the very
limited domains (e.g. digits and the alphabet).
Word classification with large lexicons has not been attempted in lip reading,
but [23] has tackled a similar problem in the context of text spotting. Their
work shows that it is feasible to train a general and scalable word recognition
model for a large pre-defined dictionary, as a multi-class classification problem.
We take a similar approach.
Of relevance to the architectures and methods developed in this paper are
CNNs for action recognition that learn from multiple-frame image sequences

4
Name Env. Output I/C # class # subj. Best perf.
AVICAR [14] In-car Digits C 10 100 37.9% [15]
AVLetter [16] Lab Alphabet I 26 10 43.5% [17]
CUAVE [18] Lab Digits I 10 36 83.0% [19]

GRID [20] Lab Words C 8.5 34 79.6% [21]
OuluVS1 [17] Lab Phrases I 10 20 89.7% [8]
OuluVS2 [6] Lab Phrases I 10 52 73.5% [22]
LRW TV Words C 500 1000+ -

Table 1: Existing lip reading datasets. I for Isolated (one word, letter or digit per record-
ing); C for Continuous recording. The reported performance is on speaker-independent ex-
periments. (∗ For GRID [20], there are 51 classes in total, but the first word in a phrase is
restricted to 4, the second word 4, etc. 8.5 is the average number of possible classes at each
position in the phrase.)

such as [24, 25, 26], particularly the ways in which they capture spatio-temporal
information in the image sequence using temporal pooling layers and 3D con-
volutional filters.

2. Building the dataset

This section describes our multi-stage pipeline for automatically collecting


and processing a very large-scale visual speech recognition dataset, starting from
British television programmes. Using this pipeline we have been able to extract
1000s of hours of spoken text covering an extensive vocabulary of 1000s of
different words, with over 1M word instances, and over 1000 different speakers.
The key ideas are to: (i) obtain a temporal alignment of the spoken audio
with a text transcription (broadcast as subtitles with the programme). This
in turn provides the time alignment between the visual face sequence and the
words spoken; (ii) obtain a spatio-temporal alignment of the lower face for the
frames corresponding to the word sequence; and, (iii) determine that the face is
speaking the words (i.e. that the words are not being spoken by another person
in the shot). The pipeline is summarised in Figure 2 and the individual stages

5
Figure 1: A sample of speakers in our dataset.

are discussed in detail in the following paragraphs.

Shot detection Video Audio

Audio-subtitle
Face detection OCR subtitle
forced alignment

Alignment
Face tracking
verification

Facial landmark AV sync & Training


detection speaker detection words

Figure 2: Pipeline to generate the text and visually aligned dataset.

Stage 1. Selecting programme types. We require programmes that have


a changing set of talking heads, so choose news and current affairs, rather than
dramas with a fixed cast. Table 2 lists the programmes. There is a significant
variation of format across the programmes – from the regular news where a single
speaker is talking directly at the camera, to panel debate where the speakers
look at each other and often shifts their attention. There are a few people

6
who appear repeatedly in the videos (e.g. news presenter in BBC News or the
host in the others), but the large majority of participants change every episode
(Figure 1).

Channel Series name Description # vid. Length Yield


BBC 1 HD News at 1 Regular news 1242 30 mins 39.9%
BBC 1 HD News at 6 Regular news 1254 30 mins 33.9%
BBC 1 HD News at 10 Regular news 1301 30 mins 32.9%
BBC 1 HD Breakfast Regular news 395 varied 39.2%
BBC 1 HD Newsnight Current affairs debate 734 35 mins 40.0%
BBC 2 HD World News Regular news 376 30 mins 31.9%
BBC 2 HD Question Time Current affairs debate 353 60 mins 48.8%

Table 2: Video statistics. The yield is the proportion of useful face appearance relative to the
total length of video. A useful face appearance is one that appears continuously for at least
5 seconds, with the face being that of the speaker.

Figure 3: Subtitles on BBC TV. Left: ‘Question Time’, Right: ‘BBC News at One’.

Stage 2. Subtitle processing and alignment. We require the alignment


between the audio and the subtitle in order to get a timestamp for every word
that is being spoken in the videos. The BBC transmits subtitles as bitmaps
rather than text, therefore subtitle text is extracted from the broadcast video
using standard OCR methods [27, 28]. The subtitles are not time-aligned, and
also not verbatim as they are generated live. The Penn Phonetics Lab Forced
Aligner [29, 30] (based on the open-source HTK toolbox [31]) is used to force-
align the subtitle to the audio signal. The aligner uses the Viterbi algorithm

7
to compute the maximum likelihood alignment between the audio (modelled by
PLP features [32]) and the text. This method of obtaining the alignment has
significant performance benefits over regular speech recognition methods that
do not use prior knowledge of what is being said. The alignment result, however,
is not perfect due to: (1) the method often misses words that are spoken too
quickly; (2) the subtitles are not verbatim; (3) the acoustic model is only trained
to recognise American English. The noisy labels are filtered by double-checking
against the commercial IBM Watson Speech to Text service. In this case, the
only remaining label noise is where an interview is dubbed in the news, which
is rare.
Stage 3. Shot boundary detection, face detection, and tracking. The
shot boundaries are determined to find the within-shot frames for which face
tracking is to be run. This is done by comparing color histograms across consec-
utive frames [33]. The HOG-based face detection method of [34] is performed on
every frame of the video (Figure 4 left). As with most face detection methods,
this results in many false positives and some missed detections. In a similar
manner to [28], all face detections of the same person are grouped across frames
using a KLT tracker [35] (Figure 4 middle). If the track overlaps with face
detections on the majority of frames, it is assumed to be correctly tracking the
face.

Figure 4: Left: Face detections; Middle: KLT features and the tracked bounding box (in
yellow); Right: Facial landmarks.

8
Stage 4. Facial landmark detection and speaker identification. Facial
landmarks are needed to determine the mouth position for speaker/ non-speaker
classification. They are determined in every frame of the face track using the
method of [36] (Figure 4 right). The landmarks are used to determine the
mouth region, and to map it to a canonical position as input to the two-stream
network described in Section 3 that is used to determine who is speaking in the
video, and reject the clip if the face is not-speaking in sync. It is important
to determine whether the face shown is actually speaking or not. For example,
there may be a reaction shot or voice-over.

Figure 5: One-second clips that contain the word ‘about’. Top: male speaker, bottom: female
speaker.

#10
5 Number of characters #10
4 5-character words #10
4 9-character words
8 8

7 7 2

6 6

1.5
5 5
Frequency

Frequency
Frequency

4 4
1
3 3

2 2
0.5

1 1

0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
Word length (characters) Word duration (seconds) Word duration (seconds)

Figure 6: Word statistics. Regardless of the actual duration of the word, we take a 1-second
clip for training and test.

Stage 5. Compiling the training and test data. The training, validation
and test sets are disjoint in time. The dates of videos corresponding to each

9
set is shown in Table 3. Note that we leave a week’s gap between the test set
and the rest in case any news footage is repeated. The lexicon is obtained by
selecting the 500 most frequently occurring words between 5 and 10 characters in
length (Figure 6 gives the word duration statistics). This word length is chosen
such that the speech duration does not exceed the fixed one-second bracket that
is used in the recognition architecture, whilst shorter words are not included
because there are too many ambiguities due to homophones (e.g. ‘bad’, ‘bat’,
‘pat’, ‘mat’, etc. are all visually identical), and sentence-level context would be
needed to disambiguate these.
These 500 words occur at least 800 times in the training set, and at least
40 times in each of the validation and test sets. For each of the occurrences,
the one-second clip is taken, and the face is cropped with the mouth centered
using the registration found in Stage 4. The words are not isolated, as is the
case in other lip-reading datasets; as a result, there may be co-articulation of
the lips from preceding and subsequent words. The test set is manually checked
for errors.

Set Dates # class #/class


Train 01/01/2010 - 28/02/2015 500 800+
Val 01/03/2015 - 25/07/2015 500 50
Test 01/08/2015 - 31/03/2016 500 50

Table 3: Dataset statistics.

3. Learning a Synchronization Network for Lip Motion and Audio

The ability to identify who is speaking is crucial in building the dataset


described in Section 2, and has many applications beyond this task.
This section describes the representations and network architectures for a
Synchronization Network (SyncNet), which ingests 0.2-second audio and video
clips, and generates a joint embedding between the inputs. The audio-to-video

10
embedding is then used to identify the active speaker and to correct the lip-sync
error.
No explicit annotations (e.g. word labels, or the precise time offset) are
used to train this network – we only assume that in the majority of television
videos, the audio and the video are usually synced, and we use cross-modal
self-supervision to learn the embedding.
The model consists of two asymmetric streams for audio and video, each of
which is described below.
Key

layer support x # filters (stride)

fc7 1x1x256 (1) contrastive loss fc7 1x1x256 (1)

fc6 6x6x4096 (1) fc6 5x4x4096 (1)

pool5 3x3 (2) pool5 3x3 (2)


conv5 3x3x512 (1) conv5 3x3x512 (1)

conv4 3x3x512 (1) conv4 3x3x512 (1)

conv3 3x3x512 (1) conv3 3x3x512 (1)

pool2 3x3 (2) pool2 1x3 (1x2)


conv2 3x3x256 (2) conv2 3x3x256 (1)

pool1 3x3 (2)


conv1 3x3x96 (1)
conv1 3x3x96 (1)

120x120x5 13x20x1

Figure 7: SyncNet architecture. Both streams are trained simultaenously.

3.1. Loss function

The training objective is that the output of the audio and the video networks
are similar for genuine pairs, and different for false pairs. Specifically, the
Euclidean distance between the network outputs is minimised or maximised.
We propose to use the contrastive loss (Equation 1), originally proposed for
training Siamese networks [37]. v and a are fc 7 vectors for the video and the
audio streams, respectively. y ∈ [0, 1] is the binary similarity between the audio
and the video inputs.

N
1 X 2
E= (yn ) d2n + (1 − yn ) max (margin − dn , 0) (1)
2N n=1

11
dn = ||vn − an ||2 (2)

An alternative to this would be to approach the problem as one of classifi-


cation (binary classification of on-sync and off-sync, or multi-class between the
different offset bins using synthetic data), however we were unable to achieve
convergence using this method.

3.2. Training

The training procedure is inspired by the Siamese network [37], however our
network is different in that it consists of non-identical streams, two independent
sets of parameters and inputs from two different domains. The network weights
are learnt using stochastic gradient descent with momentum. The parameters
for both streams of the network are learnt simultaneously.

3.3. Audio stream

The input audio data is MFCC values. This is a representation of the short-
term power spectrum of a sound on a non-linear mel scale of frequency. 13 mel
frequency bands are used at each time step. The features are computed at a
sampling rate of 100Hz, giving 20 time steps for a 0.2-second input signal.

-20
ABCDEFGHIJKLM

+20
Time Time

Figure 8: Input representations. Left: temporal representations as heatmaps for audio.


The 13 rows (A to M) in the audio image encode each of the 13 MFCC features representing
powers at different frequency bins. Right: Grayscale images of the mouth area.

Representation. The audio is encoded as a heatmap image representing


MFCC values for each time step and each mel frequency band (see Figure 8).

12
The top and bottom three rows of the image are reflected to reduce boundary
effects. Previous work [38] has also attempted to train image-style CNN for
similar inputs.
Architecture. We use a convolutional neural network inspired by those de-
signed for image recognition. Our layer architecture (Figure 7) is based on
VGG-M [39], but with modified filter sizes to ingest the inputs of unusual di-
mensions of 13 × 20 (13 in the frequency domain, and 20 in the time domain).

3.4. Visual stream

Representation. The input format to the visual network is a sequence of


mouth regions as grayscale images, as shown in Figure 8. The input dimensions
are 111×111×5 (W×H×T) for 5 frames, which corresponds to 0.2-seconds at
25 Hz.
Architecture. The visual stream is based on the VGG-M network, but the
conv1 filter size has been modified to ingest the 5-channel input instead of the
usual 3.

3.5. Applications

Confidence: 7.56 Confidence: 8.07 Confidence: 0.26


20 20 20

15 15 15
L2 distance

L2 distance

L2 distance

10 10 10

5 5 5
-10 -5 0 5 10 -10 -5 0 5 10 -10 -5 0 5 10
Offset Offset Offset

Figure 9: Mean distance between the audio and the video features for different offset values,
averaged over a clip. The actual offset lies at the trough. The three example clips shown here
are for different scenarios. Left: synchronised AV data; Middle: the audio leads the video
by 4 frames; Right: the audio and the video are uncorrelated.

The problems of AV synchronisation and active speaker detection are closely

13
related in that the correspondence between the video and the accompanying
audio must be established.
Audio-to-video synchronisation. To find the time offset between the audio
and the video, we use a sliding-window approach. For each sample, the distance
is computed between one 5-frame video feature and every audio feature in the ±
1 second range. The correct offset is found where this distance is at a minimum.
Since not every 0.2-second sample contains discriminative information (e.g., the
person might be taking a breath), the distance for every offset value is averaged
across the video clip. Typical distances against offset plots are shown in Figure 9.
Active speaker detection. We test our method using the dataset (Figure 10)
and the evaluation protocol of Chakravarty et al. [5]. The objective is to deter-
mine who the speaker is in a multi-subject scene.

Figure 10: Still images from the Columbia dataset [5].

The dataset contains 6 speakers, of which 1 is used for development and


5 (Bell, Bollinger, Lieberman, Long, Sick) for testing. A score threshold is set using
the annotations on the remaining speaker (Abbas), at the point where the ROC
curve intersects the diagonal (the equal error rate).
We report the F1 -scores in Table 4. The scores for each test sample are
averaged over a 10-frame or 100-frame window. The performance is almost
perfect for the 100-frame window. The disadvantage of increasing the size of
the averaging window is that the method cannot detect examples in which the
person speaks for a very short period; though this is not a problem for this
dataset.

14
Method [5] Ours
Window 10 100 10 100
Bell 82.9% 90.3% 93.7% 100%
Bollinger 65.8% 69.0% 83.4% 100%
Lieberman 73.6% 82.4% 86.8% 100%
Long 86.9% 96.0% 97.7% 99.8%
Sick 81.8% 89.3% 86.1% 99.8%

Table 4: F1 -scores on the Columbia speaker detection dataset. The results of [5] have been
digitised from Figure 3b of their paper, and are accurate to around ±0.5%.

4. Models for Lip Reading

The task for the network is to predict which words are being spoken, given a
video of a talking face. The input format to the network is a sequence of mouth
regions, as shown in Figure 5. Previous attempts at visual speech recognition
have relied on very precise localisation of the facial landmarks (the mouth in
particular); our aim is learn from from more noisy data, and tolerate some
localisation irregularities both in position and in time.

4.1. Architectures

We cast the problem as one of multi-way classification, and so base our ar-
chitecture on ones designed for image classification [40, 39, 41]. In particular,
we build on the VGG-M model [39] since this has a good classification perfor-
mance, but is much faster to train and experiment on than deeper models, such
as VGG-16 [41]. We develop and compare models that differ principally in how
they ‘ingest’ the T input frames (where here T= 25 for a 1 second interval).
These variations take inspiration from previous work on human action classi-
fication [24, 25, 42, 26]. Apart from these differences, the architectures share
the configuration of VGG-M, and this allows us to directly compare the perfor-
mance across different input designs. These configurations are closely related
to the visual stream of SyncNet (Section 3).

15
We next describe the five architectures, summarised in Figure 11, followed
by a discussion of their differences. Their performance is compared in Section 5.
The numbers in the names refer to the number of temporal frames ingested by
each tower, and the EF, MT, LF and LSTM indicates where the fusion occurs.

Key
MT-1 softmax MT-5 softmax

layer support x # filters (stride) EF-25


fc8 fc8
… …
conv2 conv2 softmax

conv1d 1x1x96 (1) conv1d 1x1x96 (1) fc8



conv2
concat on dimension 3 concat on dimension 3

pool1 3x3 (2) ... pool1 pool1 3x3 (2) ... pool1 pool1 3x3 (2)
conv1 3x3x48 (1) conv1 conv1 3x3x48 (1) conv1 conv1 3x3x48 (1)
VGG-M
softmax

fc8 1x1xC (1) 111x111x1


111x111x5 111x111x25
fc7 1x1x4096 (1)
LF-5 softmax
fc6 6x6x4096 (1)
fc8 LSTM-5 softmax
pool5 3x3 (2)
conv5 3x3x512 (1) lstm2 512 lstm2 ... lstm2
fc7

conv4 3x3x512 (1) lstm1 512 lstm1 ... lstm1


concat on dimension 3

conv3 3x3x512 (1)


fc6 fc6 fc6 fc6 fc6
… … … … …
pool2 3x3 (2) conv2 conv2 conv2 conv2 conv2
conv2 3x3x256 (2)

pool1 3x3 (2) pool1 3x3 (2) ... pool1 pool1 3x3 (2) pool1 pool1
conv1 7x7x96 (2) conv1 3x3x48 (1) conv1 conv1 3x3x96 (1) conv1 conv1

224x224x3 111x111x5 111x111x5

Figure 11: Left: VGG-M architecture that is used as a base. Right: Network architectures
for lip reading.

Early Fusion (EF-25). The network ingests a 25-channel image, where each
of the channels encode an individual frame in greyscale. The layer structure for
the subsequent layers is identical to that of the regular VGG-M network. This
method is related to the Early Fusion model in [25], which takes colour images
and uses a T×3-channel convolutional filter at conv1. We did experiment with
25×3-channel colour input, but found that the increased number of parame-
ters at conv1 made training difficult due to overfitting (resulting in validation
performance that is around 5% weaker; not quoted in Section 5).
Multiple Towers (MT-1). There are T= 25 towers with common conv1 layers

16
(with shared weights), each of which takes one input frame. The activations
from the towers are concatenated channel-wise after pool1, producing an output
activation with 1200 channels. The subsequent 1×1 convolution is performed
to reduce this dimension, to keep the number of parameters at conv2 at a
managable level. The rest of the network is the same as the regular VGG-M.
Multiple Towers (MT-5). There are 21 towers with common conv1 layers,
each of which takes a 5-frame window, moving one 1-frame at a time. The
subsequent layers are configured in the same way as MT-1.
Late Fusion (LF-5). Like MT-5, the 21 towers each take 5-frame windows,
with a stride of 1. However, each tower in this variant has common conv1 to fc6
layers with shared weights, after which the activations are concatenated. The
subsequent layer structure is the same as EF-25, MT-1 and MT-5.
Long Short-Term Memory (LSTM-5). Each convolutional tower shares
the layer configuration of the LF-5 model. The two-layer LSTM ingests the
visual features (fc6 activations) of the 5-frame sliding window, moving 1-frame
at a time, and returns the classification result at the end of the sequence.
Discussion. The early fusion architecture EF-25 shares similarities with pre-
vious work on human action recognition using CNNs [24, 25, 42] in that registra-
tion between frames is assumed. The models perform time-domain operations
beginning from the first layer to precisely capture local motion direction and
speed [25]. For these methods to capture useful information, good registration
of details between frames is critical. However, we are not imposing strict regis-
tration, and in any case it goes slightly against the signal (lip motion and mouth
region deformation) that we are trying to capture.
In contrast, the MT-1 model delays all time-domain registrations (and op-
erations) until after the first set of convolutional and pooling layers. This gives
tolerance against minor registration errors (the receptive field size at conv2 is
11 pixels). Note, the common conv1 layers of the multiple towers ensures that
the same filter weights are used for all frames, whereas in the early fusion ar-
chitecture EF-25 it is possible to learn different weights for each frame.
The MT-5 model shares similarities with both EF-25 and MT-1 models –

17
the 5-frame input to each tower allows the network to learn some local motion
information, but are more tolerant to movements than the EF-25 model over
the whole time period.
The LF-5 model also shares many characteristics of MT-5, but delays time-
domain operations until after all of the convolutional layers, except within the
5 neighbouring frames between which the movement would be negligible.
Likewise, the LSTM-5 delays time-domains operations, and in addition,
this model benefits from the ability to accept sequences of variable lengths,
unlike the other models.
One other design choice is the size of the input images. This was chosen as
111×111 pixels, which is smaller than that typically used in image classification
networks. The reason is that the size of the cropped mouth images are rarely
larger than 111×111 pixels, and this smaller choice means that smaller filters
can be used at conv1 than those used in VGG-M without sacrificing receptive
fields, but at a gain in avoiding unnecessary parameters being learnt.

4.2. Training

Data augmentation. Data augmentation often helps to improve valida-


tion performance by reducing overfitting in CNN image classification tasks [40].
We apply the augmentation techniques used on the ImageNet classification task
by [41, 40] (e.g. random cropping, flipping, colour shift), with a consistent trans-
formation applied to all frames of a single clip. To further augment the training
data, we make random shifts in time by up to 0.2 seconds, which improves the
top-1 validation error by 3.5% compared to the standard ImageNet augmenta-
tion methods. It was not feasible to scale in the time-domain as this results in
artifacts being shown due to the relatively low video refresh rate of 25fps.
Details. Our implementation is based on the MATLAB toolbox MatCon-
vNet [43] and Caffe [44]. The network is trained using SGD with momentum 0.9
and batch normalisation [45], but without dropout. The training was stopped
after 20 epochs, or when the validation error did not improve for 3 epochs,
whichever is sooner. The learning rate of 10−2 to 10−4 was used, decreasing on

18
log scale.

5. Experiments

In this section we evaluate and compare the several proposed architectures,


and discuss the challenges arising from the visual ambiguities between words.
We then compare to the state of the art on a public benchmark.

5.1. Comparison of architectures

Evaluation protocol. The models are evaluated on the independent test set
(Section 2). We report top-1 and top-10 accuracies , as well as recall against
rank curves. Here, the ‘Recall@K’ is the proportion of times that the correct
class is found in the top-K predictions for the word. The experiments were per-
formed under two different conditions: ‘continuous’ where the input sequences
also contain co-articulation from the neighbouring words within the one-second
window and ‘isolated’ where the words are segmented according to the forced
alignment output, and thus can last less than one-second.
Results. The results are shown in Table 5. The experimental results show that
the registration-tolerant models gives a modest improvement over EF-25, and
the performance improvement is likely to be more significant where the tracking
quality is less ideal. Having 5 frames as input seems to achieve a good balance
for registration, in that it is able to compute useful temporal derivatives (MT-5
is better than MT-1 where no temporal derivatives are computed) which re-
quires local (in time) registration, but does not require the global registration
of EF-25 (which is inferior to both MT models). The LSTM-5 shows stronger
performance compared to the CNN-based models. For all models, the perfor-
mance is slightly better under the ‘isolated’ conditions since there are fewer
ambiguities due to co-articulation.
The top-10 accuracy for the best models are over 95%, despite the relatively
modest top-1 figure of around 70%. This is a result of ambiguities in lip reading,
which we will discuss next.

19
Net LRW (Con.) LRW (Iso.) OuluVS2
R@1 R@10 R@1 R@10 R@1
EF-25 57.0% 88.8% 62.5% 92.6% [22] 73.5%
MT-1 61.1% 90.4% 64.2% 94.2% [46] 85.6%
MT-5 66.8% 94.6% 69.0% 95.6% MT-1 93.2%
LF-5 65.4% 93.3% 68.2% 94.8% MT-5 93.2%
LSTM-5 66.0% 94.3% 71.5% 96.4% LSTM-5 94.1%

Table 5: Word classification accuracy. Left: On the LRW dataset for the different architec-
tures. Right: On OuluVS2 (short phrases, frontal view). Con. (continuous): the input
sequences also contain co-articulation from the neighbouring words within the one-second
window; Iso. (isolated): the words are segmented according to the forced alignment output,
and thus can last less than one-second.

5.2. Analysis of confusions

0.32 BENEFITS BENEFIT 0.24 HAPPEN HAPPENED


0.31 QUESTIONS QUESTION 0.24 FORCE FORCES
0.31 REPORT REPORTS 0.23 HAPPENED HAPPEN
0.31 BORDER IMPORTANT 0.23 SERIOUS SERIES
0.31 AMERICA AMERICAN 0.23 TROOPS GROUPS
0.29 GROUND AROUND 0.22 QUESTION QUESTIONS
0.28 RUSSIAN RUSSIA 0.21 PROBLEM PROBABLY
0.28 FIGHT FIGHTING 0.21 WANTED WANTS
0.26 FAMILY FAMILIES 0.21 RUSSIA RUSSIAN
0.26 AMERICAN AMERICA 0.20 TAKEN TAKING
0.26 BENEFIT BENEFITS 0.20 PROBLEM PROBLEMS
0.25 ELECTIONS ELECTION 0.20 MISSING MEETING
0.24 WANTS WANTED 0.20 PARTIES PARTY

Table 6: Most frequently confused word pairs for the ‘continuous’ experiment. The numbers
refer to class confusions.

Here, we examine the classification results, in particular, the scenarios in


which the network fails to correctly classify the spoken word. Table 6 shows
the most common confusions between words in the test set for the ‘continuous’
experiment. This is generated by taking the largest off-diagonal values in the
word confusion matrix. This result confirms our prior knowledge about the
challenges in visual speech recognition – almost all of the top confusions are

20
either (i) a plural of the original word (e.g. ‘report’ and ‘reports’) which is
ambiguous because one word is a subset of the other, and the words are not
isolated so this can be due to co-articulation; or (ii) a known homophone visual
ambiguity (explained in Section 1) where the words cannot be distinguished
using visual information alone (e.g. ‘billion’ and ‘million’, ‘worse’ and ‘worst’).
Such errors are phonetically understandable. For example, some of the most
common confusions, e.g. ‘groups’ which is phonetically (G R UW P S) and ‘troops’
(T R UW P S) , ‘ground’ (G R AW N D) and ‘around’ (ER AW N D), actually share
most of the phonemes.
Apart from these difficulties, the failure cases are typically for extreme sam-
ples. For example, due to strong international accents, or poor quality/low
bandwidth location reports and Skype interviews, where there are motion com-
pression artifacts or frames dropped from the transmission.

5.3. Comparison to state of the art

It is worth noting that the top-1 classification accuracy of over 70%, shown in
Table 5, is comparable to that of many of the recent works [13, 15, 47] performed
on lexicon sizes that are orders of magnitude smaller (Table 1).

Figure 12: Original video frames for ‘hello’ on OuluVS. Compare this to the our original input
frames in Figure 3.

OuluVS2. We evaluate our method on the OuluVS2 dataset [6]. The dataset
consists of 52 subjects uttering 10 phrases (e.g. ‘thank you’, ‘hello’, etc.), and has
been widely used in previous works. Here, we assess on a speaker-independent
experiment, where some of the subjects are reserved for testing.
To apply our method on this dataset, we pre-train the convolutional layers
on the BBC data, and re-train the fully-connected layers from scratch. Training

21
from scratch on OuluVS2 underperforms as the size of this dataset is insufficient
to train a deep network. For all models apart from LSTM-5, we simply repeat
the first and the last frames to fill the 1-second clip if the phrase is shorter than
25 frames. If the clip is longer, we take a random crop.
As can be seen in Table 5 the method achieves a strong performance, and sets
the new state-of-the-art. Note that, without retraining the convolutional part
of the network, we achieve these strong results on videos that are very different
to ours in terms of lighting, background, camera perspective, etc. (Figure 12),
which shows that the model generalises well across different formats.

6. Summary and extensions

We have shown that CNN and LSTM architectures can be used to classify
temporal lip motion sequences of words with excellent results. We also demon-
strated a recognition performance that exceeds the state of the art on a standard
public benchmark dataset, OuluVS2.
Extensions could include lip reading of profile views, and varying the ar-
chitecture (in terms of depth, 3D CNNs etc) to improve performance – there is
already evidence that there are benefits of using deeper architectures [48] on our
released dataset. It is worth noting that recent papers have combined CNNs
with sequence models in order to recognize sentences rather than individual
words [49, 50].
The dataset is available for download at http://www.robots.ox.ac.uk/
~vgg/data/lip_reading/ and the trained SyncNet is available at http://www.
robots.ox.ac.uk/~vgg/software/lipsync/.

Acknowledgements.
Funding for this research is provided by the EPSRC Programme Grant See-
bibyte EP/M013774/1. We are very grateful to Rob Cooper and Matt Haynes
at BBC Research for help in obtaining the dataset.

22
References

[1] H. McGurk, J. MacDonald, Hearing lips and seeing voices, Nature 264
(1976) 746–748.

[2] A. J. Goldschen, O. N. Garcia, E. D. Petajan, Rationale for phoneme-


viseme mapping and feature selection in visual speech recognition, in:
Speechreading by Humans and Machines, Springer, 1996, pp. 505–515.

[3] P. Lucey, T. Martin, S. Sridharan, Confusability of phonemes grouped ac-


cording to their viseme classes in noisy environments, in: Proc. of Aus-
tralian Int. Conf. on Speech Science & Tech, 2004, pp. 265–270.

[4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, S. Huang,


A. Karpathy, A. Khosla, M. Bernstein, A. Berg, F. Li, Imagenet large scale
visual recognition challenge, IJCV.

[5] P. Chakravarty, T. Tuytelaars, Cross-modal supervision for learning active


speaker detection in video, arXiv preprint arXiv:1603.08907.

[6] I. Anina, Z. Zhou, G. Zhao, M. Pietikäinen, Ouluvs2: a multi-view audio-


visual database for non-rigid mouth motion analysis, in: Automatic Face
and Gesture Recognition (FG), 2015 11th IEEE International Conference
and Workshops on, Vol. 1, IEEE, 2015, pp. 1–5.

[7] Z. Zhou, G. Zhao, X. Hong, M. Pietikäinen, A review of recent advances in


visual speech decoding, Image and vision computing 32 (9) (2014) 590–605.

[8] Y. Pei, T.-K. Kim, H. Zha, Unsupervised random forest manifold alignment
for lipreading, in: Proceedings of the IEEE International Conference on
Computer Vision, 2013, pp. 129–136.

[9] O. Koller, H. Ney, R. Bowden, Deep learning of mouth shapes for sign lan-
guage, in: Proceedings of the IEEE International Conference on Computer
Vision Workshops, 2015, pp. 85–91.

23
[10] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, T. Ogata, Lipreading
using convolutional neural network., in: INTERSPEECH, 2014, pp. 1149–
1153.

[11] S. Tamura, H. Ninomiya, N. Kitaoka, S. Osuga, Y. Iribe, K. Takeda,


S. Hayamizu, Audio-visual speech recognition using deep bottleneck fea-
tures and high-performance lipreading, in: 2015 Asia-Pacific Signal and
Information Processing Association Annual Summit and Conference (AP-
SIPA), IEEE, 2015, pp. 575–582.

[12] N. Ukai, T. Seko, S. Tamura, S. Hayamizu, Gif-lr: Ga-based informative


feature for lipreading, in: Signal & Information Processing Association
Annual Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, IEEE,
2012, pp. 1–4.

[13] S. Petridis, M. Pantic, Deep complementary bottleneck features for visual


speech recognition, ICASSP (2016) 2304–2308.

[14] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys,


M. Liu, T. S. Huang, Avicar: audio-visual speech corpus in a car envi-
ronment., in: INTERSPEECH, Citeseer, 2004.

[15] Y. Fu, S. Yan, T. S. Huang, Classification and feature extraction by sim-


plexization, Information Forensics and Security, IEEE Transactions on 3 (1)
(2008) 91–100.

[16] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, R. Harvey, Extraction


of visual features for lipreading, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 24 (2) (2002) 198–213.

[17] G. Zhao, M. Barnard, M. Pietikäinen, Lipreading with local spatiotemporal


descriptors, Multimedia, IEEE Transactions on 11 (7) (2009) 1254–1265.

[18] E. K. Patterson, S. Gurbuz, Z. Tufekci, J. N. Gowdy, Cuave: A new


audio-visual database for multimodal human-computer interface research,

24
in: Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE Inter-
national Conference on, Vol. 2, IEEE, 2002, pp. II–2017.

[19] G. Papandreou, A. Katsamanis, V. Pitsikalis, P. Maragos, Adaptive multi-


modal fusion by uncertainty compensation with application to audiovisual
speech recognition, Audio, Speech, and Language Processing, IEEE Trans-
actions on 17 (3) (2009) 423–435.

[20] M. Cooke, J. Barker, S. Cunningham, X. Shao, An audio-visual corpus for


speech perception and automatic speech recognition, The Journal of the
Acoustical Society of America 120 (5) (2006) 2421–2424.

[21] M. Wand, J. Koutn, et al., Lipreading with long short-term memory, in:
2016 IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), IEEE, 2016, pp. 6115–6119.

[22] Z. Zhou, X. Hong, G. Zhao, M. Pietikäinen, A compact representation of


visual speech data using latent variables, IEEE transactions on pattern
analysis and machine intelligence 36 (1) (2014) 1–1.

[23] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and


artificial neural networks for natural scene text recognition, in: Workshop
on Deep Learning, NIPS, 2014.

[24] S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human
action recognition, IEEE PAMI 35 (1) (2013) 221–231.

[25] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei,


Large-scale video classification with convolutional neural networks, in: Pro-
ceedings of the IEEE conference on Computer Vision and Pattern Recog-
nition, 2014, pp. 1725–1732.

[26] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spa-


tiotemporal features with 3d convolutional networks.

25
[27] P. Buehler, M. Everingham, A. Zisserman, Learning sign language by
watching TV (using weakly aligned subtitles), in: Proc. CVPR, 2009.

[28] M. Everingham, J. Sivic, A. Zisserman, “Hello! My name is... Buffy” –


automatic naming of characters in TV video, in: Proc. BMVC., 2006.

[29] J. Yuan, M. Liberman, Speaker identification on the scotus corpus, Journal


of the Acoustical Society of America 123 (5) (2008) 3878.

[30] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, the


Journal of the Acoustical Society of America 87 (4) (1990) 1738–1752.

[31] P. C. Woodland, C. Leggetter, J. Odell, V. Valtchev, S. J. Young, The 1994


htk large vocabulary speech recognition system, in: Acoustics, Speech, and
Signal Processing, 1995. ICASSP-95., 1995 International Conference on,
Vol. 1, IEEE, 1995, pp. 73–76.

[32] S. Rubin, F. Berthouzoz, G. J. Mysore, W. Li, M. Agrawala, Content-


based tools for editing audio stories, in: Proceedings of the 26th annual
ACM symposium on User interface software and technology, ACM, 2013,
pp. 113–122.

[33] R. Lienhart, Reliable transition detection in videos: A survey and practi-


tioner’s guide, International Journal of Image and Graphics.

[34] D. E. King, Dlib-ml: A machine learning toolkit, The Journal of Machine


Learning Research 10 (2009) 1755–1758.

[35] C. Tomasi, T. Kanade, Selecting and tracking features for image sequence
analysis, Robotics and Automation.

[36] V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble


of regression trees, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2014, pp. 1867–1874.

26
[37] S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discrimina-
tively, with application to face verification, in: Proc. CVPR, Vol. 1, IEEE,
2005, pp. 539–546.

[38] K. J. Geras, A.-r. Mohamed, R. Caruana, G. Urban, S. Wang, O. Aslan,


M. Philipose, M. Richardson, C. Sutton, Compressing lstms into cnns,
arXiv preprint arXiv:1511.06433.

[39] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the devil in


the details: Delving deep into convolutional nets, in: Proc. BMVC., 2014.

[40] A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with


deep convolutional neural networks, in: NIPS, 2012, pp. 1106–1114.

[41] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-


scale image recognition, in: International Conference on Learning Repre-
sentations, 2015.

[42] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga,


G. Toderici, Beyond short snippets: Deep networks for video classification,
in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 4694–4702.

[43] A. Vedaldi, S. Mahendran, S. Tsogkas, S. Maji, R. Girshick, J. Kannala,


E. Rahtu, I. Kokkinos, M. B. Blaschko, D. Weiss, B. Taskar, K. Simonyan,
N. Saphra, S. Mohamed, Understanding objects in detail with fine-grained
attributes, in: Proc. CVPR, 2014.

[44] Y. Jia, Caffe: An open source convolutional architecture for fast feature
embedding, http://caffe.berkeleyvision.org/ (2013).

[45] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network train-


ing by reducing internal covariate shift, arXiv preprint arXiv:1502.03167.

[46] T. Saitoh, Z. Zhou, G. Zhao, M. Pietikäinen, Concatenated frame image


based cnn for visual speech recognition, in: Asian Conference on Computer
Vision, Springer, 2016, pp. 277–289.

27
[47] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, Multimodal deep
learning, in: Proceedings of the 28th international conference on machine
learning (ICML-11), 2011, pp. 689–696.

[48] T. Stafylakis, G. Tzimiropoulos, Combining residual networks with lstms


for lipreading, in: Interspeech, 2017.

[49] J. S. Chung, A. Senior, O. Vinyals, A. Zisserman, Lip reading sentences in


the wild, in: IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2017.

[50] Y. M. Assael, B. Shillingford, S. Whiteson, N. de Freitas, Lipnet: Sentence-


level lipreading, arXiv:1611.01599.

28

You might also like