Chung 18
Chung 18
Chung 18
Abstract
Our aim is to recognise the words being spoken by a talking face, given only
the video but not the audio. Existing works in this area have focussed on trying
to recognise a small number of utterances in controlled environments (e.g. digits
and alphabets), partially due to the shortage of suitable datasets.
We make three novel contributions: first, we develop a pipeline for fully
automated data collection from TV broadcasts. With this we have generated
a dataset with over a million word instances, spoken by over a thousand dif-
ferent people; second, we develop a two-stream convolutional neural network
that learns a joint embedding between the sound and the mouth motions from
unlabelled data. We apply this network to the tasks of audio-to-video synchroni-
sation and active speaker detection; third, we train convolutional and recurrent
networks that are able to effectively learn and recognize hundreds of words from
this large-scale dataset.
In lip reading and in speaker detection, we demonstrate results that exceed
the current state-of-the-art on public benchmark datasets.
Keywords: lip reading, lip synchronisation, active speaker detection, large
vocabulary, dataset
1. Introduction
2
ImageNet classification challenge [4].
We take advantage of this ability to recognize temporal signals in an image
time series. In particular we consider one second sequences of lip movements
of continuous speech and learn to recognize words within the sequence given
only class level supervision, but do not require stronger temporal supervision
such as specifying the start and end of the word. Clearly, spatial registration of
the mouth is an important element to consider in the design of the networks.
Typically, the imaged head will move in the video, either due to actual move-
ment of the head or due to camera motion. One approach would be to tightly
register the mouth region (including lips, teeth and tongue, that all contribute
to word recognition), but another is to develop networks that are tolerant to
some degree of motion jitter. We take the latter approach, and do not enforce
tight registration.
We make contributions in three areas: first, we build a pipeline for automated
large scale data collection, including visual and temporal alignment. With this
we are able to obtain training data for hundreds of distinct words, thousands
of instances for each word, and over a thousand speakers (Section 2); second,
we develop a two-stream convolutional neural network SyncNet that learns a
joint embedding between the sound and the mouth motions using cross-modal
self-supervision. We apply this network to the tasks of audio-to-video synchro-
nisation and active speaker detection (Section 3); third, we develop and compare
a number of network architectures for classifying multi-frame time series of lips
(Section 4). In speaker detection and lip reading, our results exceed the state-
of-the-art on public datasets, Columbia [5] and OuluVS2 [6].
As discussed in the related work below, in three aspects: (i) speaker inde-
pendence, (ii) learning from continuous speech, and (iii) lexicon (vocabulary)
size, we go far beyond the current state of the art. We also exceed the state of
the art in terms of performance, as is also shown in Section 5 by comparisons
on the standard OuluVS2 benchmark dataset [6].
3
1.1. Related work
Research on lip reading (a.k.a. visual speech recognition) has a long history.
A thorough survey of shallow (i.e. not deep learning) methods is given in the
recent review [7], and will not repeated in detail here. Many of the existing
works in this field have followed similar pipelines which first extract spatio-
temporal features around the lips (either motion-based, geometric-feature based
or both), and then align these features with respect to a canonical template. For
example, Pei et al. [8], which holds state-of-the-art on many datasets, extracts
the patch trajectory as a spatiao-temporal feature, and then aligns these features
to reference motion patterns.
A number of recent papers have used deep learning methods to tackle prob-
lems related to lip reading. Koller et al. [9] train an image classifier CNN to
discriminate visemes (mouth shapes, visual equivalent of phonemes) on a sign
language dataset where the signers mouth words. Similar CNN methods have
been performed by [10] to predict phonemes in spoken Japanese. In the context
of word recognition, [11] has used deep bottleneck features (DBF) to encode
shallow input features such as LDA and GIF [12]. Similarly [13] uses DBF to
encode the image for every frame, and trains a LSTM classifier to generate a
word-level classification.
One of the major obstacle to progress in this field has been the lack of
suitable datasets [7]. Table 1 gives a summary of existing datasets. The amount
of available data is far from sufficient to train scalable and representative models
that will be able to generalise beyond the controlled environments and the very
limited domains (e.g. digits and the alphabet).
Word classification with large lexicons has not been attempted in lip reading,
but [23] has tackled a similar problem in the context of text spotting. Their
work shows that it is feasible to train a general and scalable word recognition
model for a large pre-defined dictionary, as a multi-class classification problem.
We take a similar approach.
Of relevance to the architectures and methods developed in this paper are
CNNs for action recognition that learn from multiple-frame image sequences
4
Name Env. Output I/C # class # subj. Best perf.
AVICAR [14] In-car Digits C 10 100 37.9% [15]
AVLetter [16] Lab Alphabet I 26 10 43.5% [17]
CUAVE [18] Lab Digits I 10 36 83.0% [19]
∗
GRID [20] Lab Words C 8.5 34 79.6% [21]
OuluVS1 [17] Lab Phrases I 10 20 89.7% [8]
OuluVS2 [6] Lab Phrases I 10 52 73.5% [22]
LRW TV Words C 500 1000+ -
Table 1: Existing lip reading datasets. I for Isolated (one word, letter or digit per record-
ing); C for Continuous recording. The reported performance is on speaker-independent ex-
periments. (∗ For GRID [20], there are 51 classes in total, but the first word in a phrase is
restricted to 4, the second word 4, etc. 8.5 is the average number of possible classes at each
position in the phrase.)
such as [24, 25, 26], particularly the ways in which they capture spatio-temporal
information in the image sequence using temporal pooling layers and 3D con-
volutional filters.
5
Figure 1: A sample of speakers in our dataset.
Audio-subtitle
Face detection OCR subtitle
forced alignment
Alignment
Face tracking
verification
6
who appear repeatedly in the videos (e.g. news presenter in BBC News or the
host in the others), but the large majority of participants change every episode
(Figure 1).
Table 2: Video statistics. The yield is the proportion of useful face appearance relative to the
total length of video. A useful face appearance is one that appears continuously for at least
5 seconds, with the face being that of the speaker.
Figure 3: Subtitles on BBC TV. Left: ‘Question Time’, Right: ‘BBC News at One’.
7
to compute the maximum likelihood alignment between the audio (modelled by
PLP features [32]) and the text. This method of obtaining the alignment has
significant performance benefits over regular speech recognition methods that
do not use prior knowledge of what is being said. The alignment result, however,
is not perfect due to: (1) the method often misses words that are spoken too
quickly; (2) the subtitles are not verbatim; (3) the acoustic model is only trained
to recognise American English. The noisy labels are filtered by double-checking
against the commercial IBM Watson Speech to Text service. In this case, the
only remaining label noise is where an interview is dubbed in the news, which
is rare.
Stage 3. Shot boundary detection, face detection, and tracking. The
shot boundaries are determined to find the within-shot frames for which face
tracking is to be run. This is done by comparing color histograms across consec-
utive frames [33]. The HOG-based face detection method of [34] is performed on
every frame of the video (Figure 4 left). As with most face detection methods,
this results in many false positives and some missed detections. In a similar
manner to [28], all face detections of the same person are grouped across frames
using a KLT tracker [35] (Figure 4 middle). If the track overlaps with face
detections on the majority of frames, it is assumed to be correctly tracking the
face.
Figure 4: Left: Face detections; Middle: KLT features and the tracked bounding box (in
yellow); Right: Facial landmarks.
8
Stage 4. Facial landmark detection and speaker identification. Facial
landmarks are needed to determine the mouth position for speaker/ non-speaker
classification. They are determined in every frame of the face track using the
method of [36] (Figure 4 right). The landmarks are used to determine the
mouth region, and to map it to a canonical position as input to the two-stream
network described in Section 3 that is used to determine who is speaking in the
video, and reject the clip if the face is not-speaking in sync. It is important
to determine whether the face shown is actually speaking or not. For example,
there may be a reaction shot or voice-over.
Figure 5: One-second clips that contain the word ‘about’. Top: male speaker, bottom: female
speaker.
#10
5 Number of characters #10
4 5-character words #10
4 9-character words
8 8
7 7 2
6 6
1.5
5 5
Frequency
Frequency
Frequency
4 4
1
3 3
2 2
0.5
1 1
0 0 0
1 2 3 4 5 6 7 8 9 10 11 12 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
Word length (characters) Word duration (seconds) Word duration (seconds)
Figure 6: Word statistics. Regardless of the actual duration of the word, we take a 1-second
clip for training and test.
Stage 5. Compiling the training and test data. The training, validation
and test sets are disjoint in time. The dates of videos corresponding to each
9
set is shown in Table 3. Note that we leave a week’s gap between the test set
and the rest in case any news footage is repeated. The lexicon is obtained by
selecting the 500 most frequently occurring words between 5 and 10 characters in
length (Figure 6 gives the word duration statistics). This word length is chosen
such that the speech duration does not exceed the fixed one-second bracket that
is used in the recognition architecture, whilst shorter words are not included
because there are too many ambiguities due to homophones (e.g. ‘bad’, ‘bat’,
‘pat’, ‘mat’, etc. are all visually identical), and sentence-level context would be
needed to disambiguate these.
These 500 words occur at least 800 times in the training set, and at least
40 times in each of the validation and test sets. For each of the occurrences,
the one-second clip is taken, and the face is cropped with the mouth centered
using the registration found in Stage 4. The words are not isolated, as is the
case in other lip-reading datasets; as a result, there may be co-articulation of
the lips from preceding and subsequent words. The test set is manually checked
for errors.
10
embedding is then used to identify the active speaker and to correct the lip-sync
error.
No explicit annotations (e.g. word labels, or the precise time offset) are
used to train this network – we only assume that in the majority of television
videos, the audio and the video are usually synced, and we use cross-modal
self-supervision to learn the embedding.
The model consists of two asymmetric streams for audio and video, each of
which is described below.
Key
120x120x5 13x20x1
The training objective is that the output of the audio and the video networks
are similar for genuine pairs, and different for false pairs. Specifically, the
Euclidean distance between the network outputs is minimised or maximised.
We propose to use the contrastive loss (Equation 1), originally proposed for
training Siamese networks [37]. v and a are fc 7 vectors for the video and the
audio streams, respectively. y ∈ [0, 1] is the binary similarity between the audio
and the video inputs.
N
1 X 2
E= (yn ) d2n + (1 − yn ) max (margin − dn , 0) (1)
2N n=1
11
dn = ||vn − an ||2 (2)
3.2. Training
The training procedure is inspired by the Siamese network [37], however our
network is different in that it consists of non-identical streams, two independent
sets of parameters and inputs from two different domains. The network weights
are learnt using stochastic gradient descent with momentum. The parameters
for both streams of the network are learnt simultaneously.
The input audio data is MFCC values. This is a representation of the short-
term power spectrum of a sound on a non-linear mel scale of frequency. 13 mel
frequency bands are used at each time step. The features are computed at a
sampling rate of 100Hz, giving 20 time steps for a 0.2-second input signal.
-20
ABCDEFGHIJKLM
+20
Time Time
12
The top and bottom three rows of the image are reflected to reduce boundary
effects. Previous work [38] has also attempted to train image-style CNN for
similar inputs.
Architecture. We use a convolutional neural network inspired by those de-
signed for image recognition. Our layer architecture (Figure 7) is based on
VGG-M [39], but with modified filter sizes to ingest the inputs of unusual di-
mensions of 13 × 20 (13 in the frequency domain, and 20 in the time domain).
3.5. Applications
15 15 15
L2 distance
L2 distance
L2 distance
10 10 10
5 5 5
-10 -5 0 5 10 -10 -5 0 5 10 -10 -5 0 5 10
Offset Offset Offset
Figure 9: Mean distance between the audio and the video features for different offset values,
averaged over a clip. The actual offset lies at the trough. The three example clips shown here
are for different scenarios. Left: synchronised AV data; Middle: the audio leads the video
by 4 frames; Right: the audio and the video are uncorrelated.
13
related in that the correspondence between the video and the accompanying
audio must be established.
Audio-to-video synchronisation. To find the time offset between the audio
and the video, we use a sliding-window approach. For each sample, the distance
is computed between one 5-frame video feature and every audio feature in the ±
1 second range. The correct offset is found where this distance is at a minimum.
Since not every 0.2-second sample contains discriminative information (e.g., the
person might be taking a breath), the distance for every offset value is averaged
across the video clip. Typical distances against offset plots are shown in Figure 9.
Active speaker detection. We test our method using the dataset (Figure 10)
and the evaluation protocol of Chakravarty et al. [5]. The objective is to deter-
mine who the speaker is in a multi-subject scene.
14
Method [5] Ours
Window 10 100 10 100
Bell 82.9% 90.3% 93.7% 100%
Bollinger 65.8% 69.0% 83.4% 100%
Lieberman 73.6% 82.4% 86.8% 100%
Long 86.9% 96.0% 97.7% 99.8%
Sick 81.8% 89.3% 86.1% 99.8%
Table 4: F1 -scores on the Columbia speaker detection dataset. The results of [5] have been
digitised from Figure 3b of their paper, and are accurate to around ±0.5%.
The task for the network is to predict which words are being spoken, given a
video of a talking face. The input format to the network is a sequence of mouth
regions, as shown in Figure 5. Previous attempts at visual speech recognition
have relied on very precise localisation of the facial landmarks (the mouth in
particular); our aim is learn from from more noisy data, and tolerate some
localisation irregularities both in position and in time.
4.1. Architectures
We cast the problem as one of multi-way classification, and so base our ar-
chitecture on ones designed for image classification [40, 39, 41]. In particular,
we build on the VGG-M model [39] since this has a good classification perfor-
mance, but is much faster to train and experiment on than deeper models, such
as VGG-16 [41]. We develop and compare models that differ principally in how
they ‘ingest’ the T input frames (where here T= 25 for a 1 second interval).
These variations take inspiration from previous work on human action classi-
fication [24, 25, 42, 26]. Apart from these differences, the architectures share
the configuration of VGG-M, and this allows us to directly compare the perfor-
mance across different input designs. These configurations are closely related
to the visual stream of SyncNet (Section 3).
15
We next describe the five architectures, summarised in Figure 11, followed
by a discussion of their differences. Their performance is compared in Section 5.
The numbers in the names refer to the number of temporal frames ingested by
each tower, and the EF, MT, LF and LSTM indicates where the fusion occurs.
Key
MT-1 softmax MT-5 softmax
pool1 3x3 (2) ... pool1 pool1 3x3 (2) ... pool1 pool1 3x3 (2)
conv1 3x3x48 (1) conv1 conv1 3x3x48 (1) conv1 conv1 3x3x48 (1)
VGG-M
softmax
pool1 3x3 (2) pool1 3x3 (2) ... pool1 pool1 3x3 (2) pool1 pool1
conv1 7x7x96 (2) conv1 3x3x48 (1) conv1 conv1 3x3x96 (1) conv1 conv1
Figure 11: Left: VGG-M architecture that is used as a base. Right: Network architectures
for lip reading.
Early Fusion (EF-25). The network ingests a 25-channel image, where each
of the channels encode an individual frame in greyscale. The layer structure for
the subsequent layers is identical to that of the regular VGG-M network. This
method is related to the Early Fusion model in [25], which takes colour images
and uses a T×3-channel convolutional filter at conv1. We did experiment with
25×3-channel colour input, but found that the increased number of parame-
ters at conv1 made training difficult due to overfitting (resulting in validation
performance that is around 5% weaker; not quoted in Section 5).
Multiple Towers (MT-1). There are T= 25 towers with common conv1 layers
16
(with shared weights), each of which takes one input frame. The activations
from the towers are concatenated channel-wise after pool1, producing an output
activation with 1200 channels. The subsequent 1×1 convolution is performed
to reduce this dimension, to keep the number of parameters at conv2 at a
managable level. The rest of the network is the same as the regular VGG-M.
Multiple Towers (MT-5). There are 21 towers with common conv1 layers,
each of which takes a 5-frame window, moving one 1-frame at a time. The
subsequent layers are configured in the same way as MT-1.
Late Fusion (LF-5). Like MT-5, the 21 towers each take 5-frame windows,
with a stride of 1. However, each tower in this variant has common conv1 to fc6
layers with shared weights, after which the activations are concatenated. The
subsequent layer structure is the same as EF-25, MT-1 and MT-5.
Long Short-Term Memory (LSTM-5). Each convolutional tower shares
the layer configuration of the LF-5 model. The two-layer LSTM ingests the
visual features (fc6 activations) of the 5-frame sliding window, moving 1-frame
at a time, and returns the classification result at the end of the sequence.
Discussion. The early fusion architecture EF-25 shares similarities with pre-
vious work on human action recognition using CNNs [24, 25, 42] in that registra-
tion between frames is assumed. The models perform time-domain operations
beginning from the first layer to precisely capture local motion direction and
speed [25]. For these methods to capture useful information, good registration
of details between frames is critical. However, we are not imposing strict regis-
tration, and in any case it goes slightly against the signal (lip motion and mouth
region deformation) that we are trying to capture.
In contrast, the MT-1 model delays all time-domain registrations (and op-
erations) until after the first set of convolutional and pooling layers. This gives
tolerance against minor registration errors (the receptive field size at conv2 is
11 pixels). Note, the common conv1 layers of the multiple towers ensures that
the same filter weights are used for all frames, whereas in the early fusion ar-
chitecture EF-25 it is possible to learn different weights for each frame.
The MT-5 model shares similarities with both EF-25 and MT-1 models –
17
the 5-frame input to each tower allows the network to learn some local motion
information, but are more tolerant to movements than the EF-25 model over
the whole time period.
The LF-5 model also shares many characteristics of MT-5, but delays time-
domain operations until after all of the convolutional layers, except within the
5 neighbouring frames between which the movement would be negligible.
Likewise, the LSTM-5 delays time-domains operations, and in addition,
this model benefits from the ability to accept sequences of variable lengths,
unlike the other models.
One other design choice is the size of the input images. This was chosen as
111×111 pixels, which is smaller than that typically used in image classification
networks. The reason is that the size of the cropped mouth images are rarely
larger than 111×111 pixels, and this smaller choice means that smaller filters
can be used at conv1 than those used in VGG-M without sacrificing receptive
fields, but at a gain in avoiding unnecessary parameters being learnt.
4.2. Training
18
log scale.
5. Experiments
Evaluation protocol. The models are evaluated on the independent test set
(Section 2). We report top-1 and top-10 accuracies , as well as recall against
rank curves. Here, the ‘Recall@K’ is the proportion of times that the correct
class is found in the top-K predictions for the word. The experiments were per-
formed under two different conditions: ‘continuous’ where the input sequences
also contain co-articulation from the neighbouring words within the one-second
window and ‘isolated’ where the words are segmented according to the forced
alignment output, and thus can last less than one-second.
Results. The results are shown in Table 5. The experimental results show that
the registration-tolerant models gives a modest improvement over EF-25, and
the performance improvement is likely to be more significant where the tracking
quality is less ideal. Having 5 frames as input seems to achieve a good balance
for registration, in that it is able to compute useful temporal derivatives (MT-5
is better than MT-1 where no temporal derivatives are computed) which re-
quires local (in time) registration, but does not require the global registration
of EF-25 (which is inferior to both MT models). The LSTM-5 shows stronger
performance compared to the CNN-based models. For all models, the perfor-
mance is slightly better under the ‘isolated’ conditions since there are fewer
ambiguities due to co-articulation.
The top-10 accuracy for the best models are over 95%, despite the relatively
modest top-1 figure of around 70%. This is a result of ambiguities in lip reading,
which we will discuss next.
19
Net LRW (Con.) LRW (Iso.) OuluVS2
R@1 R@10 R@1 R@10 R@1
EF-25 57.0% 88.8% 62.5% 92.6% [22] 73.5%
MT-1 61.1% 90.4% 64.2% 94.2% [46] 85.6%
MT-5 66.8% 94.6% 69.0% 95.6% MT-1 93.2%
LF-5 65.4% 93.3% 68.2% 94.8% MT-5 93.2%
LSTM-5 66.0% 94.3% 71.5% 96.4% LSTM-5 94.1%
Table 5: Word classification accuracy. Left: On the LRW dataset for the different architec-
tures. Right: On OuluVS2 (short phrases, frontal view). Con. (continuous): the input
sequences also contain co-articulation from the neighbouring words within the one-second
window; Iso. (isolated): the words are segmented according to the forced alignment output,
and thus can last less than one-second.
Table 6: Most frequently confused word pairs for the ‘continuous’ experiment. The numbers
refer to class confusions.
20
either (i) a plural of the original word (e.g. ‘report’ and ‘reports’) which is
ambiguous because one word is a subset of the other, and the words are not
isolated so this can be due to co-articulation; or (ii) a known homophone visual
ambiguity (explained in Section 1) where the words cannot be distinguished
using visual information alone (e.g. ‘billion’ and ‘million’, ‘worse’ and ‘worst’).
Such errors are phonetically understandable. For example, some of the most
common confusions, e.g. ‘groups’ which is phonetically (G R UW P S) and ‘troops’
(T R UW P S) , ‘ground’ (G R AW N D) and ‘around’ (ER AW N D), actually share
most of the phonemes.
Apart from these difficulties, the failure cases are typically for extreme sam-
ples. For example, due to strong international accents, or poor quality/low
bandwidth location reports and Skype interviews, where there are motion com-
pression artifacts or frames dropped from the transmission.
It is worth noting that the top-1 classification accuracy of over 70%, shown in
Table 5, is comparable to that of many of the recent works [13, 15, 47] performed
on lexicon sizes that are orders of magnitude smaller (Table 1).
Figure 12: Original video frames for ‘hello’ on OuluVS. Compare this to the our original input
frames in Figure 3.
OuluVS2. We evaluate our method on the OuluVS2 dataset [6]. The dataset
consists of 52 subjects uttering 10 phrases (e.g. ‘thank you’, ‘hello’, etc.), and has
been widely used in previous works. Here, we assess on a speaker-independent
experiment, where some of the subjects are reserved for testing.
To apply our method on this dataset, we pre-train the convolutional layers
on the BBC data, and re-train the fully-connected layers from scratch. Training
21
from scratch on OuluVS2 underperforms as the size of this dataset is insufficient
to train a deep network. For all models apart from LSTM-5, we simply repeat
the first and the last frames to fill the 1-second clip if the phrase is shorter than
25 frames. If the clip is longer, we take a random crop.
As can be seen in Table 5 the method achieves a strong performance, and sets
the new state-of-the-art. Note that, without retraining the convolutional part
of the network, we achieve these strong results on videos that are very different
to ours in terms of lighting, background, camera perspective, etc. (Figure 12),
which shows that the model generalises well across different formats.
We have shown that CNN and LSTM architectures can be used to classify
temporal lip motion sequences of words with excellent results. We also demon-
strated a recognition performance that exceeds the state of the art on a standard
public benchmark dataset, OuluVS2.
Extensions could include lip reading of profile views, and varying the ar-
chitecture (in terms of depth, 3D CNNs etc) to improve performance – there is
already evidence that there are benefits of using deeper architectures [48] on our
released dataset. It is worth noting that recent papers have combined CNNs
with sequence models in order to recognize sentences rather than individual
words [49, 50].
The dataset is available for download at http://www.robots.ox.ac.uk/
~vgg/data/lip_reading/ and the trained SyncNet is available at http://www.
robots.ox.ac.uk/~vgg/software/lipsync/.
Acknowledgements.
Funding for this research is provided by the EPSRC Programme Grant See-
bibyte EP/M013774/1. We are very grateful to Rob Cooper and Matt Haynes
at BBC Research for help in obtaining the dataset.
22
References
[1] H. McGurk, J. MacDonald, Hearing lips and seeing voices, Nature 264
(1976) 746–748.
[8] Y. Pei, T.-K. Kim, H. Zha, Unsupervised random forest manifold alignment
for lipreading, in: Proceedings of the IEEE International Conference on
Computer Vision, 2013, pp. 129–136.
[9] O. Koller, H. Ney, R. Bowden, Deep learning of mouth shapes for sign lan-
guage, in: Proceedings of the IEEE International Conference on Computer
Vision Workshops, 2015, pp. 85–91.
23
[10] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, T. Ogata, Lipreading
using convolutional neural network., in: INTERSPEECH, 2014, pp. 1149–
1153.
24
in: Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE Inter-
national Conference on, Vol. 2, IEEE, 2002, pp. II–2017.
[21] M. Wand, J. Koutn, et al., Lipreading with long short-term memory, in:
2016 IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), IEEE, 2016, pp. 6115–6119.
[24] S. Ji, W. Xu, M. Yang, K. Yu, 3D convolutional neural networks for human
action recognition, IEEE PAMI 35 (1) (2013) 221–231.
25
[27] P. Buehler, M. Everingham, A. Zisserman, Learning sign language by
watching TV (using weakly aligned subtitles), in: Proc. CVPR, 2009.
[35] C. Tomasi, T. Kanade, Selecting and tracking features for image sequence
analysis, Robotics and Automation.
26
[37] S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discrimina-
tively, with application to face verification, in: Proc. CVPR, Vol. 1, IEEE,
2005, pp. 539–546.
[44] Y. Jia, Caffe: An open source convolutional architecture for fast feature
embedding, http://caffe.berkeleyvision.org/ (2013).
27
[47] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Y. Ng, Multimodal deep
learning, in: Proceedings of the 28th international conference on machine
learning (ICML-11), 2011, pp. 689–696.
28