AV-HuBERT Large
AV-HuBERT Large
Abstract low-resource settings, which is the case for most of the ∼7,000
spoken languages [17].
Audio-based automatic speech recognition (ASR) degrades sig- This paper presents a self-supervised framework for robust
nificantly in noisy environments and is particularly vulnerable to AVSR, which is based upon the recently introduced Audio-Visual
interfering speech, as the model cannot determine which speaker HuBERT (AV-HuBERT) pre-training appraoch [18]. First, large
to transcribe. Audio-visual speech recognition (AVSR) systems quantities of unlabeled audio-visual speech data are used to pre-
improve robustness by complementing the audio stream with train our model to capture the nuanced correlations between
the visual information that is invariant to noise and helps the sounds and associated lip movements, then only a tiny amount of
arXiv:2201.01763v3 [cs.SD] 14 Jul 2022
model focus on the desired speaker. However, previous AVSR transcribed audio-visual speech data is used for fine-tuning the
work focused solely on the supervised learning setup; hence the model for best AVSR performance. The efficacy of our frame-
progress was hindered by the amount of labeled data available. work is demonstrated on low-resource (30h) and mid-resource
In this work, we present a self-supervised AVSR framework (433h) setups showing WER reductions of up to 50% compared
built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the- to previous SOTA models. Furthermore, we investigate the
art audio-visual speech representation learning model. On the robustness of the proposed approach and audio-only systems
largest available AVSR benchmark dataset LRS3, our approach against different types of noises, which have not been studied in
outperforms prior state-of-the-art by ∼ 50% (28.0% vs. 14.1%) prior work but are essential for practical applications. For exam-
using less than 10% of labeled data (433hr vs. 30hr) in the pres- ple, an AVSR system deployed in meeting scenarios is subject
ence of babble noise, while reducing the WER of an audio-based to babble noise, while one used in a home environment naturally
model by over 75% (25.8% vs. 5.8%) on average 1 . encounters music, cooking, or vacuums machine noises.
Index Terms: audio-visual speech recognition, self-supervised
learning, representation learning, robust speech recognition 2. Method
In this section, we present our methodology for audio-visual
1. Introduction speech recognition. First, we introduce the Audio-Visual Hu-
With the recent development of supervised neural models [1, 2], BERT (AV-HuBERT) pre-training approach, which we use for
the performance of automatic speech recognition (ASR) systems unsupervised learning of joint representations over audio and
has improved significantly, achieving human parity [3] or even visual streams. We then describe how we adopt AV-HuBERT for
outperforming humans on several clean speech benchmarks [4,5]. robust audio-visual speech recognition.
However, ASR systems are vulnerable to noise and may degrade
drastically when speech recordings are corrupted with noise [6]. 2.1. AV-HuBERT for Audio-Visual Speech Recognition
To make ASR more reliable in various scenarios, research on
noise robustness [7–9] has received increasing attention in recent Figure 1: AV-HuBERT for audio-visual speech recognition. X:
years. mask; blue waveform: original audio; orange waveform: noise;
An active research direction on noise robustness combines Cn : audio-visual clusters. Dashed box: the pre-trained part
the audio and visual streams of the speaker to utilize the noise-
invariant lip movement information. Audio-visual speech recog- Pretraining: Masked Prediction Fine-tuning
nition (AVSR) models, which combine these two modalities, C80 C20 C5 C10 C27 How Are You ? </s>
(AVSR) appeared more than 60 years ago [13], recent devel- Audio-Visual Fusion
opments on novel model architectures [14, 15] and large-scale
data collection [14, 16] have brought AVSR performance to
new heights. Nonetheless, while modern neural architectures ResNet FFN
are hungry for large training data, existing research AVSR ef-
forts are fully-supervised, requiring costly labeled data. This
limitation hinders the application of modern AVSR systems in +
∗ Work
done at Meta AI
1 Our
code and models are available at https://github.com/ AV-HuBERT [18] is a self-supervised approach for learning
facebookresearch/av_hubert joint speech representations from audio and lip-movement in-
formation in video recordings, which extends the HuBERT [19] audio by randomly sampling speech utterances from the same
speech representation learning framework to multimodal in- minibatch. We use more diverse sources in our noise-augmented
puts. AV-HuBERT consumes frame-level synchronous audio pretraining, including both speech and non-speech noise, e.g.,
and video streams as input to produce contextualized audio- ambient and babble noise. Additionally, since WavLM targets
visual representations for each frame. AV-HuBERT pretraining audio-only self-supervised learning, the intersection between the
iterates over two steps: feature clustering and masked prediction. secondary and the primary utterances needs to be fewer than
Feature clustering creates discrete frame-level targets for the 50% to signify which utterance in the mixture is the main one.
subsequent masked prediction step. Audio-based mel-frequency Our approach is unconstrained and more flexible on mixing
cepstral coefficients (MFCC) features are always used for cluster noise because the accompanying visual stream disambiguates
generation in the first iteration. For multi-iteration pretraining, the primary and secondary utterances.
the learned audio-visual features extracted from the latest AV-
HuBERT transformer network are used for cluster generation 3. Experiments
in all subsequent iterations. Inspired by the BERT pretrain-
ing widely used for text data [20] and deep cluster for visual 3.1. Data and Experimental Setup
data [21], The masked prediction loss drives training of the AV- Our experiments are conducted on LRS3 [16] with around 433
HuBERT model by predicting the cluster assignments of the hours of audio-visual speech from over 5000 speakers, which is
masked frames given a corrupted video signal with randomly the largest publicly available labeled audio-visual speech recogni-
masked segments. To finetune it for a downstream task, the tion dataset. VoxCeleb2 [23], a large-scale audio-visual speech
cluster prediction head of the pretrained model is removed. De- dataset that was initially proposed for the speaker recognition
pending on the desired architecture of the final model, either a task is used for our self-supervised pre-training. VoxCeleb2
linear layer is added for an encoder-only model or a randomly ini- has around 2,442 hours of videos from over 6,000 speakers and
tialized decoder module with cross attention over the pretrained contains utterances from multiple languages. We follow the pre-
encoder is used for a sequence-to-sequence model. Some or all processing steps in [18] to select the ”English” portion, which
layers may be updated during finetuning. amounts to 1,326 hours of videos.
Unlike the prior work in [18], which utilizes the pretrained We augment input samples using many noise categories.
AV-HuBERT encoder for unimodal downstream scenarios like The total duration of noise in each category is shown in table 1.
lip-reading and ASR, this paper examines the effectiveness of The noise audio clips in the categories of ”natural”, ”music”
the multimodal learned representations of AV-HuBERT for the and ”babble” are sampled from MUSAN dataset [24], while
multimodal audio-visual speech recognition (AVSR) task that the overlapping ”speech” noise samples are drawn from LRS3.
aims to transcribe speech videos using audio and visual streams. In creating ”speech” and ”babble” noise sets, we ensured
Given a pre-trained AV-HuBERT model, we keep both its audio there are no speaker overlap among different partitions.
and video frontends during finetuning. We use a sequence-to-
sequence model for AVSR, where AV-HuBERT serves as the Table 1: Total duration in hours of noise samples in different
encoder module. In contrast to pretraining, we do not apply categories
input masking or modality dropout during finetuning. Also, we
froze the pretrained AV-HuBERT encoder for a certain number
of training steps, after which we updated all model weights. Partition natural music babble speech
train 6 35 20 50
2.2. Noise-Augmented AV-HuBERT validation 1 4 2 6
test 1 4 2 6
AVSR systems leverage the visual modality during noisy con-
ditions [14]; however, A recognizer trained on clean conditions
may rely overly on the audio stream since a model can predict We follow the protocol of [18] to create two settings for
with audio more effortlessly, thus leading to failure of leveraging finetuning the model; a low-resource setting using 30h of labeled
visual information in adverse auditory conditions at test time. videos and a mid-resource setting using 433h of labels. Unless
A typical solution adopted by prior work is noise-augmented otherwise specified, we use AV-HuBERT LARGE as the default
supervised training [14, 15], which adds noise sampled from a model architecture for all our experiments. The model has 24
separate noise dataset to the clean audio at a fixed or sampled transformer blocks, where each block has 16 attention heads
signal-to-noise ratio (SNR). We adopt this strategy during the and 1024/4096 embedding/feedforward dimensions. We add a
finetuning stage and refer to it as noise-augmented finetuning to 9-layer randomly initialized transformer decoder with similar
emphasize the stage noise is employed. embedding/feedforward dimensions during finetuning.
To further boost our model’s robustness to acoustic noise, During training, we first select one noise category and sam-
we extended noise augmentation to AV-HuBERT pretraining ple a noise audio clip from its training partition. We randomly
by randomly adding different types of noise to the audio input, mix the sampled noise at 0dB SNR with a probability of 0.25,
making AV-HuBERT more suitable for AVSR applications. We following [15]. At test time, we evaluate the model separately
refer to it as noise-augmented pretraining. Incorporating noise for each noise type. The testing noise clips are added at five
in the pretraining phase benefits the model by closing the domain SNR levels: {−10, −5, 0, 5, 10}dB. The performance on the
gap between pretraining, finetuning, and testing. We still use original clean test set is also reported for comparison. By default,
the cluster assignment inferred from clean audio-visual speech noise clips are added during both pre-training and finetuning. We
because phonetic information, which is highly correlated with follow the training pipeline in [18], where the model is trained
the clusters, should be invariant to noise. for five iterations in total. To save the computation time, we
A concurrent work, WavLM [22], proposes utterance mixing, always use the smaller BASE model architecutre [18] in all itera-
which is a similar technique to ours but applied for audio-only tions except the last one, where we use a LARGE model. Video
speech representation learning. Utterance mixing augments input samples are batched together not to exceed 1000 image frames
per GPU. The model is pre-trained with 600K steps using 64 Figure 2: Comparison of models using different inputs and pre-
V100-GPUs and finetuned for 30K/100K steps respectively in training methods.
30h/433h setting.
Table 3: Comparison among models with different pre-training best AVSR performance in adverse acoustic conditions. Com-
configurations and input modalities. C: clean audio, N: noisy pared to training from scratch, the gain of noise-augmented
audio. The N-WER is averaged over 4 noise types and 5 SNRs. pre-training peaks around 0dB SNR ratio, which matches the
SNR used in training.
Model PT FT Audio-only Audio-visual Compared to “babble”, “music” and “natural” noise
Size Type Data C-WER N-WER C-WER N-WER
types, noise-augmented AV-HuBERT pre-training is more effec-
(a). LARGE None 30h 20.6 59.2 20.8 42.9 tive in overlapping “speech” noise as shown in Figure 2. Con-
(b). LARGE Clean 30h 4.3 39.8 3.3 9.3
(c). LARGE Noisy 30h 3.8 28.7 3.3 7.8 sistent with previous findings when comparing audio-visual and
(d). LARGE None 433h 4.7 39.2 3.5 14.8
audio-only recognizers trained from scratch, the visual modal-
(e). LARGE Clean 433h 1.5 29.1 1.4 6.9 ity provides a strong clue to “choose” the target speech track.
(f). LARGE Noisy 433h 1.6 25.8 1.4 5.8 AV-HuBERT effectively learns visual representation from paired
audio-visual data, leading to significant gains in speech separa-
tion.
3.3.3. Effect of noise-augmented pre-training The gains from noise-augmented pre-training generalize
across model architectures, as shown in table 4. In addition,
The impact of incorporating noise in pre-training is presented regardless of the finetuning strategy, our proposed pre-training
in (AV, PT=Noisy) vs. (AV, PT=Clean) bars from Figure 2 and approach is helpful, as is shown by row (b) vs. row (c) and row
(Noisy vs. Clean) rows in Table 3 (i.e., (b) vs. (c) and (e) vs. (e) vs. row (f) in N-WER in Table 4.
(f)).
Overall the noise-augmented pre-training improves the result
in noisy settings. The WER is reduced by 16.1% (9.3%→7.8%) / 4. Conclusion
15.9% (6.9%→5.8%) on average in low-resource (30h) and mid- This paper presented a new state-of-the-art audio-visual speech
resource (433h) settings compared to pre-training on clean data. recognition (AVSR) model based on the AV-HuBERT approach
Compared to an audio-visual model trained from scratch, the for multimodal speech representation learning. To our knowl-
noise-augmented pre-training approach reduces recognition error edge, this is the first attempt towards building an AVSR model
by 81.8% (42.9%→7.8%) / 60.8% (14.8%→5.8%) in low/mid- using a large volume of unlabeled audio-visual speech data. Our
resource settings. audio-visual speech recognizer achieves high recognition accu-
racy and is robust to different noise categories even with a few
Table 4: Comparison among BASE models with different pre- hours of labeled data. With less than 10% of labeled data, our
training configurations and input modalities. The model is fine- model outperforms prior SOTA by ∼ 50%. Our future work
tuned with 30h labeled data. C: clean audio, N: noisy audio. includes applying audio-visual speech recognition in real-world
The N-WER is averaged over 4 noise types and 5 SNR ratios. low-resource and multilingual settings.
6. Appendix
6.1. Full results
Table 5 and 6 shows the WERs of LARGE and BASE AV-
HuBERT models under various noise types and SNR levels.
Table 5: Test WER (%) of LARGE AV-HuBERT under different levels and types of noise. Lower is better. B: babble, S: speech, M: music,
N: natural noise.