0% found this document useful (0 votes)

115 views7 pages

AV-HuBERT Large

This document presents a self-supervised framework for robust audio-visual speech recognition (AVSR) built upon Audio-Visual HuBERT (AV-HuBERT). Large quantities of unlabeled audio-visual speech data are used to pre-train the model to capture correlations between sounds and lip movements. Then a small amount of labeled data is used for fine-tuning, demonstrating WER reductions of up to 50% compared to previous methods on low and mid-resource datasets. The approach also improves noise robustness of audio-only models against noises like babble and music, reducing WER by over 75% on average.

Uploaded by

utilisateur em

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views7 pages

AV-HuBERT Large

Uploaded by

utilisateur em

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Robust Self-Supervised Audio-Visual Speech Recognition

Bowen Shi1∗ , Wei-Ning Hsu2 , Abdelrahman Mohamed2

1
Toyota Technological Institute at Chicago
2
Meta AI
[email protected], [email protected], [email protected]

Abstract low-resource settings, which is the case for most of the ∼7,000
spoken languages [17].
Audio-based automatic speech recognition (ASR) degrades sig- This paper presents a self-supervised framework for robust
nificantly in noisy environments and is particularly vulnerable to AVSR, which is based upon the recently introduced Audio-Visual
interfering speech, as the model cannot determine which speaker HuBERT (AV-HuBERT) pre-training appraoch [18]. First, large
to transcribe. Audio-visual speech recognition (AVSR) systems quantities of unlabeled audio-visual speech data are used to pre-
improve robustness by complementing the audio stream with train our model to capture the nuanced correlations between
the visual information that is invariant to noise and helps the sounds and associated lip movements, then only a tiny amount of
arXiv:2201.01763v3 [cs.SD] 14 Jul 2022

model focus on the desired speaker. However, previous AVSR transcribed audio-visual speech data is used for fine-tuning the
work focused solely on the supervised learning setup; hence the model for best AVSR performance. The efficacy of our frame-
progress was hindered by the amount of labeled data available. work is demonstrated on low-resource (30h) and mid-resource
In this work, we present a self-supervised AVSR framework (433h) setups showing WER reductions of up to 50% compared
built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the- to previous SOTA models. Furthermore, we investigate the
art audio-visual speech representation learning model. On the robustness of the proposed approach and audio-only systems
largest available AVSR benchmark dataset LRS3, our approach against different types of noises, which have not been studied in
outperforms prior state-of-the-art by ∼ 50% (28.0% vs. 14.1%) prior work but are essential for practical applications. For exam-
using less than 10% of labeled data (433hr vs. 30hr) in the pres- ple, an AVSR system deployed in meeting scenarios is subject
ence of babble noise, while reducing the WER of an audio-based to babble noise, while one used in a home environment naturally
model by over 75% (25.8% vs. 5.8%) on average 1 . encounters music, cooking, or vacuums machine noises.
Index Terms: audio-visual speech recognition, self-supervised
learning, representation learning, robust speech recognition 2. Method
In this section, we present our methodology for audio-visual
1. Introduction speech recognition. First, we introduce the Audio-Visual Hu-
With the recent development of supervised neural models [1, 2], BERT (AV-HuBERT) pre-training approach, which we use for
the performance of automatic speech recognition (ASR) systems unsupervised learning of joint representations over audio and
has improved significantly, achieving human parity [3] or even visual streams. We then describe how we adopt AV-HuBERT for
outperforming humans on several clean speech benchmarks [4,5]. robust audio-visual speech recognition.
However, ASR systems are vulnerable to noise and may degrade
drastically when speech recordings are corrupted with noise [6]. 2.1. AV-HuBERT for Audio-Visual Speech Recognition
To make ASR more reliable in various scenarios, research on
noise robustness [7–9] has received increasing attention in recent Figure 1: AV-HuBERT for audio-visual speech recognition. X:
years. mask; blue waveform: original audio; orange waveform: noise;
An active research direction on noise robustness combines Cn : audio-visual clusters. Dashed box: the pre-trained part
the audio and visual streams of the speaker to utilize the noise-
invariant lip movement information. Audio-visual speech recog- Pretraining: Masked Prediction Fine-tuning

nition (AVSR) models, which combine these two modalities, C80 C20 C5 C10 C27 How Are You ? </s>

bring AI systems one step closer to how humans perceive L L L L L L L L

speech [10] and provide better performance for a broad range

of application scenarios [11, 12] where both audio and visual
Transformer Encoder Transformer Decoder
streams are accessible, e.g., video meetings, talks, interviews.
Although early studies of audio-visual speech recognition <s> How Are You ?

(AVSR) appeared more than 60 years ago [13], recent devel- Audio-Visual Fusion
opments on novel model architectures [14, 15] and large-scale
data collection [14, 16] have brought AVSR performance to
new heights. Nonetheless, while modern neural architectures ResNet FFN

are hungry for large training data, existing research AVSR ef-
forts are fully-supervised, requiring costly labeled data. This
limitation hinders the application of modern AVSR systems in +

∗ Work
done at Meta AI
1 Our
code and models are available at https://github.com/ AV-HuBERT [18] is a self-supervised approach for learning
facebookresearch/av_hubert joint speech representations from audio and lip-movement in-
formation in video recordings, which extends the HuBERT [19] audio by randomly sampling speech utterances from the same
speech representation learning framework to multimodal in- minibatch. We use more diverse sources in our noise-augmented
puts. AV-HuBERT consumes frame-level synchronous audio pretraining, including both speech and non-speech noise, e.g.,
and video streams as input to produce contextualized audio- ambient and babble noise. Additionally, since WavLM targets
visual representations for each frame. AV-HuBERT pretraining audio-only self-supervised learning, the intersection between the
iterates over two steps: feature clustering and masked prediction. secondary and the primary utterances needs to be fewer than
Feature clustering creates discrete frame-level targets for the 50% to signify which utterance in the mixture is the main one.
subsequent masked prediction step. Audio-based mel-frequency Our approach is unconstrained and more flexible on mixing
cepstral coefficients (MFCC) features are always used for cluster noise because the accompanying visual stream disambiguates
generation in the first iteration. For multi-iteration pretraining, the primary and secondary utterances.
the learned audio-visual features extracted from the latest AV-
HuBERT transformer network are used for cluster generation 3. Experiments
in all subsequent iterations. Inspired by the BERT pretrain-
ing widely used for text data [20] and deep cluster for visual 3.1. Data and Experimental Setup
data [21], The masked prediction loss drives training of the AV- Our experiments are conducted on LRS3 [16] with around 433
HuBERT model by predicting the cluster assignments of the hours of audio-visual speech from over 5000 speakers, which is
masked frames given a corrupted video signal with randomly the largest publicly available labeled audio-visual speech recogni-
masked segments. To finetune it for a downstream task, the tion dataset. VoxCeleb2 [23], a large-scale audio-visual speech
cluster prediction head of the pretrained model is removed. De- dataset that was initially proposed for the speaker recognition
pending on the desired architecture of the final model, either a task is used for our self-supervised pre-training. VoxCeleb2
linear layer is added for an encoder-only model or a randomly ini- has around 2,442 hours of videos from over 6,000 speakers and
tialized decoder module with cross attention over the pretrained contains utterances from multiple languages. We follow the pre-
encoder is used for a sequence-to-sequence model. Some or all processing steps in [18] to select the ”English” portion, which
layers may be updated during finetuning. amounts to 1,326 hours of videos.
Unlike the prior work in [18], which utilizes the pretrained We augment input samples using many noise categories.
AV-HuBERT encoder for unimodal downstream scenarios like The total duration of noise in each category is shown in table 1.
lip-reading and ASR, this paper examines the effectiveness of The noise audio clips in the categories of ”natural”, ”music”
the multimodal learned representations of AV-HuBERT for the and ”babble” are sampled from MUSAN dataset [24], while
multimodal audio-visual speech recognition (AVSR) task that the overlapping ”speech” noise samples are drawn from LRS3.
aims to transcribe speech videos using audio and visual streams. In creating ”speech” and ”babble” noise sets, we ensured
Given a pre-trained AV-HuBERT model, we keep both its audio there are no speaker overlap among different partitions.
and video frontends during finetuning. We use a sequence-to-
sequence model for AVSR, where AV-HuBERT serves as the Table 1: Total duration in hours of noise samples in different
encoder module. In contrast to pretraining, we do not apply categories
input masking or modality dropout during finetuning. Also, we
froze the pretrained AV-HuBERT encoder for a certain number
of training steps, after which we updated all model weights. Partition natural music babble speech
train 6 35 20 50
2.2. Noise-Augmented AV-HuBERT validation 1 4 2 6
test 1 4 2 6
AVSR systems leverage the visual modality during noisy con-
ditions [14]; however, A recognizer trained on clean conditions
may rely overly on the audio stream since a model can predict We follow the protocol of [18] to create two settings for
with audio more effortlessly, thus leading to failure of leveraging finetuning the model; a low-resource setting using 30h of labeled
visual information in adverse auditory conditions at test time. videos and a mid-resource setting using 433h of labels. Unless
A typical solution adopted by prior work is noise-augmented otherwise specified, we use AV-HuBERT LARGE as the default
supervised training [14, 15], which adds noise sampled from a model architecture for all our experiments. The model has 24
separate noise dataset to the clean audio at a fixed or sampled transformer blocks, where each block has 16 attention heads
signal-to-noise ratio (SNR). We adopt this strategy during the and 1024/4096 embedding/feedforward dimensions. We add a
finetuning stage and refer to it as noise-augmented finetuning to 9-layer randomly initialized transformer decoder with similar
emphasize the stage noise is employed. embedding/feedforward dimensions during finetuning.
To further boost our model’s robustness to acoustic noise, During training, we first select one noise category and sam-
we extended noise augmentation to AV-HuBERT pretraining ple a noise audio clip from its training partition. We randomly
by randomly adding different types of noise to the audio input, mix the sampled noise at 0dB SNR with a probability of 0.25,
making AV-HuBERT more suitable for AVSR applications. We following [15]. At test time, we evaluate the model separately
refer to it as noise-augmented pretraining. Incorporating noise for each noise type. The testing noise clips are added at five
in the pretraining phase benefits the model by closing the domain SNR levels: {−10, −5, 0, 5, 10}dB. The performance on the
gap between pretraining, finetuning, and testing. We still use original clean test set is also reported for comparison. By default,
the cluster assignment inferred from clean audio-visual speech noise clips are added during both pre-training and finetuning. We
because phonetic information, which is highly correlated with follow the training pipeline in [18], where the model is trained
the clusters, should be invariant to noise. for five iterations in total. To save the computation time, we
A concurrent work, WavLM [22], proposes utterance mixing, always use the smaller BASE model architecutre [18] in all itera-
which is a similar technique to ours but applied for audio-only tions except the last one, where we use a LARGE model. Video
speech representation learning. Utterance mixing augments input samples are batched together not to exceed 1000 image frames
per GPU. The model is pre-trained with 600K steps using 64 Figure 2: Comparison of models using different inputs and pre-
V100-GPUs and finetuned for 30K/100K steps respectively in training methods.
30h/433h setting.

3.2. Main Results

Table 2 compares the performance of our proposed noise-
augmented AV-HuBERT approach under different settings versus
existing supervised AVSR models. In the clean audio setting,
our best model outperforms the best model from Ma et al. [26]
by 39.0% (2.3%→1.4%) while using fewer labeled data. To
enable direct comparison with previous research work which
focus primarily on babble noise, we follow [15] to synthesize
babble noise by randomly mixing 30 audio clips from LRS32
for evaluation.
As shown in the ”Babble” column, with only 30
hours of labeled data, our model outperforms [14] by 80.4%
(42.5%→8.3%) and [15] by 67.4% (25.5%→8.3%) at 0dB SNR.
Compared to the former SOTA [15], we achieve 49.6% lower
WER (28.0% → 14.1%) on average across different SNR ratios
with 10 times fewer labels. When using all 433 hours labeled
data for finetuning, the relative improvement is further increased
to 55.7% (28.0% → 12.4%). Note that the babble noise we
use for training our model is synthesized from MUSAN, which
has potential domain mismatch from the babble noise syn-
thesized from LRS3 used at test time; however, our approach
significantly improves over prior work and estableshes the new
SOTA.
When the noise type is extended beyond babble noise,
our proposed audio-visual model consistently improves over
its audio-only ASR counterpart with is more than 70% relative
WER reduction. The reduction varies depending on noise type the volume of the noise is higher and noisy environments with
and SNR; hence, we analyze how the model performs in differ- speech or babble noise where the interfering signal is sim-
ent noise conditions and how each component of our approach ilar to the target speech. Averaged over different pre-training
contributes to such improvement. configurations, the AVSR model achieves 53.0% (42.6% →
20.0%) and 70.8% (31.3% → 9.2%) relative WER reduction
3.3. Analysis over audio-only ASR under noisy settings using 30 hours and
433 hours of labeled data, respectively.
To examine the impact of pre-training (no pre-training, pre-
training with clean audio, or noise-augmented pre-training) and It is worth noting that our AVSR model enjoys its largest
input modality (audio or audio-visual), we experimented with gain over the audio-only model under speech noise settings,
the six setups covering the cross product of these conditions. For where a secondary speech utterance is randomly mixed into the
setups with audio-only input during finetuning, we follow [18] primary one. When using overlapping speech noise under the
by replacing the visual features in the pre-trained AV-HuBERT mid-resource setting, the WER went from 48.1% to 7.7% by
model with a dummy zero vector at each frame. Performances AVSR on average across different pre-training configurations,
of these six setups are shown in table 3. Figure 2 shows a more while the WER is reduced from 25.8% to 9.6% in the other three
detailed performance breakdown over SNR ratios and noise noise categories. These results suggest that the paired visual
types. stream provides an effective audio source separation, where the
audio-only recognizer can not distinguish two audio tracks.
3.3.1. Effect of the visual modality
We first examine the performance of AVSR models against audio- 3.3.2. Effect of pre-training
only ASR models under low-resource and mid-resource condi-
tions by comparing the blue and yellow bars of the same shading The performance of fine-tuning an AV-HuBERT model is com-
pattern in each group in figure 2 and audio-only vs. audio-visual pared against directly optimizing a model from scratch on labeled
columns in Table 3. AVSR consistently outperforms audio-only audio-visual speech in (AV, PT=Clean) vs. (AV, PT=None) bars
ASR under all settings regardless of the SNR and the type of of Figure 2 and (Clean vs. None) rows (i.e., (b) vs. (a) and (e)
noise, except for the setup where the model is trained on only 30 vs. (d)) in Table 3. Note that the AV-HuBERT model pre-trained
hours of labeled data from scratch. on clean audio (PT=clean) is identical to the one used in [18].
As shown in Figure 2, the benefit of incorporating the vi-
On average, AV-HuBERT pre-training brings substantial
sual stream is more apparent in challenging scenarios, where
relative improvements of 78.3% (42.9%→9.3%) and 53.4%
the WER degradation relative to the clean condition is large.
(14.8%→6.9%) when using 30h and 433h of labeled data, re-
Specifically, these scenarios include low SNR conditions where
spectively. The model achieves bigger gains in the low-resource
2 [14] uses audios from LRS2 [14], which has restricted access and setting, which confirms the impact of the self-supervised audio-
we are unable to obtain it. visual representations learned by the AV-HuBERT model.
Table 2: WER (%) of our models and prior work on the LRS3 dataset. “Mode” denotes whether a model uses audio-visual input (AV) or
only audio as input (A). “Hr” denotes the amount of labeled audio-visual speech data used in each system.

Model Mode Hr Babble, SNR= Speech, SNR= Music+Natural, SNR= Clean

-10 -5 0 5 10 avg -10 -5 0 5 10 avg -10 -5 0 5 10 avg ∞
Makino et al. [25] AV 31K - - - - - - - - - - - - - - - - - - 4.5
Ma et al. [26] AV 595 - - - - - - - - - - - - - - - - - - 2.3
Afouras et al. [14] AV 1.4K - - 42.5 - - - - - - - - - - - - - - - 7.2
Xu et al. [15] AV 433 38.6 31.1 25.5 24.3 20.7 28.0 - - - - - - - - - - - - 6.8
AV-HuBERT AV 30 35.1 18.4 8.3 4.9 3.9 14.1 11.5 6.8 5.0 4.2 3.9 6.3 12.0 7.0 4.8 4.1 3.7 6.3 3.3
AV-HuBERT AV 433 34.9 16.6 5.8 2.6 2.0 12.4 11.4 4.6 2.9 2.2 1.8 4.6 9.7 4.77 2.5 1.9 1.8 4.1 1.4
AV-HuBERT A 30 99.6 69.3 21.9 9.0 5.6 41.1 77.3 51.2 32.0 19. 10.8 38.2 47.9 21.5 9.2 5.9 4.8 17.8 3.8
AV-HuBERT A 433 97.5 62.3 15.7 5.1 2.6 36.6 81.7 56.2 37.3 19.0 8.3 40.5 38.7 15.1 5.7 3.1 2.3 13.0 1.6

Table 3: Comparison among models with different pre-training best AVSR performance in adverse acoustic conditions. Com-
configurations and input modalities. C: clean audio, N: noisy pared to training from scratch, the gain of noise-augmented
audio. The N-WER is averaged over 4 noise types and 5 SNRs. pre-training peaks around 0dB SNR ratio, which matches the
SNR used in training.
Model PT FT Audio-only Audio-visual Compared to “babble”, “music” and “natural” noise
Size Type Data C-WER N-WER C-WER N-WER
types, noise-augmented AV-HuBERT pre-training is more effec-
(a). LARGE None 30h 20.6 59.2 20.8 42.9 tive in overlapping “speech” noise as shown in Figure 2. Con-
(b). LARGE Clean 30h 4.3 39.8 3.3 9.3
(c). LARGE Noisy 30h 3.8 28.7 3.3 7.8 sistent with previous findings when comparing audio-visual and
(d). LARGE None 433h 4.7 39.2 3.5 14.8
audio-only recognizers trained from scratch, the visual modal-
(e). LARGE Clean 433h 1.5 29.1 1.4 6.9 ity provides a strong clue to “choose” the target speech track.
(f). LARGE Noisy 433h 1.6 25.8 1.4 5.8 AV-HuBERT effectively learns visual representation from paired
audio-visual data, leading to significant gains in speech separa-
tion.
3.3.3. Effect of noise-augmented pre-training The gains from noise-augmented pre-training generalize
across model architectures, as shown in table 4. In addition,
The impact of incorporating noise in pre-training is presented regardless of the finetuning strategy, our proposed pre-training
in (AV, PT=Noisy) vs. (AV, PT=Clean) bars from Figure 2 and approach is helpful, as is shown by row (b) vs. row (c) and row
(Noisy vs. Clean) rows in Table 3 (i.e., (b) vs. (c) and (e) vs. (e) vs. row (f) in N-WER in Table 4.
(f)).
Overall the noise-augmented pre-training improves the result
in noisy settings. The WER is reduced by 16.1% (9.3%→7.8%) / 4. Conclusion
15.9% (6.9%→5.8%) on average in low-resource (30h) and mid- This paper presented a new state-of-the-art audio-visual speech
resource (433h) settings compared to pre-training on clean data. recognition (AVSR) model based on the AV-HuBERT approach
Compared to an audio-visual model trained from scratch, the for multimodal speech representation learning. To our knowl-
noise-augmented pre-training approach reduces recognition error edge, this is the first attempt towards building an AVSR model
by 81.8% (42.9%→7.8%) / 60.8% (14.8%→5.8%) in low/mid- using a large volume of unlabeled audio-visual speech data. Our
resource settings. audio-visual speech recognizer achieves high recognition accu-
racy and is robust to different noise categories even with a few
Table 4: Comparison among BASE models with different pre- hours of labeled data. With less than 10% of labeled data, our
training configurations and input modalities. The model is fine- model outperforms prior SOTA by ∼ 50%. Our future work
tuned with 30h labeled data. C: clean audio, N: noisy audio. includes applying audio-visual speech recognition in real-world
The N-WER is averaged over 4 noise types and 5 SNR ratios. low-resource and multilingual settings.

Model PT FT Audio-only Audio-visual

Size Type Type C-WER N-WER C-WER N-WER 5. References
(a). BASE None Clean 24.6 79.8 22.0 70.9 [1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
(b). BASE Clean Clean 4.6 46.3 4.0 28.2 A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kings-
(c). BASE Noisy Clean 4.4 33.8 4.1 12.5 bury, “Deep neural networks for acoustic modeling in speech recog-
(d). BASE None Noisy 16.9 55.4 17.2 39.5 nition: The shared views of four research groups,” IEEE Signal
(e). BASE Clean Noisy 4.8 37.3 4.2 13.1 Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
(f). BASE Noisy Noisy 4.4 33.3 4.1 10.3
[2] D. Amodei et al., “Deep speech 2 : End-to-end speech recognition
in english and mandarin,” in Proceedings of The 33rd International
Concerning SNR, the WER is reduced the most in low SNR Conference on Machine Learning, 2016.
settings, i.e., high noise, as is shown in Figure 2. This obser- [3] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke,
vation matches our hypothesis made in section 3.3.2 about the D. Yu, and G. Zweig, “Achieving human parity in conversational
domain discrepancy between pre-training and finetuning. Intro- speech recognition,” IEEE/ACM Transactions on Audio, Speech,
ducing noise during the pre-training stage bridges the domain and Language Processing, vol. PP, 2016.
gap and makes the model more resilient to noise at test time. [4] Z. Tüske, G. Saon, K. Audhkhasi, and B. Kingsbury, “Single
One key takeaway from this work is that noise augmentation is headed attention based sequence-to-sequence model for state-of-
needed during pre-training and finetuning phases to achieve the the-art results on switchboard-300,” in INTERSPEECH, 2020.
[5] T. S. Nguyen, S. Stueker, and A. H. Waibel, “Super-human perfor-
mance in online low-latency recognition of conversational speech,”
in Interspeech, 2021.
[6] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation:
Deep audio-visual speech enhancement,” in INTERSPEECH, 2018.
[7] S. Watanabe, M. Mandel, J. Barker, and E. Vincent, “Chime-6 chal-
lenge: Tackling multispeaker speech recognition for unsegmented
recordings,” ArXiv, vol. abs/2004.09249, 2020.
[8] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer,
“An analysis of environment, microphone and data simulation mis-
matches in robust speech recognition,” Computer Speech and Lan-
guage, vol. 46, 2017.
[9] K. Kinoshita, T. Ochiai, M. Delcroix, and T. Nakatani, “Improving
noise robust automatic speech recognition with single-channel
time-domain enhancement network,” in ICASSP, 2021.
[10] H. Mcgurk and J. MacDoald, “Hearing lips and seeing voices,”
Nature, 1976.
[11] A. Biswas, P. Sahu, and M. Chandra, “Multiple cameras audio
visual speech recognition using active appearance model visual
features in car environment,” International Journal of Speech Tech-
nology, vol. 19, 2016.
[12] Y. Koguchi, K. Oharada, Y. Takagi, Y. Sawada, B. Shizuki, and
S. Takahashi, “A mobile command input through vowel lip shape
recognition,” in HCI, 2018.
[13] W. H. Sumby and I. Pollack, “Visual contribution to speech intel-
ligibility in noise,” Journal of the Acoustical Society of America,
vol. 26, pp. 212–215, 1954.
[14] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman,
“Deep audio-visual speech recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2018.
[15] B. Xu, C. Lu, Y. Guo, and J. Wang, “Discriminative multi-modality
speech recognition,” in CVPR, 2020.
[16] T. Afouras, J. S. Chung, and A. Zisserman, “Lrs3-ted: a large-scale
dataset for visual speech recognition,” 2018, arXiv:1809.00496.
[17] https://www.ethnologue.com/.
[18] B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning
audio-visual speech representation by masked multimodal cluster
prediction,” 2022.
[19] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov,
and A. Mohamed, “Hubert: Self-supervised speech representation
learning by masked prediction of hidden units,” in ICASSP, 2021.
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
training of deep bidirectional transformers for language under-
standing,” in NAACL, 2019.
[21] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep cluster-
ing for unsupervised learning of visual features,” in Proceedings
of the European Conference on Computer Vision (ECCV), 2018,
pp. 132–149.
[22] S. Chen, C. Wang, Z. Chen, Y. Wu, Z. C. Shujie Liu, J. Li,
N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian,
Y. Qian, J. Wu, M. Zeng, and F. Wei, “Wavlm: Large-scale self-
supervised pre-training for full stack speech processing,” ArXiv,
vol. abs/2110.13900, 2021.
[23] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep
speaker recognition,” in INTERSPEECH, 2018.
[24] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and
noise corpus,” ArXiv, vol. abs/1510.08484, 2015.
[25] T. Makino, H. Liao, Y. Assael, B. Shillingford, B. Garcia, O. Braga,
and O. Siohan, “Recurrent neural network transducer for audio-
visual speech recognition,” in Interspeech, 2019.
[26] P. Ma, S. Petridis, and M. Pantic, “End-to-end audio-visual speech
recognition with conformers,” in ICASSP, 2021.

6. Appendix
6.1. Full results
Table 5 and 6 shows the WERs of LARGE and BASE AV-
HuBERT models under various noise types and SNR levels.
Table 5: Test WER (%) of LARGE AV-HuBERT under different levels and types of noise. Lower is better. B: babble, S: speech, M: music,
N: natural noise.

SNR (dB) PT: None PT: Clean PT: Noisy

A,30h B S M N B S M N B S M N
-10 103.1 95.7 83.4 77.8 100.7 101.9 66.1 58.2 99.6 77.3 50.5 45.2
-5 90.3 86.9 65.9 61.0 87.5 91.4 39.4 36.0 69.3 51.2 21.5 21.5
0 63.9 70.5 46.4 46.6 37.6 63.9 17.4 15.9 21.9 32.0 9.0 9.3
5 43.2 52.2 35.2 34.3 11.8 25.4 8.2 7.6 9.0 19.7 5.9 5.8
10 32.1 39.9 28.0 27.8 6.7 9.5 5.9 5.5 5.6 10.8 4.9 4.6
clean 20.6 4.3 3.8
A,433h B S M N B S M N B S M N
-10 100.7 95.3 64.7 55.6 98.2 94.3 47.4 39.3 97.5 81.7 40.8 36.5
-5 82.3 79.7 38.2 34.0 65.6 73.8 18.7 17.2 62.3 56.2 15.3 14.9
0 39.2 52.8 18.8 17.7 17.0 46.3 6.5 6.4 15.7 37.3 5.7 5.6
5 17 28.4 10.7 10.6 5.3 22.9 3.0 3.4 5.1 19.0 3.1 3.1
10 8.4 15.7 7.5 7.3 2.7 9.7 2.0 2.2 2.6 8.3 2.3 2.3
clean 4.7 1.5 1.6
AV,30h B S M N B S M N B S M N
-10 80.3 62.7 60.3 55.6 32.2 18.1 16.4 13.2 30.7 11.5 12.5 11.4
-5 63.1 52.1 46.8 44.9 18.5 10.4 9.3 8.0 15.9 6.8 7.3 6.6
0 45.1 42.7 36.2 35.0 8.7 6.6 5.6 5.2 7.3 5.0 4.9 4.7
5 33.3 34.8 29.5 28.5 4.8 4.8 4.3 4.1 4.4 4.2 4.1 4.0
10 27.2 29.4 25.4 25.2 3.7 4.0 3.7 3.7 3.9 3.9 3.6 3.7
clean 20.8 3.3 3.3
AV,433h B S M N B S M N B S M N
-10 60.2 26.5 29.5 24.0 30.0 15.9 13.8 10.3 28.4 11.4 9.9 9.4
-5 33.1 15.1 14.9 12.4 15.2 7.5 6.4 5.4 13.4 4.6 4.8 4.6
0 14 8.8 7.9 8.0 5.9 3.9 3.3 2.9 5.0 2.9 2.5 2.5
5 6.8 6.2 5.0 5.3 2.7 2.4 2.1 2.2 2.6 2.2 1.9 1.9
10 4.6 5.1 4.0 4.3 1.9 1.9 1.7 1.8 1.9 1.8 1.8 1.7
clean 3.5 1.4 1.4
Table 6: Test WER (%) of BASE AV-HuBERT fine-tuned with 30 hours of labeled data under different levels and types of noise. Lower is
better. B: babble, S: speech, M: music, N: natural noise.

SNR (dB) PT: None PT: Clean PT: Noisy

A,30h (Clean-FT) B S M N B S M N B S M N
-10 111.7 101.1 93.8 91.3 99.4 99.5 74.5 65.7 96.4 88.1 53.6 47.7
-5 110.5 98.0 88.3 84.1 92.9 91.5 51.9 45.1 74.8 66.0 27.5 26.5
0 99.8 92.4 76.7 72.7 57.3 74.2 25.5 23.7 29.7 48.0 12.3 12.6
5 79.3 81.3 58.7 55.6 21.2 38.7 11.2 12.4 12.1 31.4 7.7 7.5
10 52.6 60.8 43.5 44.0 9.4 17.6 7.4 7.5 7.1 15.3 6.1 5.9
clean 24.6 4.6 4.4
A,30h (Noisy-FT) B S M N B S M N B S M N
-10 103.1 94.3 80.2 75.3 107.0 92.2 65.6 55.8 98.6 87.5 53.2 46.8
-5 89.4 86.3 61.7 55.9 81.2 74.9 38.9 31.8 74.5 65.1 26.6 24.5
0 59.2 68.0 40.7 39.5 35.8 43.6 17.3 16.3 28.5 47.1 11.8 11.9
5 38.3 48.6 28.8 29.0 13.8 20.7 9.1 9.8 11.3 31.1 7.1 7.5
10 26.6 35.3 22.9 24.8 7.5 10.3 6.8 6.7 6.7 14.8 5.8 5.6
clean 16.9 4.8 4.4
AV,30h (Clean-FT) B S M N B S M N B S M N
-10 103.5 97.0 89.6 85.8 84.8 92.4 44.7 33.2 48.5 29.3 21.3 17.5
-5 99.7 92.4 81.3 76.5 49.6 75.0 24.0 18.0 24.7 12.9 11.1 10.1
0 86.6 83.5 66.2 60.6 19.7 41.1 11.2 9.3 11.4 7.7 6.9 6.7
5 65.8 69.0 49.2 45.9 8.7 15.6 6.5 6.5 6.3 5.8 5.1 5.1
10 42.2 50.5 35.7 36.1 5.5 7.8 5.0 5.2 4.7 5.0 4.4 4.6
clean 22.0 4.0 4.1
AV,30h (Noisy-FT) B S M N B S M N B S M N
-10 79.4 61.1 57.5 54.7 42.1 27.2 23.6 18.3 38.5 15.8 17.9 15.1
-5 60.6 49.4 42.9 40.0 25.4 17.0 13.3 11.1 21.2 9.6 10.0 8.9
0 42.0 38.8 32.4 30.7 12.8 10.0 7.9 7.4 9.9 6.7 6.5 6.2
5 28.9 30.3 25.6 25.1 7.0 6.8 5.8 5.6 5.9 5.5 5.0 5.2
10 22.8 24.4 21.4 21.3 5.3 5.2 4.9 4.8 4.8 4.8 4.3 4.5
clean 17.2 4.2 4.1

PAM360 Product Presentation
No ratings yet
PAM360 Product Presentation
7 pages
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
No ratings yet
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
20 pages
Saint Bernard: Assembly Instructions: Assemble The Head
No ratings yet
Saint Bernard: Assembly Instructions: Assemble The Head
0 pages
66 Brosur Arthroscopic Energy Generator
No ratings yet
66 Brosur Arthroscopic Energy Generator
2 pages
CTC Attention
No ratings yet
CTC Attention
5 pages
Unified Cross-Modal Attention Robust Audio-Visual Speech Recognition and Beyond
No ratings yet
Unified Cross-Modal Attention Robust Audio-Visual Speech Recognition and Beyond
13 pages
001 OK AVSR Based Transformer
No ratings yet
001 OK AVSR Based Transformer
15 pages
Attention Audio Visual Speech Recognition
No ratings yet
Attention Audio Visual Speech Recognition
5 pages
Icassp19 Zhoupan
No ratings yet
Icassp19 Zhoupan
5 pages
Visper
No ratings yet
Visper
5 pages
2111.02735v3 - Speech Emotion Detection
No ratings yet
2111.02735v3 - Speech Emotion Detection
7 pages
01 Ok Continuous Phoneme Recognition Based On Audio-Visual Modality Fusion
No ratings yet
01 Ok Continuous Phoneme Recognition Based On Audio-Visual Modality Fusion
8 pages
FP Sinyal Fix
No ratings yet
FP Sinyal Fix
13 pages
LSTMSE-Net: Long Short Term Speech Enhancement Network For Audio-Visual Speech Enhancement
No ratings yet
LSTMSE-Net: Long Short Term Speech Enhancement Network For Audio-Visual Speech Enhancement
5 pages
Hyb Conformer
No ratings yet
Hyb Conformer
5 pages
01.ok Audio-Visual Speech Recognition Based On Dual Cross-Modality Attentions With The Transformer Model
No ratings yet
01.ok Audio-Visual Speech Recognition Based On Dual Cross-Modality Attentions With The Transformer Model
18 pages
1 s2.0 S2666764923000450 Main
No ratings yet
1 s2.0 S2666764923000450 Main
10 pages
2 Base
No ratings yet
2 Base
5 pages
Tao 2017 3 17
No ratings yet
Tao 2017 3 17
5 pages
2023 Arabicnlp-1 10
No ratings yet
2023 Arabicnlp-1 10
8 pages
EG Seq2seq
No ratings yet
EG Seq2seq
13 pages
Where Visual Speech Meets Language: VSP-LLM Framework For Efficient and Context-Aware Visual Speech Processing
No ratings yet
Where Visual Speech Meets Language: VSP-LLM Framework For Efficient and Context-Aware Visual Speech Processing
13 pages
Can We Read Speech Beyond The Lips? Rethinking Roi Selection For Deep Visual Speech Recognition
No ratings yet
Can We Read Speech Beyond The Lips? Rethinking Roi Selection For Deep Visual Speech Recognition
8 pages
Audio-Visual Scene Analysis With Self-Supervised Multisensory Features
No ratings yet
Audio-Visual Scene Analysis With Self-Supervised Multisensory Features
18 pages
Harwath IJCV 2019
No ratings yet
Harwath IJCV 2019
22 pages
A Closer Look at Audio-Visual Segmentation
No ratings yet
A Closer Look at Audio-Visual Segmentation
18 pages
Icassp 2013 6639140
No ratings yet
Icassp 2013 6639140
4 pages
Enhancement of VSR Using Low Dimension Visual Feature
No ratings yet
Enhancement of VSR Using Low Dimension Visual Feature
4 pages
Visual Speech Recognition
No ratings yet
Visual Speech Recognition
14 pages
Asr01 Intro
No ratings yet
Asr01 Intro
43 pages
Self-Supervised Speech Representation Learning: A Review
No ratings yet
Self-Supervised Speech Representation Learning: A Review
34 pages
3 Gan
No ratings yet
3 Gan
12 pages
Multimodal Grounding For Sequence-To-Sequence Speech Recognition
No ratings yet
Multimodal Grounding For Sequence-To-Sequence Speech Recognition
5 pages
Audiovisual
No ratings yet
Audiovisual
86 pages
Learning Visual Voice Activity Detection With An Automatically Annotated Dataset
No ratings yet
Learning Visual Voice Activity Detection With An Automatically Annotated Dataset
6 pages
D: Diagnostic Benchmark For Audio Visual Evaluation
No ratings yet
D: Diagnostic Benchmark For Audio Visual Evaluation
16 pages
Contrastive Positive Sample Propagation Along The Audio-Visual Event Line
No ratings yet
Contrastive Positive Sample Propagation Along The Audio-Visual Event Line
21 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
SpeakVision: A Comprehensive Survey On End-to-End Sentence Level Lipreading
No ratings yet
SpeakVision: A Comprehensive Survey On End-to-End Sentence Level Lipreading
4 pages
ENTERFACE 2010 Project Proposal: 1. Introduction and Project Objectives
No ratings yet
ENTERFACE 2010 Project Proposal: 1. Introduction and Project Objectives
7 pages
LRS3 Ted
No ratings yet
LRS3 Ted
2 pages
Cocktail Hubert: Generalized Self-Supervised Pre-Training For Mixture and Single-Source Speech Maryam Fazel-Zarandi and Wei-Ning Hsu Meta Ai - Fair
No ratings yet
Cocktail Hubert: Generalized Self-Supervised Pre-Training For Mixture and Single-Source Speech Maryam Fazel-Zarandi and Wei-Ning Hsu Meta Ai - Fair
5 pages
Deep Learning Modernisasi Machine Learning Untuk Big Data
100% (1)
Deep Learning Modernisasi Machine Learning Untuk Big Data
4 pages
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
No ratings yet
Prajwal Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Paper
11 pages
Ramaswamy See The Sound Hear The Pixels WACV 2020 Paper
No ratings yet
Ramaswamy See The Sound Hear The Pixels WACV 2020 Paper
10 pages
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
No ratings yet
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
16 pages
Lipreading Using A Comparative Machine Learning Approach
No ratings yet
Lipreading Using A Comparative Machine Learning Approach
7 pages
Investigating Self-Supervised Front Ends For Speec
No ratings yet
Investigating Self-Supervised Front Ends For Speec
6 pages
Consulting Proposal
No ratings yet
Consulting Proposal
28 pages
Aldarmaki et al - 2022 - Unsupervised Automatic Speech Recognition（语音识别综述）
No ratings yet
Aldarmaki et al - 2022 - Unsupervised Automatic Speech Recognition（语音识别综述）
37 pages
Multistream Articulatory Feature-Based Models For
No ratings yet
Multistream Articulatory Feature-Based Models For
8 pages
Parameter-Efficient Fine-Tuning of Whisper For Low-Resource Speech Recognition
No ratings yet
Parameter-Efficient Fine-Tuning of Whisper For Low-Resource Speech Recognition
4 pages
Wavenet Autoencoders
No ratings yet
Wavenet Autoencoders
13 pages
VSET
No ratings yet
VSET
5 pages
ASR For Embedded Real Time Applications: K.Kartheek, D.V.Srihari Babu
No ratings yet
ASR For Embedded Real Time Applications: K.Kartheek, D.V.Srihari Babu
9 pages
Relja Arandjelovic Objects That Sound ECCV 2018 Paper
No ratings yet
Relja Arandjelovic Objects That Sound ECCV 2018 Paper
17 pages
CNN Bilstm 2021
No ratings yet
CNN Bilstm 2021
5 pages
Objects That Sound: Abstract
No ratings yet
Objects That Sound: Abstract
20 pages
Unsupervised Speech Representation Learning Using Wavenet Autoencoders
No ratings yet
Unsupervised Speech Representation Learning Using Wavenet Autoencoders
13 pages
Nishizaki 2017
No ratings yet
Nishizaki 2017
6 pages
Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
No ratings yet
Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
14 pages
Human Visual System Model: Understanding Perception and Processing
From Everand
Human Visual System Model: Understanding Perception and Processing
Fouad Sabry
No ratings yet
Silent Speech Interface: Fundamentals and Applications
From Everand
Silent Speech Interface: Fundamentals and Applications
Fouad Sabry
No ratings yet
PDF Real Numbers
No ratings yet
PDF Real Numbers
9 pages
HW 2
No ratings yet
HW 2
18 pages
Table of Test Specification
No ratings yet
Table of Test Specification
2 pages
Sediment Basin Design Fact Sheet
No ratings yet
Sediment Basin Design Fact Sheet
43 pages
Autonomous Tuning Gas Turbine From Ge Digital Datasheet
No ratings yet
Autonomous Tuning Gas Turbine From Ge Digital Datasheet
6 pages
2005 - CG 03 Boxguide e r6
No ratings yet
2005 - CG 03 Boxguide e r6
1 page
B.Com 4th Semb - Math Revision Class Study Materials - Unit-5 To Be Uploaded
No ratings yet
B.Com 4th Semb - Math Revision Class Study Materials - Unit-5 To Be Uploaded
17 pages
Q1. Minimum: P Q P Q
0% (1)
Q1. Minimum: P Q P Q
8 pages
Product Data Sheet: APC Smart-UPS RT 5000VA, 208V, Rackmount, 3U, 2x NEMA L6-20R & 2x NEMA L6-30R Outlets
No ratings yet
Product Data Sheet: APC Smart-UPS RT 5000VA, 208V, Rackmount, 3U, 2x NEMA L6-20R & 2x NEMA L6-30R Outlets
4 pages
Pressure Vessels Lectures 98
No ratings yet
Pressure Vessels Lectures 98
1 page
Public Garden Automation Document
63% (8)
Public Garden Automation Document
85 pages
@vtucode - in 21CS643 Module 4 2021 Scheme
No ratings yet
@vtucode - in 21CS643 Module 4 2021 Scheme
189 pages
FOSS and Common Core Math - Grade 8
No ratings yet
FOSS and Common Core Math - Grade 8
28 pages
Adnan Aslam Noon: Present Address Permanent Address
No ratings yet
Adnan Aslam Noon: Present Address Permanent Address
3 pages
ESAT Scholarship Exam For NEET JEE
No ratings yet
ESAT Scholarship Exam For NEET JEE
4 pages
Write A Java Program That Prompts The User For An Integer and Then Prints Out All The Prime Numbers Up To That Integer
100% (1)
Write A Java Program That Prompts The User For An Integer and Then Prints Out All The Prime Numbers Up To That Integer
6 pages
Basic Lowry Model: by Dr. Jean-Paul Rodrigue
No ratings yet
Basic Lowry Model: by Dr. Jean-Paul Rodrigue
14 pages
MT7830A MaxicTechnology
No ratings yet
MT7830A MaxicTechnology
8 pages
Team POP Quiz 1 Template: About Module 1 Prepared By: BOIL BABY BOIL
No ratings yet
Team POP Quiz 1 Template: About Module 1 Prepared By: BOIL BABY BOIL
13 pages
Transportation of Water
No ratings yet
Transportation of Water
16 pages
Chapter-3-Wind Energy
No ratings yet
Chapter-3-Wind Energy
7 pages
SYS600 Operation Manual
100% (1)
SYS600 Operation Manual
166 pages
Petrochemical Processes - 2001
No ratings yet
Petrochemical Processes - 2001
174 pages
Degradation of Silicon Two-Barrier Thin
No ratings yet
Degradation of Silicon Two-Barrier Thin
9 pages
2nd Year Physics CH Wise 2021 by 786 Academy
100% (6)
2nd Year Physics CH Wise 2021 by 786 Academy
10 pages
The Hydrology of Wadi Ibrahim Catchment in Makkah City The Kingdom of Saudi Arabia
No ratings yet
The Hydrology of Wadi Ibrahim Catchment in Makkah City The Kingdom of Saudi Arabia
10 pages
MTE 02 English January 2022 December 2022
No ratings yet
MTE 02 English January 2022 December 2022
36 pages

AV-HuBERT Large

Uploaded by

AV-HuBERT Large

Uploaded by

Robust Self-Supervised Audio-Visual Speech Recognition

Bowen Shi1∗ , Wei-Ning Hsu2 , Abdelrahman Mohamed2

bring AI systems one step closer to how humans perceive L L L L L L L L

speech [10] and provide better performance for a broad range

3.2. Main Results

Model Mode Hr Babble, SNR= Speech, SNR= Music+Natural, SNR= Clean

Model PT FT Audio-only Audio-visual

SNR (dB) PT: None PT: Clean PT: Noisy

SNR (dB) PT: None PT: Clean PT: Noisy

You might also like