Exploring Wav2vec 2.0 Fine Tuning For Improved Speech Emotion Recognition
Exploring Wav2vec 2.0 Fine Tuning For Improved Speech Emotion Recognition
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: System overview of our methods. (a) Emotion state estimation phase of P-TAPT. An additional CNN with stride 2 is
used to align the time steps between wav2vec and wav2vec 2.0. The output of cluster assignments will be used as pseudo-labels
for the P-TAPT objective. (b) Model architecture and pretraining objective of wav2vec 2.0 along with our P-TAPT objective.
2. METHOD on the final layer is simple yet effective for SER. Specifically,
the final contextualized representation extracted by wav2vec
We first review wav2vec 2.0, which serves as the backbone 2.0 is first processed by a global average pooling across the
model for the methods we examine. We then present the time dimension, then followed by the ReLU activation and a
two baseline methods we established. Finally, we introduce single linear layer to predict the emotion categories. In ad-
pseudo-label task adaptive pretraining (P-TAPT), a novel dition, a modified version of SpecAugment [22] proposed in
method we designed to fine-tune wav2vec 2.0 on SER. wav2vec 2.0 is applied during training for better generaliza-
tion. We will use this architecture for the fine-tuning stage
2.1. The wav2vec 2.0 model
of all three methods. We abbreviate the vanilla fine-tuning
Wav2vec 2.0 is a transformer-based model trained to extract method as V-FT.
contextualized representations from raw audio signal. Fig- Task adaptive pretraining. Task adaptive pretraining
ure 1.b shows the wav2vec2.0 model architecture and its pre- (TAPT) [17] is a simple but effective method to fine-tune
training objective. It consists of three sub-modules, feature pretrained language models [7] on domain-specific tasks. It
encoder, transformer module, and quantization module. Fea- bridges the difference between the pretraining and target do-
ture encoder is a multi-layer CNN that processes the input main by continuing to pretrain on the target dataset. In this pa-
signal into low-level features. Based on this representation, per, we examine TAPT as one of the methods of fine-tuning
the transformer module is further applied to produce contex- wav2vec 2.0 on SER. To distinguish from the original pre-
tualized representation. The quantization module discretizes training and fine-tuning stage, we define an intermediate task
the low-level features into a trainable codebook. To train the adaptation stage for the continual pretraining process.
model, part of the low-level features are masked from the
transformer module, and the objective is to identify the quan- 2.3. Pseudo-label task adaptive pretraining
tized version of the masked features based on its context.2
While TAPT adapts to emotive speech by continual training
2.2. Comparing methods with the pretraining objective, it does not make use of emo-
tion labels. Essentially, the contextualized representations
As there is no existing baseline system fine-tuning wav2vec obtained will be general features suitable for various down-
2.0 on SER, we created two baselines. One is the conventional stream tasks. As we only focus on SER, we propose to adapt
fine-tuning method, and the other is task adaptive pretraining this objective to generate emotion-specific features. Instead of
which is first introduced in NLP. identifying the missing low-level features, we focus on pre-
Vanilla fine-tuning. Wav2vec 2.0 differs from its NLP dicting the emotion state of the masked sequence. One ad-
counterparts [7] in that there is no utterance-level pretraining vantage it brings is better data efficiency. Reconstruction of
task to naturally form a sentence representation. As a conse- missing audio parts is a more complicated task, which makes
quence, aggregation across time steps is required to fine-tune the model vulnerable to over-fitting. Additionally, it simpli-
on utterance level classification tasks. We experimented with fies the fine-tuning stage as it already filters out information
different configurations and found that using average pooling unrelated to emotion recognition from the contextualized rep-
2 There is an additional diversity loss in pretraining which promotes the resentation.
diversity of the quantization codebook. However, frame-level emotion states need to be recog-
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
nized to realize our method. While only utterance-level emo- wav2vec 2.0 implementation on top of the huggingface im-
tion labels are given for most of the SER dataset, several stud- plementation and adopt a pretrained model checkpoint from
ies [15, 1, 20] indicate that frame-level emotion information Facebook AI4 . Both models are pretrained on the unsuper-
can still be inferred by training with a segment-based classifi- vised speech of LibriSpeech 960h [25] without transcriptions.
cation objective. Particularly, as shown in Figure 1.a, we fine- We evaluate our systems using unweighted accuracy (UA) [2]
tune wav2vec to extract frame-level emotion representation under a speaker-independent setting; the speakers in the test
that is useful for predicting an utterance-level emotion label. set are excluded from the training data. Additional imple-
We find that using CNN architectures such as wav2vec is im- mentation details are provided in our github repository. We
portant since the locality of CNN preserves sequential struc- run each experiment 5 times for the full IEMOCAP dataset
ture. After training, we run k-means clustering algorithm [23] and 20 times for SAVEE and on sub-sampled versions of
on all of the extracted representations from the target dataset. IEMOCAP. Additionally, we observe that wav2vec 2.0 fails
As Mathilde et al. [21] have shown, the k-means cluster as- to converge with some of the random seeds. Therefore we
signments on intermediate layers of CNN classifiers can cap- discard and rerun outlier runs where the performance is out-
ture information related to the target labels. Therefore, we side two standard deviations from the mean.
interpret this cluster assignment as a pseudo-label that repre- IEMOCAP. To have a fair comparison with the majority
sents the local emotion state. of previous works, we split the dataset by leaving one session
We then replace the TAPT objective with our new P-TAPT out as test set, the remaining four sessions are used for train-
objective. We add a position-wise linear head composed of ing. Note that most of the papers using IEMOCAP do not
two linear layers to predict the k-means cluster assignments explicitly define their validation set [26]. We therefore train
of the masked frames. In practice, we run multiple k-means with all four sessions for a fixed 15 epochs without validation
clustering with a different number of clusters, and our model using a batch size of 64. The number of epochs is chosen
needs to predict an ensemble of cluster assignments with mul- so that each of our competing methods converges in terms of
tiple linear heads. This cluster ensemble technique is shown training loss.
to facilitate representation learning in HuBERT [11], a recent SAVEE. A similar evaluation procedure is used for
self-supervised speech representation learning model. SAVEE. In each fold, one speaker is left out for testing
and the remaining three are used for training. We increase the
3. EXPERIMENTAL SETUP number of training epochs to 30 and the batch size is halved
to 32 for the smaller training set.
3.1. Dataset
We use two datasets for evaluation, IEMOCAP [4] and 4. RESULTS AND DISCUSSION
SAVEE [5]. We only use the speech modality.
IEMOCAP. Interactive Emotional Dyadic Motion Cap- 4.1. Comparison of fine-tuning methods
ture (IEMOCAP) is a popular dataset for evaluating SER
systems. It contains five recording sessions, each with one Table 1 compares performance for the fine-tuning methods on
male speaker and one female speaker. To compare with IEMOCAP. For all sessions except the first, TAPT yields a no-
previous works, we use the default labels provided by IEMO- ticeable improvement over V-FT, and P-TAPT performs better
CAP. However, only four emotion categories are considered: than TAPT for all sessions. On the other hand, Table 2 shows
neutral, sad, angry, and happy. In particular, the “excited” that on SAVEE, both TAPT and P-TAPT outperform V-FT by
category is merged with “happy” due to its sparsity in the a large margin. However, the performance of P-TAPT is very
dataset. The total amount of speech is about 7 hours. close to that of TAPT on SAVEE. We analyze these results by
SAVEE. The Surrey Audio-Visual Expressed Emotion considering the characteristics of SAVEE and IEMOCAP.
(SAVEE) dataset contains four male speakers: DC, JE, JK, Domain shift and linguistic content. We first quantify
and KL. Each speaker reads out the same set of 120 sentences the domain shift between both datasets and the pretraining
labeled with one of the 7 emotion categories: angry, disgust, dataset. We take the wav2vec 2.0 model pretrained on Lib-
sad, fear, happy, surprise, and neutral. We use all of the emo- riSpeech and calculate the pretraining loss on both datasets
tion categories, which results in 480 utterances with a total of along with the test set of LibriSpeech. Table 3 verifies the
30 minutes of speech. presence of domain shift on both datasets providing room
for TAPT to improve. A smaller loss indicates that SAVEE
3.2. Training and evaluation procedure is closer to LibriSpeech as the model can already general-
ize well to SAVEE. However, this improvement is larger on
All experiments use the same learning rate 1 × 10−4 with
SAVEE than IEMOCAP despite the smaller domain shift. We
Adam optimizer [24]. For the wav2vec model, we use a
observe a strong correlation between linguistic content and
pretrained model developed by Facebook AI3 . We build our
3 https://github.com/pytorch/fairseq/tree/master/examples/wav2vec 4 https://huggingface.co/facebook/wav2vec2-base
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
Table 1: Comparison of methods on IEMOCAP in UA(%) the wav2vec 2.0 model (using V-FT) outperforms wav2vec
2.0 without fine-tuning[16] by 3.6% absolute UA. The P-
Session 1 2 3 4 5 Mean TAPT method provides 7.4% absolute improvement over
SOTA models on IEMOCAP. We also show performance for
V-FT 71.0 76.2 66.3 68.7 67.3 69.9
methods that use both speech and text [27, 28]; our audio-
TAPT 71.8 79.6 70.2 73.2 72.5 73.5
only method appears comparable.
P-TAPT 72.8 80.2 71.0 73.6 73.7 74.3
Table 4: Comparison of methods on data efficiency in UA(%)
Table 2: Comparison of methods on SAVEE in UA(%) on the session 5 of IEMOCAP
4.2. Comparison with prior works We are grateful to PwC USA as well as to The Digital Trans-
formation and Innovation Center at Carnegie Mellon Univer-
Table 5 compares our performance on IEMOCAP to that of sity for supporting our research. We thank Yangyang Xia and
existing SOTA models. We only include methods that eval- Richard M. Stern for the discussions and feedback.
uate under speaker-independent settings. Simply fine-tuning
6 We identify it as feature instead of model architecture since they only
5 Two-thirds of the sentences are specific to one emotion and shared across use wav2vec/wav2vec 2.0 as feature extractor without fine-tuning.
all speakers. 7 Referring to the text features extracted from ALBERT [8] and BERT [7].
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
7. REFERENCES 2021, 2021, pp. 3370–3374.
[16] L. Pepino, P. Riera, and L. Ferrer, “Emotion Recognition from
Speech Using wav2vec 2.0 Embeddings,” in Proc. Interspeech
[1] H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep
2021, 2021, pp. 3400–3404.
learning architectures for speech emotion recognition,” Neural
Networks, vol. 92, pp. 60–68, 2017. [17] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Belt-
agy, D. Downey, and N. A. Smith, “Don’t stop pretraining:
[2] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition
Adapt language models to domains and tasks,” in Proceedings
using deep neural network and extreme learning machine,” in
of ACL, 2020.
Fifteenth annual conference of the international speech com-
munication association, 2014. [18] W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu,
V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and
[3] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention
M. Auli, “Robust wav2vec 2.0: Analyzing Domain Shift
based fully convolutional network for speech emotion recogni-
in Self-Supervised Pre-Training,” in Proc. Interspeech 2021,
tion,” in 2018 Asia-Pacific Signal and Information Processing
2021, pp. 721–725.
Association Annual Summit and Conference (APSIPA ASC),
2018, pp. 1771–1775. [19] M. D. Pell, “Influence of emotion and focus location on
prosody in matched statements and questions.,” The Journal
[4] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,
of the Acoustical Society of America, vol. 109 4, pp. 1668–80,
S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMO-
2001.
CAP: Interactive emotional dyadic motion capture database,”
Language resources and evaluation, vol. 42, no. 4, pp. 335– [20] S. Mao, P. Ching, C.-C. J. Kuo, and T. Lee, “Advancing Mul-
359, 2008. tiple Instance Learning with Attention Modeling for Categori-
cal Speech Emotion Recognition,” in Proc. Interspeech 2020,
[5] P. Jackson and S. ul haq, “Surrey audio-visual expressed emo-
2020, pp. 2357–2361.
tion (savee) database,” 04 2011.
[21] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep
[6] S. R. Livingstone and F. Russo, “The Ryerson audio-visual
clustering for unsupervised learning of visual features,” in Eu-
database of emotional speech and song (RAVDESS): A dy-
ropean Conference on Computer Vision, 2018.
namic, multimodal set of facial and vocal expressions in North
American English,” PLoS ONE, vol. 13, 2018. [22] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmen-
[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT:
tation Method for Automatic Speech Recognition,” in Proc.
Pre-training of deep bidirectional transformers for language
Interspeech 2019, 2019, pp. 2613–2617.
understanding,” in NAACL-HLT, 2019.
[23] S. Lloyd, “Least squares quantization in pcm,” IEEE Trans-
[8] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and
actions on Information Theory, vol. 28, no. 2, pp. 129–137,
R. Soricut, “Albert: A lite bert for self-supervised learning
1982.
of language representations,” in International Conference on
Learning Representations, 2020. [24] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-
timization,” in 3rd International Conference on Learning Rep-
[9] S. Schneider, A. Baevski, R. Collobert, and M. Auli,
resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
“wav2vec: Unsupervised Pre-Training for Speech Recogni-
Conference Track Proceedings, 2015.
tion,” in Proc. Interspeech 2019, 2019, pp. 3465–3469.
[25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
[10] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec
rispeech: An ASR corpus based on public domain audio
2.0: A framework for self-supervised learning of speech rep-
books,” in 2015 IEEE International Conference on Acous-
resentations,” in Advances in Neural Information Processing
tics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–
Systems, 2020, vol. 33, pp. 12449–12460.
5210.
[11] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdi-
[26] C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, and
nov, and A. Mohamed, “Hubert: Self-supervised speech repre-
B. Schmauch, “CNN+LSTM Architecture for Speech Emotion
sentation learning by masked prediction of hidden units,” 2021.
Recognition with Data Augmentation,” in Proc. Workshop on
[12] Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0 Speech, Music and Mind (SMM 2018), 2018, pp. 21–25.
on Speaker Verification and Language Identification,” in Proc.
[27] M. Chen and X. Zhao, “A Multi-Scale Fusion Framework for
Interspeech 2021, 2021, pp. 1509–1513.
Bimodal Speech Emotion Recognition,” in Proc. Interspeech
[13] L. Peng, K. Fu, B. Lin, D. Ke, and J. Zhan, “A Study on Fine- 2020, 2020, pp. 374–378.
Tuning wav2vec2.0 Model for the Task of Mispronunciation
[28] J. Santoso, T. Yamada, S. Makino, K. Ishizuka, and T. Hi-
Detection and Diagnosis,” in Proc. Interspeech 2021, 2021,
ramura, “Speech Emotion Recognition Based on Attention
pp. 4448–4452.
Weight Correction Using Word-Level Confidence Measure,” in
[14] J. Boigne, B. Liyanage, and T. Östrem, “Recognizing more Proc. Interspeech 2021, 2021, pp. 1947–1951.
emotions with less data using self-supervised transfer learn-
ing,” ArXiv, vol. abs/2011.05585, 2020.
[15] Y. Xia, L.-W. Chen, A. Rudnicky, and R. M. Stern, “Temporal
Context in Speech Emotion Recognition,” in Proc. Interspeech
Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.