0% found this document useful (0 votes)
12 views5 pages

Exploring Wav2vec 2.0 Fine Tuning For Improved Speech Emotion Recognition

The document discusses the exploration of fine-tuning strategies for Wav2Vec 2.0 to enhance speech emotion recognition (SER) performance. It introduces two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT), and presents a novel method called pseudo-label task adaptive pretraining (P-TAPT) that improves SER accuracy, achieving a 7.4% absolute increase over state-of-the-art results on the IEMOCAP dataset. The study highlights the effectiveness of these methods, particularly in low-resource settings, and emphasizes the importance of adapting fine-tuning techniques for SER tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

Exploring Wav2vec 2.0 Fine Tuning For Improved Speech Emotion Recognition

The document discusses the exploration of fine-tuning strategies for Wav2Vec 2.0 to enhance speech emotion recognition (SER) performance. It introduces two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT), and presents a novel method called pseudo-label task adaptive pretraining (P-TAPT) that improves SER accuracy, achieving a 7.4% absolute increase over state-of-the-art results on the IEMOCAP dataset. The study highlights the effectiveness of these methods, particularly in low-resource settings, and emphasizes the importance of adapting fine-tuning techniques for SER tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.

00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10095036

EXPLORING WAV2VEC 2.0 FINE TUNING FOR IMPROVED SPEECH EMOTION


RECOGNITION

Li-Wei Chen, Alexander Rudnicky

Language Technologies Institute, Carnegie Mellon University

ABSTRACT have been developed for speech processing. Wav2vec [9] is


While Wav2Vec 2.0 has been proposed for speech recogni- a multi-layer convolutional neural network (CNN) trained to
tion (ASR), it can also be used for speech emotion recognition predict future frames conditioned on past frames by minimiz-
(SER); its performance can be significantly improved using ing a contrastive loss. On the other hand, wav2vec 2.0 [10] is
different fine-tuning strategies. Two baseline methods, vanilla a transformer-based model that adopts a masked learning ob-
fine-tuning (V-FT) and task adaptive pretraining (TAPT) are jective to predict missing frames from the remaining context.
first presented. We show that V-FT is able to outperform
Despite the success of these methods in ASR, speaker
state-of-the-art models on the IEMOCAP dataset. TAPT,
verification, and mispronunciation detection [10, 12, 13],
an existing NLP fine-tuning strategy, further improves the
only a few attempts [14, 15, 16] have been made to apply
performance on SER. We also introduce a novel fine-tuning
them on SER. Boigne et al. [14] find that wav2vec features
method termed P-TAPT, which modifies the TAPT objective
are superior to traditional spectral-based features on SER.
to learn contextualized emotion representations. Experiments
Xia et al. [15] compare features extracted with different time
show that P-TAPT performs better than TAPT, especially
spans and conclude that features with longer temporal context
under low-resource settings. Compared to prior works in
such as wav2vec perform better on SER. Pepino et al. [16]
this literature, our top-line system achieved a 7.4% absolute
show that features extracted from a linear combination of
improvement in unweighted accuracy (UA) over the state-
layers outperform singe layer representations in wav2vec 2.0
of-the-art performance on IEMOCAP. Our code is publicly
on SER. While these studies demonstrated the usefulness of
available.1
the pretrained models as feature extractors, little research has
Index Terms— Speech emotion recognition, deep neural been conducted on fine-tuning them for SER.
networks, wav2vec 2.0, fine-tuning, pretrained models
One persistent issue on fine-tuning pretrained models is
the mismatch between pretraining and target domain [17, 18].
1. INTRODUCTION
Task adaptive pretraining (TAPT) [17] is proposed to resolve
the domain shift by continuing the pretraining process on
Speech emotion recognition (SER) remains one of the key
the target dataset. Hsu et al. [18] show that TAPT greatly
components in human-machine interaction and in human
improves generalization and robustness on ASR when the
communication systems. With the development of deep
pretraining and fine-tuning data are dissimilar. Since the
learning, several attempts [1, 2, 3] have been made to auto-
speech in the pretraining ASR corpus differs from emotive
matically learn emotion representations from audio signals
speech in multiple regards [19], we consider TAPT a com-
using neural nets. However, the improvement of deep learn-
pelling method for fine-tuning on SER.
ing based systems is often limited by the lack of annotated
data. Commonly used SER datasets [4, 5, 6] are relatively In this paper, we explore methods for fine-tuning wav2vec
small in size in comparison to automatic speech recognition 2.0 on SER. We show that by adding a simple neural network
(ASR) datasets. Moreover, systems trained on these datasets on top of wav2vec 2.0, vanilla fine-tuning (V-FT) outper-
may not generalize well to other domains such as call centers. forms state-of-the-art (SOTA) methods on the IEMOCAP [4]
Self-supervised pretrained models [7, 8] provide a solu- dataset. In addition, with V-FT as a baseline, TAPT signifi-
tion by first learning from a large scale speech corpus with- cantly boosts the performance of fine-tuning wav2vec 2.0 on
out explicit labeling. The knowledge learned from pretrain- SER. Furthermore, motivated by previous works on the ben-
ing can be transferred to downstream tasks by either using the efits of segment-based emotion features [1, 20, 15] and self-
model as a feature extractor or directly fine-tuning the whole supervised representation learning [11, 21], we developed a
model. While first introduced for the purpose of natural lan- novel fine-tuning procedure for SER which yields even better
guage processing (NLP), several pretrained models [9, 10, 11] performance, especially in low-resource conditions. Finally,
1 https://github.com/b04901014/FT-w2v2-ser we achieve a 7.4% absolute increase on unweighted accuracy
(UA) over the SOTA performance on IEMOCAP.

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: System overview of our methods. (a) Emotion state estimation phase of P-TAPT. An additional CNN with stride 2 is
used to align the time steps between wav2vec and wav2vec 2.0. The output of cluster assignments will be used as pseudo-labels
for the P-TAPT objective. (b) Model architecture and pretraining objective of wav2vec 2.0 along with our P-TAPT objective.

2. METHOD on the final layer is simple yet effective for SER. Specifically,
the final contextualized representation extracted by wav2vec
We first review wav2vec 2.0, which serves as the backbone 2.0 is first processed by a global average pooling across the
model for the methods we examine. We then present the time dimension, then followed by the ReLU activation and a
two baseline methods we established. Finally, we introduce single linear layer to predict the emotion categories. In ad-
pseudo-label task adaptive pretraining (P-TAPT), a novel dition, a modified version of SpecAugment [22] proposed in
method we designed to fine-tune wav2vec 2.0 on SER. wav2vec 2.0 is applied during training for better generaliza-
tion. We will use this architecture for the fine-tuning stage
2.1. The wav2vec 2.0 model
of all three methods. We abbreviate the vanilla fine-tuning
Wav2vec 2.0 is a transformer-based model trained to extract method as V-FT.
contextualized representations from raw audio signal. Fig- Task adaptive pretraining. Task adaptive pretraining
ure 1.b shows the wav2vec2.0 model architecture and its pre- (TAPT) [17] is a simple but effective method to fine-tune
training objective. It consists of three sub-modules, feature pretrained language models [7] on domain-specific tasks. It
encoder, transformer module, and quantization module. Fea- bridges the difference between the pretraining and target do-
ture encoder is a multi-layer CNN that processes the input main by continuing to pretrain on the target dataset. In this pa-
signal into low-level features. Based on this representation, per, we examine TAPT as one of the methods of fine-tuning
the transformer module is further applied to produce contex- wav2vec 2.0 on SER. To distinguish from the original pre-
tualized representation. The quantization module discretizes training and fine-tuning stage, we define an intermediate task
the low-level features into a trainable codebook. To train the adaptation stage for the continual pretraining process.
model, part of the low-level features are masked from the
transformer module, and the objective is to identify the quan- 2.3. Pseudo-label task adaptive pretraining
tized version of the masked features based on its context.2
While TAPT adapts to emotive speech by continual training
2.2. Comparing methods with the pretraining objective, it does not make use of emo-
tion labels. Essentially, the contextualized representations
As there is no existing baseline system fine-tuning wav2vec obtained will be general features suitable for various down-
2.0 on SER, we created two baselines. One is the conventional stream tasks. As we only focus on SER, we propose to adapt
fine-tuning method, and the other is task adaptive pretraining this objective to generate emotion-specific features. Instead of
which is first introduced in NLP. identifying the missing low-level features, we focus on pre-
Vanilla fine-tuning. Wav2vec 2.0 differs from its NLP dicting the emotion state of the masked sequence. One ad-
counterparts [7] in that there is no utterance-level pretraining vantage it brings is better data efficiency. Reconstruction of
task to naturally form a sentence representation. As a conse- missing audio parts is a more complicated task, which makes
quence, aggregation across time steps is required to fine-tune the model vulnerable to over-fitting. Additionally, it simpli-
on utterance level classification tasks. We experimented with fies the fine-tuning stage as it already filters out information
different configurations and found that using average pooling unrelated to emotion recognition from the contextualized rep-
2 There is an additional diversity loss in pretraining which promotes the resentation.
diversity of the quantization codebook. However, frame-level emotion states need to be recog-

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
nized to realize our method. While only utterance-level emo- wav2vec 2.0 implementation on top of the huggingface im-
tion labels are given for most of the SER dataset, several stud- plementation and adopt a pretrained model checkpoint from
ies [15, 1, 20] indicate that frame-level emotion information Facebook AI4 . Both models are pretrained on the unsuper-
can still be inferred by training with a segment-based classifi- vised speech of LibriSpeech 960h [25] without transcriptions.
cation objective. Particularly, as shown in Figure 1.a, we fine- We evaluate our systems using unweighted accuracy (UA) [2]
tune wav2vec to extract frame-level emotion representation under a speaker-independent setting; the speakers in the test
that is useful for predicting an utterance-level emotion label. set are excluded from the training data. Additional imple-
We find that using CNN architectures such as wav2vec is im- mentation details are provided in our github repository. We
portant since the locality of CNN preserves sequential struc- run each experiment 5 times for the full IEMOCAP dataset
ture. After training, we run k-means clustering algorithm [23] and 20 times for SAVEE and on sub-sampled versions of
on all of the extracted representations from the target dataset. IEMOCAP. Additionally, we observe that wav2vec 2.0 fails
As Mathilde et al. [21] have shown, the k-means cluster as- to converge with some of the random seeds. Therefore we
signments on intermediate layers of CNN classifiers can cap- discard and rerun outlier runs where the performance is out-
ture information related to the target labels. Therefore, we side two standard deviations from the mean.
interpret this cluster assignment as a pseudo-label that repre- IEMOCAP. To have a fair comparison with the majority
sents the local emotion state. of previous works, we split the dataset by leaving one session
We then replace the TAPT objective with our new P-TAPT out as test set, the remaining four sessions are used for train-
objective. We add a position-wise linear head composed of ing. Note that most of the papers using IEMOCAP do not
two linear layers to predict the k-means cluster assignments explicitly define their validation set [26]. We therefore train
of the masked frames. In practice, we run multiple k-means with all four sessions for a fixed 15 epochs without validation
clustering with a different number of clusters, and our model using a batch size of 64. The number of epochs is chosen
needs to predict an ensemble of cluster assignments with mul- so that each of our competing methods converges in terms of
tiple linear heads. This cluster ensemble technique is shown training loss.
to facilitate representation learning in HuBERT [11], a recent SAVEE. A similar evaluation procedure is used for
self-supervised speech representation learning model. SAVEE. In each fold, one speaker is left out for testing
and the remaining three are used for training. We increase the
3. EXPERIMENTAL SETUP number of training epochs to 30 and the batch size is halved
to 32 for the smaller training set.
3.1. Dataset
We use two datasets for evaluation, IEMOCAP [4] and 4. RESULTS AND DISCUSSION
SAVEE [5]. We only use the speech modality.
IEMOCAP. Interactive Emotional Dyadic Motion Cap- 4.1. Comparison of fine-tuning methods
ture (IEMOCAP) is a popular dataset for evaluating SER
systems. It contains five recording sessions, each with one Table 1 compares performance for the fine-tuning methods on
male speaker and one female speaker. To compare with IEMOCAP. For all sessions except the first, TAPT yields a no-
previous works, we use the default labels provided by IEMO- ticeable improvement over V-FT, and P-TAPT performs better
CAP. However, only four emotion categories are considered: than TAPT for all sessions. On the other hand, Table 2 shows
neutral, sad, angry, and happy. In particular, the “excited” that on SAVEE, both TAPT and P-TAPT outperform V-FT by
category is merged with “happy” due to its sparsity in the a large margin. However, the performance of P-TAPT is very
dataset. The total amount of speech is about 7 hours. close to that of TAPT on SAVEE. We analyze these results by
SAVEE. The Surrey Audio-Visual Expressed Emotion considering the characteristics of SAVEE and IEMOCAP.
(SAVEE) dataset contains four male speakers: DC, JE, JK, Domain shift and linguistic content. We first quantify
and KL. Each speaker reads out the same set of 120 sentences the domain shift between both datasets and the pretraining
labeled with one of the 7 emotion categories: angry, disgust, dataset. We take the wav2vec 2.0 model pretrained on Lib-
sad, fear, happy, surprise, and neutral. We use all of the emo- riSpeech and calculate the pretraining loss on both datasets
tion categories, which results in 480 utterances with a total of along with the test set of LibriSpeech. Table 3 verifies the
30 minutes of speech. presence of domain shift on both datasets providing room
for TAPT to improve. A smaller loss indicates that SAVEE
3.2. Training and evaluation procedure is closer to LibriSpeech as the model can already general-
ize well to SAVEE. However, this improvement is larger on
All experiments use the same learning rate 1 × 10−4 with
SAVEE than IEMOCAP despite the smaller domain shift. We
Adam optimizer [24]. For the wav2vec model, we use a
observe a strong correlation between linguistic content and
pretrained model developed by Facebook AI3 . We build our
3 https://github.com/pytorch/fairseq/tree/master/examples/wav2vec 4 https://huggingface.co/facebook/wav2vec2-base

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
Table 1: Comparison of methods on IEMOCAP in UA(%) the wav2vec 2.0 model (using V-FT) outperforms wav2vec
2.0 without fine-tuning[16] by 3.6% absolute UA. The P-
Session 1 2 3 4 5 Mean TAPT method provides 7.4% absolute improvement over
SOTA models on IEMOCAP. We also show performance for
V-FT 71.0 76.2 66.3 68.7 67.3 69.9
methods that use both speech and text [27, 28]; our audio-
TAPT 71.8 79.6 70.2 73.2 72.5 73.5
only method appears comparable.
P-TAPT 72.8 80.2 71.0 73.6 73.7 74.3
Table 4: Comparison of methods on data efficiency in UA(%)
Table 2: Comparison of methods on SAVEE in UA(%) on the session 5 of IEMOCAP

Speaker DC JE JK KL Mean Data size ∼0.5hr ∼1hr ∼3hr ∼7hr


V-FT 75.2 78.8 56.0 39.0 62.3 V-FT 56.8 60.9 66.4 67.3
TAPT 81.6 83.3 69.9 49.7 71.1 TAPT 58.1 62.9 69.3 72.5
P-TAPT 86.7 84.2 66.8 45.8 70.9 P-TAPT 58.8 64.1 70.0 73.7
Human 73.7 67.7 71.2 53.2 66.5 Ratio(%) 65.0 62.5 80.6 81.3

Table 5: Comparison with prior works on IEMOCAP


emotion labels in SAVEE.5 We conjecture that this correlation
is captured by our model and surpasses human evaluators who
Method Feature UA (%)
annotate emotion from only para-linguistic information. This
also explains why P-TAPT does not further improve TAPT, FCN+Attention [3] Spectrogram 63.9
as the TAPT objective is already suitable for modeling lin- Wav2vec w/o. FT [14] Wav2vec6 64.3
guistic information. Nonetheless, in more naturally elicited Wav2vec w. FT [15] Waveform 66.9
emotional conversations (IEMOCAP), P-TAPT performs bet- Wav2vec 2.0 w/o. FT [16] Wav2vec 2.06 66.3
ter than TAPT. Wav2vec 2.0 w. V-FT Waveform 69.9
Data efficiency. We also investigated the behavior of our Wav2vec 2.0 w. TAPT Waveform 73.5
methods when presented with different amounts of training Wav2vec 2.0 w. P-TAPT Waveform 74.3
data. Specifically, We fix session 5 of IEMOCAP as the held-
out test set, and gradually halve the number of training ex- Audio + Text [27] MFCC+ALBERT7 72.1
amples in the remaining four sessions by random selection. Audio + ASR [28] MFCC+BERT7 75.9
We compare TAPT and P-TAPT using the ratio of their cor-
responding improvements over V-FT. A lower ratio indicates
that the improvement from P-TAPT is more significant than 5. CONCLUSION
TAPT. As shown in Table 4, this ratio is lower under low-
We describe different fine-tuning strategies for wav2vec 2.0
resource settings with one hour or less of training data. Thus
on SER. These strategies produce SOTA performance on
P-TAPT is more data-efficient than TAPT. As mentioned in
IEMOCAP, a well-studied corpus. We verify the presence of
Section 2.3, we attribute this to the change of objective from
domain shift in SER and demonstrate that addressing it im-
the reconstruction of audio frames to the prediction of emo-
proves performance. We describe an algorithm for learning
tion states which is less data-intensive though it requires la-
contextualized emotion representation and show its advantage
beled data.
in fine-tuning a wav2vec 2.0 model for SER. We believe that
Table 3: Wav2vec 2.0 pretraining loss on different datasets these techniques can be generalized to other tasks and can
provide a basis for research on the utility of contextualized
Dataset Libri.(test-clean) SAVEE IEMOCAP emotion representation. We intend to continue exploring the
usefulness of this approach, in a multi-modal setting.
Loss 32.04 41.37 55.42
6. ACKNOWLEDGEMENTS

4.2. Comparison with prior works We are grateful to PwC USA as well as to The Digital Trans-
formation and Innovation Center at Carnegie Mellon Univer-
Table 5 compares our performance on IEMOCAP to that of sity for supporting our research. We thank Yangyang Xia and
existing SOTA models. We only include methods that eval- Richard M. Stern for the discussions and feedback.
uate under speaker-independent settings. Simply fine-tuning
6 We identify it as feature instead of model architecture since they only
5 Two-thirds of the sentences are specific to one emotion and shared across use wav2vec/wav2vec 2.0 as feature extractor without fine-tuning.
all speakers. 7 Referring to the text features extracted from ALBERT [8] and BERT [7].

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
7. REFERENCES 2021, 2021, pp. 3370–3374.
[16] L. Pepino, P. Riera, and L. Ferrer, “Emotion Recognition from
Speech Using wav2vec 2.0 Embeddings,” in Proc. Interspeech
[1] H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep
2021, 2021, pp. 3400–3404.
learning architectures for speech emotion recognition,” Neural
Networks, vol. 92, pp. 60–68, 2017. [17] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Belt-
agy, D. Downey, and N. A. Smith, “Don’t stop pretraining:
[2] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition
Adapt language models to domains and tasks,” in Proceedings
using deep neural network and extreme learning machine,” in
of ACL, 2020.
Fifteenth annual conference of the international speech com-
munication association, 2014. [18] W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu,
V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and
[3] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention
M. Auli, “Robust wav2vec 2.0: Analyzing Domain Shift
based fully convolutional network for speech emotion recogni-
in Self-Supervised Pre-Training,” in Proc. Interspeech 2021,
tion,” in 2018 Asia-Pacific Signal and Information Processing
2021, pp. 721–725.
Association Annual Summit and Conference (APSIPA ASC),
2018, pp. 1771–1775. [19] M. D. Pell, “Influence of emotion and focus location on
prosody in matched statements and questions.,” The Journal
[4] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,
of the Acoustical Society of America, vol. 109 4, pp. 1668–80,
S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMO-
2001.
CAP: Interactive emotional dyadic motion capture database,”
Language resources and evaluation, vol. 42, no. 4, pp. 335– [20] S. Mao, P. Ching, C.-C. J. Kuo, and T. Lee, “Advancing Mul-
359, 2008. tiple Instance Learning with Attention Modeling for Categori-
cal Speech Emotion Recognition,” in Proc. Interspeech 2020,
[5] P. Jackson and S. ul haq, “Surrey audio-visual expressed emo-
2020, pp. 2357–2361.
tion (savee) database,” 04 2011.
[21] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep
[6] S. R. Livingstone and F. Russo, “The Ryerson audio-visual
clustering for unsupervised learning of visual features,” in Eu-
database of emotional speech and song (RAVDESS): A dy-
ropean Conference on Computer Vision, 2018.
namic, multimodal set of facial and vocal expressions in North
American English,” PLoS ONE, vol. 13, 2018. [22] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmen-
[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT:
tation Method for Automatic Speech Recognition,” in Proc.
Pre-training of deep bidirectional transformers for language
Interspeech 2019, 2019, pp. 2613–2617.
understanding,” in NAACL-HLT, 2019.
[23] S. Lloyd, “Least squares quantization in pcm,” IEEE Trans-
[8] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and
actions on Information Theory, vol. 28, no. 2, pp. 129–137,
R. Soricut, “Albert: A lite bert for self-supervised learning
1982.
of language representations,” in International Conference on
Learning Representations, 2020. [24] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-
timization,” in 3rd International Conference on Learning Rep-
[9] S. Schneider, A. Baevski, R. Collobert, and M. Auli,
resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
“wav2vec: Unsupervised Pre-Training for Speech Recogni-
Conference Track Proceedings, 2015.
tion,” in Proc. Interspeech 2019, 2019, pp. 3465–3469.
[25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
[10] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec
rispeech: An ASR corpus based on public domain audio
2.0: A framework for self-supervised learning of speech rep-
books,” in 2015 IEEE International Conference on Acous-
resentations,” in Advances in Neural Information Processing
tics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–
Systems, 2020, vol. 33, pp. 12449–12460.
5210.
[11] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdi-
[26] C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, and
nov, and A. Mohamed, “Hubert: Self-supervised speech repre-
B. Schmauch, “CNN+LSTM Architecture for Speech Emotion
sentation learning by masked prediction of hidden units,” 2021.
Recognition with Data Augmentation,” in Proc. Workshop on
[12] Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0 Speech, Music and Mind (SMM 2018), 2018, pp. 21–25.
on Speaker Verification and Language Identification,” in Proc.
[27] M. Chen and X. Zhao, “A Multi-Scale Fusion Framework for
Interspeech 2021, 2021, pp. 1509–1513.
Bimodal Speech Emotion Recognition,” in Proc. Interspeech
[13] L. Peng, K. Fu, B. Lin, D. Ke, and J. Zhan, “A Study on Fine- 2020, 2020, pp. 374–378.
Tuning wav2vec2.0 Model for the Task of Mispronunciation
[28] J. Santoso, T. Yamada, S. Makino, K. Ishizuka, and T. Hi-
Detection and Diagnosis,” in Proc. Interspeech 2021, 2021,
ramura, “Speech Emotion Recognition Based on Attention
pp. 4448–4452.
Weight Correction Using Word-Level Confidence Measure,” in
[14] J. Boigne, B. Liyanage, and T. Östrem, “Recognizing more Proc. Interspeech 2021, 2021, pp. 1947–1951.
emotions with less data using self-supervised transfer learn-
ing,” ArXiv, vol. abs/2011.05585, 2020.
[15] Y. Xia, L.-W. Chen, A. Rudnicky, and R. M. Stern, “Temporal
Context in Speech Emotion Recognition,” in Proc. Interspeech

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.

You might also like