Exploring Wav2vec 2.0 Fine Tuning For Improved Speech Emotion Recognition

The document discusses the exploration of fine-tuning strategies for Wav2Vec 2.0 to enhance speech emotion recognition (SER) performance. It introduces two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT), and presents a novel method called pseudo-label task adaptive pretraining (P-TAPT) that improves SER accuracy, achieving a 7.4% absolute increase over state-of-the-art results on the IEMOCAP dataset. The study highlights the effectiveness of these methods, particularly in low-resource settings, and emphasizes the importance of adapting fine-tuning techniques for SER tasks.

Uploaded by

sriramarathanreddyk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views5 pages

Exploring Wav2vec 2.0 Fine Tuning For Improved Speech Emotion Recognition

Uploaded by

sriramarathanreddyk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.

00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10095036

EXPLORING WAV2VEC 2.0 FINE TUNING FOR IMPROVED SPEECH EMOTION

RECOGNITION

Li-Wei Chen, Alexander Rudnicky

Language Technologies Institute, Carnegie Mellon University

ABSTRACT have been developed for speech processing. Wav2vec [9] is

While Wav2Vec 2.0 has been proposed for speech recogni- a multi-layer convolutional neural network (CNN) trained to
tion (ASR), it can also be used for speech emotion recognition predict future frames conditioned on past frames by minimiz-
(SER); its performance can be significantly improved using ing a contrastive loss. On the other hand, wav2vec 2.0 [10] is
different fine-tuning strategies. Two baseline methods, vanilla a transformer-based model that adopts a masked learning ob-
fine-tuning (V-FT) and task adaptive pretraining (TAPT) are jective to predict missing frames from the remaining context.
first presented. We show that V-FT is able to outperform
Despite the success of these methods in ASR, speaker
state-of-the-art models on the IEMOCAP dataset. TAPT,
verification, and mispronunciation detection [10, 12, 13],
an existing NLP fine-tuning strategy, further improves the
only a few attempts [14, 15, 16] have been made to apply
performance on SER. We also introduce a novel fine-tuning
them on SER. Boigne et al. [14] find that wav2vec features
method termed P-TAPT, which modifies the TAPT objective
are superior to traditional spectral-based features on SER.
to learn contextualized emotion representations. Experiments
Xia et al. [15] compare features extracted with different time
show that P-TAPT performs better than TAPT, especially
spans and conclude that features with longer temporal context
under low-resource settings. Compared to prior works in
such as wav2vec perform better on SER. Pepino et al. [16]
this literature, our top-line system achieved a 7.4% absolute
show that features extracted from a linear combination of
improvement in unweighted accuracy (UA) over the state-
layers outperform singe layer representations in wav2vec 2.0
of-the-art performance on IEMOCAP. Our code is publicly
on SER. While these studies demonstrated the usefulness of
available.1
the pretrained models as feature extractors, little research has
Index Terms— Speech emotion recognition, deep neural been conducted on fine-tuning them for SER.
networks, wav2vec 2.0, fine-tuning, pretrained models
One persistent issue on fine-tuning pretrained models is
the mismatch between pretraining and target domain [17, 18].
1. INTRODUCTION
Task adaptive pretraining (TAPT) [17] is proposed to resolve
the domain shift by continuing the pretraining process on
Speech emotion recognition (SER) remains one of the key
the target dataset. Hsu et al. [18] show that TAPT greatly
components in human-machine interaction and in human
improves generalization and robustness on ASR when the
communication systems. With the development of deep
pretraining and fine-tuning data are dissimilar. Since the
learning, several attempts [1, 2, 3] have been made to auto-
speech in the pretraining ASR corpus differs from emotive
matically learn emotion representations from audio signals
speech in multiple regards [19], we consider TAPT a com-
using neural nets. However, the improvement of deep learn-
pelling method for fine-tuning on SER.
ing based systems is often limited by the lack of annotated
data. Commonly used SER datasets [4, 5, 6] are relatively In this paper, we explore methods for fine-tuning wav2vec
small in size in comparison to automatic speech recognition 2.0 on SER. We show that by adding a simple neural network
(ASR) datasets. Moreover, systems trained on these datasets on top of wav2vec 2.0, vanilla fine-tuning (V-FT) outper-
may not generalize well to other domains such as call centers. forms state-of-the-art (SOTA) methods on the IEMOCAP [4]
Self-supervised pretrained models [7, 8] provide a solu- dataset. In addition, with V-FT as a baseline, TAPT signifi-
tion by first learning from a large scale speech corpus with- cantly boosts the performance of fine-tuning wav2vec 2.0 on
out explicit labeling. The knowledge learned from pretrain- SER. Furthermore, motivated by previous works on the ben-
ing can be transferred to downstream tasks by either using the efits of segment-based emotion features [1, 20, 15] and self-
model as a feature extractor or directly fine-tuning the whole supervised representation learning [11, 21], we developed a
model. While first introduced for the purpose of natural lan- novel fine-tuning procedure for SER which yields even better
guage processing (NLP), several pretrained models [9, 10, 11] performance, especially in low-resource conditions. Finally,
1 https://github.com/b04901014/FT-w2v2-ser we achieve a 7.4% absolute increase on unweighted accuracy
(UA) over the SOTA performance on IEMOCAP.

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
Fig. 1: System overview of our methods. (a) Emotion state estimation phase of P-TAPT. An additional CNN with stride 2 is
used to align the time steps between wav2vec and wav2vec 2.0. The output of cluster assignments will be used as pseudo-labels
for the P-TAPT objective. (b) Model architecture and pretraining objective of wav2vec 2.0 along with our P-TAPT objective.

2. METHOD on the final layer is simple yet effective for SER. Specifically,
the final contextualized representation extracted by wav2vec
We first review wav2vec 2.0, which serves as the backbone 2.0 is first processed by a global average pooling across the
model for the methods we examine. We then present the time dimension, then followed by the ReLU activation and a
two baseline methods we established. Finally, we introduce single linear layer to predict the emotion categories. In ad-
pseudo-label task adaptive pretraining (P-TAPT), a novel dition, a modified version of SpecAugment [22] proposed in
method we designed to fine-tune wav2vec 2.0 on SER. wav2vec 2.0 is applied during training for better generaliza-
tion. We will use this architecture for the fine-tuning stage
2.1. The wav2vec 2.0 model
of all three methods. We abbreviate the vanilla fine-tuning
Wav2vec 2.0 is a transformer-based model trained to extract method as V-FT.
contextualized representations from raw audio signal. Fig- Task adaptive pretraining. Task adaptive pretraining
ure 1.b shows the wav2vec2.0 model architecture and its pre- (TAPT) [17] is a simple but effective method to fine-tune
training objective. It consists of three sub-modules, feature pretrained language models [7] on domain-specific tasks. It
encoder, transformer module, and quantization module. Fea- bridges the difference between the pretraining and target do-
ture encoder is a multi-layer CNN that processes the input main by continuing to pretrain on the target dataset. In this pa-
signal into low-level features. Based on this representation, per, we examine TAPT as one of the methods of fine-tuning
the transformer module is further applied to produce contex- wav2vec 2.0 on SER. To distinguish from the original pre-
tualized representation. The quantization module discretizes training and fine-tuning stage, we define an intermediate task
the low-level features into a trainable codebook. To train the adaptation stage for the continual pretraining process.
model, part of the low-level features are masked from the
transformer module, and the objective is to identify the quan- 2.3. Pseudo-label task adaptive pretraining
tized version of the masked features based on its context.2
While TAPT adapts to emotive speech by continual training
2.2. Comparing methods with the pretraining objective, it does not make use of emo-
tion labels. Essentially, the contextualized representations
As there is no existing baseline system fine-tuning wav2vec obtained will be general features suitable for various down-
2.0 on SER, we created two baselines. One is the conventional stream tasks. As we only focus on SER, we propose to adapt
fine-tuning method, and the other is task adaptive pretraining this objective to generate emotion-specific features. Instead of
which is first introduced in NLP. identifying the missing low-level features, we focus on pre-
Vanilla fine-tuning. Wav2vec 2.0 differs from its NLP dicting the emotion state of the masked sequence. One ad-
counterparts [7] in that there is no utterance-level pretraining vantage it brings is better data efficiency. Reconstruction of
task to naturally form a sentence representation. As a conse- missing audio parts is a more complicated task, which makes
quence, aggregation across time steps is required to fine-tune the model vulnerable to over-fitting. Additionally, it simpli-
on utterance level classification tasks. We experimented with fies the fine-tuning stage as it already filters out information
different configurations and found that using average pooling unrelated to emotion recognition from the contextualized rep-
2 There is an additional diversity loss in pretraining which promotes the resentation.
diversity of the quantization codebook. However, frame-level emotion states need to be recog-

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
nized to realize our method. While only utterance-level emo- wav2vec 2.0 implementation on top of the huggingface im-
tion labels are given for most of the SER dataset, several stud- plementation and adopt a pretrained model checkpoint from
ies [15, 1, 20] indicate that frame-level emotion information Facebook AI4 . Both models are pretrained on the unsuper-
can still be inferred by training with a segment-based classifi- vised speech of LibriSpeech 960h [25] without transcriptions.
cation objective. Particularly, as shown in Figure 1.a, we fine- We evaluate our systems using unweighted accuracy (UA) [2]
tune wav2vec to extract frame-level emotion representation under a speaker-independent setting; the speakers in the test
that is useful for predicting an utterance-level emotion label. set are excluded from the training data. Additional imple-
We find that using CNN architectures such as wav2vec is im- mentation details are provided in our github repository. We
portant since the locality of CNN preserves sequential struc- run each experiment 5 times for the full IEMOCAP dataset
ture. After training, we run k-means clustering algorithm [23] and 20 times for SAVEE and on sub-sampled versions of
on all of the extracted representations from the target dataset. IEMOCAP. Additionally, we observe that wav2vec 2.0 fails
As Mathilde et al. [21] have shown, the k-means cluster as- to converge with some of the random seeds. Therefore we
signments on intermediate layers of CNN classifiers can cap- discard and rerun outlier runs where the performance is out-
ture information related to the target labels. Therefore, we side two standard deviations from the mean.
interpret this cluster assignment as a pseudo-label that repre- IEMOCAP. To have a fair comparison with the majority
sents the local emotion state. of previous works, we split the dataset by leaving one session
We then replace the TAPT objective with our new P-TAPT out as test set, the remaining four sessions are used for train-
objective. We add a position-wise linear head composed of ing. Note that most of the papers using IEMOCAP do not
two linear layers to predict the k-means cluster assignments explicitly define their validation set [26]. We therefore train
of the masked frames. In practice, we run multiple k-means with all four sessions for a fixed 15 epochs without validation
clustering with a different number of clusters, and our model using a batch size of 64. The number of epochs is chosen
needs to predict an ensemble of cluster assignments with mul- so that each of our competing methods converges in terms of
tiple linear heads. This cluster ensemble technique is shown training loss.
to facilitate representation learning in HuBERT [11], a recent SAVEE. A similar evaluation procedure is used for
self-supervised speech representation learning model. SAVEE. In each fold, one speaker is left out for testing
and the remaining three are used for training. We increase the
3. EXPERIMENTAL SETUP number of training epochs to 30 and the batch size is halved
to 32 for the smaller training set.
3.1. Dataset
We use two datasets for evaluation, IEMOCAP [4] and 4. RESULTS AND DISCUSSION
SAVEE [5]. We only use the speech modality.
IEMOCAP. Interactive Emotional Dyadic Motion Cap- 4.1. Comparison of fine-tuning methods
ture (IEMOCAP) is a popular dataset for evaluating SER
systems. It contains five recording sessions, each with one Table 1 compares performance for the fine-tuning methods on
male speaker and one female speaker. To compare with IEMOCAP. For all sessions except the first, TAPT yields a no-
previous works, we use the default labels provided by IEMO- ticeable improvement over V-FT, and P-TAPT performs better
CAP. However, only four emotion categories are considered: than TAPT for all sessions. On the other hand, Table 2 shows
neutral, sad, angry, and happy. In particular, the “excited” that on SAVEE, both TAPT and P-TAPT outperform V-FT by
category is merged with “happy” due to its sparsity in the a large margin. However, the performance of P-TAPT is very
dataset. The total amount of speech is about 7 hours. close to that of TAPT on SAVEE. We analyze these results by
SAVEE. The Surrey Audio-Visual Expressed Emotion considering the characteristics of SAVEE and IEMOCAP.
(SAVEE) dataset contains four male speakers: DC, JE, JK, Domain shift and linguistic content. We first quantify
and KL. Each speaker reads out the same set of 120 sentences the domain shift between both datasets and the pretraining
labeled with one of the 7 emotion categories: angry, disgust, dataset. We take the wav2vec 2.0 model pretrained on Lib-
sad, fear, happy, surprise, and neutral. We use all of the emo- riSpeech and calculate the pretraining loss on both datasets
tion categories, which results in 480 utterances with a total of along with the test set of LibriSpeech. Table 3 verifies the
30 minutes of speech. presence of domain shift on both datasets providing room
for TAPT to improve. A smaller loss indicates that SAVEE
3.2. Training and evaluation procedure is closer to LibriSpeech as the model can already general-
ize well to SAVEE. However, this improvement is larger on
All experiments use the same learning rate 1 × 10−4 with
SAVEE than IEMOCAP despite the smaller domain shift. We
Adam optimizer [24]. For the wav2vec model, we use a
observe a strong correlation between linguistic content and
pretrained model developed by Facebook AI3 . We build our
3 https://github.com/pytorch/fairseq/tree/master/examples/wav2vec 4 https://huggingface.co/facebook/wav2vec2-base

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
Table 1: Comparison of methods on IEMOCAP in UA(%) the wav2vec 2.0 model (using V-FT) outperforms wav2vec
2.0 without fine-tuning[16] by 3.6% absolute UA. The P-
Session 1 2 3 4 5 Mean TAPT method provides 7.4% absolute improvement over
SOTA models on IEMOCAP. We also show performance for
V-FT 71.0 76.2 66.3 68.7 67.3 69.9
methods that use both speech and text [27, 28]; our audio-
TAPT 71.8 79.6 70.2 73.2 72.5 73.5
only method appears comparable.
P-TAPT 72.8 80.2 71.0 73.6 73.7 74.3
Table 4: Comparison of methods on data efficiency in UA(%)
Table 2: Comparison of methods on SAVEE in UA(%) on the session 5 of IEMOCAP

Speaker DC JE JK KL Mean Data size ∼0.5hr ∼1hr ∼3hr ∼7hr

V-FT 75.2 78.8 56.0 39.0 62.3 V-FT 56.8 60.9 66.4 67.3
TAPT 81.6 83.3 69.9 49.7 71.1 TAPT 58.1 62.9 69.3 72.5
P-TAPT 86.7 84.2 66.8 45.8 70.9 P-TAPT 58.8 64.1 70.0 73.7
Human 73.7 67.7 71.2 53.2 66.5 Ratio(%) 65.0 62.5 80.6 81.3

Table 5: Comparison with prior works on IEMOCAP

emotion labels in SAVEE.5 We conjecture that this correlation
is captured by our model and surpasses human evaluators who
Method Feature UA (%)
annotate emotion from only para-linguistic information. This
also explains why P-TAPT does not further improve TAPT, FCN+Attention [3] Spectrogram 63.9
as the TAPT objective is already suitable for modeling lin- Wav2vec w/o. FT [14] Wav2vec6 64.3
guistic information. Nonetheless, in more naturally elicited Wav2vec w. FT [15] Waveform 66.9
emotional conversations (IEMOCAP), P-TAPT performs bet- Wav2vec 2.0 w/o. FT [16] Wav2vec 2.06 66.3
ter than TAPT. Wav2vec 2.0 w. V-FT Waveform 69.9
Data efficiency. We also investigated the behavior of our Wav2vec 2.0 w. TAPT Waveform 73.5
methods when presented with different amounts of training Wav2vec 2.0 w. P-TAPT Waveform 74.3
data. Specifically, We fix session 5 of IEMOCAP as the held-
out test set, and gradually halve the number of training ex- Audio + Text [27] MFCC+ALBERT7 72.1
amples in the remaining four sessions by random selection. Audio + ASR [28] MFCC+BERT7 75.9
We compare TAPT and P-TAPT using the ratio of their cor-
responding improvements over V-FT. A lower ratio indicates
that the improvement from P-TAPT is more significant than 5. CONCLUSION
TAPT. As shown in Table 4, this ratio is lower under low-
We describe different fine-tuning strategies for wav2vec 2.0
resource settings with one hour or less of training data. Thus
on SER. These strategies produce SOTA performance on
P-TAPT is more data-efficient than TAPT. As mentioned in
IEMOCAP, a well-studied corpus. We verify the presence of
Section 2.3, we attribute this to the change of objective from
domain shift in SER and demonstrate that addressing it im-
the reconstruction of audio frames to the prediction of emo-
proves performance. We describe an algorithm for learning
tion states which is less data-intensive though it requires la-
contextualized emotion representation and show its advantage
beled data.
in fine-tuning a wav2vec 2.0 model for SER. We believe that
Table 3: Wav2vec 2.0 pretraining loss on different datasets these techniques can be generalized to other tasks and can
provide a basis for research on the utility of contextualized
Dataset Libri.(test-clean) SAVEE IEMOCAP emotion representation. We intend to continue exploring the
usefulness of this approach, in a multi-modal setting.
Loss 32.04 41.37 55.42
6. ACKNOWLEDGEMENTS

4.2. Comparison with prior works We are grateful to PwC USA as well as to The Digital Trans-
formation and Innovation Center at Carnegie Mellon Univer-
Table 5 compares our performance on IEMOCAP to that of sity for supporting our research. We thank Yangyang Xia and
existing SOTA models. We only include methods that eval- Richard M. Stern for the discussions and feedback.
uate under speaker-independent settings. Simply fine-tuning
6 We identify it as feature instead of model architecture since they only
5 Two-thirds of the sentences are specific to one emotion and shared across use wav2vec/wav2vec 2.0 as feature extractor without fine-tuning.
all speakers. 7 Referring to the text features extracted from ALBERT [8] and BERT [7].

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.
7. REFERENCES 2021, 2021, pp. 3370–3374.
[16] L. Pepino, P. Riera, and L. Ferrer, “Emotion Recognition from
Speech Using wav2vec 2.0 Embeddings,” in Proc. Interspeech
[1] H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep
2021, 2021, pp. 3400–3404.
learning architectures for speech emotion recognition,” Neural
Networks, vol. 92, pp. 60–68, 2017. [17] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Belt-
agy, D. Downey, and N. A. Smith, “Don’t stop pretraining:
[2] K. Han, D. Yu, and I. Tashev, “Speech emotion recognition
Adapt language models to domains and tasks,” in Proceedings
using deep neural network and extreme learning machine,” in
of ACL, 2020.
Fifteenth annual conference of the international speech com-
munication association, 2014. [18] W.-N. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu,
V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and
[3] Y. Zhang, J. Du, Z. Wang, J. Zhang, and Y. Tu, “Attention
M. Auli, “Robust wav2vec 2.0: Analyzing Domain Shift
based fully convolutional network for speech emotion recogni-
in Self-Supervised Pre-Training,” in Proc. Interspeech 2021,
tion,” in 2018 Asia-Pacific Signal and Information Processing
2021, pp. 721–725.
Association Annual Summit and Conference (APSIPA ASC),
2018, pp. 1771–1775. [19] M. D. Pell, “Influence of emotion and focus location on
prosody in matched statements and questions.,” The Journal
[4] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower,
of the Acoustical Society of America, vol. 109 4, pp. 1668–80,
S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMO-
2001.
CAP: Interactive emotional dyadic motion capture database,”
Language resources and evaluation, vol. 42, no. 4, pp. 335– [20] S. Mao, P. Ching, C.-C. J. Kuo, and T. Lee, “Advancing Mul-
359, 2008. tiple Instance Learning with Attention Modeling for Categori-
cal Speech Emotion Recognition,” in Proc. Interspeech 2020,
[5] P. Jackson and S. ul haq, “Surrey audio-visual expressed emo-
2020, pp. 2357–2361.
tion (savee) database,” 04 2011.
[21] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep
[6] S. R. Livingstone and F. Russo, “The Ryerson audio-visual
clustering for unsupervised learning of visual features,” in Eu-
database of emotional speech and song (RAVDESS): A dy-
ropean Conference on Computer Vision, 2018.
namic, multimodal set of facial and vocal expressions in North
American English,” PLoS ONE, vol. 13, 2018. [22] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmen-
[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT:
tation Method for Automatic Speech Recognition,” in Proc.
Pre-training of deep bidirectional transformers for language
Interspeech 2019, 2019, pp. 2613–2617.
understanding,” in NAACL-HLT, 2019.
[23] S. Lloyd, “Least squares quantization in pcm,” IEEE Trans-
[8] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and
actions on Information Theory, vol. 28, no. 2, pp. 129–137,
R. Soricut, “Albert: A lite bert for self-supervised learning
1982.
of language representations,” in International Conference on
Learning Representations, 2020. [24] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-
timization,” in 3rd International Conference on Learning Rep-
[9] S. Schneider, A. Baevski, R. Collobert, and M. Auli,
resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
“wav2vec: Unsupervised Pre-Training for Speech Recogni-
Conference Track Proceedings, 2015.
tion,” in Proc. Interspeech 2019, 2019, pp. 3465–3469.
[25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
[10] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec
rispeech: An ASR corpus based on public domain audio
2.0: A framework for self-supervised learning of speech rep-
books,” in 2015 IEEE International Conference on Acous-
resentations,” in Advances in Neural Information Processing
tics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–
Systems, 2020, vol. 33, pp. 12449–12460.
5210.
[11] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdi-
[26] C. Etienne, G. Fidanza, A. Petrovskii, L. Devillers, and
nov, and A. Mohamed, “Hubert: Self-supervised speech repre-
B. Schmauch, “CNN+LSTM Architecture for Speech Emotion
sentation learning by masked prediction of hidden units,” 2021.
Recognition with Data Augmentation,” in Proc. Workshop on
[12] Z. Fan, M. Li, S. Zhou, and B. Xu, “Exploring wav2vec 2.0 Speech, Music and Mind (SMM 2018), 2018, pp. 21–25.
on Speaker Verification and Language Identification,” in Proc.
[27] M. Chen and X. Zhao, “A Multi-Scale Fusion Framework for
Interspeech 2021, 2021, pp. 1509–1513.
Bimodal Speech Emotion Recognition,” in Proc. Interspeech
[13] L. Peng, K. Fu, B. Lin, D. Ke, and J. Zhan, “A Study on Fine- 2020, 2020, pp. 374–378.
Tuning wav2vec2.0 Model for the Task of Mispronunciation
[28] J. Santoso, T. Yamada, S. Makino, K. Ishizuka, and T. Hi-
Detection and Diagnosis,” in Proc. Interspeech 2021, 2021,
ramura, “Speech Emotion Recognition Based on Attention
pp. 4448–4452.
Weight Correction Using Word-Level Confidence Measure,” in
[14] J. Boigne, B. Liyanage, and T. Östrem, “Recognizing more Proc. Interspeech 2021, 2021, pp. 1947–1951.
emotions with less data using self-supervised transfer learn-
ing,” ArXiv, vol. abs/2011.05585, 2020.
[15] Y. Xia, L.-W. Chen, A. Rudnicky, and R. M. Stern, “Temporal
Context in Speech Emotion Recognition,” in Proc. Interspeech

Authorized licensed use limited to: INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY. Downloaded on April 04,2025 at 04:30:49 UTC from IEEE Xplore. Restrictions apply.

Review 3 PPT Final1)
No ratings yet
Review 3 PPT Final1)
51 pages
Speech Emotion Journal Phase 2-3
No ratings yet
Speech Emotion Journal Phase 2-3
6 pages
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
No ratings yet
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
68 pages
Final Presentation
No ratings yet
Final Presentation
50 pages
OEC552 - Soft Computing (Ripped From Amazon Kindle Ebooks by Sai Seena)
No ratings yet
OEC552 - Soft Computing (Ripped From Amazon Kindle Ebooks by Sai Seena)
156 pages
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
From Everand
RWKV Architecture and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Fine-Tune Wav2Vec2 For English ASR in Hugging Face With ? Transformers
No ratings yet
Fine-Tune Wav2Vec2 For English ASR in Hugging Face With ? Transformers
23 pages
Advanced Robotics: Intelligent Machines
No ratings yet
Advanced Robotics: Intelligent Machines
400 pages
2203.07378v4 - Speech Emotion Detection
No ratings yet
2203.07378v4 - Speech Emotion Detection
20 pages
Deep Reinforcement Learning (Aske Plaat) (Z-Library)
100% (3)
Deep Reinforcement Learning (Aske Plaat) (Z-Library)
414 pages
Speech Emotion Recognition1
No ratings yet
Speech Emotion Recognition1
86 pages
Real-Time Speech Emotion Recognition Using Deep Le
No ratings yet
Real-Time Speech Emotion Recognition Using Deep Le
40 pages
Applsci 12 09188 v2
No ratings yet
Applsci 12 09188 v2
17 pages
Wav LM
No ratings yet
Wav LM
14 pages
EX-Vector Emotional X-Vector Transfer Learning For Speaker Recognition With Emotion Domain Adaption
No ratings yet
EX-Vector Emotional X-Vector Transfer Learning For Speaker Recognition With Emotion Domain Adaption
12 pages
Can Large Language Models Aid in Annotating Speech Emotional Data Uncovering New Frontiers Research Frontier
No ratings yet
Can Large Language Models Aid in Annotating Speech Emotional Data Uncovering New Frontiers Research Frontier
12 pages
SECOND - s11042 023 16849 X
No ratings yet
SECOND - s11042 023 16849 X
18 pages
HIKVISION AI+ Product&Solution Presentation
No ratings yet
HIKVISION AI+ Product&Solution Presentation
61 pages
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
No ratings yet
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
20 pages
Literature Study 2025
No ratings yet
Literature Study 2025
10 pages
Fnbot 17 1181598
No ratings yet
Fnbot 17 1181598
10 pages
10 Coding Prompts For Your Portfolio
No ratings yet
10 Coding Prompts For Your Portfolio
13 pages
2022 Coling-1 626
No ratings yet
2022 Coling-1 626
5 pages
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
No ratings yet
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
6 pages
Emotion Recognition in Audio and Video Using Deep Neural Networks
No ratings yet
Emotion Recognition in Audio and Video Using Deep Neural Networks
9 pages
An Enhanced Speech Emotion Recognition Using Vision Transformer
No ratings yet
An Enhanced Speech Emotion Recognition Using Vision Transformer
17 pages
Sample Course End Project Report
No ratings yet
Sample Course End Project Report
27 pages
Opening The Black Box of Wav2Vec Feature Encoder
No ratings yet
Opening The Black Box of Wav2Vec Feature Encoder
5 pages
2024 Acl-Long 752
No ratings yet
2024 Acl-Long 752
11 pages
2111.02735v3 - Speech Emotion Detection
No ratings yet
2111.02735v3 - Speech Emotion Detection
7 pages
Low-Rank Adaptation Method For Wav2vec2-Based Fake Audio Detection
No ratings yet
Low-Rank Adaptation Method For Wav2vec2-Based Fake Audio Detection
6 pages
DL For SER
No ratings yet
DL For SER
9 pages
XEmoAccent Embracing Diversity in Cross-Accent Emo
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emo
19 pages
Wav2vec2-Based Paralinguistic Systems To Recognise Vocalised Emotions and Stuttering
No ratings yet
Wav2vec2-Based Paralinguistic Systems To Recognise Vocalised Emotions and Stuttering
4 pages
Cyprus University of Technology TEPAK Report Template English PDF
No ratings yet
Cyprus University of Technology TEPAK Report Template English PDF
17 pages
Cyprus University of Technology TEPAK Report Template English PDF
No ratings yet
Cyprus University of Technology TEPAK Report Template English PDF
17 pages
Enhancing Modal Fusion by Alignment and Label Matching For Multimodal Emotion Recognition
No ratings yet
Enhancing Modal Fusion by Alignment and Label Matching For Multimodal Emotion Recognition
5 pages
CNN Bilstm 2021
No ratings yet
CNN Bilstm 2021
5 pages
Sensors 23 06212 v2
No ratings yet
Sensors 23 06212 v2
20 pages
Enhancing Emergency Response Through Speech Emotion Recognition A Machine Learning Approach
No ratings yet
Enhancing Emergency Response Through Speech Emotion Recognition A Machine Learning Approach
5 pages
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
No ratings yet
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
9 pages
SER Report
No ratings yet
SER Report
4 pages
20240815-Temporal Attention Convolutional Network For Speech Emotion Recognition With Latent Representation
No ratings yet
20240815-Temporal Attention Convolutional Network For Speech Emotion Recognition With Latent Representation
5 pages
Multimodal Speech Emotion Recognition and Ambiguity Resolution
No ratings yet
Multimodal Speech Emotion Recognition and Ambiguity Resolution
9 pages
FP-05 4
No ratings yet
FP-05 4
6 pages
10 - Recurrent Neural Network Based Speech Emotion
No ratings yet
10 - Recurrent Neural Network Based Speech Emotion
13 pages
10 1109@access 2019 2936124
No ratings yet
10 1109@access 2019 2936124
19 pages
An Ensemble 1D-CNN-LSTM-GRU Model With Data Augmentation For Speech Emotion Recognition
No ratings yet
An Ensemble 1D-CNN-LSTM-GRU Model With Data Augmentation For Speech Emotion Recognition
19 pages
Serdl 2
No ratings yet
Serdl 2
10 pages
Towards The Explainability of Multimodal Speech Emotion Recognition
No ratings yet
Towards The Explainability of Multimodal Speech Emotion Recognition
5 pages
Speech Emotion Recognition Using Deep Learning Hybrid Models
No ratings yet
Speech Emotion Recognition Using Deep Learning Hybrid Models
5 pages
Reality
No ratings yet
Reality
11 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
4 pages
GROUP7 Researchpaper
No ratings yet
GROUP7 Researchpaper
9 pages
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
No ratings yet
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
7 pages
Visual Metaphor
No ratings yet
Visual Metaphor
33 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
Robust Speech Emotion Recognition Under Different Encoding Conditions
No ratings yet
Robust Speech Emotion Recognition Under Different Encoding Conditions
5 pages
Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology
No ratings yet
Multimodal Emotion Recognition With High-Level Speech and Text Features Mariana Rodrigues Makiuchi, Kuniaki Uto, Koichi Shinoda Tokyo Institute of Technology
9 pages
9 - Yogendra
No ratings yet
9 - Yogendra
5 pages
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
Physical-Layer Security in 6G Networks
No ratings yet
Physical-Layer Security in 6G Networks
14 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
Sample Poster Template CSE
No ratings yet
Sample Poster Template CSE
1 page
Deep Learning Report 1 3
No ratings yet
Deep Learning Report 1 3
3 pages
The Warehouse of The Future
No ratings yet
The Warehouse of The Future
28 pages
Machine Learing Mooc
No ratings yet
Machine Learing Mooc
1 page
AICE Milestone06 27.03.2024
No ratings yet
AICE Milestone06 27.03.2024
7 pages
AI ML in Development
No ratings yet
AI ML in Development
98 pages
Leng
No ratings yet
Leng
17 pages
Ly Ngoc Vu YSCPaper
No ratings yet
Ly Ngoc Vu YSCPaper
11 pages
Major Project Final AI DS B31
No ratings yet
Major Project Final AI DS B31
71 pages
The Effects of AI On University Students' Academic Performance
No ratings yet
The Effects of AI On University Students' Academic Performance
1 page
Unit - 4 AI
No ratings yet
Unit - 4 AI
8 pages
B2B Stablecoins and AI Verification
No ratings yet
B2B Stablecoins and AI Verification
21 pages
Sitanshu - 1182190038 Media Law SEminar
No ratings yet
Sitanshu - 1182190038 Media Law SEminar
27 pages
SIH2024 IDEA Presentation Format
No ratings yet
SIH2024 IDEA Presentation Format
7 pages
Koncel-Kedziorski Et Al. (2019) Text Generation From Knowledge Graphs With Graph Transformers Proceedings of NAACL-HLT 2019, Pages 2284-2293
No ratings yet
Koncel-Kedziorski Et Al. (2019) Text Generation From Knowledge Graphs With Graph Transformers Proceedings of NAACL-HLT 2019, Pages 2284-2293
10 pages
Data in Machine Learning
No ratings yet
Data in Machine Learning
7 pages
Assignment3 - Nekhlesh SIngh Sajwan
No ratings yet
Assignment3 - Nekhlesh SIngh Sajwan
5 pages
Ist 407 Presentation
No ratings yet
Ist 407 Presentation
12 pages
Chapter 06 - Sharda - 11e - Full - Accessible - PPT - 06
No ratings yet
Chapter 06 - Sharda - 11e - Full - Accessible - PPT - 06
39 pages
Iec 2023 Smart A4 LR 0
No ratings yet
Iec 2023 Smart A4 LR 0
16 pages
Brochure Switchon
No ratings yet
Brochure Switchon
12 pages
Project Proposal
No ratings yet
Project Proposal
4 pages
Tutorial 9 - Questions 2023
No ratings yet
Tutorial 9 - Questions 2023
4 pages
CSE Captcha
No ratings yet
CSE Captcha
17 pages
Nohel Kumar Nath (CV)
No ratings yet
Nohel Kumar Nath (CV)
1 page
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet