1. Introduction
In recent years, music emotion recognition (MER) has attracted widespread interest, stimulated by the growing demand for the management of massive music resources. MER is considered a useful auxiliary tool for music information retrieval and organization [
1], music recommendation systems [
2,
3], automatic music composing [
4,
5], and so on. Using manual methods to obtain music emotion labels can be time-consuming, labor-intensive, and error-prone. Therefore, the research field of automatically recognizing emotion labels has come into being.
Automatic MER constitutes a process of using computers to extract and analyze music features, form the mapping relations between music features and emotion space, and recognize the emotion that music expresses [
6,
7]. The existing MER methods can be divided into two categories: regression and classification, according to different emotion models. The former uses the spatial position of emotion space to express human internal emotions. The latter selects finite discrete emotional labels to classify music. In this paper, we focus on the music emotion classification task and use Russell’s circumplex emotional model [
8] to label music emotions based on the four quadrants of the valence-arousal space.
Since MER requires the establishment of mapping relations between music features and emotion space, the effective extraction of emotion-related features is the key to accurate classification. In early research, most studies used handcrafted features (such as pitch, tone, intensity, rhythm, etc.) and traditional machine learning methods such as the support vector machine (SVM) [
9,
10], k-nearest neighbor (KNN) [
11], Bayesian network [
12], Gaussian mixture model (GMM) [
13], decision tree (DT) [
14], etc., to classify music emotions. These methods complete feature extraction and emotion recognition separately, with no further processing or extraction of the original handcrafted musical features. Therefore, the traditional machine learning-based MER methods require significant effort in manually extracting features and thus achieve low classification accuracy.
With the rapid development of artificial intelligence, deep learning-based music emotion recognition is gradually becoming mainstream, and it significantly contributes to improving classification accuracy by using multi-layer representation and abstract learning [
6]. Compared with traditional machine learning methods, deep learning-based methods can reduce the burden of manually extracting features and learning music features automatically during the training procedure. In MER studies, most works are convolutional neural network (CNN) or recurrent neural network (RNN)-based models [
15,
16,
17,
18,
19]. CNN-based methods mimic the visual perception of living creatures and can learn feature representations from data effectively. RNN is a sequential model and is good at processing sequence data, so it is widely used in dimensional MER tasks and dynamic emotion detection [
17,
18,
20]. In order to combine the advantages of CNNs and RNNs, some works also propose frameworks that combine the CNN and RNN (including its variants such as Bi-RNN and LSTM) together to strengthen the ability to learn useful features [
21,
22].
Although deep learning methods have become increasingly popular in recent years, they still face some challenges and limitations when performing music emotion recognition. For example, traditional CNN and RNN models do not consider the influence of features with different spatial positions or time intervals, and they treat all the features equally. Fortunately, the emergence of the attention mechanism has largely addressed this issue. The main idea of the attention mechanism is to introduce a dynamic weighting mechanism within the network, allowing the model to perform weighted calculations for different parts of the input, thus enabling the network to focus more on important information while ignoring irrelevant information [
23]. Thus far, various attention-based models have been proposed to learn emotion-related information about music by letting models focus on the important parts of features to achieve better performance [
24,
25,
26,
27,
28]. Another challenge for traditional neural networks is the potential loss of key information and performance degradation as the network depth increases; therefore, residual learning is introduced to prevent the gradient from disappearing through skip connections [
29] and has also been employed in emotion recognition tasks [
30,
31].
Motivated by existing works, we propose an attention-based spatial-temporal feature extraction approach for music emotion classification and have called it the FFA-BiGRU model. In the proposed method, we first input the log Mel-spectrogram of music clips into the FFA (feature fusion attention) [
32] module to extract high-level spatial features of music audios. The FFA module consists of three group architecture blocks, each of which is a stack of channel-spatial attention-based convolutional residual blocks. The three group architecture blocks can fully extract the multi-scale spatial features of music clips, and the channel-spatial attention mechanism can pay more attention to critical information for sentiment classification from both the channel and spatial aspects. Finally, the output features of three group architecture blocks are fused together through channel-spatial attention. In order to further capture the sequence characteristics of music audio, the bidirectional gated recurrent units (BiGRU) module is employed to learn the temporal features after the FFA module. Then, the output feature maps of FFA and those of the BiGRU are concatenated in the channel direction. Finally, the concatenated features are passed through fully connected layers to predict the emotion classification results.
In summary, the main contributions of this paper are as follows:
- (1)
We propose an end-to-end spatial-temporal feature extraction method for MER called FFA-BiGRU. The proposed model fully considers the spatial and temporal properties of the log Mel-spectrogram features of music audios and extracts rich emotion-related spatial-temporal features through the combination of the FFA and BiGRU modules. The experimental results show that the integration of FFA and BiGRU is effective and can achieve a better classification performance than the existing baselines.
- (2)
In the proposed FFA-BiGRU model, we extract multi-scale spatial features through three group architecture blocks, and each of them is a stack of multiple channel-spatial attention-based residual blocks. The channel-spatial attention mechanism can effectively highlight features that are critical to music emotion classification at both the channel and spatial levels. Moreover, the concatenation of spatial features from FFA and the temporal features from the BiGRU can fully retain the spatial and temporal features of music, which can discriminate emotions well in the final emotion space.
- (3)
Finally, we conduct sufficient comparison experiments and an ablation study on the EMOPIA dataset [
33], and it shows the effectiveness of the network architecture as well as each component of the approach, including the effectiveness of the channel-spatial attention mechanism, the optimal number of group architecture blocks, and the optimal number of network layers in these blocks.
The remainder of this paper is organized as follows. Related works of this paper are given in
Section 2.
Section 3 illustrates the architectural details of the proposed model. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of our method in
Section 4. Finally,
Section 5 concludes this paper and discusses several future research directions.
2. Related Works
In general, the MER methods are classified into two types: traditional machine learning approaches and deep learning approaches. Meanwhile, with the emergence of the attention mechanism, it has been widely used in deep learning models and improves the MER accuracy. Therefore, machine learning methods and deep learning methods (including those whose integration with the attention mechanism) are discussed in detail in the following subsections.
2.1. Machine Learning Method for MER
For music emotion recognition, commonly used machine learning approaches include SVM, the Gaussian mixture model, the KNN method, naïve Bayes, and so on. The representative works using traditional machine learning methods are summarized in the first half of
Table 1. Specifically, the authors in [
11] use rhythm patterns as music features and use KNN and a self-organizing map (SOM) to predict emotion for a set of kid songs. Lu et al. [
13] proposed a hierarchical Gaussian mixture model to automate the task of mood detection, where three types of music features—intensity, timbre, and rhythm—are extracted to represent the characteristics of a music clip. The hierarchical framework can emphasize the most suitable features in different detection tasks. Kim et al. [
34] executed music emotion classification based on the lyrics using three machine learning methods: naïve Bayes, the hidden Markov model, and SVM. The results showed that the classification performance of the SVM method was optimal. Malheiro et al. [
35] used SVM to evaluate musical emotions based on three novel lyric features: slang presence, structural analysis features, and semantic features. Hu et al. [
36] collected physiological signals from wearable devices, trained four classification models, and found that the KNN model achieved the best performance. Xu et al. [
37] introduced source separation into a standard music emotion recognition system and extracted a combined 84-dimensional feature vector for music emotion recognition. The combined feature vector consists of a 72-dimensional acoustic feature vector and a 12-dimensional chroma feature vector. Lastly, the SVM classifier is employed for emotion prediction, and the experimental results have verified that source separation can effectively improve the MER performance.
Each of these traditional machine learning methods has inherent strengths and limitations. On the one hand, machine learning methods explicitly extract features from music data, allowing for a clear understanding of which features are crucial for music emotion recognition tasks. On the other hand, machine learning methods also have a significant drawback: manually extracted features may not cover the full spectrum of relevant characteristics, and there is a lack of further feature extraction beyond the initial selection.
2.2. Deep Learning Method and Attention Mechanism for MER
With the rapid development of deep learning, significant work has been conducted for MER by constructing various deep learning network architectures, and the accuracy has been greatly improved in recent years. The most widely used network is the convolutional neural network, which emulates biological visual perception and can effectively extract feature representations from music spectrogram data. For example, a deep CNN on music spectrograms is proposed for music emotion classification in [
15]. By using the proposed method, no additional effort is required to extract specific features, which is left to the training procedure of the CNN model. Keelawat et al. [
38] used electroencephalogram (EEG) as the input feature to realize emotion recognition during music listening. CNNs with three to seven convolutional layers were employed in the research, and a binary classification task was measured. Yang et al. [
19] applied the constant-Q transform on music objects to derive the spectrogram and then took the spectrogram as the input of the CNN model to predict the dimensional emotion of music objects.
In addition, the BiLSTM (bidirectional long short-term memory) model, as a two-way recurrent neural network with long short-term memory, is often used in emotion classification tasks due to its ability to process sequence data and maintain long-term memory. Weninger et al. [
20] employed a deep RNN structure for online continuous-time music mood regression. The study first extracted a large set of segmental acoustic features and then performed multi-variate regression using deep recurrent neural networks. The results showed that the deep RNN outperformed SVR and feedforward neural networks in both continuous-time and static music mood regression. In [
17], a deep bidirectional long short-term memory (DBLSTM)-based multi-scale regression method was proposed for dynamic music emotion prediction, in which a fusion component was introduced to integrate the outputs of all DBLSTM models with different scales. The experimental results show that the proposed method achieves a significant improvement when compared with state-of-the-art methods. The representative works using deep learning methods are also summarized in
Table 1.
More recently, the attention mechanism has also been combined with the deep learning methods, further achieving accuracy improvements for the MER tasks. In [
24], the authors proposed multi-scale context-based attention (MCA) using LSTM for dynamic music emotion prediction. The proposed MCA mechanism pays different attention to the previous contexts of different time scales of music, and multi-scale models fused with attention can learn the deep representations of music structure dynamically and lead to better performance. In [
39], a structure that combines 3D convolutions and attention-based sliding recurrent neural networks (ASRNNs) was proposed for speech emotion recognition, in which the 3D convolution model was proposed to obtain both the local features and periodicity information of emotional speech and the ASRNN was used to extract the continuous segment-level internal representations and focus on the salient emotion regions using a temporal attention model. The authors in [
40] proposed two attention-based methods based on a VGG-ish architecture for the task of music emotion recognition. The first attention method used self-attention to replace the spatial convolutions in later layers of the VGG-ish network, and the second method used element-wise attention-based rectified linear units (ReLUs) in all the layers of the baseline VGG-ish network. The experimental results show that the first method can match the baseline performance with fewer computations and parameters, and the second method can achieve a better performance than the baseline without increasing the number of parameters.
In [
25], a novel attention-based joint feature extraction model was proposed for static MER. It utilizes the CNN to learn emotion-related features through the filter bank and log Mel-spectrogram and further uses location-aware attention and self-attention [
23] mechanisms to obtain salient emotion-related features. The authors in [
27] proposed an end-to-end attention-based deep feature fusion (ADFF) approach for MER. The proposed model first uses an adapted VGGNet as a spatial feature learning module and then uses a squeeze-and-excitation (SE) attention-based [
41] temporal feature learning module to obtain multi-level emotion-related spatial-temporal features. The experiments show that the combination of a spatial-temporal feature extractor and SE attention can achieve a better performance than the state-of-the-art model. In [
28], a short-chunk CNN model with multi-head self-attention, called SCMA, and a BiLSTM model with multi-head self-attention, called BiLMA, are proposed for MER. It shows that the multi-head self-attention mechanism can effectively capture relevant information from features for emotion recognition tasks.
Moreover, there are also some studies using a multimodal network to predict music emotion by combining various features such as symbolic, acoustic, and lyric features. In particular, a multimodal neural network for MER is proposed in [
26], where audio features, lyric features, and context features are extracted separately and fused by a cross-modal attention mechanism. Similarly, a multimodal multifaceted MER method is proposed in [
30], where symbolic and acoustic features are extracted from both MIDI and audio data and integrated with a self-attention mechanism. In [
42], the authors propose an end-to-end one-dimensional residual temporal and channel attention network (RTCAN-1D) to fuse the subject’s individual EDA features and the external evoked music features. The experiments show that the proposed method outperforms the existing state-of-the-art models. The representative works utilizing attention mechanisms are also summarized in
Table 1.
Overall, compared with traditional machine learning models, deep learning methods, along with their combination with attention mechanisms, can automatically extract more useful features for MER and lead to significant improvements in MER accuracy.
Table 1.
Representative works of MER.
Table 1.
Representative works of MER.
Method | Reference | Year | Input Features | Learning Model |
---|
Machine learning | [13] | 2006 | Intensity, timbre, and rhythm | GMM |
[11] | 2010 | Rhythm patterns | KNN, SOM |
[34] | 2011 | Emotion vocabulary | NB, HMM, SVM |
[37] | 2014 | Acoustic feature, chroma feature | SVM |
[35] | 2018 | Slang presence, structural analysis features, semantic features | SVM |
[36] | 2018 | Physiological signals | SVM, NB, KNN, and DT |
Deep learning methods | [20] | 2014 | A large set of acoustic features | Deep RNN |
[17] | 2016 | Low-level acoustic features | BiLSTM |
[15] | 2017 | Spectrogram | CNN |
[38] | 2019 | EEG | CNN |
[19] | 2020 | Spectrogram | CNN |
Deep learning + attention | [24] | 2017 | MFCCs, spectral flux, centroid, entropy, slope, etc. | LSTM + MCA |
[25] | 2021 | Log Mel-spectrogram and filter bank spectrogram | CNN + local attention and global self-attention + GRU-SVM |
[27] | 2022 | Log Mel-spectrogram | VGGNet + SE Attention + BiLSTM |
[42] | 2022 | EDA, the external evoked music features | Residual channel-temporal attention |
[26] | 2022 | Mel-spectrogram, lyrics, track name, and artist | CNN + cross-modal attention |
[28] | 2023 | Mel-spectrogram/MIDI-like representation | Short-chunk CNN/BiLSTM + multi-head self-attention |
[30] | 2023 | MIDI-like representation, MFCCs, chromagram | BiGRU + CNN + the multi-head self-attention |
3. Proposed FFA-BiGRU Method
For music emotion recognition, past works have shown that spectral features play an important role in identifying the emotion [
27,
28], and a spectrogram is a good representation of the audio clip, and it contains all physical information of the original audio [
43]. Therefore, we choose the spectrogram as the model input since it summarizes spectral information in a concise form. Furthermore, the frequency scale of the spectrogram is converted from a linear scale to a mel scale as it resembles the human auditory system. As a result, the log Mel-spectrogram of music audio is used as the model input in this paper. The log Mel-spectrogram of music audio is a pictorial representation and, at the same time, reflects the temporal changes of frequency; that is, the log Mel-spectrogram has both spatial and temporal characteristics. Therefore, we propose a spatial-temporal feature extraction model, called FFA-BiGRU, for music emotion classification.
The proposed model is shown in
Figure 1, which consists of three modules: a multi-level spatial feature learning module, a temporal feature learning module, and an emotion prediction module. Specifically, the spatial feature learning module is an attention-based convolutional residual network called FFA [
32], which mainly consists of three group architecture (GA) blocks to learn multi-scale spatial features. Each group architecture is a global residual block that combines 19 basic blocks with local residual learning, and each basic block combines residual learning with a channel-spatial attention module. To further capture the temporal features of music audio, the BiGRU module is introduced after the spatial feature learning module to capture the sequential features. Then, the output feature maps of FFA and those of the BiGRU are concatenated in the channel direction. Finally, the fully connected (FC) layer is used to provide the final emotion prediction result. In the following subsections, we will provide the network details of each module in the proposed model.
3.1. Multi-Level Spatial Feature Learning
In reference [
32], the channel-spatial attention-based feature fusion module, FFA, is proposed, and it has shown great effectiveness in image processing. In order to extract the spatial features from the log Mel-spectrogram of music audio, we utilized the architecture of the FFA module as a multi-level spatial feature learning module in our method.
As shown in
Figure 1, the spatial feature learning module first passes the log Mel-spectrogram of music audio into a convolutional layer containing 64 filters with a filter size of 3 × 3 to perform shallow feature extraction. It then feeds the output into three group architecture blocks to extract multi-level deep spatial features. The output features from the three group architecture blocks are first concatenated in the channel direction and further fused together through channel-spatial attention. After that, the fused features are passed through two convolutional layers with a filter size of 3 × 3 to further extract the spatial features. To avoid losing useful information about the original log Mel-spectrogram, a global skip connection is introduced between the original input and the final output of the spatial feature learning module.
In the spatial feature learning module, group architecture is a key component for extracting multi-level deep spatial music features. In particular, each group architecture consists of 19 basic blocks with local residual learning, and each basic block is a convolutional residual subnet with channel-spatial attention. Therefore, in the following subsections, we will introduce channel-spatial attention, group architecture (including its basic blocks), and the spatial feature fusion strategy successively.
3.1.1. Channel-Spatial Attention
Since the log Mel-spectrogram of music is a pictorial representation, we believe that different spatial regions have varying degrees of importance for emotion classification. Meanwhile, different feature maps from the CNN-based spatial feature learning module also play distinct roles in emotion recognition. Therefore, the group architecture subnet employs channel-spatial attention to obtain importance at both the channel and spatial levels, as shown in
Figure 2. The channel-spatial attention weights are calculated as follows.
Channel Attention (CA) Channel attention is mainly used to expand the representation ability of the spatial feature learning module by assigning different weights to each feature map. This allows it to learn the importance of each channel adaptively, thereby better capturing the emotion information. The realization of the CA block is shown in
Figure 2. First, the input feature maps
with dimension
are fed into a global average pooling layer to squeeze the global spatial information into a channel descriptor
, and the
c-th element of
is given as:
where
refers to the
c-th channel feature map of
, and
represents the value of
at the position (
i,
j).
denotes the global pooling function. After global average pooling, the feature shape changes from
to
.
Next, to obtain the channel attention weights, the average pooled feature maps are then fed into two convolutional layers, followed by a ReLU activation function and a sigmoid activation function, respectively. The channel attention weights are finally given by:
where
represents the sigmoid function, and
denotes the ReLU function. Both Conv layers utilize a 1 × 1 convolution, with kernel numbers
and
, respectively. For each basic block, the two Conv layers in its channel attention module have the filter numbers
and
, respectively.
Finally, the output of the CA block is obtained by scaling the input feature maps
with the attention weight vector
as follows:
where
denotes element-wise multiplication, and the attention values are broadcast along the spatial dimension during multiplication.
Spatial Attention (SA) Considering that different regions of the music audio features may have different importance for emotion classification, the output from the CA block is subsequently fed into a spatial attention (SA) block. This allows the network to focus more on the important regions.
As shown in
Figure 2, the SA weight is computed by feeding the output
of the CA block into two convolutional layers. The first layer is followed by a ReLU activation function, and the second is followed by a sigmoid activation function, respectively. Both convolutional layers use 1 × 1 convolution filters with the filter numbers
and 1, respectively. Therefore, the dimension of the SA weight is
. The SA weight is given by:
Finally, the output of the SA block is obtained by scaling the input feature maps
with the attention weight vector
as follows:
where
also denotes element-wise multiplication, and the SA weight values are broadcast along the channel dimension during this multiplication.
3.1.2. Group Architecture Block: Deep Spatial Feature Extractor
For the FFA module in
Figure 1, the log Mel-spectrogram is first fed into a convolution layer to extract the shallow information of music; then, the output features are sent into three group architecture blocks to extract multi-level deep spatial features. Each group architecture block is a global residual block that contains 19 basic blocks with the same structure, as shown in
Figure 1. The stacking of multiple basic blocks increases the depth and enhances the expressiveness of the network to learn high-level emotion-related information. A long shortcut connection is introduced between the first and the last basic blocks to avoid losing useful information. Each of the three group architecture modules can learn spatial features of different levels. Since each group architecture block consists of a stack of multiple basic blocks, we provide a detailed introduction of the basic block in the following.
Basic Block: The basic subnet in group architecture is the basic block, and its detailed realization is given in
Figure 1. It is a local residual learning structure consisting of two convolutional layers, followed by a ReLU layer and a channel-spatial attention module, respectively. Local residual learning can improve training stabilization and allow the less important information to be bypassed through multiple local residual connections while the main network focuses on effective information. In the basic block, two convolutional layers (each containing 64 filters with a filter size of 3 × 3) are used to extract local spatial features, and the channel-spatial attention mechanism allows the network to focus on important information from the channel and spatial levels.
3.1.3. Spatial Feature Fusion Strategy
Channel-spatial attention-based feature fusion: The deep learning of each GA network can extract spatial features from different levels. To make full use of all the emotion-related information, we first concatenate all the output spatial feature maps of the three group architecture blocks in the channel direction, as shown in
Figure 1. Then, we multiply the output feature maps of the three group architecture blocks by the corresponding channel attention weights and further fuse these three parts of the weighted feature maps by element-wise summation. The fused features are further passed into the spatial attention module to assign different weights for the feature maps from the spatial level. The channel and spatial attention here employ the same structure as that in
Figure 2, except that the number of filters in the two Conv layers of the channel attention module becomes 4 and 192, respectively. Finally, the output features of the channel-spatial attention are passed to two Conv layers (each containing 64 and 1 filters with the filter size of 3 × 3, respectively) to learn the high-level spatial features further.
3.2. Temporal Feature Learning and Emotion Prediction
Temporal feature learning: Music audio sequences tend to be long and exhibit strong temporal dependencies. The time-dependent information in the sequences may not be captured using the spatial feature extractor alone. To further capture temporally related emotion features, we employed the BiGRU to learn the temporal features following the spatial feature learning module. The structure of the BiGRU is given in
Figure 1. The BiGRU is a variant of BiLSTM with fewer parameters and higher learning efficiency. By incorporating the BiGRU layer, the model can learn to extract the temporal features in both the forward and backward directions, which helps to improve the overall performance of the model.
The BiGRU is a bidirectional GRU that constructs two GRU layers in opposite directions. The basic GRU unit in the BiGRU is depicted in
Figure 3, and the main calculation formula for the GRU is as follows:
where
, and
represent the outputs of the reset gate, the update gate, the candidate hidden state, and the GRU unit, respectively;
, and
denote the corresponding state weight matrices, respectively;
denotes the sigmoid function.
For the BiGRU model, which superimposes two single-layer GRU models in opposite directions, its output is determined by the states of the two superimposed GRUs. Specifically, the forward output of the BiGRU is , and the reverse output is . Then, the output of the BiGRU is given as .
After the BiGRU module processes the music features, the temporal features are obtained. To fully utilize both the spatial and temporal features, the output feature maps of the FFA and those from the BiGRU are concatenated in the channel direction. The concatenated spatial-temporal features are subsequently used to predict the emotion results.
Emotion prediction: By passing the log Mel-spectrogram into spatial and temporal feature learning modules, the spatial-temporal features are sufficiently extracted and concatenated. To predict the music emotion results, the concatenated features from the FFA and BiGRU are first flattened and then fed into two FC layers. These layers map the spatial-temporal features into the emotion space. The structure of the FC layer is given in
Figure 4, where the first layer consists of 32 nodes and the second consists of 4 nodes.
5. Conclusions
In this work, we propose an attention-based spatial-temporal feature extractor for music emotion classification. The proposed model employs the FFA module as a multi-scale spatial feature extractor to obtain comprehensive and effective spatial information, and it utilizes the BiGRU to learn the temporal features of music sequences further. A series of experiments verify the effectiveness of the proposed network architecture, and it outperforms the baseline models in terms of accuracy, precision, and AUC performance. Although the proposed model has achieved some improvement in accuracy, there is still much room for improvement in the emotion classification of certain categories, such as the “HVLA” and “LVHA” categories. In future work, we will further refine the model to enhance its classification accuracy for these specific categories. Moreover, the proposed method is only evaluated on the EMOPIA dataset, in which the emotion expression is not universal. In fact, since different individuals, cultures, and customs may elicit varying emotional responses to the same piece of music; therefore, more diversified datasets and corresponding research methods are urgently needed for music emotion recognition.