A Music Emotion Recognition Algorithm With Hierarchical SVM Based Classifiers
A Music Emotion Recognition Algorithm With Hierarchical SVM Based Classifiers
Fig. 2 The block diagram of the proposed music emotion recognition algorithm.
where Si is the sample within a frame, and n is the frame and three pitch related features were generated as follows:
size (n is set to 705 in this study). The following five dynamic : Mean_Pitchsemitone, the average ℎ !"# for each
related features are generated as follows: music audio recording.
: Mean_Loudness, the average Loudness for each 32-ms : Median_Pitchsemitone, the median ℎ !"# for each
frame. music audio recording.
: Var_Loudness, the variance of Loudness for each 32-ms : Var_Pitchsemitone, the variance of ℎ !"# for each
frame. music audio recording.
: Range_Loudness, the difference between the maximal and Melody is a series of notes composed of a specific relation
minimal Loudness for each 32-ms frame. of pitches. In this study, Pitchstep is used to describe
: RMS_Loudness, the root mean square value of Loudness for
characteristics of melody as follows:
each 32-ms frame. ℎ !2 = ℎ!"# ( ) − ℎ!"# ( − 1),
: Low-energy_Rate, the percentage of 32-ms frames with
less-than-average RMS_Loudness energy for each audio = 1,2, . . . , , (3)
recording.
where n is the number of notes of each music audio recording.
2) Rhythm Features: For each audio recording, five rhythm Based on Pitchstep, three melody related features are defined:
related features are generated as follows: : Max_Pitchstep, the maximum Pitchstep for each music audio
1250
: Roughness, the amount of partials that departs from
Positive:
High Arousal烉烉 SVM Classifier Happy
multiples of the fundamental frequency. Happy and Tensional -node II
Negative:
4) Timbre Features: A total of 16 timbre related features are (Valence) Tensional
Happy SVM Classifier
generated. Three features are generated by energy spectral Tensional
density (ESD) analysis [9] as follows: Sad
-node I
Peaceful (Arousal) Positive:
: Brightness_1500Hz, the percentage of energy above 1500 SVM Classifier Peaceful
Hz. Low Arousal烉 -node III
: Brightness_3000Hz, the percentage of energy above 3000 Sad and Peaceful (Valence) Negative:
Hz. Sad
: Spectral_Rolloff, the frequency of the percentage of energy Fig. 4 The structure of the hierarchical SVMs.
which is less than 15%.
KBCS and NWFE methods are utilized at every node to find
Another 13 features are generated based on Mel-frequency
the best feature set for different separation targets at each node.
cepstrum (MFC) analysis which is a representation of the short-
SVM is a binary classifier that relies on a nonlinear mapping
term power spectrum of a sound signal on the Mel scale. MFC
of the training set to a higher dimensional space, wherein the
is widely utilized in sound processing since the frequency
transformed data is well-separated by a decision hyperplane.
bands in MFC can approximate the human auditory system's
For details about the SVM, please refer to [12].
response more closely than linearly-spaced frequency bands
used in normal cepstrum. A total of 20 Mel-scale frequency III. EXPERIMENTAL RESULTS
cepstral coefficients (MFCCs) make up an MFC. In this study,
A total of 219 classical music samples from two music
the first 13 MFCCs are selected as the features from to .
datasets were used to verify the proposed music emotion
In order to reduce the effects of variance caused by a diverse
recognition algorithm. In the first dataset (dataset A), 270
range of values among different features, each feature is
music samples of 30 seconds from 211 pieces of Western
normalized by the z-score normalization method. This
classical music were selected. To verify the music emotion
procedure results in a normalized feature with a mean of zero
perceived by normal audience, six graduate students were
and a standard deviation of one.
recruited to label the emotions including happiness, tension,
C. Feature Selection sadness, and peace based on their perceived emotion from the
We adopted the kernel-based class separability (KBCS) music samples. If a music sample was labelled as the same
measurement for feature selection. The best individual N (BIN) emotion by at least five subjects, this music clip was then
was adopted as the KBCS criterion search strategy. In the BIN, selected as the music samples used in this study. A total of 175
each of the music features is evaluated individually by its music clips were picked up via this process in dataset A (49 for
criterion values. The most n significant features represent the happy, 38 for tensional, 47 for peaceful, and 41 for sad). In the
features with the highest n criterion values. For details about second dataset (dataset B), to verify the music emotion
the KBCS, please refer to [10]. perceived by music expert, two music therapists participated in
selecting and annotating 60 representative music samples of
D. Feature Extraction 180 seconds. These samples are Western classical music. If a
We employed the nonparametric weighted feature extraction music sample was labelled as the same emotion by both of the
(NWFE) for feature extraction. The idea of this method is to experts, this music sample was then selected. A total of 45
assign every sample with different weights and to define new music samples were selected via this process (13 for happy, 11
nonparametric between-class and within-class scatter matrices. for tensional, 9 for peaceful, and 12 for sad).
The goal of the NWFE method is to find a linear transformation A 5 fold cross-validation was employed to test our method
which maximizes the between-class scatter and minimizes the and the accuracies of each SVM classifier. Node I of the trained
within-class scatter. For details about the NWFE, please refer hierarchical SVMs classifier was used to discriminate music
to [11]. samples with high and low arousal. The KBCS+NWFE
methods were used to rank and reduce the normalized music
E. Classifier Construction features. At node I of the SVM classifier, the best average
We used valence and arousal axes as shown in Fig. 1 to accuracy of dataset A is reached when 15 features are used, and
divide music emotion into four classes. The structure of the the best average accuracy of dataset B is reached when 5
hierarchical SVMs shown in Fig. 4 consists of three SVM features are used. Node II and node III are both responsible for
classifiers. The first classifier, node I, separates music samples classifying positive and negative valence. The same feature
into a group of low arousal and a group of high arousal. The selection and extraction procedures were performed as those
second and third classifiers: node II and node III, are used to used in node I. The best average accuracy is achieved at 10
discriminate positive and negative valence from each level of features for both node II and III in dataset A, and is achieved
arousal, respectively. With the hierarchical SVM classifiers, by 5 features for both node II and III in dataset B. With the
four music emotions can be assigned to the corresponding extracted features for each node, we can evaluate the
quadrants of the 2D music emotion space. Please note that classification performance of the hierarchical SVMs by
accuracy summarized in Table III. In dataset A, the proposed
1251
TABLE I feature extraction method, and a hierarchical SVMs classifier
PERFORMANCE COMPARISON OF THE HIERARCHICAL SVMS CLASSIFIERS has been proposed. 35 features from dynamic, rhythm, pitch,
Dataset Hierarchical SVMs Classifier Accuracy and timbre of music are generated from each music audio
Node I (High vs. Low arousal) 90.75% recording. In order to improve the classification accuracy of the
Node II (Positive vs. negative valence) 86.63% classifier, each node of hierarchical SVMs extracts important
A features from the 35 features by the KBCS+NWFE method
Node IIIġ(Positive vs. negative valence) 77.73%
Overall (4 emotions) 86.94% individually. The extracted features are then treated as the
Node I (High vs. Low arousal) 95.00% inputs of the hierarchical SVMs classifier to separate four
B
Node II (Positive vs. negative valence) 95.83% quadrant of the 2D music emotion space. We validated the
Node IIIġ(Positive vs. negative valence) 83.81% proposed algorithm using two music datasets. In the first
Overall (4 emotions) 92.33% dataset, the emotion of each music clip was annotated by
TABLE II graduate students to represent the perceived emotions by
PERFORMANCE ANALYSIS OF DIFFERENT STRUCTURES OF SVMS CLASSIFIER normal audience. In the second dataset, music emotion was
Overall
labeled by music therapists to represent the perceived emotions
Dataset Structure of SVMs Classifier by music expert. The average classification accuracies of the
Accuracy
Hierarchical SVMs classifier I 86.94% proposed algorithm are 86.94% and 92.33%, respectively. The
A Hierarchical SVMs classifier II 85.26% effectiveness of the proposed algorithm was validated by these
One-against-one 84.89% results successfully.
Hierarchical SVMs classifier I 92.33%
B Hierarchical SVMs classifier II 89.03% REFERENCES
One-against-one 83.22% [1] Y. H. Yang and H. H. Chen, “Machine recognition of music emotion: A
review,” ACM Transactions on Intelligent Systems and Technology
music emotion classification algorithm can classify high and
(TIST), vol. 3, no. 3, pp. 1–30, 2012.
low arousal at node I with an accuracy of 90.75%, positive and [2] Y. Feng, Y. Zhuang, and Y. Pan, “Popular music retrieval by detecting
negative valence at node II and node III with accuracies of mood,” in Proceedings of the 26th Annual International ACM SIGIR
86.63% and 77.73%, respectively. In dataset B, the Conference on Research and Development in Information Retrieval, pp.
375–376, 2003.
corresponding accuracies at node I, II, and III are 95.00%,
[3] Y. H. Yang, C. C. Liu, and H. H. Chen, “Music emotion classification:
95.83% and 83.81%, respectively. The overall accuracy is A fuzzy approach,” ACM International Conference on Multimedia, pp.
86.94% in dataset A and 92.33% in dataset B for discriminating 81–84, 2006.
four music emotions based on the corresponding quadrants of [4] T. Eerola, “Modeling listeners’ emotional response to music,” Topics in
Cognitive Science, vol. 4, no. 4, pp. 607–624, 2012.
the 2D emotion space. [5] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, “A regression approach
to music emotion recognition, ” IEEE Transactions on Audio, Speech,
IV. DISCUSSION and Language, vol. 16, no. 2, pp. 448–457, 2008
In this section, we compared three different structures of [6] L. Lu, D. Liu, and H. Zhang, “Automatic mood detection and tracking
of music audio signals,” In IEEE Transaction Audio, Speech and
SVMs classifier. Hierarchical SVMs classifier I was utilized in Language Processing, vol. 14, no. 1, pp. 5–18, 2006.
this study in which node I was used to discriminate high and [7] A. Camacho and J. G. Harris, “A sawtooth waveform inspired pitch
low arousal, while node II and III were used to distinguish estimator for speech and music,” The Journal of the Acoustical Society
positive and negative valence. We changed the classification of America, vol. 124, no. 3, pp. 1638–1652, 2008.
[8] O. Lartillot, P. Toiviainen, and T. Eerola, MIRtoolbox: An Integrated
sequence and formed a hierarchical SVMs classifier II, in Set of Functions Written in Matlab, Finnish Center of Excellence in
which node I was used to discriminate positive and negative Interdisciplinary Music Research, University of Jyvaskyla, Finland.
valence while node II and III were used to distinguish high and [Online]. Available:
low arousal. The third one we used is the basic one-against-one http://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/mir
toolbox
structure of SVMs classifier [12], using n × (n-1)/2 SVMs to [9] P. N. Juslin, “Cue utilization in communication of emotion in music
distinguish n classes by a maximum voting strategy. The performance: relating performance to perception,” Journal of
comparison results are shown in Table IV. The hierarchical Experimental Psychology: Human Perception and Performance, vol. 26,
SVMs classifier I achieves the best performance among these 3 no. 6, pp. 1797–1813, 2000.
[10] L. Wang, “Feature selection with kernel class separability,” IEEE
structures. This indicates that, for music emotion classification, Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no.
hierarchical SVMs outperforms one-against-one SVMs and the 9, pp. 1534–1546, 2008.
power of discriminating arousal outperforms the power of [11] B. C. Kuo, and D. A Landgrebe, “Nonparametric weighted feature
discriminating valence. extraction for classification,” IEEE Transactions on Geoscience and
Remote Sensing, vol. 42, No. 5, pp.1096–1105, 2004.
[12] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning,
V. CONCLUSIONS vol. 20, pp. 273–297, 1995.
In this paper, a music emotion recognition algorithm
consisting of a KBCS feature selection method, a NWFE
1252