0% found this document useful (0 votes)
47 views4 pages

A Music Emotion Recognition Algorithm With Hierarchical SVM Based Classifiers

Music emotion I've papaer

Uploaded by

aankitachawda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views4 pages

A Music Emotion Recognition Algorithm With Hierarchical SVM Based Classifiers

Music emotion I've papaer

Uploaded by

aankitachawda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2014 International Symposium on Computer, Consumer and Control

A Music Emotion Recognition Algorithm with Hierarchical SVM Based Classifiers


Wei-Chun Chiang, Jeen-Shing Wang, and Yu-Liang Hsu
Department of Electrical Engineering,
National Cheng Kung University, NCKU
Tainan, Taiwan, R.O.C.
1
[email protected]
2
[email protected]
3
[email protected]

Abstract—This paper proposes a music emotion recognition


algorithm consisting of a kernel-based class separability (KBCS)
feature selection method, a nonparametric weighted feature
extraction (NWFE) feature extraction method, and a hierarchical
support vector machines (SVMs) classifier to recognize four types
of music emotion. For each music sample, a total of 35 features
from dynamic, rhythm, pitch, and timbre of music were generated
from music audio recordings. With the extracted features via
feature selection and extraction methods, hierarchical SVM-based
classifiers are then utilized to recognize four types of music
emotion including happy, tensional, sad and peaceful. The
performance of the proposed algorithm was evaluated by two Fig. 1 The two-dimensional emotion space.
datasets with a total of 219 classical music samples. Inġġ the first were extracted from 100 film soundtracks for prediction model
dataset, music emotion of each sample was annotated by recruited construction whose the variance explained (R2) of arousal was
subjects, while the second dataset was labelled by music therapists.
0.81 and valence was 0.66. Yang et al. [5] formulated MER as
The two datasets were used to verify the perceived emotions from
normal audience and music expert, respectively. The average a regression problem to predict the arousal and valence values
accuracy of the proposed algorithm achieved at 86.94% and (AV values) of each music sample directly. The best
92.33% for these two music datasets, respectively. The performance evaluated in terms of the R2 statistics reaches 0.58
experimental results have successfully validated the effectiveness for arousal and 0.28 for valence by employing support vector
of the proposed music emotion recognition algorithm with machines as regressors.
hierarchical SVM-based classifiers. From a review of the existing literature, we found that there
has no standard rule for distinguishing emotion’s taxonomy.
Keywords-Music emotion; feature extraction; kernel-based class Different studies use different numbers and classes of emotion,
separability; nonparametric weighted feature extraction; and even use different words to describe the same emotion,
hierarchical support vector machines (e.g., sad versus sorrowful). This may overwhelm the subjects,
and mislead them to choose a wrong music emotion in
I. INTRODUCTION experiments. Although dimensional approaches can improve
Recently music emotion recognition (MER) has become an the above disadvantages of categorical approaches, subjects
active research topic that has been addressed by categorical may be confused to the meaning of the valence and arousal axes.
approaches or dimensional approaches [1]. The categorical For example, a subject may not know which emotion he/she
approach of MER emphasizes on training a classifier by using perceives when the score of valence axis is 5 and the score of
machine learning techniques for music emotion classification arousal axis is -2.
results, while the dimensional approach identifies emotions In order to solve these problems, we combined both
based on the feature placement of emotion coordinate with categorical and dimensional approaches into a systematic one
named axes, such as valence and arousal. Among categorical that uses valence and arousal axes to divide music emotion into
approaches, Feng et al. [2] classified emotions into four four classes including happy, tensional, sad, and peaceful. The
categories including happiness, sadness, anger, and fear using illustration of a two dimensional emotion space and its
223 pieces of modern pop music. In their study, tempo and corresponding emotion categories is shown in Fig. 1. The rest
articulation were used as the input features of neural networks of this paper is organized as follows. Section II describes a
whose average accuracy achieved 67%. Yang et al. [3] adopted music emotion recognition algorithm. The experimental results
a fuzzy k-NN classifier and a fuzzy nearest-mean classifier to are presented in Section III. Section IV and V provide
recognize four classes of music emotion from 243 pop music discussions and conclusions, respectively.
segments. The best accuracy they obtained was 78.33%. In
dimensional approach, Eerola [4] presented a computational II. MUSIC EMOTION RECOGNITION ALGORITHM
prediction of emotional responses to music. The audio-based The proposed music emotion recognition algorithm is
features including dynamic, timbre, register, and harmony, composed of the following procedures: data acquisition, feature

978-1-4799-5277-9/14 $31.00 © 2014 IEEE 1249


DOI 10.1109/IS3C.2014.323
Data Acquisition Feature Generation
(Music signals) (35 features)

Classifier Construction Feature Feature


(Hierarchical SVMs) Extraction (NWFE) Selection (KBCS)

Fig. 2 The block diagram of the proposed music emotion recognition algorithm.

generation, feature selection, feature extraction, and


classifier construction. The block diagram of the proposed
algorithm scheme is shown in Fig. 2.
Fig. 3 Rhythm features: attack slope, attack time, and note length on an
A. Data Acquisition͒ envelope curve.
In this study, the signal of each music was converted to a
 : Median_Slope, the median of all attack slopes for each
standard recording format: mono channel PCM with sampling music audio recording. The attack slope is defined as the slope
rate of 22,050 Hz and 16-bit resolution [1]. from valley to peak at each note.
B. Feature Generation  : Max_Slope, the maximum attack slope for each music
audio recording.
To depict the dominant characteristics of emotion presented
in music, a total of 35 features from dynamic, rhythm, pitch, 3) Pitch Features: Pitch depends on the frequency content of
and timbre of music were generated from the converted the sound stimulus. In this study, the pitch of each note is
standard audio recordings. derived from Camacho’s algorithm (Sawtooth Waveform
Inspired Pitch Estimator, SWIPE) [7] and its output is on the
1) Dynamic Features: Each audio recording was divided into
frequency scale (Hz). Since a semitone is defined as the
non-overlapping frames of 32-ms length [6]. The loudness of
smallest musical frequency interval in western music, pitch is
each frame (which also called envelope) is defined as the sum
transferred to semitone on the chromatic scale as follows:
of absolute value of audio signal as follows:
($!%&')*+-*./0 ⁄)
= ∑ni=1|Si |, (1)  ℎ !"# = 69+12×log  , (2)

where Si is the sample within a frame, and n is the frame and three pitch related features were generated as follows:
size (n is set to 705 in this study). The following five dynamic  : Mean_Pitchsemitone, the average  ℎ !"# for each
related features are generated as follows: music audio recording.
 : Mean_Loudness, the average Loudness for each 32-ms  : Median_Pitchsemitone, the median  ℎ !"# for each
frame. music audio recording.
 : Var_Loudness, the variance of Loudness for each 32-ms  : Var_Pitchsemitone, the variance of  ℎ !"# for each
frame. music audio recording.
 : Range_Loudness, the difference between the maximal and Melody is a series of notes composed of a specific relation
minimal Loudness for each 32-ms frame. of pitches. In this study, Pitchstep is used to describe
 : RMS_Loudness, the root mean square value of Loudness for
characteristics of melody as follows:
each 32-ms frame.  ℎ !2 = ℎ!"# ( ) −  ℎ!"# ( − 1),
 : Low-energy_Rate, the percentage of 32-ms frames with
less-than-average RMS_Loudness energy for each audio = 1,2, . . . , , (3)
recording.
where n is the number of notes of each music audio recording.
2) Rhythm Features: For each audio recording, five rhythm Based on Pitchstep, three melody related features are defined:
related features are generated as follows:  : Max_Pitchstep, the maximum Pitchstep for each music audio

 : Tempo, the average peaks of the Loudness per minute for


recording.
each audio recording.  : Min_Pitchstep, the minimum Pitchstep for each music audio
The rest of rhythm features are introduced in Fig. 3 where recording.
the blue waveform is the envelope of audio signal. A note  : Mean_Pitchstep, the average Pitchstep for each music audio
length is defined as the duration from valley to valley located recording.
on the waveform of envelope, which represents the lasting time Three additional pitch related features adopted from [8] are
of a note’s sound. defined as follows:
 : Var_Rhythm, the variance of a note length for each music  : Best_Modality, the most possible mode of each music
audio recording. recording in which a total of 12 possible modes including C,
 : Articulation, the average of all notes’ attack time ratio for C♯, D, E♭, E, F, F♯, G, G♯, A, B♭, and B.
each music audio recording. The attack time ratio is defined as  : Inharmonicity, the degree to which the frequencies of
the ratio of attack time and note length. partial tones depart from whole multiples of the fundamental
frequency.

1250
 : Roughness, the amount of partials that departs from
Positive:
High Arousal烉烉 SVM Classifier Happy
multiples of the fundamental frequency. Happy and Tensional -node II
Negative:
4) Timbre Features: A total of 16 timbre related features are (Valence) Tensional
Happy SVM Classifier
generated. Three features are generated by energy spectral Tensional
density (ESD) analysis [9] as follows: Sad
-node I
Peaceful (Arousal) Positive:
 : Brightness_1500Hz, the percentage of energy above 1500 SVM Classifier Peaceful
Hz. Low Arousal烉 -node III
 : Brightness_3000Hz, the percentage of energy above 3000 Sad and Peaceful (Valence) Negative:
Hz. Sad

 : Spectral_Rolloff, the frequency of the percentage of energy Fig. 4 The structure of the hierarchical SVMs.
which is less than 15%.
KBCS and NWFE methods are utilized at every node to find
Another 13 features are generated based on Mel-frequency
the best feature set for different separation targets at each node.
cepstrum (MFC) analysis which is a representation of the short-
SVM is a binary classifier that relies on a nonlinear mapping
term power spectrum of a sound signal on the Mel scale. MFC
of the training set to a higher dimensional space, wherein the
is widely utilized in sound processing since the frequency
transformed data is well-separated by a decision hyperplane.
bands in MFC can approximate the human auditory system's
For details about the SVM, please refer to [12].
response more closely than linearly-spaced frequency bands
used in normal cepstrum. A total of 20 Mel-scale frequency III. EXPERIMENTAL RESULTS
cepstral coefficients (MFCCs) make up an MFC. In this study,
A total of 219 classical music samples from two music
the first 13 MFCCs are selected as the features from  to  .
datasets were used to verify the proposed music emotion
In order to reduce the effects of variance caused by a diverse
recognition algorithm. In the first dataset (dataset A), 270
range of values among different features, each feature is
music samples of 30 seconds from 211 pieces of Western
normalized by the z-score normalization method. This
classical music were selected. To verify the music emotion
procedure results in a normalized feature with a mean of zero
perceived by normal audience, six graduate students were
and a standard deviation of one.
recruited to label the emotions including happiness, tension,
C. Feature Selection sadness, and peace based on their perceived emotion from the
We adopted the kernel-based class separability (KBCS) music samples. If a music sample was labelled as the same
measurement for feature selection. The best individual N (BIN) emotion by at least five subjects, this music clip was then
was adopted as the KBCS criterion search strategy. In the BIN, selected as the music samples used in this study. A total of 175
each of the music features is evaluated individually by its music clips were picked up via this process in dataset A (49 for
criterion values. The most n significant features represent the happy, 38 for tensional, 47 for peaceful, and 41 for sad). In the
features with the highest n criterion values. For details about second dataset (dataset B), to verify the music emotion
the KBCS, please refer to [10]. perceived by music expert, two music therapists participated in
selecting and annotating 60 representative music samples of
D. Feature Extraction 180 seconds. These samples are Western classical music. If a
We employed the nonparametric weighted feature extraction music sample was labelled as the same emotion by both of the
(NWFE) for feature extraction. The idea of this method is to experts, this music sample was then selected. A total of 45
assign every sample with different weights and to define new music samples were selected via this process (13 for happy, 11
nonparametric between-class and within-class scatter matrices. for tensional, 9 for peaceful, and 12 for sad).
The goal of the NWFE method is to find a linear transformation A 5 fold cross-validation was employed to test our method
which maximizes the between-class scatter and minimizes the and the accuracies of each SVM classifier. Node I of the trained
within-class scatter. For details about the NWFE, please refer hierarchical SVMs classifier was used to discriminate music
to [11]. samples with high and low arousal. The KBCS+NWFE
methods were used to rank and reduce the normalized music
E. Classifier Construction features. At node I of the SVM classifier, the best average
We used valence and arousal axes as shown in Fig. 1 to accuracy of dataset A is reached when 15 features are used, and
divide music emotion into four classes. The structure of the the best average accuracy of dataset B is reached when 5
hierarchical SVMs shown in Fig. 4 consists of three SVM features are used. Node II and node III are both responsible for
classifiers. The first classifier, node I, separates music samples classifying positive and negative valence. The same feature
into a group of low arousal and a group of high arousal. The selection and extraction procedures were performed as those
second and third classifiers: node II and node III, are used to used in node I. The best average accuracy is achieved at 10
discriminate positive and negative valence from each level of features for both node II and III in dataset A, and is achieved
arousal, respectively. With the hierarchical SVM classifiers, by 5 features for both node II and III in dataset B. With the
four music emotions can be assigned to the corresponding extracted features for each node, we can evaluate the
quadrants of the 2D music emotion space. Please note that classification performance of the hierarchical SVMs by
accuracy summarized in Table III. In dataset A, the proposed

1251
TABLE I feature extraction method, and a hierarchical SVMs classifier
PERFORMANCE COMPARISON OF THE HIERARCHICAL SVMS CLASSIFIERS has been proposed. 35 features from dynamic, rhythm, pitch,
Dataset Hierarchical SVMs Classifier Accuracy and timbre of music are generated from each music audio
Node I (High vs. Low arousal) 90.75% recording. In order to improve the classification accuracy of the
Node II (Positive vs. negative valence) 86.63% classifier, each node of hierarchical SVMs extracts important
A features from the 35 features by the KBCS+NWFE method
Node IIIġ(Positive vs. negative valence) 77.73%
Overall (4 emotions) 86.94% individually. The extracted features are then treated as the
Node I (High vs. Low arousal) 95.00% inputs of the hierarchical SVMs classifier to separate four
B
Node II (Positive vs. negative valence) 95.83% quadrant of the 2D music emotion space. We validated the
Node IIIġ(Positive vs. negative valence) 83.81% proposed algorithm using two music datasets. In the first
Overall (4 emotions) 92.33% dataset, the emotion of each music clip was annotated by
TABLE II graduate students to represent the perceived emotions by
PERFORMANCE ANALYSIS OF DIFFERENT STRUCTURES OF SVMS CLASSIFIER normal audience. In the second dataset, music emotion was
Overall
labeled by music therapists to represent the perceived emotions
Dataset Structure of SVMs Classifier by music expert. The average classification accuracies of the
Accuracy
Hierarchical SVMs classifier I 86.94% proposed algorithm are 86.94% and 92.33%, respectively. The
A Hierarchical SVMs classifier II 85.26% effectiveness of the proposed algorithm was validated by these
One-against-one 84.89% results successfully.
Hierarchical SVMs classifier I 92.33%
B Hierarchical SVMs classifier II 89.03% REFERENCES
One-against-one 83.22% [1] Y. H. Yang and H. H. Chen, “Machine recognition of music emotion: A
review,” ACM Transactions on Intelligent Systems and Technology
music emotion classification algorithm can classify high and
(TIST), vol. 3, no. 3, pp. 1–30, 2012.
low arousal at node I with an accuracy of 90.75%, positive and [2] Y. Feng, Y. Zhuang, and Y. Pan, “Popular music retrieval by detecting
negative valence at node II and node III with accuracies of mood,” in Proceedings of the 26th Annual International ACM SIGIR
86.63% and 77.73%, respectively. In dataset B, the Conference on Research and Development in Information Retrieval, pp.
375–376, 2003.
corresponding accuracies at node I, II, and III are 95.00%,
[3] Y. H. Yang, C. C. Liu, and H. H. Chen, “Music emotion classification:
95.83% and 83.81%, respectively. The overall accuracy is A fuzzy approach,” ACM International Conference on Multimedia, pp.
86.94% in dataset A and 92.33% in dataset B for discriminating 81–84, 2006.
four music emotions based on the corresponding quadrants of [4] T. Eerola, “Modeling listeners’ emotional response to music,” Topics in
Cognitive Science, vol. 4, no. 4, pp. 607–624, 2012.
the 2D emotion space. [5] Y. H. Yang, Y. C. Lin, Y. F. Su, and H. H. Chen, “A regression approach
to music emotion recognition, ” IEEE Transactions on Audio, Speech,
IV. DISCUSSION and Language, vol. 16, no. 2, pp. 448–457, 2008
In this section, we compared three different structures of [6] L. Lu, D. Liu, and H. Zhang, “Automatic mood detection and tracking
of music audio signals,” In IEEE Transaction Audio, Speech and
SVMs classifier. Hierarchical SVMs classifier I was utilized in Language Processing, vol. 14, no. 1, pp. 5–18, 2006.
this study in which node I was used to discriminate high and [7] A. Camacho and J. G. Harris, “A sawtooth waveform inspired pitch
low arousal, while node II and III were used to distinguish estimator for speech and music,” The Journal of the Acoustical Society
positive and negative valence. We changed the classification of America, vol. 124, no. 3, pp. 1638–1652, 2008.
[8] O. Lartillot, P. Toiviainen, and T. Eerola, MIRtoolbox: An Integrated
sequence and formed a hierarchical SVMs classifier II, in Set of Functions Written in Matlab, Finnish Center of Excellence in
which node I was used to discriminate positive and negative Interdisciplinary Music Research, University of Jyvaskyla, Finland.
valence while node II and III were used to distinguish high and [Online]. Available:
low arousal. The third one we used is the basic one-against-one http://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/mir
toolbox
structure of SVMs classifier [12], using n × (n-1)/2 SVMs to [9] P. N. Juslin, “Cue utilization in communication of emotion in music
distinguish n classes by a maximum voting strategy. The performance: relating performance to perception,” Journal of
comparison results are shown in Table IV. The hierarchical Experimental Psychology: Human Perception and Performance, vol. 26,
SVMs classifier I achieves the best performance among these 3 no. 6, pp. 1797–1813, 2000.
[10] L. Wang, “Feature selection with kernel class separability,” IEEE
structures. This indicates that, for music emotion classification, Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no.
hierarchical SVMs outperforms one-against-one SVMs and the 9, pp. 1534–1546, 2008.
power of discriminating arousal outperforms the power of [11] B. C. Kuo, and D. A Landgrebe, “Nonparametric weighted feature
discriminating valence. extraction for classification,” IEEE Transactions on Geoscience and
Remote Sensing, vol. 42, No. 5, pp.1096–1105, 2004.
[12] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning,
V. CONCLUSIONS vol. 20, pp. 273–297, 1995.
In this paper, a music emotion recognition algorithm
consisting of a KBCS feature selection method, a NWFE

1252

You might also like