Issues Detecting Emotions
Issues Detecting Emotions
Issues Detecting Emotions
Abstract. Investigating subjective values of audio data is both interesting and pleasant topic for research, gaining attention and popularity
among researchers recently. We focus on automatic detection of emotions
in songs/audio les, using features based on spectral contents. The data
set, containing a few hundreds of music pieces, was used in experiments.
The emotions are grouped into 13 or 6 classes. We compare our results
with tests on human subjects. One of the main conclusions is that multilabel classication is required.
Keywords: Music information retrieval, sound analysis.
Introduction
Data Parametrization
Automatic parametrization of audio data for classication purposes is hard because of ambiguity of labeling and subjectivity of description. However, since a
piece of music evokes similar emotions in listeners representing the same cultural
background, it seems to be possible to obtain parametrization that can be used
for the purpose of extracting emotions. Our goal was to check how numerical
parameters work for classication purposes, how good or low is classication
accuracy, and how it is comparable with human performance.
ezak et al. (Eds.): RSFDGrC 2005, LNAI 3642, pp. 314322, 2005.
D. Sl
c Springer-Verlag Berlin Heidelberg 2005
315
Objective descriptors of audio signal characterize basic properties of the investigated sounds, such as loudness, duration, pitch, and more advanced properties, describing frequency contents and its changes over time. Some descriptors
come from speech processing and include prosodic and quality features, such as
phonation type, articulation manned etc. [12]. Such features can be applied to
detection of emotions in speech signal, but not all of them can be applied to
music signals, which require other descriptors. Features applied to music signal include structure of the spectrum - timbral features, time domain features,
time-frequency description, and higher-level features, such as rhythmic content
features [7], [9], [13], [14].
When parameterizing music sounds for emotion classication, we assumed
that emotions depend, to some extend, on harmony and rhythm. Since we deal
with audio, not MIDI les, our parametrization is based on spectral contents
(chords and timbre). Western music, recorded stereo with 44100 Hz sampling
frequency and 16-bit resolution was used as audio samples. We applied long
analyzing frame, 32768 samples taken from the left channel, in order obtain more
precise spectral bins, and to describe longer time fragment. Hanning window was
applied, and spectral components calculated up to 12 kHz and no more than 100
partials, since higher harmonics did not contribute signicantly to the spectrum.
The following set of 29 audio descriptors was calculated for our analysis
window [14]:
F requency: dominating fundamental frequency of the sound
Level: maximal level of sound in the analyzed frame
T ristimulus1, 2, 3: Tristimulus parameters calculated for F requency, given
by [10]:
A2
(1)
T ristimulus1 = N 1
2
n=1 An
2
n=2,3,4 An
(2)
T ristimulus2 = N
2
n=1 An
N
A2n
T ristimulus3 = n=5
(3)
N
2
n=1 An
where An denotes the amplitude of the nth harmonic, N is the number of
harmonics available in spectrum, M = N/2 and L = N/2 + 1
EvenHarm and OddHarm: Contents of even and odd harmonics in the
spectrum, dened as
M
2
k=1 A2k
EvenHarm =
(4)
N
2
n=1 An
L
2
k=2 A2k1
(5)
OddHarm =
N
2
A
n=1 n
316
n An
n=1
An
(6)
Ak
log
3
Ak1 Ak Ak+1
N
1
k=2
(7)
Experiment Setup
frustrated,
bluesy, melancholy,
longing, pathetic,
cheerful, gay, happy,
dark, depressing,
delicate, graceful,
dramatic, emphatic,
dreamy, leisurely,
agitated, exciting, enthusiastic,
fanciful, light,
mysterious, spooky,
passionate,
sacred, spiritual.
317
Class
No. of objects
Graceful
45
Happy
36
Passionate
40
Pathetic
155
Sacred
11
Spooky
77
Results
318
Class
No. of objects
Agitated
16
Bluesy
18
Dark
6
Dramatic
88
Dreamy
20
Fanciful
34
Frustrated
17
Class
No. of objects
Graceful
14
Happy
24
Passionate
18
Pathetic
32
Sacred
17
Spooky
7
Fig. 2. Representation of classes in the collection of 303 musical recordings for the
research on automatic classifying emotions
Class
No. of objects k-NN Correctness
1. happy, fanciful
57
k=11
81.33%
2. graceful, dreamy
34
k=5
88.67%
3. pathetic, passionate
49
k=9
83.67%
4. dramatic, agitated, frustrated
117
k=7
62.67%
5. sacred, spooky
23
k=7
92.33%
6. dark, bluesy,
23
k=5
92.33%
Fig. 3. Results of automatic classication of emotions for the 303-element database
using k-NN
Class
No. of objects Correctness
1. happy, fanciful
74
95.97%
2. graceful, dreamy
91
89.77%
3. pathetic, passionate
195
71.72%
4. dramatic, agitated, frustrated
327
64.02%
5. sacred, spooky
88
89.88%
6. dark, bluesy,
97
88.80%
Fig. 4. Results of automatic classication of emotions for the 870-element database
also performed experiments for the same 6 classes, using k-NN classier, i.e.,
examining all 6 classes in parallel. These experiments yielded 37% correctness
(and 23.05% for 13 classes), suggesting that further work was needed. Since we
suspected that uneven number of objects in classes and not too big data set could
hinder classication, the 870-element data set was used in further experiments.
The results of experiments with binary classication performed on the full
data set, containing 870 audio les, are presented in Figure 4. The best results
of experiments were obtained in k-NN for k = 13. As we can see, the results
have been even improved comparing to the small data set. However, general
classication for all classes examined in parallel was still low, comparable with
319
results for 303-element data set, since we obtained 20.12% accuracy for 13 classes
and 37.47% for 6 classes.
Because of the low level of accuracy in general classication, we decided to
compare the results with human performance. Two other subjects with musical
background were asked to classify a test set of 39 samples, i.e., 3 samples for
each class. The results convinced us that the diculty is not just in the parametrization or method of classication, since the correctness of assessment yielded
24.24% and 33.33%, diering essentially on particular samples. This experiment
suggests that multi-class labeling by a few subjects may be needed, since various
listeners may perceive various emotions while listening to the same le, even if
they represent the same cultural and musical background.
Multi-class Labeling
320
5.1
Training Models
Testing Criteria
We assume that we build models for each base class only, and not for combination
of classes (MODEL-n) because of sparseness of data as discussed above. As an
exemplary classier we use Support Vector Machines (SVM) [3] as they are
recognized to give very good results in text and scene classication, i.e., in multilabel data.
Now, let us see how can we obtain multiple labels from the outputs of each
of the models. In standard 2-class SVM the positive (negative) output of a SVM
for a testing object means that it is a positive (negative) example. In the case
of multi-class problems there are several SVMs built one for each class. The
highest positive output of SVMs determines the class of a testing object. However, it can happen that no SVM gives positive output. This approach can be
extended to multi-label classication.
Let us consider the following three testing (labeling) criteria.
P-criterion labels the testing object with all classes corresponding to positive
output of SVM. If no output is positive than the object is unlabeled.
T-criterion works similarly to P-criterion, however, if no output is positive
than the top value is used for labeling.
C-criterion evaluates top values that are close each other no matter whether
they are positive or negative.
5.3
321
to a testing object are proper then it is classied correctly if all are wrong
then incorrectly. However, what makes it dierent from single-label classication,
only some of the labels can be attached properly this is the case of partial
correctness.
Thus, except standard measures of quality of classication like precision or
accuracy, we need additional ones that take into account also partial correctness.
Some examples of such measures, for example one-error, coverage, and precision,
have been proposed in the literature (see, e.g., [11]). In [2] there are proposed
two methods, -evaluation and base class evaluation of multi-label classier evaluation that make it possible to analyze results of classication in a wide range
of settings.
Dicult task of automatic recognition of emotions in music pieces was investigated in our research. The purpose of this investigations was not only testing
how numerical parameters perform in objective description of subjective features, but also assessment of the recognition accuracy, and comparison of results
with human subjects. The obtained accuracy is not high, but is of the same
quality as human assessment. Since humans dier in their opinions regarding
emotions evoked by the same piece, inaccuracies in automatic classication are
not surprising.
In the next experiments we plan to apply multi-class labeling of sounds and
develop the methodology for multi-label data classication.
We also plan to extend the set of descriptors used for sounds parametrization.
In particular, we want to investigate how values of particular parameters change
with time and how it is related to any kind of emotions.
Acknowledgements
This research was partially supported by the National Science Foundation under grant IIS-0414815, by the grant 3 T11C 002 26 from Ministry of Scientic
Research and Information Technology of the Republic of Poland, and by the
Research Center at the Polish-Japanese Institute of Information Technology,
Warsaw, Poland.
The authors express thanks to Dr. Rory A. Lewis from the University of North
Carolina at Charlotte for elaborating the audio database for research purposes.
References
1. Berger, A.: Error-correcting output coding for text classication. IJCAI99: Workshop on machine learning for information ltering. Stockholm, Sweden (1999).
Available at http://www-2.cs.cmu.edu/aberger/pdf/ecoc.pdf
2. Boutell, M., Shen, X., Luo, J., Brown, C.: Multi-label Semantic Scene Classication.
Technical Report, Dept. of Computer Science, U. Rochester, (2003).
322
3. Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data
mining and knowledge discovery 2(2) (1998) 121167.
4. Clare, A., King, R.D.: Knowledge Discovery in Multi-label Phenotype Data. Lecture Notes in Computer Science 2168 (2001) 4253.
5. Fujinaga, I., McMillan, K.: Realtime recognition of orchestral instruments. Proceedings of the International Computer Music Conference (2000) 141143
6. Kostek, B., Wieczorkowska, A.: Parametric Representation Of Musical Sounds.
Archives of Acoustics 22, 1 (1997) 326
7. Li, T., Ogihara, M.: Detecting emotion in music. 4th International Conference on
Music Information Retrieval ISMIR, Washington, D.C., and Baltimore, MD (2003).
Available at http://ismir2003.ismir.net/papers/Li.PDF
8. McCallum, A.: Multi-label Text Classication with a Mixture Model Trained by
EM. AAAI99 Workshop on Text Learning, (1999).
9. Peeters, G. Rodet, X.: Automatically selecting signal descriptors for Sound Classication. ICMC 2002 Goteborg, Sweden (2002)
10. Pollard, H. F., Jansson, E. V.: A Tristimulus Method for the Specication of Musical Timbre. Acustica 51 (1982) 162171
11. Schapire, R., Singer, Y.: Boostexter: a boosting-based system for text categorization. Machine Learning 39(2/3) (2000) 135168
12. Tato, R., Santos, R., Kompe, R., Pardo, J. M.: Emotional Space Improves Emotion
Recognition. 7th International Conference on Spoken Language Processing ICSLP
2002, Denver, Colorado (2002).
13. Tzanetakis, G., Cook, P.: Marsyas: A framework for audio analy
sis. Organized Sound 4(3) (2000) 169-U175.
Available at http://www2.cs.cmu.edu/gtzan/work/pubs/organised00gtzan.pdf
14. Wieczorkowska, A., Wroblewski, J., Synak, P., Slezak, D.: Application of temporal
descriptors to musical instrument sound recognition. Journal of Intelligent Information Systems 21(1), Kluwer (2003), 7193