0% found this document useful (0 votes)
129 views

Automatic Musical Pattern Feature Extraction Using Convolutional Neural Network

This document proposes using a convolutional neural network (CNN) to automatically extract musical pattern features from audio music for genre classification. CNN is a model widely used in image analysis that can extract informative features with minimal prior knowledge. The document hypothesizes that musical patterns in transformed audio signals (e.g. MFCC) are similar to visual patterns and can thus be captured using CNN. An experiment applying CNN to extract musical pattern features from MFCC representations is presented to test this hypothesis and the informativeness of the features for genre classification.

Uploaded by

api-25914596
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views

Automatic Musical Pattern Feature Extraction Using Convolutional Neural Network

This document proposes using a convolutional neural network (CNN) to automatically extract musical pattern features from audio music for genre classification. CNN is a model widely used in image analysis that can extract informative features with minimal prior knowledge. The document hypothesizes that musical patterns in transformed audio signals (e.g. MFCC) are similar to visual patterns and can thus be captured using CNN. An experiment applying CNN to extract musical pattern features from MFCC representations is presented to test this hypothesis and the informativeness of the features for genre classification.

Uploaded by

api-25914596
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,

IMECS 2010, March 17 - 19, 2010, Hong Kong

Automatic Musical Pattern Feature Extraction


Using Convolutional Neural Network
Tom LH. Li∗, Antoni B. Chan∗and Andy HW. Chun∗

Abstract—Music genre classification has been a is adopted in image information retrieval tasks. Migrat-
challenging yet promising task in the field of music ing technologies from another research field brings new
information retrieval (MIR). Due to the highly elu- opportunities to break through the current bottleneck of
sive characteristics of audio musical data, retrieving music genre classification. The proposed musical pat-
informative and reliable features from audio signals is tern feature extractor has advantages in several aspects.
crucial to the performance of any music genre classi-
It requires minimal prior knowledge to build up. Once
fication system. Previous work on audio music genre
obtained, the process of feature extraction is highly ef-
classification systems mainly concentrated on using
timbral features, which limits the performance. To ficient. These two advantages guarantee the scalability
address this problem, we propose a novel approach to of our feature extractors. Moreover, our musical pattern
extract musical pattern features in audio music using features are complementary to other main-stream feature
convolutional neural network (CNN), a model widely sets used in other classification systems. Our experiments
adopted in image information retrieval tasks. Our show that musical data have very similar characteristics
experiments show that CNN has strong capacity to to image data so that the variation of musical patterns
capture informative features from the variations of can be captured using CNN. We also show that the mu-
musical patterns with minimal prior knowledge pro- sical pattern features are informative for genre classifica-
vided. tion tasks.
Keywords: music feature extractor, music informa-
tion retrieval, convolutional neural network, multime-
dia data mining 2 Related Works
By the nature of data involved in analysis, the field of mu-
1 Introduction sic genre classification is divided to two different scopes:
symbolic and audio. Symbolic music genre classification
Automatic music genre classification has grown in vast studies songs in their symbolic format, such as MIDI, Mu-
popularity in recent years as a result of the rapid devel- sicXML, etc. Various models (Basili et. al. [1], McKay
opment of the digital entertainment industry. As a first et. al. [2], Ponce et. al. [3]) have been proposed to
step of genre classification, feature extraction from musi- perform symbolic music genre classification. Feature sets
cal data will significantly influence the final classification representing instrumentation, musical texture, rhythm,
accuracy. The annual international contest Music Infor- dynamics, pitch statistics, melody, etc. are used as input
mation Retrieval Evaluation eXchange (MIREX) holds for a wide variety of generic multi-class classifiers.
regular competitions for audio music genre classification
that attracts tens of participating groups each year. Most Identifying the music genre directly from audio signal is
of the systems rely heavily on timbral, statistical spectral more difficult because of the increased difficulties in fea-
features. Feature sets pertaining to other musicological ture extraction. In symbolic musical data, information
aspects such as rhythm and pitch are also proposed, but such as instrument, note onsets are readily available in
their performance is far less reliable compared with the the precise musicological description of the songs. For au-
timbral feature sets. Additionally, there are few feature dio music however, only the recorded audio signal is read-
sets aiming at the variations of musical patterns. The in- ily available. Trying to apply methodologies in symbolic
adequateness of musical descriptors will certainly impose music analysis on auto-transcribed audio data is highly
a constraint on audio music genre classification systems. impractical since building up a reliable auto-transcription
system for audio music appears to be a more challeng-
In this paper we propose a novel approach to automati- ing task than audio genre classification itself. In fact,
cally retrieve musical pattern features from audio music the best candidate scored only about 70% in the 2009
using convolutional neural network (CNN), a model that MIREX medoly extraction contest, a simpler task than
∗ Department of Computer Science, City University of Hong auto-transcription. Researchers therefore need to turn to
Kong, Kowloon, Hong Kong, Email: [email protected], alternative approaches to extract informative feature sets
[email protected], [email protected] for genre classification, such as,

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

• Tzanetakis et. al. [4, 5, 6]: STFT, MFCC, Pitch its first few layers serve as feature extractors that would
Histogram, Rhythm Histogram be automatically acquired via supervised training. It is
shown from extensive experiments [14] that CNN has con-
• Bergstra et. al. [7]: STFT, RCEPS, MFCC, Zero- siderable capacity to capture the topological information
crossing Rate, Spectral summary, LPC. in visual objects.
• Ellis et. al. [8]: MFCC, Chroma
There are few applications of CNN in audio analysis de-
• Lidy et. al. [9, 10]: Rhythm Pattern, Statistical spite its successes in vision research. The core objective
Spectrum Descriptor, Rhythm Hisitogram, Symbolic of this paper is to examine and evaluate the possibilies
Feature from auto-transcribed music. extending the application of CNN to music information
retrieval. The evaluation can be further decomposed into
• Meng et. al. [11]: MFCC, Mean and variance the following hypotheses:
of MFCC, Filterbank Coefficients, Autoregressive
model, Zero-crossing Rate, Short-time Energy Ra-
• The variations of musical patterns (after a certain
tio.
form of transform, such as FFT, MFCC) is similar
to those in images and therefore can be extracted
Most of the proposed systems concentrate only on fea- with CNN.
ture sets extracted from a short window of audio signals,
using statistical measurements such as maximum value, • The musical pattern descriptors extracted with CNN
average, deviation, etc. Such features are representative are informative for distinguishing musical genres.
of the ”musical texture” of the excerpt concerned, i.e.
timbral description. Feature sets concerning other musi- In the latter part of this paper, evidence supporting these
cological aspects such as rhythm and pitch are also pro- two hypotheses will be provided.
posed, but their performance is usually far worse than
their timbral counterparts. There are few feature sets 3.2 CNN Architecture for Audio
which capture the musical variation patterns. Relying Input 1st Conv 2nd Conv 3rd Conv Output
only on timbral descriptors would certainly limit the per- Raw MFCC 3@46x1 15@10x1 65@1x1 Genre
10@1x1
1@190x13
formance of genre classification systems; Aucouturier et.
al. [12] indicates that a performance bottleneck exists if
only timbral feature sets are used.

The dearth of musical pattern features can be ascribed


to the elusive characteristics of musical data; it is typ-
ically difficult to handcraft musical pattern knowledge
into feature extractors, as they require extra efforts to
handcraft specific knowledge into their computation pro-
Figure 1: CNN to extract musical patterns in MFCC
cesses, which would limit their scalability. To overcome
this problem, we propose a novel approach to automat-
ically obtain musical pattern extractors through super- Figure 1 shows the architecture of our CNN model. There
vised learning, migrating a widely adopted technology in are five layers in total, including the input and output lay-
image information retrieval. We believe that introduc- ers. The first layer is a 190 × 13 map, which hosts the
ing technology in another field brings new opportunities 13 MFCCs from 190 adjacent frames of one excerpt. The
to break through the current bottleneck of audio genre second layer is a convolutional layer of 3 different kernels
classificaion. of equal size. During convolution, the kernel surveys a
fixed 10 × 13 region in the previous layer, multiplying the
input value with its associate weight in the kernel, adding
3 Methodology the kernel bias and passing the squashing function. The
In this section, we briefly review the CNN and the pro- result is saved and used as the input to the next convo-
posed music genre classification system. lutional layer. After each convolution, the kernel hops
4 steps forward along the input as a process of subsam-
3.1 Convolutional Neural Network pling. The 3rd and 4th layer function very similarly to
the 2nd layer, with 15 and 65 feature maps respectively.
The design of convolutional neural network (CNN) has Their kernel size is 10 × 1 and their hop size is 4. Each
its origin in the study of biological neural system. The kernel of a convolutional layer has connections with all
specific method of connections discovered in cats’ visual the feature maps in the previous layer. The last layer is
neurons is responsible for identifying the variations in the an output layer with full connections with the 4th layer.
topological structure of objects seen [13]. LeCun incor- The parameter selection process is described in Section
porate such knowledge in his design of CNN [14] so that 4.2.

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

It can be observed from the topology of CNN that the 4 Results and Analysis
model is a multi-layer neural network with special con-
straints on the connections in the convolutional layers, so 4.1 Dataset
that each artificial neuron only concentrates on a small
region of input, just like the receptive field of one bio- The dataset of our experiment is the GTZAN dataset
logical neuron. Because the kernel is shared across one which has been used to evaluate various genre classi-
feature map, it becomes a pattern detector that would fication systems [4, 7, 10]. It contains 1000 song ex-
acquire high activation when a certain pattern is shown cerpts of 30 seconds, sampling rate 22050 Hz at 16 bit.
in the input. In our experimental setting, each MFCC Its songs are distributed evenly into 10 different genres:
frame spans 23ms on the audio signal with 50% overlap Blues, Classical, Country, Disco, Hiphop, Jazz, Metal,
with the adjacent frames. Therefore the first convolu- Pop, Reggae and Rock.
tional layer (2nd layer) detects basic musical patterns ap-
pear in 127ms. Subsequent convolutional layers therefore 4.2 CNN Pattern Extractor
capture musical patterns in windows size of 541ms and
2.2s, respectively. The CNN is trained using the stochas-
0.9
tic gradient descent algorithm [15]. After convergence, 3−genre
the values in the intermediate convolutional layers can 4−genre
0.8 5−genre
be exported as the features of the corresponding musical 6−genre
excerpt.
0.7

The model we use is a modified CNN model presented in


[16]. Compared with the traditonal CNN model, we ob- 0.6
training error rate

served that the training is easier, and the capacity loss is


0.5
negligible. In return, as much as 66.8% of computational
requirement is saved.
0.4

0.3

3.3 Music Genre Classification 0.2

0.1
0 50 100 150 200
Conv. Musical epoch
MFCC Neural Pattern
Extraction Network Extractors
Songs Figure 3: Convergence Curve in 200-epoch training
and
Segmentation
Trained Generic
Musical Classifiers
Pattern & Majority Genre Figure 3 shows the convergence of the training error rate
Extractors Voting of our CNN model, on four sub-datasets extracted from
the GTZAN dataset. The smallest dataset contains 3
genres: Classical, Jazz and Rock. The latter datasets in-
Figure 2: Overview of the classification system
crease in size as Disco, Pop and Blues genres are added.
From the figure we can observe that the trend of conver-
gence over different datasets is similar, however the train-
Figure 2 shows the overview of our classification system. ing on a 3-genre dataset converges much faster than the
The first step of the process is MFCC extraction from training on a 6-genre dataset. This shows the difficulty
audio signals. MFCC is an efficient and highly informa- in training CNN increases drastically when the number
tive feature set that has been widely adopted for audio of genres involved in training increases. We believe this
analysis since its proposal. After MFCC extraction, the is because the CNN gets confused with the complexity
input song is transformed into an MFCC map with 13 of the training data and therefore never obtains suitable
pixels wide which is then segmented to fit the input size pattern extractors in the first few layers. Additionally we
of CNN. Provided the song label, the musical pattern ex- also found that the combination of genres in the 3-genre
tractors are automatically aquired via supervised learn- subset will not affect the training of CNN. All combina-
ing. Those extractors are used to retrieve high-order, tions have very similar curve of convergence.
pattern-related features which will later serve as the in-
put of generic, multi-class classifiers such as Decision Tree Based on the observations above, the training of our CNN
Classifiers, Support Vector Machine etc. After classifica- feature extractors are divided in four parallel models to
tion of each song segments, the result is aggregated in a cover the full 10-genre GTZAN dataset. Three models
majority voting process to produce the song-level label. are arbitrarily selected to cover 9 non-overlapping gen-

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

res, while one model is deliberately chosen to train on give very poor performance compared with the training
the 3 most difficult-to-classify genres shown in [4], i.e. evaluation; the accuracy of below 30% is therefore too low
Blues, Metal and Rock. Dividing the dataset into small to make any reliable judgements. It reveals that our cur-
subsets to train the CNN feature extractors may have the rent musical pattern extraction model has the deficiency
side-effect that features extracted to classify songs within in generalizing the musical patterns learnt to unseen mu-
one subset may not be effective in intersubset classifica- sical data. We further study such phenomenon and found
tion, and therefore it may seem more reasonable to select that the reason is two-fold: 1. Musical data is typically
three 4-genre models instead of four 3-genre models. We aboundant in its variation, and therefore it is hardly suf-
observe from our experiments that such alternative is un- ficient for 80 songs to represent all types of variations
necessary since features extracted from individual subsets in one specific genre; 2. The MFCC feature is sensitive
possess a good capacity for intersubset distinction. Ad- to the timbral, temple and key variation of music which
ditionally, we also observe that the training of 4-genre further accentuates the shortage in training data.
subsets is far less effective and less efficient compared
with training of 3-genre subsets. One practical solution to these problems above is to en-
large the training dataset by adding affine transforms
Extensive experiments are also performed towards the se- of songs, such as key elevation/lowering, slight tempo
lection of CNN network parameters. First is the network shift, etc. Additional data smooths the variation within
layer number. We discover that CNN with more than one genre and boosts the overall generalizability. Similar
3 convolutional layers is exceptionally difficult to train work can be found in [16]. Alternatively, the MFCC fea-
for the network convergence will easily get trapped in lo- ture input can be replaced with transforms insensitive to
cal minimas. On the other hand, CNNs with less than timbral, tempo and key variation, such as mel-frequency
3 convolutional layers do not have sufficient capacity for spectrum or chroma feature [8].
music classification. The convolution/subsampling size
is set at 10/4 for similar criteria. Larger convolutional Our method on musical pattern extractor can be com-
sizes are difficult to train, while smaller ones are sub- pared with the work in [18], which also applies an image
jected to capacity limitation. To determine the feature model to audio music genre classification. It is shown
map numbers in the three convolutional layers, we first that our system possesses better scalability. The texture-
set the three parameters sufficiently large, then watch the of-texture model used in [18] is so highly computational
performance of CNN as we gradually reduce the number. intensive that the authors reduce the training set to 17
We discover that 3, 15 and 65 is the optimal feature map songs each category. In comparison our CNN takes less
numbers for the first three convolutional layers. Reduc- than two hours to obtain feature extractors from a 3-
ing them further will drastically constrain the capacity of genre, 240-song training set. The efficiency of process
CNN feature extractors. can be raised further with parallel computing on differ-
ent combination of genres.
4.3 Evaluation
5 Conclusions and Future Work
After obtaining 4 CNNs as described above, we apply
the feature extractors on the full dataset to retrieve mu- In this paper we presented a methodology to automati-
sical pattern features. We deliberately reserve 20% songs cally extract musical patterns features from audio music.
in the training of CNN as to examine the ability of our Using the CNN migrated from the the image information
feature extractors on unseen musical data. The musi- retrieval field, our feature extractors need minimal prior
cal pattern features are evaluated using various models knowledge to construct. Our experiments show that CNN
in the WEKA machine learning system [17]. We dis- is a viable alternative for automatic feature extraction.
cover that the features scored very well in the 10-genre Such discovery lends support to our hypothesis that the
training evaluation, using a variety of tree classifiers such intrinsic characteristics in the variation of musical data
as J48, Attribute Selected Classifier, etc. The classifi- are similar to those of image data. Our CNN model is
cation accuracy is 84% before the majority voting, and highly scalable. We also presented our discovery of the
gets even higher afterwards. Additionally, musical ex- optimal parameter set and best practice using CNN on
cerpts not used in CNN training have minor difference in audio music genre classification.
classification rate compared with excerpts used to train
CNNs. This provides evidence to support our hypothesis Our experiments reveal that our current model is not
in Section 3 that the variations of musical patterns in the robust enough to generalized the training result to unseen
form of MFCC is similar to those of image so that CNN musical data. This can be overcome with an enlarged
can be used to automatically extract them. In addition, dataset. Furthermore, replacing the MFCCs with other
those patterns provide useful information to distinguish feature sets such as the Chroma feature set would also
musical genres. improve the robustness of our model. Further application
of image techniques are likely to produce fruitful results
However, further experiments on the splitted test dataset towards music classification.

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 Vol I,
IMECS 2010, March 17 - 19, 2010, Hong Kong

References [14] Bengio, Y. and LeCun, Y. Scaling learning al-


gorithms towards AI Large-Scale Kernel Machines
[1] Basili, R. and Serafini, A. and Stellato, A. Classifica- 2007
tion of musical genre: a machine learning approach
Proceedings of ISMIR 2004 [15] Spall, J.C Introduction to stochastic search and opti-
mization: estimation, simulation, and control 2003,
[2] McKay, C. and Fujinaga, I. Classification of musical John Wiley and Sons
genre: a machine learning approach Proceedings of
ISMIR 2004 [16] Simard, P.Y. and Steinkraus, D. and Platt, J. Best
practices for convolutional neural networks applied
[3] de León, P.J.P. and Inesta, J.M., I. Musical style
to visual document analysis International Confer-
identification using self-organising maps Web Deliv-
ence on Document Analysis and Recogntion (IC-
ering of Music, 2002. WEDELMUSIC 2002. Pro-
DAR), IEEE Computer Society, Los Alamitos p958–
ceedings. Second International Conference on p82–
962, 2003
89 2002
[4] Tzanetakis, G. and Cook, P. Musical genre classifi- [17] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
cation of audio signals, IEEE Transactions on speech Pfahringer, Peter Reutemann, Ian H. Witten (2009);
and audio processing Volume 10, Number 5, p293– The WEKA Data Mining Software: An Update;
302, 2002 SIGKDD Explorations, Volume 11, Issue 1.

[5] Li, T. and Tzanetakis, G. Factors in automatic mu- [18] Deshpande, H. and Singh, R. and Nam, U. Classifi-
sical genre classification of audio signals IEEE WAS- cation of music signals in the visual domain Proceed-
PAA, p143–146, 2003 ings of the COST-G6 Conference on Digital Audio
Effects 2001
[6] Lippens, S. and Martens, J.P. and De Mulder, T.
and Tzanetakis, G. A comparison of human and au-
tomatic musical genre classification IEEE Interna-
tional Conference on Acoustics, Speech, and Signal
Processing, Volume 4, p233–236, 2004
[7] Bergstra, J. and Casagrande, N. and Erhan, D. and
Eck, D. and Kégl, B. Aggregate features and Ad-
aBoost for music classification Machine Learning,
Volume 65, Number 2, p473–484, 2006
[8] Ellis, D.P.W. Classifying music audio with timbral
and chroma features Dins Proc. ISMIR 2007
[9] Lidy, T. and Rauber, A. Evaluation of feature
extractors and psycho-acoustic transformations for
music genre classification Proceedings of the 6th In-
ternational Conference on Music Information Re-
trieval (ISMIR05) p34–41
[10] Lidy, T. and Rauber, A. and Pertusa, A. and Inesta,
J.M. Improving genre classification by combination
of audio and symbolic descriptors using a transcrip-
tion system Proc. ISMIR, Vienna, Austria 2007
[11] Anders Meng, Peter Ahrendt, Jan Larsen. Improv-
ing Music Genre Classification by Short-time Fea-
ture Integration. IEEE International Conference on
Acoustics, Speech, and Signal Processing , 2005
[12] Pachet, F. and Aucouturier, J.J. Improving timbre
similarity: How high is the sky? Journal of negative
results in speech and audio sciences, 2004
[13] Movshon, JA and Thompson, ID and Tolhurst, DJ
Spatial summation in the receptive fields of simple
cells in the cat’s striate cortex. The Journal of Phys-
iology Volume 283, Number 1, p53, 1978

ISBN: 978-988-17012-8-2 IMECS 2010


ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)

You might also like