0% found this document useful (0 votes)
10 views5 pages

Cheng 2020

Uploaded by

likhitha A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Cheng 2020

Uploaded by

likhitha A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2020 International Symposium on Computer, Consumer and Control (IS3C)

Convolutional Neural Networks Approach for


Music Genre Classification
Yu-Huei Cheng Pang-Ching Chang Che-Nan Kuo*
Department of Information and Department of Information and Department of Artificial Intelligence
Communication Engineering Communication Engineering CTBC Financial Management College
Chaoyang University of Technology Chaoyang University of Technology Tainan, Taiwan
Taichung, Taiwan Taichung, Taiwan [email protected]
[email protected] [email protected]

Abstract—In recent years, the complexity of music of certain frequencies and ignores the frequency signals that it
production has gradually decreased results in many people does not want to perceive. However, these signals are not
create music and upload their created music to streaming media. evenly distributed in the frequency coordinate axis. More
2020 International Symposium on Computer, Consumer and Control (IS3C) | 978-1-7281-9362-5/20/$31.00 ©2020 IEEE | DOI: 10.1109/IS3C50286.2020.00109

The huge music streaming media has caused people to spend filters are allocated in the low frequency region and fewer high
much time searching for specific music. Therefore, the frequencies. Figure 1 show a schematic diagram of the filter
technique of quick classification of music genres is very bank in the Mel scale, where the X-axis represents the
important in today’s society. As machine learning and deep frequency and the Y-axis represents the pulse signal. For a
learning technologies maturing, the Convolutional Neural spectrogram of an audio frequency, the peak value represents
Networks (CNN) are applied to many fields, and various CNN-
the main frequency component of the audio frequency, which
based variants have emerged one after another. The traditional
is also called a formant, and the formant carries the
music genre classification requires relevant professional
knowledge to manually extract features from time series data.
identification properties of the sound. In audio recognition, we
Deep learning has been proven to be effective and efficient in need to extract the position and transition process of the
time series data. In order to save the user’s time when searching formant. The process of this change is a smooth curve
for different styles of music, we applied CNN’s advantages and connecting these resonance peak points, we call that spectrum
characteristics in audio to implement a music genre envelope. The original spectrum consists of two parts:
classification model. In the pre-processing, we use Librosa is envelope and spectrum details. If these two parts are separated,
used to convert the original audio files into their corresponding then we can get the spectrum envelope. We need to use
Mel spectrums. The converted Mel spectrum is then fed into the cepstrum analysis, the purpose is to separate the envelope and
proposed CNN model for training. The majority voting is spectrum details. Cepstrum is the spectrum obtained by
applied to the decisions made by the 10 classifiers, and the performing logarithmic operation after Fourier transformation
average accuracy obtained on the GTZAN dataset is 84%. of the signal, and then performing inverse Fourier
transformation. Therefore, cepstrum analysis can decompose
Keywords—Convolutional Neural Networks; GTZAN; Music the signal.
genre classification; Mel-spectrum
This experiment uses the GTZAN data set as training and
I. INTRODUCTION verification, which was established by Cook [2], the purpose
After the advent of the Internet, many people upload music is to study the application of machine learning to the
originally stored on vinyl records or discs to streaming media classification of music genres. There are 10 genres, each of
on the Internet. Coupled with the rise of music streaming them with 100 pieces of music, and a total of 1000 pieces. For
services in recent years, people search for popular music a real-time beat tracking system of audio signals with music,
through music streaming services, and the huge online music Scheirer et al. [3] have special insights. In this system, the
library makes it difficult to search for specific genres or music. filter bank will be coupled with a combined filter network, and
Therefore, a tool which can identify and classify music is an the combined filter will track the signal period to generate the
important issue for beginners and specific musicians. Since main beat and its result. Eve et al. [4] are different from most
most of the music in the current music streaming media only people in that they use pitch, duration and MIDI signal as the
has the title and author, and most of these music do not carry feature basis for classification to achieve good results. Elbir et
specific tags. This makes it be a challenging task to be able to al. [5] found that using Support Vector Machine compared
identify the hidden tags in songs and to classify songs with K-Nearest Neighbors, Naïve Bayes, Decision Tree,
according to genre. Random Forest in the GTZAN data set, the result is that
Support Vector Machine is better than other methods. Rajan
There are two steps are required in music genre et al. [6] combined the features extracted from the main
classification. The first step is to extract the audio features of melody with the modified group delay feature (MODGDF)
the input music, and the second step is to construct a classifier and used Support Vector Machine to prove the potential of the
through these features. In this study, we use the Mel spectrum group delay feature and melody feature in music genre
to simulate human perception. Davis and Mermelstein [1]
proposed Mel frequency analysis and cepstrum analysis in
1980. In an experiment on human auditory perception, it is
shown that human auditory perception will only focus on
certain specific areas instead of a whole piece of audio. Mel
frequency analysis is an analysis method based on human
auditory perception. After observation, it is found that the
human ear is like a filter bank, and the human hearing is
selective to frequency, so it only pays attention to certain Fig. 1. Schematic diagram of Mel scale filter bank
frequency components. In other words, it only passes signals

* indicates corresponding author.

978-1-7281-9362-5/20/$31.00 ©2020 IEEE 399


DOI 10.1109/IS3C50286.2020.00109

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.
classification. Kobayashi et al. [7] proposed a low-level audio
function based on sub-band audio signals decomposed by
undecimated wavelet transform and demonstrated the
advantages of this method over traditional methods through
the model built by Support Vector Machine.
In recent years, the rise of machine learning and deep
learning results in the outstanding of convolutional neural Fig. 2. Schematic diagram of convolutional neural network
network (CNN). Many improvements based on CNN are
emerging in endlessly. The CNN have excellent effects on
unchangeable sequences or missing elements. The audio is
one of the data in the unchangeable sequence. If the
arrangement or the elements are changed in the audio
sequence, the new audio and the original audio will be
different files. CNN has been applied to solve various
complex audio problems, for example, sentiment analysis [8],
feature extraction [9], genre classification [10] and prediction
[11]. The CNN model is also widely used in material such as Fig. 3. Schematic diagram of max pooling
audio signals and word sorting, Therefore, we propose a
method of using convolutional neural networks for the
identification of different music styles. C. Architecture for Convolutional Neural Network
In audio analysis, most people use cepstrum or Mel Convolutional neural networks are basically composed of
spectrum as input. The cepstrum is a spectrum chart obtained a convolutional layer, pooling layer, and fully connected layer.
by performing logarithmic operation after Fourier The principle of the convolutional layer is to obtain the local
transformation of the signal, and then performing inverse features of the audio or picture through a window with a
Fourier transformation. The Mel spectrum is to map the specified size (convolutional kernel) by sliding up and down
spectrum to the Mel nonlinear spectrum based on auditory sequentially. Next, through the Activation function, the
perception, and then convert it to the cepstrum, and then pass feature map is generated as the input of the next layer. The
the spectrum through a set of Mel filters to get the Mel function of the pooling layer is to reduce the size of the input
spectrum, the formula is shown in (1). If x[k] is subjected to audio or picture to reduce the dimension of each Feature map
cepstrum analysis, the cepstrum coefficient obtained on the and retain important features. The fully connected layer can
Mel spectrum is called the Mel frequency cepstrum coefficient be used as a general neural network, which will classify after
(MFCC). Based on the above viewpoints, in order to simulate receiving the feature information of the previous
the human ear’s perception of audio, this experiment uses Mel convolutional layer and pooling layer. Neurons in the fully
spectrum as our preprocessing method. connected layer are only connected to the pixels of the
previous layer of kernel, and the weight of each link is the
݈‫)ܫ݁ܯ(݃݋݈ = ]݇[ ݔ ݃݋‬ (1)
same and shared in the same layer. Figure 2 show a schematic
II. METHODS diagram of the convolutional neural network architecture.

A. Dataset The function of the pooling layer is to compress the local


features after convolution to reduce computing resources and
We use GTZAN as our dataset, which was created by time. There are two pooling methods, i.e., max pooling and
Tzanetakis and Cook [2]. The dataset is composed of 10 average pooling. This study uses maximum pooling as the
genres, contain Blues, Classical, Country, Disco, HipHop, method of our pooling layer, which is to reduce the data of the
Jazz, Metal, Pop, Reggae, Rock. Each genre contains 100 matrix by taking the maximum value. Figure 3 show a
songs. Each of which is 30 seconds long WAVE format with schematic diagram of max pooling.
22,050 Hz mono and 16-bit audio files. The dataset contains a
total of 1,000 songs with a total capacity of 1.6GB. We split The purpose of using dropout is to prevent overfitting from
the dataset into 700 songs in the training set and 300 songs in happening. The overfitting means that the neural network is
the test set. matches a specific dataset too closely or precisely when
learning features. This leads to the very low generalization and
B. Preprocessing recognition accuracy of the neural network. Dropout is
This study uses Mel spectrum as input. Aaron et al. [12] currently a technique used in deep learning to reduce
have used MFCC to preprocess songs. In this experiment, 30 overfitting. The dropout means that the neural network
seconds of audio is sent to the preprocessing stage and randomly disconnects neurons during learning, that is, these
converted into their respective Mel spectrums, quantize into disconnected neurons will not participate in the training
audio signal at 660,000 sampling rate per second, perform fast process during the current training. After iteration random
Fourier transform on 1,024 frames, Hop size is set to 256. The sampling, a sub-network is constructed from the original
preprocessing of this experiment is carried out through neural network, and the structure of the sub-network is also
Librosa. Its working principle is to perform pre-emphasis, different from the original network structure. Figure 4 is a
framing, and windowing functions on the original audio, and schematic diagram of dropout.
map the amplitude and frequency of each frame after the fast
Fourier transform to the Mel scale. And then merge them
according to the fast Fourier frame and the number of Hop size,
and finally perform cepstrum analysis to obtain the MFCC.

400

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Schematic diagram of dropout

Fig. 5. Function graph of ReLU

We add an activation function after each convolutional


layer. The activation function we use is Rectified Linear Unit
(ReLU). If it is negative than setting ReLU to 0, and output
the value if it is positive. ReLU can solve the problem of
gradient explosion, and has the characteristics of fast
calculation speed and fast convergence speed. Figure 5 and (2)
respectively show the function graph of ReLU and the
mathematical formula of ReLU.
݂(‫ = )ݔ‬max(0, ‫)ݔ‬ (2)
Fig.6. Proposed CNN model architecture

The CNN model architecture used in this research is


composed of 5 convolutional layers. The preprocessed 64GB of memory for training actions. The total number of
spectrogram is used as input and sent to the CNN model. We iteration performed in the experiment is 2,180, batch size is
set the convolutional kernel as (3, 3), the stride as (1, 1), the set to 32, epochs is set to 100, and the experiment execution
activation function as ReLu, the max pooling layer as (2, 2), time is 13.1 hours.
and the dropout as 0.5. The output of the first convolutional ADAM [13] is an optimizer for controlling the learning rate, it
layer is 128, which is compressed to 64 after the max pooling can iteratively update neural network weights and optimize the
layer and sent to the second layer. The output of the second objective function based on training data. Therefore, we also used
convolutional layer is 64, which is compressed to 32 after the ADAM into our architecture.
max pooling layer and sent to the third layer. The output of the AUC-ROC (Area Under the Curve of Receiver Operating
third convolutional layer is 32, which is compressed to 16 Characteristic curve) is a coordinate schema analysis tool, and
through the maximum pooling layer and sent to the fourth it usually used as a scoring standard for audio classification,
layer. The output of the fourth layer of convolutional layer is as shown in (3). Among them, true positive (TP) means
16, and it is compressed to 8 after the maximum pooling layer something is detected, and it exist indeed. False positive (FP)
and sent to the fifth layer. The output of the fifth layer is 8, means something is detected, but it does not exist actually.
and it is compressed to 4 after the maximum pooling layer and True negative (TN) means something is not detected, and it
sent to fully connected layer. After classification by fully also does not exist in fact. False negative (FN) means that
connected layer, the result is obtained and the majority vote is something is not detected, but it exist actually.
taken to obtain the accuracy. Figure 6 shows the proposed
CNN model architecture. AUC-ROC is mostly applied to unbalanced datasets. The
GTZAN used in this experiment is a balanced dataset, so we
change to adopt majority voting as our scoring indicator, as
shown in (4). Among them, T indicates that there are T
III. RESULTS
classifiers, and N indicates that there are N categories. If the
A. Experimental Anvironment prediction results of T classifiers for category j are greater than
First of all, this research splits the GTZAN dataset into half of the total voting results, the prediction is category j,
70% for training set and 30% for test set, and convert all the otherwise the prediction is refuse.
audio in the dataset into their respective MFCC and send them ܶܲ ോ (ܶܲ + ‫)ܰܨ‬
to the proposed CNN model for training. Librosa is a tool for ܴܱ‫= ܥ‬ (3)
‫ ܲܨ‬ോ (‫ ܲܨ‬+ ܶܰ)
audio signal processing, we use it for audio conversion in
preprocessing to obtain the spectrogram we need. The
experiment was performed under a GPU server with Ubuntu
18.04 based on NVIDIA RTX 2080Ti, and it is equipped with

401

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.
TABLE I. TABLE TYPE STYLES OUR METHOD AND OTHER METHODS
COMPARE RESULTS

Method Table Column Head

Our Method 83.30%

[5] 72.70%

[6] 71.73%
[14] 79.70%

Fig. 7. Accuracy curve of our proposed method

Fig. 9. Confusion matrix of the CNN model

IV. CONCLUSION
Music genre classification can help users find the music
they are interested, especially specific musicians and
Fig. 8. Loss curve of our proposed method beginners. Because they are new to music and relatively
unfamiliar with various music styles, it takes a lot of time to
find a specific style of music from streaming media and this

has caused inefficiency. For a specific musician, when looking
‫ۓ‬ ் ்
for the music of the desired genre, if he listens for a long time
ۖ 1 and judges the genre, the sound will become tired and the
‫ ܥ‬, ෍ ݄௜ ݆(‫ > )ݔ‬ා ෍ ݄௜ ݇(‫)ݔ‬ judgment will fail, and they also spend a lot of time searching
‫ = )ݔ(ܪ‬௝ 2 (4)
‫ ۔‬௜ୀଵ ௜ୀଵ for music. Therefore, a music genre classification tool is a
ۖ ௞ୀଵ time-saving method for these people. The highest accuracy of
‫ە‬ ܴ݂݁‫ݎ݄݁ݐܱ ; ݁ݏݑ‬ our proposed convolutional neural network for music genre
classification is 83.3%, it will help the future work of music
genre classification. In the future, we will continue to increase
B. Experimental Results
the accuracy of the model and integrate streaming media and
The architecture proposed in this study has an accuracy web crawlers to combine our CNN architecture to make it
rate of 77% based on the confusion matrix, and the accuracy more complete, help music beginners and specific musicians
rate of majority voting is 83.30%. Table I shows the results for shorten their time and increase efficiency.
our method compares with [5], [6], and [14]. Figure 7 and
Figure 8 respectively show the accuracy curve and loss curve ACKNOWLEDGMENT
of the model. Figure 9 shows the confusion matrix of the CNN This work was supported in part by the Ministry of Science
model. In the confusion matrix, we found that Rock genre and Technology (MOST) in Taiwan under grant MSOT109-
scores the lowest. This reason is the Rock genre included in 2221-E-324-029, MOST108-2218-E-005-021, MOST108-
GTZAN is more diverse, and therefore conflicts with other 2821-C-324-001-ES, and the Chaoyang University of
genres and causing the low score. Technology (CYUT) and Higher Education Sprout Project,
Ministry of Education, Taiwan, under the project name: “The
R&D and the cultivation of talent for Health-Enhancement
Products.”

402

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.
REFERENCES in 2018 IEEE International Symposium on Multimedia (ISM), 2018,
pp. 180-181: IEEE.
[8] M. Roopaei, P. Rad, and M. Jamshidi, "Deep learning control for
[1] S. Davis and P. Mermelstein, "Comparison of parametric complex and large scale cloud systems," Intelligent Automation & Soft
representations for monosyllabic word recognition in continuously Computing, vol. 23, no. 3, pp. 389-391, 2017.
spoken sentences," IEEE transactions on acoustics, speech, and [9] T. Li, A. B. Chan, and A. H. Chun, "Automatic musical pattern feature
signal processing, vol. 28, no. 4, pp. 357-366, 1980. extraction using convolutional neural network," Genre, vol. 10, p. 1x1,
[2] G. Tzanetakis and P. Cook, "Musical genre classification of audio 2010.
signals," IEEE Transactions on speech and audio processing, vol. 10, [10] T. Nakashika, C. Garcia, and T. Takiguchi, "Local-feature-map
no. 5, pp. 293-302, 2002. integration using convolutional neural networks for music genre
[3] E. D. Scheirer, "Tempo and beat analysis of acoustic musical signals," classification," in Thirteenth Annual Conference of the International
The Journal of the Acoustical Society of America, vol. 103, no. 1, pp. Speech Communication Association, 2012.
588-601, 1998. [11] S. Sigtia and S. Dixon, "Improved music feature learning with deep
[4] E. Zheng, M. Moh, and T.-S. Moh, "Music genre classification: A n- neural networks," in 2014 IEEE international conference on acoustics,
gram based musicological approach," in 2017 IEEE 7th International speech and signal processing (ICASSP), 2014, pp. 6959-6963: IEEE.
Advance Computing Conference (IACC), 2017, pp. 671-677: IEEE. [12] A. Van den Oord, S. Dieleman, and B. Schrauwen, "Deep content-
[5] A. Elbir, H. B. Çam, M. E. Iyican, B. Öztürk, and N. Aydin, "Music based music recommendation," in Advances in neural information
Genre Classification and Recommendation by Using Machine processing systems, 2013, pp. 2643-2651.
Learning Techniques," in 2018 Innovations in Intelligent Systems and [13] D. P. Kingma and J. Ba, "Adam: A method for stochastic
Applications Conference (ASYU), 2018, pp. 1-5: IEEE. optimization," arXiv preprint arXiv:1412.6980, 2014.
[6] R. Rajan and H. A. Murthy, "Music genre classification by fusion of [14] J. M. de Sousa, E. T. Pereira, and L. R. Veloso, "A robust music genre
modified group delay and melodic features," in 2017 Twenty-third classification approach for global and regional music datasets
National Conference on Communications (NCC), 2017, pp. 1-6: IEEE. evaluation," in 2016 IEEE International Conference on Digital Signal
[7] T. Kobayashi, A. Kubota, and Y. Suzuki, "Audio feature extraction Processing (DSP), 2016, pp. 109-113: IEEE.
based on sub-band signal correlations for music genre classification,"

403

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.

You might also like