Cheng 2020
Cheng 2020
Abstract—In recent years, the complexity of music of certain frequencies and ignores the frequency signals that it
production has gradually decreased results in many people does not want to perceive. However, these signals are not
create music and upload their created music to streaming media. evenly distributed in the frequency coordinate axis. More
2020 International Symposium on Computer, Consumer and Control (IS3C) | 978-1-7281-9362-5/20/$31.00 ©2020 IEEE | DOI: 10.1109/IS3C50286.2020.00109
The huge music streaming media has caused people to spend filters are allocated in the low frequency region and fewer high
much time searching for specific music. Therefore, the frequencies. Figure 1 show a schematic diagram of the filter
technique of quick classification of music genres is very bank in the Mel scale, where the X-axis represents the
important in today’s society. As machine learning and deep frequency and the Y-axis represents the pulse signal. For a
learning technologies maturing, the Convolutional Neural spectrogram of an audio frequency, the peak value represents
Networks (CNN) are applied to many fields, and various CNN-
the main frequency component of the audio frequency, which
based variants have emerged one after another. The traditional
is also called a formant, and the formant carries the
music genre classification requires relevant professional
knowledge to manually extract features from time series data.
identification properties of the sound. In audio recognition, we
Deep learning has been proven to be effective and efficient in need to extract the position and transition process of the
time series data. In order to save the user’s time when searching formant. The process of this change is a smooth curve
for different styles of music, we applied CNN’s advantages and connecting these resonance peak points, we call that spectrum
characteristics in audio to implement a music genre envelope. The original spectrum consists of two parts:
classification model. In the pre-processing, we use Librosa is envelope and spectrum details. If these two parts are separated,
used to convert the original audio files into their corresponding then we can get the spectrum envelope. We need to use
Mel spectrums. The converted Mel spectrum is then fed into the cepstrum analysis, the purpose is to separate the envelope and
proposed CNN model for training. The majority voting is spectrum details. Cepstrum is the spectrum obtained by
applied to the decisions made by the 10 classifiers, and the performing logarithmic operation after Fourier transformation
average accuracy obtained on the GTZAN dataset is 84%. of the signal, and then performing inverse Fourier
transformation. Therefore, cepstrum analysis can decompose
Keywords—Convolutional Neural Networks; GTZAN; Music the signal.
genre classification; Mel-spectrum
This experiment uses the GTZAN data set as training and
I. INTRODUCTION verification, which was established by Cook [2], the purpose
After the advent of the Internet, many people upload music is to study the application of machine learning to the
originally stored on vinyl records or discs to streaming media classification of music genres. There are 10 genres, each of
on the Internet. Coupled with the rise of music streaming them with 100 pieces of music, and a total of 1000 pieces. For
services in recent years, people search for popular music a real-time beat tracking system of audio signals with music,
through music streaming services, and the huge online music Scheirer et al. [3] have special insights. In this system, the
library makes it difficult to search for specific genres or music. filter bank will be coupled with a combined filter network, and
Therefore, a tool which can identify and classify music is an the combined filter will track the signal period to generate the
important issue for beginners and specific musicians. Since main beat and its result. Eve et al. [4] are different from most
most of the music in the current music streaming media only people in that they use pitch, duration and MIDI signal as the
has the title and author, and most of these music do not carry feature basis for classification to achieve good results. Elbir et
specific tags. This makes it be a challenging task to be able to al. [5] found that using Support Vector Machine compared
identify the hidden tags in songs and to classify songs with K-Nearest Neighbors, Naïve Bayes, Decision Tree,
according to genre. Random Forest in the GTZAN data set, the result is that
Support Vector Machine is better than other methods. Rajan
There are two steps are required in music genre et al. [6] combined the features extracted from the main
classification. The first step is to extract the audio features of melody with the modified group delay feature (MODGDF)
the input music, and the second step is to construct a classifier and used Support Vector Machine to prove the potential of the
through these features. In this study, we use the Mel spectrum group delay feature and melody feature in music genre
to simulate human perception. Davis and Mermelstein [1]
proposed Mel frequency analysis and cepstrum analysis in
1980. In an experiment on human auditory perception, it is
shown that human auditory perception will only focus on
certain specific areas instead of a whole piece of audio. Mel
frequency analysis is an analysis method based on human
auditory perception. After observation, it is found that the
human ear is like a filter bank, and the human hearing is
selective to frequency, so it only pays attention to certain Fig. 1. Schematic diagram of Mel scale filter bank
frequency components. In other words, it only passes signals
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.
classification. Kobayashi et al. [7] proposed a low-level audio
function based on sub-band audio signals decomposed by
undecimated wavelet transform and demonstrated the
advantages of this method over traditional methods through
the model built by Support Vector Machine.
In recent years, the rise of machine learning and deep
learning results in the outstanding of convolutional neural Fig. 2. Schematic diagram of convolutional neural network
network (CNN). Many improvements based on CNN are
emerging in endlessly. The CNN have excellent effects on
unchangeable sequences or missing elements. The audio is
one of the data in the unchangeable sequence. If the
arrangement or the elements are changed in the audio
sequence, the new audio and the original audio will be
different files. CNN has been applied to solve various
complex audio problems, for example, sentiment analysis [8],
feature extraction [9], genre classification [10] and prediction
[11]. The CNN model is also widely used in material such as Fig. 3. Schematic diagram of max pooling
audio signals and word sorting, Therefore, we propose a
method of using convolutional neural networks for the
identification of different music styles. C. Architecture for Convolutional Neural Network
In audio analysis, most people use cepstrum or Mel Convolutional neural networks are basically composed of
spectrum as input. The cepstrum is a spectrum chart obtained a convolutional layer, pooling layer, and fully connected layer.
by performing logarithmic operation after Fourier The principle of the convolutional layer is to obtain the local
transformation of the signal, and then performing inverse features of the audio or picture through a window with a
Fourier transformation. The Mel spectrum is to map the specified size (convolutional kernel) by sliding up and down
spectrum to the Mel nonlinear spectrum based on auditory sequentially. Next, through the Activation function, the
perception, and then convert it to the cepstrum, and then pass feature map is generated as the input of the next layer. The
the spectrum through a set of Mel filters to get the Mel function of the pooling layer is to reduce the size of the input
spectrum, the formula is shown in (1). If x[k] is subjected to audio or picture to reduce the dimension of each Feature map
cepstrum analysis, the cepstrum coefficient obtained on the and retain important features. The fully connected layer can
Mel spectrum is called the Mel frequency cepstrum coefficient be used as a general neural network, which will classify after
(MFCC). Based on the above viewpoints, in order to simulate receiving the feature information of the previous
the human ear’s perception of audio, this experiment uses Mel convolutional layer and pooling layer. Neurons in the fully
spectrum as our preprocessing method. connected layer are only connected to the pixels of the
previous layer of kernel, and the weight of each link is the
݈)ܫ݁ܯ(݈݃ = ]݇[ ݔ ݃ (1)
same and shared in the same layer. Figure 2 show a schematic
II. METHODS diagram of the convolutional neural network architecture.
400
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. Schematic diagram of dropout
401
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.
TABLE I. TABLE TYPE STYLES OUR METHOD AND OTHER METHODS
COMPARE RESULTS
[5] 72.70%
[6] 71.73%
[14] 79.70%
IV. CONCLUSION
Music genre classification can help users find the music
they are interested, especially specific musicians and
Fig. 8. Loss curve of our proposed method beginners. Because they are new to music and relatively
unfamiliar with various music styles, it takes a lot of time to
find a specific style of music from streaming media and this
ே
has caused inefficiency. For a specific musician, when looking
ۓ ் ்
for the music of the desired genre, if he listens for a long time
ۖ 1 and judges the genre, the sound will become tired and the
ܥ, ݄ ݆( > )ݔා ݄ ݇()ݔ judgment will fail, and they also spend a lot of time searching
= )ݔ(ܪ 2 (4)
۔ୀଵ ୀଵ for music. Therefore, a music genre classification tool is a
ۖ ୀଵ time-saving method for these people. The highest accuracy of
ە ܴ݂݁ݎ݄݁ݐܱ ; ݁ݏݑ our proposed convolutional neural network for music genre
classification is 83.3%, it will help the future work of music
genre classification. In the future, we will continue to increase
B. Experimental Results
the accuracy of the model and integrate streaming media and
The architecture proposed in this study has an accuracy web crawlers to combine our CNN architecture to make it
rate of 77% based on the confusion matrix, and the accuracy more complete, help music beginners and specific musicians
rate of majority voting is 83.30%. Table I shows the results for shorten their time and increase efficiency.
our method compares with [5], [6], and [14]. Figure 7 and
Figure 8 respectively show the accuracy curve and loss curve ACKNOWLEDGMENT
of the model. Figure 9 shows the confusion matrix of the CNN This work was supported in part by the Ministry of Science
model. In the confusion matrix, we found that Rock genre and Technology (MOST) in Taiwan under grant MSOT109-
scores the lowest. This reason is the Rock genre included in 2221-E-324-029, MOST108-2218-E-005-021, MOST108-
GTZAN is more diverse, and therefore conflicts with other 2821-C-324-001-ES, and the Chaoyang University of
genres and causing the low score. Technology (CYUT) and Higher Education Sprout Project,
Ministry of Education, Taiwan, under the project name: “The
R&D and the cultivation of talent for Health-Enhancement
Products.”
402
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.
REFERENCES in 2018 IEEE International Symposium on Multimedia (ISM), 2018,
pp. 180-181: IEEE.
[8] M. Roopaei, P. Rad, and M. Jamshidi, "Deep learning control for
[1] S. Davis and P. Mermelstein, "Comparison of parametric complex and large scale cloud systems," Intelligent Automation & Soft
representations for monosyllabic word recognition in continuously Computing, vol. 23, no. 3, pp. 389-391, 2017.
spoken sentences," IEEE transactions on acoustics, speech, and [9] T. Li, A. B. Chan, and A. H. Chun, "Automatic musical pattern feature
signal processing, vol. 28, no. 4, pp. 357-366, 1980. extraction using convolutional neural network," Genre, vol. 10, p. 1x1,
[2] G. Tzanetakis and P. Cook, "Musical genre classification of audio 2010.
signals," IEEE Transactions on speech and audio processing, vol. 10, [10] T. Nakashika, C. Garcia, and T. Takiguchi, "Local-feature-map
no. 5, pp. 293-302, 2002. integration using convolutional neural networks for music genre
[3] E. D. Scheirer, "Tempo and beat analysis of acoustic musical signals," classification," in Thirteenth Annual Conference of the International
The Journal of the Acoustical Society of America, vol. 103, no. 1, pp. Speech Communication Association, 2012.
588-601, 1998. [11] S. Sigtia and S. Dixon, "Improved music feature learning with deep
[4] E. Zheng, M. Moh, and T.-S. Moh, "Music genre classification: A n- neural networks," in 2014 IEEE international conference on acoustics,
gram based musicological approach," in 2017 IEEE 7th International speech and signal processing (ICASSP), 2014, pp. 6959-6963: IEEE.
Advance Computing Conference (IACC), 2017, pp. 671-677: IEEE. [12] A. Van den Oord, S. Dieleman, and B. Schrauwen, "Deep content-
[5] A. Elbir, H. B. Çam, M. E. Iyican, B. Öztürk, and N. Aydin, "Music based music recommendation," in Advances in neural information
Genre Classification and Recommendation by Using Machine processing systems, 2013, pp. 2643-2651.
Learning Techniques," in 2018 Innovations in Intelligent Systems and [13] D. P. Kingma and J. Ba, "Adam: A method for stochastic
Applications Conference (ASYU), 2018, pp. 1-5: IEEE. optimization," arXiv preprint arXiv:1412.6980, 2014.
[6] R. Rajan and H. A. Murthy, "Music genre classification by fusion of [14] J. M. de Sousa, E. T. Pereira, and L. R. Veloso, "A robust music genre
modified group delay and melodic features," in 2017 Twenty-third classification approach for global and regional music datasets
National Conference on Communications (NCC), 2017, pp. 1-6: IEEE. evaluation," in 2016 IEEE International Conference on Digital Signal
[7] T. Kobayashi, A. Kubota, and Y. Suzuki, "Audio feature extraction Processing (DSP), 2016, pp. 109-113: IEEE.
based on sub-band signal correlations for music genre classification,"
403
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 07,2021 at 16:48:27 UTC from IEEE Xplore. Restrictions apply.