Million Song Dataset
Feature extraction with Spectral Analysis
Classification with k-Means algorithm
Data Set used (1)
MillionSongSubset from https://labrosa.ee.columbia.edu. 10000 songs (1%
from Million Song Dataset) selected random.
Data are in HDF5 format, which is a dedicated format to organize big data
arrays.
I have used a Matlab wrapper in order access the from the HDF5 files. This
wrapper was found on https://labrosa.ee.columbia.edu also.
Data Set used (2)
• Data for each song is wrapped in a .h5 . It looks like in the bellow pictures:
• There are no audio signal data, only metadata like year,
artist…
Input set
1000 arrays like in picture with ascii code of songs
name
Feature extraction using Spectral Analysis
• Features extraction means to create a projection form a M dimensional
space of the input features to N dimensional space (N < M). The new
features from the N dimensional spaces shall be uncorrelated.
• Spectral Analysis can be done using FFT, which is already implemented in
MATLAB. The function for FFT is fft();
Apply fft to input data
• we observe that only the first element has a
significant value
• we are going to select only 1st element from
each row from the input data.
Classification using K-means algorithm
• Classification using K-means algorithm means to group the input features in K
clusters using an iterative method.
• Steps for K-means algorithm are next ones:
• Set randomly K centroids in input features spaces.
• Calculate distances from each features to the all centroids and assign the feature to the
closest one.
• Recalculate the centroids based on the features in each cluster.
• Repeat until convergence (there is no more features which change the cluster from they
appear)
K Means Clustering
http://rossfarrelly.blogspot.ro/2012/12/k-meansclustering.html
Weakness of K-means Algorithm
• It is not robust to outliners. Very far data from the centroid, will pull the centroid away from
the real one
• The result is circular cluster shape because is based on distance
• Sensitive to initial condition. Different initial condition may produce different result of
cluster. The algorithm may be trapped in the local optimum.
• When the numbers of data are not so many, initial groping will determine the cluster
significantly
http://people.revoledu.com/kardi/tutorial/kMean/Weakness.htm
Thank you!