Urban Sound Classification PaperV2
Urban Sound Classification PaperV2
Figure 2 shows the plot of raw data for a sample from each class Figure 3 shows the log spectrogram filterbanks
The challenge we faced in extracting the filterbanks features are 140 times larger than the features193 set, however its size did not
different-sized raw data. Since our samples include sound bring enough new information about the samples to compensate
excerpts of different length, resolution, and number of channels. for the slowness of training.
The shape of filterbank features also varies across samples. We did not see as much advantage to maintaining the time-series
Because neither Sklearn nor Tensorflow allows varied data shape, nature of the data as expected. In comparison between
we would need to make the size of extracted features identical const_MFCC (filterbank method) and features193 (signal type
across samples. Our first solution is zero padding, making the isolation method). The features193 training set gave us
shorter signals as long as the longest ones. Our second solution is moderately better performance. One factor is model turn-over
to cut each sample into the same number of windows (frames), so rate, computing on 193 features to 820 total features. Another
each sample will have the same number of windows regardless of factor may be the shape of the const_MFCC training set. The
the length of the original sound. Shorter samples will have a frame filterbank split each audio file into smaller segments. Meaning
partition that each frame overlaps with others more. that any models learning on the time series was learning on the
smaller patterns contained and not the overall pattern of the audio
sequence. In contrast, for the flat0pad data (also a filterbank
4. MODELS method) we flattened the data to a single time series, which let our
RNN learn on the overall pattern instead of the smaller segments.
4.1 Comparison of Neural Networks across And as we can see in Fig. 4, the RNN gave better performance in
Data Sets comparison to the DeepNN on the flattened dataset (flat0pad) to
We tested with three different architectures of neural network: the split dataset (const_MFCC).
Recurrent Neural Networks (RNN), Deep Neural Networks The feature engineering we attempted with the const_LogMFCC
(DeepNN), and a Convolutional Neural Networks (CNN). We dataset failed because the const_MFCC data, which the
kept each model simple initially, using at most 3 hidden layers for const_LogMFCC dataset is produced from, contained negative
numbers. Meaning that we were losing half the data when
converting the dataset. We tried to correct that by taking the log of
the absolute values, yet we were still losing half the scale, which
is the result we see above in the above figure. Unfortunately, even
correctly scaling the data by shifting it all to positive values then
taking the log, log(𝑑𝑎𝑡𝑎 + −𝑚𝑖𝑛 + 1 ) did not give us results
that were any better due to the function condensing the upper
bounds into the the middle. While our manual attempts at feature
engineering were failures, we did have success using Sci-Kit
Learn’s standard scaler, which added about another 3-4
percentage point increase in accuracy for our CNN and DeepNN
(though we only had time to test this result on the features193
dataset).
With only 8732 samples in total, overfitting was a major issue
with this dataset. In our RNN, and DeepNN models especially, we
Figure 4. We see that feature193 and const_MFCC sets gave had to control with high rates of dropout (as much as 80% on each
us the best performance. And among those sets our CNN layer) and L2 regularization (lambda of 0.01 on each layer). Even
model gave the best accuracy, with 58% on const_MFCC and with those controls we were unable to completely prevent the
64% on features193 (we later optimized it to 72.8% accuracy RNN from overfitting, as shown in the plot below. The
on features193) complexity of the RNN made it the slowest and hardest to tune.
Indeed, the RNN “cell” in Tensorflow acted much like a
our DeepNN and 2 for our RNN and CNN. As we see in our
blackbox, making it challenging to add regularization inside. The
comparison plot above, our const_MFCC and features193 datasets
RNN also needed as much as 80% dropout, and actually required
gave us the best accuracy across all models, with features193
beating const_MFCC by an average of about 5 percentage points.
The only model we were unable to get working for a data set was
the CNN for the flat0pad training set that had 27,600 features per
sample. Our laptop (i5 CPU, 8gb ram) persistantly froze and/or
gave memory errors when attempting to run that particular model.
A large part of why the flat0pad did not do well is simply that It
took too long to run models on it. Even for our DeepNN, it could
take over two hours to train for 5000 epochs. The RNN could take
over seven hours. This meant that we simply did not have as much
time to really tune the model to the extent we could the others. To
illustrate this point more clearly, the fastest time we got was 30
minutes with our DeepNN, and CNN on the features193 training
set. The longest was over 25 hours, for our models on the flat0pad
training set. This difference means that we had roughly 50 more
chances to turn-over and tune our models on features193 training
set than the flat0pad set. A difference in time that is Figure 5 All models reach their peak very early, before 5,000
understandable considering that the flat0pad training set is over a epochs. And while DeepNN and CNN plateau, RNN begins to
overfit despite high dropout and regularization.
CSCI E-81 Machine Learning & Data Mining Final Project Fall 2016
a higher level of regularization than the DeepNN using a lambda 5. FINAL RESULTS
of 0.04.
5.1 Training Curve Results
We had the best results with our CNN as can be seen in the fig. 4
Although images distortions did not help much with “increasing”
and 5 above. While the DeepNN was comparable, the CNN
our amount of data, an actual increase in data samples would. We
consistently beat it by a few percentage points across all datasets.
plotted a training curve using the features193 dataset in
Part of this success stems from the CNN sharing its weights
increments of 1,000 samples recording the training and test
across multiple features. Therefore it was not as susceptible to
accuracies for each batch size after 15,000 epochs with our
overfitting as the other models we tried. Indeed, while both the
DeepNN model. While is not difinitve, the plot does to trend
RNN and DeepNN needed as much as 80% dropout, our CNN
toward collision of the training and test accuracies. A trend that
worked just fine with 50%. We did also add regularization to the
appears to continue beyond the sample sizes we are able to
CNN, however it did not give us as much improvement as we
capture.
found with the other models.
Remarkably, even though our features193 dataset’s features are
not related together like the image pixels CNNs are known to be
good at classifying against, treating the data like an image worked
to an extent. We were able to use the patch size in convolution to
really control how much detail from the features we wanted. We
found a patch size that combined 10 features to a single set of
weights and biases to provide the best results.
This analogy of our features as connected like an image goes only
so far. In an attempt to artificially increase our data size we tried
some of Tensorflow’s image distortion functions on our
features193 training set to limited effect. Unsurprisingly, these
functions lengthened the time it takes to train to a similar accuracy
as not using the distortions. However the accuracy curves, as
shown below, do not seem to indicate that training longer would
improve accuracy.
Figure 7. Training sets top line, Test sets bottom line. We see a
trend toward convergence between the two lines off to the
right indicating that more data samples would increase
accuracy.
6. ACKNOWLEDGMENTS machine-learning/guide-mel-frequency-cepstral-coefficients-
Our thanks to Justin Salamon, Christopher Jacoby, and Juan Pablo mfccs/
Bello for creating the UrbanSound8K dataset [3] Haytham Fayek. 2016. Speech Processing for Machine
Learning: Filter banks, Mel-Frequency Cepstral Coefficients
7. REFERENCES (MFCCs) and What's In-Between. In
[1] Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. http://haythamfayek.com/blog/ Online.
2014. A Dataset and Taxonomy for Urban Sound Research. URL=http://haythamfayek.com/2016/04/21/speech-
In Proceedings of the 22nd ACM international conference on processing-for-machine-learning.html
Multimedia (MM '14). ACM, New York, NY, USA, 1041-
[4] Aaqid Sayeed. 2016 Urban Sound Classification, Part I, Part
1044. DOI=http://doi.acm.org/10.1145/2647868.2655045
II. https://aqibsaeed.github.io/2016-09-03-urban-sound-
[2] James Lyons. 2013. Mel Frequency Cepstral Coefficient classification-part-1/
(MFCC) tutorial. In Practical Cryptography Online.
URL=http://www.practicalcryptography.com/miscellaneous/