Bird clef CNN architecture
Bird clef CNN architecture
Jan Schlüter
1 Introduction
Biodiversity monitoring is important to assess the impact of human actions on
other species, and inform decisions on short-term projects and long-term policies.
However, manually observing, classifying and counting animals is limited in scale
and scope by the supply of experts, the costs of human labor and basic human
needs. Automated or semi-automated approaches for recording and analyzing
observations would allow monitoring more locations, over longer time spans and
in forbidding terrain.
For many animals, acoustic monitoring is an interesting option – it allows
to detect and classify individuals even when they are difficult to see, which is
precisely the reason for some of their vocalizations. Acoustic recording devices
paired with automatic classifiers could support both experts in the field in iden-
tifying species, as well as enable passive acoustic monitoring.
The CLEF (Conference and Labs of the Evaluation Forum) initiative fosters
the development of automatic classifiers for the case of bird vocalizations through
the yearly BirdCLEF challenge [13,8]. In 2018, it provided 36,496 recordings
of 1,500 South American bird species to develop classifiers on, and challenged
participants in two tasks: (1) Recognizing species in 12,347 recordings mostly
captured with monodirectional microphones aimed at individual birds, and (2)
recognizing species at 5-second intervals in 51 long-term recordings taken with
multidirectional recording devices.
In the following, I describe my submission to BirdCLEF 2018, which reached
the second place among six participating teams. Section 2 details the layout and
training of a Convolutional Neural Network (CNN) processing the audio signals,
Section 3 describes a Multilayer Perceptron (MLP) predicting the species from
the metadata of each recording, and Section 4 explains how the predictions of
multiple audio and metadata networks are combined into a final result. Section 5
compares results on an internal validation set and the official test sets. Finally,
Section 6 discusses ideas that did not turn out successful, and Section 7 concludes
with a summary and an outlook.
Softmax
pooling
Global
Local
...
class
time
time
time
pitch pitch logits prob.
a global pooling strategy that produces a single prediction per species for the
full recording. This type of architecture can be directly trained and tested on
recordings of arbitrary length and does not require a hand-chosen averaging or
majority voting procedure to produce recording-wise predictions at test time.
In the following, I will describe a common preprocessing frontend as well as
the local prediction and global prediction stages in detail.
Local predictions: The frontend is followed by one of two variants for com-
puting a sequence of local predictions from the input spectrogram.
A) The first variant is based on submission sparrow in [9], but enlarged and
adapted to perform classification rather than detection. It starts with two convo-
lutional layers of 64 3×3 units each, followed by non-overlapping max-pooling of
3×3, and two more convolutional layers of 128 3×3 units each. All convolutions
are unpadded (“valid”). At this point, the temporal resolution has been reduced
to a third, and there are 21 frequency bands remaining. I apply a convolutional
layer of 128 3×17 units, leaving 5 frequency bands, followed by 3×5 max-pooling,
squashing all frequency bands. This combination of convolution and pooling is
meant to fuse information over all frequency bands while introducing some pitch
invariance, and has proven helpful in [20,9]. This is followed by a convolutional
layer of 1024 9×1 units, fusing information over 103 original spectrogram frames
(1.5 s), finally followed by 1024 1×1 and 1500 1×1 units for classification (the last
three layers resemble what would be fully-connected layers in a fixed-input-size
CNN, but here they apply to an arbitrarily long input sequence).
All layers except the last one are followed by batch normalization (the usual
variant with statistics shared over all spatial locations, but separated per chan-
nel, and with learned scale and shift) and leaky rectification as max(x/100, x).
The last three layers are preceded by 50% dropout [11]. The very last layer
neither has batch normalization nor a nonlinearity. While its 1500 features will
lead to predictions for the 1500 bird species, the nonlinearity will be applied
after global pooling, for reasons discussed below.
B) As a variation on this architecture, I replaced the first four convolutional
layers with two residual blocks with pre-activation following [10, Fig. 1b]. Each
residual block consists of four 3×3 convolutional layers, each of which is preceded
by batch normalization and linear rectification as max(x, 0) (except for the initial
convolution following the preprocessing frontend, where pre-activation would be
redundant). Convolution inputs are padded with a leading and trailing row and
column of zeros so the output size matches the input size. Skip connections
add the input of each pair of two convolutions to its output. If needed, the
skip connection includes a 1 × 1 convolutional layer to adjust the number of
channels. The first residual block has 64 channels throughout and is followed by
3×3 max-pooling, the second residual block has 128 channels and is followed by
batch normalization, linear rectification and an unpadded convolutional layer of
128 units leaving 5 frequency bands (compared to the first CNN variant, this
requires 3×22 units instead of 3×17 units since the convolutions are padded).
3×5 max-pooling and remaining layers are adopted from the first CNN variant.
The αt are shared over all species, and α is produced with multi-head attention:
I extend the last convolutional layer of the local prediction stage by K units,
such that it produces 1500 + K time series. I split off the K additional time
series C and apply a softmax over time to each one:
T
X −1
Bk,t = exp(Ck,t )/ exp(Ck,i ) (4)
i=0
2.3 Training
To deal with the fact that training recordings are usually a few minutes long,
but only sparsely populated with vocalizations of the target bird (without any
information on where in a recording the target bird is audible), several authors
employed hand-designed methods to find potential bird calls and trained on
these passages only [17,23,14,5]. Here I take a different route: I randomly choose
extracts long enough to hopefully contain at least one vocalization of every
annotated bird for a recording, and directly train on those, relying on the global
pooling operation to distribute the gradient to the relevant local predictions.
Loss and targets: Optimization minimizes the cross-entropy loss between pre-
dictions y and targets t:
S−1
X
ce(y, t) = − ts log(ys ) (6)
s=0
I employ three different variants for defining the target vector t for a recording.
A) The most obvious choice for a categorical classification task is to set ts
to 1 for the annotated foreground species for the recording, and to 0 otherwise.
Denoting the foreground species as Sf , we have:
(
1 if s = Sf
ts = (7)
0 otherwise
Obviously, the network can never reach this target. However, as categorical cross-
entropy (Equation 6) is linear in the targets, this is equivalent to adding the
cross-entropy losses against each background species and the foreground species,
weighting the latter twice. Furthermore, since ADAM is invariant to the scale
of the gradient, it is unimportant if we set the foreground target to 2 and back-
ground to 1, or the foreground to 1 and background to 0.5, for example.
C) As a third variant, I train a network as a Born-Again Network (BAN),
following [7]. Using a trained model, I compute predictions ŷ for the excerpt
encountered during training, and set the target as follows:
(
ŷt + 1 if s = Sf
ts = (9)
ŷt otherwise
Again, note that this is equivalent to adding the loss against the target of variant
A to the loss against ŷ (or averaging the two losses, if training with ADAM).
Even if the model computing ŷ has the same architecture as the new model to
be trained, this provides a helpful regularization, leading to faster convergence
and possibly improved generalization (or an additional model for ensembling).
Of course the same modification is possible for variant B (omitted for brevity).
2.4 Testing
Since the network can process input of arbitrary size into a vector of 1500 class
probabilities, we can directly apply it to full test recordings for the first subtask
of the challenge (monodirectional recordings), and to 5-second excerpts for the
second subtask (long-term multidirectional recordings at 5-second intervals).
3 Predictions from Metadata
Not all 1500 bird species of the challenge are likely to occur across the whole
South American continent. Similarly, some migrating birds are not likely to occur
in specific seasons, and some species are more likely to be heard during dawn
than at noon. To learn about and use these dependencies, I train additional
networks on the metadata supplied with the training recordings, which is also
available for the test recordings: the date, time, geocoordinates, and elevation.
3.3 Training
Optimization uses the same settings as for the audio network, except that mon-
itoring for early stopping uses the mean average precision (MAP, see Section 5)
on the validation set rather than the cross-entropy loss, and mini-epochs of 5000
instead of 1000 updates.
Training targets are chosen according to strategy A or B of Section 2.3, with
strategy B turning out to be more useful for ensembling.
To obtain additional variations for ensembling, I varied training in two ways:
A) I added Gaussian noise to the metadata values (before circular encoding, if
applicable), either with small standard deviations (3 days, 3 minutes, 5 meters,
0.3 degrees), medium standard deviations (7 days, 10 minutes, 20 meters, 1
degree) or large standard deviations (14 days, 30 minutes, 100 meters, 3 degrees).
B) I trained networks on a subset of metadata fields, either leaving out one of
the fields or all but one field.
As an additional regularization, I experimented with randomly dropping sin-
gle fields (handling them like missing fields, i.e., replacing with zeros and flipping
the validity indicator variable), but this only worsened results.
4 Ensembling of Predictions
Both to combine audio-based and metadata-based predictions and to benefit
from multiple variations among the audio-based and metadata-based models, it
is crucial to be able to merge predictions of multiple models.
In its simplest form, predictions by multiple models for the same recording
can be merged by averaging them. There are different stages at which this could
happen; empirically, it worked best to average predictions after global temporal
pooling, but before applying the softmax output nonlinarity.
Averaging is sufficient for few similar models, but not optimal for larger sets of
diverse models, or for combining audio-based and metadata-based predictions:
in these cases, it pays off to weight the predictions of each model differently.
To automate the choice of weights, I employ hyperopt [2], a general-purpose
hyperparameter optimization tool. Given a set of models, I use it to search
a weight in [0, 1) for each model that optimizes the ensemble’s mean average
precision (MAP) on the validation set (specifically, the average of the MAPs for
the foreground and background species; see the next section for details on the
evaluation). When the set of models becomes large, hyperopt can also be used to
choose which models to include in the ensemble at all. To support this, I extend
the search space by a binary choice for each weight, allowing hyperopt to more
easily set it to zero. While this strategy greatly helps in finding a good ensemble,
of course it bears the risk of over-optimizing towards the validation set.
5 Results
Challenge submissions are evaluated in terms of mean average precision (MAP),
both when using only the annotated foreground species as ground truth and
when including the background species as additional correct classes. Given class
probabilities y and a set of annotated species S for a recording, the average
precision is computed as
1 X
ap(y, S) = p(y, S, rank(y, s)), (11)
|S|
s∈S
where rank(y, s) is the number of values in y less than or equal to ys :
X
rank(y, s) = [yi ≤ ys ] , (12)
i
The mean average precision is the mean of the average precisions of all recordings.
Note that when |S| = 1 (e.g., when only evaluating against foreground species),
the average precision is equal to the reciprocal rank of the correct species, as
easy to see when inserting (13) into (11).
As an additional evaluation measure, I use the classification accuracy with
respect to the foreground species, and the top-5 classification accuracy (which
treats a prediction as correctly classified when the foreground species is among
the 5 largest class probabilities for a recording).
Table 1 lists validation results for all candidate model variants partaking in
the ensemble search. In addition, it shows the ensembling weights, validation
and official test results for three ensembles formed from the candidate models.
We can draw several interesting insights from these results:
– All else kept equal, the residual network outperforms the plain CNN (variants
A and B in Section 2.2, “Local predictions”). It still pays off to combine them
in an ensemble.
– All else kept equal, log-mean-exp pooling tends to outperform temporal at-
tention (variants A and B in Section 2.2, “Global predictions”). Again, it
still pays off to combine them in an ensemble.
– All else kept equal, including background species in the targets tends to de-
crease the foreground MAP against foreground species (and the classification
accuracy), while increasing the MAP against background species (variants
A and B in Section 2.3, “Loss and targets”).
– All else kept equal, retraining a model as a Born-Again Network (BAN) tends
to improve its performance (variant C in Section 2.3, “Loss and targets”).
– The metadata alone suffices to classify 21% of validation recordings correctly.
– Including all metadata fields gives better performance than omitting some.
– When training on a single metadata field, it seems location is the most
informative cue.
– There is no systematic difference between 12-dimensional and 3-dimensional
date/time encodings (but there is only a single pair of results with all else
kept equal).
– Adding Gaussian noise to metadata diminishes results. This might indicate
that the models learn to exploit unwanted correlations between training and
validation data, such as recordings done on the same day at the same loca-
tion, rather than learning robust occurrence patterns. However, when ensem-
bling with audio models, blurred (noisy) metadata is superior to unmodified
data.
Table 1. Results for several model variants on the validation set (columns “MAP-fg”
to “Top-5”), both for audio (upper part) and metadata networks (lower part). The last
three columns show model selections for three ensembles, with corresponding weights.
The last four rows display ensemble results on the validation set and official test set.
6 Dead Ends
During experimentation, I tried several ideas that did not turn out to be helpful
in the end, and were not discussed above. I will describe a few in the following.
7 Discussion
Acknowledgements
I would like to thank Hervé Glotin, Hervé Goëau, Willem-Pier Vellinga, and
Alexis Joly for organizing this challenge, supported by Xeno-Canto, Floris’Tic,
SABIOD and EADM MaDICS. This research is supported by the Vienna Science
and Technology Fund (WWTF) under grants NXT17-004 and MA14-018. I also
gratefully acknowledge the support of NVIDIA Corporation with the donation
of two Tesla K40 GPUs and a Titan Xp GPU used for this research.
References
1. Amores, J.: Multiple instance classification: Review, taxonomy and comparative
study. Artificial Intelligence 201, 81–105 (2013). https://doi.org/10.1016/j.artint.
2013.06.003, http://refbase.cvc.uab.es/files/Amo2013.pdf
2. Bergstra, J., Yamins, D., Cox, D.: Making a science of model search: Hyperparame-
ter optimization in hundreds of dimensions for vision architectures. In: Proceedings
of the 30th International Conference on Machine Learning (ICML). Proceedings of
Machine Learning Research, vol. 28, pp. 115–123. Atlanta, GA, USA (Jun 2013),
https://github.com/hyperopt/hyperopt
3. Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In:
Advances in Neural Information Processing Systems 30, pp. 4467–4475. Curran
Associates, Inc. (2017), http://papers.nips.cc/paper/7033-dual-path-networks
4. Dan Cireşan, U.M., Schmidhuber, J.: Multi-column deep neural networks for image
classification. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). pp. 3642–3649. Providence, RI, USA (Jun 2012),
http://www.idsia.ch/~ciresan/data/cvpr2012.pdf
5. Fazekas, B., Schindler, A., Lidy, T., Rauber, A.: A multi-modal deep neural net-
work approach to bird-song identification. In: Working Notes of CLEF. Dublin,
Ireland (Sep 2017), http://ceur-ws.org/Vol-1866/paper_179
6. Foulds, J., Frank, E.: A review of multi-instance learning assumptions.
Knowledge Engineering Review 25(1), 1–25 (Mar 2010). https://doi.org/10.
1017/S026988890999035X, http://www.cs.waikato.ac.nz/~ml/publications/2010/
FouldsAndFrankMIreview.pdf
7. Furlanello, T., Lipton, Z.C., Itti, L., Anandkumar, A.: Born again neural networks.
In: Proceedings of the 35th International Conference on Machine Learning (ICML).
Proceedings of Machine Learning Research, vol. 80. Stockholm, Sweden (Jul 2018),
https://arxiv.org/abs/1805.04770
8. Goëau, H., Glotin, H., Planqué, R., Vellinga, W.P., Kahl, S., Joly, A.: Overview of
BirdCLEF 2018: monophone vs. soundscape bird identification. In: Working Notes
of CLEF. Avignon, France (Sep 2018)
9. Grill, T., Schlüter, J.: Two convolutional neural networks for bird detection in
audio signals. In: Proceedings of the 25th European Signal Processing Conference
(EUSIPCO). Kos Island, Greece (Aug 2017), http://ofai.at/~jan.schlueter/pubs/
2017_eusipco.pdf
10. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual net-
works. In: Proceedings of the 14th European Conference on Computer Vision
(ECCV). pp. 630–645. Springer International Publishing, Amsterdam, Netherlands
(2016). https://doi.org/10.1007/978-3-319-46493-0_38, preprint http://arxiv.org/
abs/1603.05027
11. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.:
Improving neural networks by preventing co-adaptation of feature detectors. arXiv
e-prints abs/1207.0580 (Jul 2012), http://arxiv.org/abs/1207.0580
12. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In: Proceedings of the 32nd International Con-
ference on Machine Learning (ICML). Proceedings of Machine Learning Research,
vol. 37, pp. 448–456. PMLR, Lille, France (Jul 2015), http://proceedings.mlr.press/
v37/ioffe15.html
13. Joly, A., Goëau, H., Botella, C., Glotin, H., Bonnet, P., Planqué, R., Vellinga,
W.P., Müller, H.: Overview of LifeCLEF 2018: a large-scale evaluation of species
identification and recommendation algorithms in the era of AI. In: Proceedings of
CLEF. Avignon, France (Sep 2018)
14. Kahl, S., Wilhelm-Stein, T., Hussein, H., Klinck, H., Kowerko, D., Ritter, M., Eibl,
M.: Large-scale bird sound classification using convolutional neural networks. In:
Working Notes of CLEF. Dublin, Ireland (Sep 2017), http://ceur-ws.org/Vol-1866/
paper_143.pdf
15. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceed-
ings of the 3rd International Conference on Learning Representations (ICLR). San
Diego, CA, USA (May 2015), http://arxiv.org/abs/1412.6980
16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep
convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L.,
Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems
25, pp. 1097–1105. Curran Associates, Inc. (2012), http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks
17. Lasseck, M.: Bird song classification in field recordings: Winning solution for
NIPS4B 2013 competition. In: NIPS Workshop on Neural Information Scaled for
Bioacoustics. Lake Tahoe, NV, USA (2013), http://www.animalsoundarchive.org/
RefSys/Nips4b2013NotesAndSourceCode/WorkingNotes_Mario.pdf
18. Lasseck, M.: Improved automatic bird identification through decision tree based
feature selection and bagging. In: Working Notes of CLEF. Toulouse, France (Sep
2015), http://ceur-ws.org/Vol-1391/160-CR.pdf
19. Pinheiro, P.O., Collobert, R.: From image-level to pixel-level labeling with con-
volutional networks. In: Proceedings of the 28th IEEE Conference on Computer
Vision and Pattern Recognition (CVPR). pp. 1713–1721. Boston, MA, USA (Jun
2015), http://openaccess.thecvf.com/content_cvpr_2015/html/Pinheiro_From_
Image-Level_to_2015_CVPR_paper.html
20. Schlüter, J.: Learning to pinpoint singing voice from weakly labeled examples.
In: Proceedings of the 17th International Society for Music Information Retrieval
Conference (ISMIR). New York City, NY, USA (Aug 2016), http://ofai.at/~jan.
schlueter/pubs/2016_ismir.pdf
21. Schlüter, J., Grill, T.: Exploring data augmentation for improved singing voice
detection with neural networks. In: Proceedings of the 16th International Society
for Music Information Retrieval Conference (ISMIR). Málaga, Spain (Oct 2015),
http://ofai.at/~jan.schlueter/pubs/2015_ismir.pdf
22. Sercu, T., Goel, V.: Dense prediction on sequences with time-dilated convolutions
for speech recognition. In: NIPS Workshop on End-to-end Learning for Speech and
Audio Processing. Barcelona, Spain (Nov 2016), http://arxiv.org/abs/1611.09288
23. Sprengel, E., Jaggi, M., Kilcher, Y., Hofmann, T.: Audio based bird species identifi-
cation using deep learning techniques. In: Working Notes of CLEF. Évora, Portugal
(Sep 2016), http://ceur-ws.org/Vol-1609/16090547.pdf
24. Veit, A., Wilber, M., Belongie, S.: Residual networks are exponential ensembles
of relatively shallow networks. arXiv e-prints 1605.06431v1 (May 2016), https:
//arxiv.org/abs/1605.06431v1
25. Wang, Y., Getreuer, P., Hughes, T., Lyon, R.F., Saurous, R.A.: Trainable frontend
for robust and far-field keyword spotting. In: Proceedings of the 42nd IEEE Inter-
national Conference on Acoustics, Speech, and Signal Processing (ICASSP). pp.
5670–5674 (Mar 2017). https://doi.org/10.1109/ICASSP.2017.7953242, preprint
http://arxiv.org/abs/1607.05666
26. Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual trans-
formations for deep neural networks. In: Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017),
http://openaccess.thecvf.com/content_cvpr_2017/html/Xie_Aggregated_
Residual_Transformations_CVPR_2017_paper.html
27. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical
risk minimization. In: Proceedings of the 6th International Conference on Learning
Representations (ICLR). Vancouver, Canada (May 2018), https://arxiv.org/abs/
1710.09412