An Efficient Model for a Vast Number of Bird Species Identification Based on Acoustic Features
Abstract
:Simple Summary
Abstract
1. Introduction
- We proposed a novel method for preprocessing audio samples before sending them to the training stage.
- We fused Mel−spectrogram and MFCCs as input features and discussed the impact of orders of MFCCs on the performance of the model.
- We first introduced the coordinate attention module in identifying bird species.
- A robust deep neural network based on LSTM for bird call identification was proposed. Seven performance metrics were used to evaluate models.
2. Materials and Methods
2.1. Data
Data Preprocessing
2.2. Methodology
2.2.1. Mel−Spectrogram and MFCCs Generating
2.2.2. Coordinate Attention Module Embedding
2.2.3. Network Architecture
3. Results
3.1. Evaluation Metrics
3.2. Experimental Setup
3.3. Performance of the Proposed Method
3.3.1. Effectiveness of Feature Type
3.3.2. Model Comparison
3.3.3. Effectiveness of Inner Modules
3.3.4. Visualization of Features
3.3.5. ROC Illustration
4. Discussion
4.1. Revelation of the Proposed Model
4.2. Advantages of Our Work
4.3. Limitations and Future Improvements
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Vielliard, J.M. Bird community as an indicator of biodiversity: Results from quantitative surveys in Brazil. An. Acad. Bras. Ciênc. 2000, 72, 323–330. [Google Scholar] [CrossRef]
- Gregory, R. Birds as biodiversity indicators for Europe. Significance 2006, 3, 106–110. [Google Scholar] [CrossRef]
- Green, S.; Marler, P. The analysis of animal communication. In Social Behavior and Communication; Springer: Boston, MA, USA, 1979; pp. 73–158. [Google Scholar]
- Chen, G.; Xia, C.; Zhang, Y. Individual identification of birds with complex songs: The case of green-backed flycatchers ficedula elisae. Behav. Process. 2020, 173, 104063. [Google Scholar] [CrossRef]
- O’Shaughnessy, D. Speech Communications: Human and Machine; Wiley: Hoboken, NJ, USA, 1999. [Google Scholar]
- Umesh, S.; Cohen, L.; Nelson, D. Fitting the mel scale. In Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), Phoenix, AZ, USA, 15–19 March 1999; pp. 217–220. [Google Scholar]
- Logan, B. Mel frequency cepstral coefficients for music modeling. In Proceedings of the International Symposium on Music Information Retrieval, Plymouth, MA, USA, 23–25 October 2000. [Google Scholar]
- Kingsbury, B.E.; Morgan, N.; Greenberg, S. Robust speech recognition using the modulation spectrogram. Speech Commun. 1998, 25, 117–132. [Google Scholar] [CrossRef]
- Flanagan, J.L. Speech synthesis. In Speech Analysis Synthesis and Perception; Springer: Berlin, Heidelberg, 1972; pp. 204–276. [Google Scholar]
- Nussbaumer, H.J. The fast Fourier transform. In Fast Fourier Transform and Convolution Algorithms; Springer: Berlin, Heidelberg, 1981; pp. 80–111. [Google Scholar]
- Sundararajan, D. The Discrete Fourier Transform: Theory, Algorithms and Applications; World Scientific: Singapore, 2001. [Google Scholar]
- Winograd, S. On computing the discrete Fourier transform. Math. Comput. 1978, 32, 175–199. [Google Scholar] [CrossRef]
- de Oliveira, A.G.; Ventura, T.M.; Ganchev, T.D.; de Figueiredo, J.M.; Jahn, O.; Marques, M.I.; Schuchmann, K.-L. Bird acoustic activity detection based on morphological filtering of the spectrogram. Appl. Acoust. 2015, 98, 34–42. [Google Scholar] [CrossRef]
- Suzuki, Y.; Takeshima, H. Equal-loudness-level contours for pure tones. J. Acoust. Soc. Am. 2004, 116, 918–933. [Google Scholar] [CrossRef]
- Pierre Jr, R.L.S.; Maguire, D.J.; Automotive, C.S. The impact of A-weighting sound pressure level measurements during the evaluation of noise exposure. In Proceedings of the Conference NOISE-CON, Baltimore, MD, USA, 12–14 July 2004; pp. 12–14. [Google Scholar]
- Ahmed, N.; Natarajan, T.; Rao, K.R. Discrete cosine transform. IEEE Trans. Comput. 1974, 100, 90–93. [Google Scholar] [CrossRef]
- Sahidullah, M.; Saha, G. Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 2012, 54, 543–565. [Google Scholar] [CrossRef]
- Glotin, H.; Ricard, J.; Balestriero, R. Fast Chirplet transform to enhance CNN machine listening-validation on animal calls and speech. arXiv 2016, arXiv:1611.08749v2. [Google Scholar]
- Ramirez, A.D.P.; de la Rosa Vargas, J.I.; Valdez, R.R.; Becerra, A. A comparative between mel frequency cepstral coefficients (MFCC) and inverse mel frequency cepstral coefficients (IMFCC) features for an automatic bird species recognition system. In Proceedings of the 2018 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Gudalajara, Mexico, 7–9 November 2018; pp. 1–4. [Google Scholar]
- Fine, S.; Singer, Y.; Tishby, N. The hierarchical hidden Markov model: Analysis and applications. Mach. Learn. 1998, 32, 41–62. [Google Scholar] [CrossRef] [Green Version]
- Shan-shan, X.; Hai-feng, X.; Jiang, L.; Yan, Z.; Dan-jv, L. Research on Bird Songs Recognition Based on MFCC-HMM. In Proceedings of the 2021 International Conference on Computer, Control and Robotics (ICCCR), Shanghai, China, 8–10 January 2021; pp. 262–266. [Google Scholar]
- Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
- Eddy, S.R. What is a hidden Markov model? Nat. Biotechnol. 2004, 22, 1315–1316. [Google Scholar] [CrossRef] [PubMed]
- Xu, M.; Duan, L.-Y.; Cai, J.; Chia, L.-T.; Xu, C.; Tian, Q. HMM-based audio keyword generation. In Proceedings of the Pacific-Rim Conference on Multimedia, Tokyo, Japan, 30 November–3 December 2004; pp. 566–574. [Google Scholar]
- Rabiner, L.; Juang, B.-H. Fundamentals of Speech Recognition; Prentice-Hall, Inc.: Hoboken, NJ, USA, 1993. [Google Scholar]
- Ricard, J.; Glotin, H. Bag of MFCC-based Words for Bird Identification. In Proceedings of the CLEF (Working Notes), Évora, Portugal, 5–8 September 2016; pp. 544–546. [Google Scholar]
- Neal, L.; Briggs, F.; Raich, R.; Fern, X.Z. Time-frequency segmentation of bird song in noisy acoustic environments. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 2012–2015. [Google Scholar]
- Zhao, Z.; Zhang, S.-H.; Xu, Z.-Y.; Bellisario, K.; Dai, N.-H.; Omrani, H.; Pijanowski, B.C. Automated bird acoustic event detection and robust species classification. Ecol. Inform. 2017, 39, 99–108. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Shinde, P.P.; Shah, S. A review of machine learning and deep learning applications. In Proceedings of the 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; pp. 1–6. [Google Scholar]
- Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Koops, H.V.; Van Balen, J.; Wiering, F.; Cappellato, L.; Ferro, N.; Halvey, M.; Kraaij, W. A deep neural network approach to the LifeCLEF 2014 bird task. CLEF Work. Notes 2014, 1180, 634–642. [Google Scholar]
- Tóth, B.P.; Czeba, B. Convolutional Neural Networks for Large-Scale Bird Song Classification in Noisy Environment. In Proceedings of the CLEF (Working Notes), Évora, Portugal, 5–8 September 2016; pp. 560–568. [Google Scholar]
- Xie, J.; Zhao, S.; Li, X.; Ni, D.; Zhang, J. KD-CLDNN: Lightweight automatic recognition model based on bird vocalization. Appl. Acoust. 2022, 188, 108550. [Google Scholar] [CrossRef]
- Piczak, K.J. Recognizing Bird Species in Audio Recordings using Deep Convolutional Neural Networks. In Proceedings of the CLEF (Working Notes), Évora, Portugal, 5–8 September 2016; pp. 534–543. [Google Scholar]
- Zhang, X.; Chen, A.; Zhou, G.; Zhang, Z.; Huang, X.; Qiang, X. Spectrogram-frame linear network and continuous frame sequence for bird sound classification. Ecol. Inform. 2019, 54, 101009. [Google Scholar] [CrossRef]
- Sprengel, E.; Jaggi, M.; Kilcher, Y.; Hofmann, T. Audio Based Bird Species Identification Using Deep Learning Techniques; Infoscience: Tokyo, Japan, 2016. [Google Scholar]
- Kumar, Y.; Gupta, S.; Singh, W. A novel deep transfer learning models for recognition of birds sounds in different environment. Soft Comput. 2022, 26, 1003–1023. [Google Scholar] [CrossRef]
- Effendy, N.; Ruhyadi, D.; Pratama, R.; Rabba, D.F.; Aulia, A.F.; Atmadja, A.Y. Forest quality assessment based on bird sound recognition using convolutional neural networks. Int. J. Electr. Comput. Eng. 2022, 12, 4235–4242. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Kahl, S.; Wood, C.M.; Eibl, M.; Klinck, H. BirdNET: A deep learning solution for avian diversity monitoring. Ecol. Inform. 2021, 61, 101236. [Google Scholar] [CrossRef]
- Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
- Qiao, Y.; Qian, K.; Zhao, Z. Learning higher representations from bioacoustics: A sequence-to-sequence deep learning approach for bird sound classification. In Proceedings of the International Conference on Neural Information Processing, Bangkok, Thailand, 18–22 November 2020; pp. 130–138. [Google Scholar]
- Zhang, F.; Zhang, L.; Chen, H.; Xie, J. Bird Species Identification Using Spectrogram Based on Multi-Channel Fusion of DCNNs. Entropy 2021, 23, 1507. [Google Scholar] [CrossRef]
- Conde, M.V.; Shubham, K.; Agnihotri, P.; Movva, N.D.; Bessenyei, S. Weakly-Supervised Classification and Detection of Bird Sounds in the Wild. arXiv 2021, arXiv:2107.04878. [Google Scholar]
- Kahl, S.; Denton, T.; Klinck, H.; Glotin, H.; Goëau, H.; Vellinga, W.-P.; Planqué, R.; Joly, A. Overview of BirdCLEF 2021: Bird call identification in soundscape recordings. In Proceedings of the CLEF (Working Notes), Évora, Portugal, 5–8 September 2021; pp. 1437–1450. [Google Scholar]
- Cakir, E.; Adavanne, S.; Parascandolo, G.; Drossos, K.; Virtanen, T. Convolutional recurrent neural networks for bird audio detection. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 1744–1748. [Google Scholar]
- Gupta, G.; Kshirsagar, M.; Zhong, M.; Gholami, S.; Ferres, J.L. Comparing recurrent convolutional neural networks for large scale bird species classification. Sci. Rep. 2021, 11, 17085. [Google Scholar] [CrossRef] [PubMed]
- Xeno-Canto. Sharing Bird Sounds from around the World. 2021. Available online: https://www.xeno-canto.org/about/xeno-canto (accessed on 17 March 2021).
- Johnson, D.H. Signal-to-noise ratio. Scholarpedia 2006, 1, 2088. [Google Scholar] [CrossRef]
- Sainath, T.N.; Kingsbury, B.; Mohamed, A.-R.; Ramabhadran, B. Learning filter banks within a deep neural network framework. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013; pp. 297–302. [Google Scholar]
- Shannon, C.E. Communication in the presence of noise. Proc. IRE 1949, 37, 10–21. [Google Scholar] [CrossRef]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
- Dufour, O.; Artieres, T.; Glotin, H.; Giraudet, P. Clusterized mel filter cepstral coefficients and support vector machines for bird song identification. In Soundscape Semiotics—Localization and Categorization; InTech: London, UK, 2013; pp. 89–93. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Fisher, R.A. The use of multiple measurements in taxonomic problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
- Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
- Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 843–852. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Graves, A.; Jaitly, N.; Mohamed, A.-R. Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–12 December 2013; pp. 273–278. [Google Scholar]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
Environment | Setup |
---|---|
GPU | NVIDIA GeForce RTX 3080 Ti 12 G |
CPU | Intel(R) Core(R) i9−12900 KF 3.90 GHz |
RAM | 64 G |
Operating system | Windows 11 64 bit |
Software environment | Python 3.8.10, Pytorch 1.12, CUDA 11.6 |
Hyper−Parameter | Value |
---|---|
Epoch | 70 |
Batch size | 256 |
Learning rate | 0.0001 |
Learning rate decay | 0.1 per 10 epochs |
Optimizer | AdamW |
Loss function | Cross entropy |
Feature Type | Accuracy | Precision | Recall | F1−Score | AUC | Top−5 | mAP |
---|---|---|---|---|---|---|---|
Waveform | 43.59% | 41.76% | 35.80% | 35.93% | 95.22% | 69.02% | 40.29% |
Mel−spectrogram | 68.60% | 68.98% | 67.23% | 67.12% | 97.50% | 82.19% | 71.23% |
MFCCs (20 order) | 68.84% | 66.63% | 66.92% | 66.63% | 97.58% | 82.47% | 70.81% |
Mel−spectrogram and MFCCs | 70.94% | 70.98% | 69.41% | 69.36% | 97.91% | 83.64% | 72.72% |
Normalized Mel−spectrogram | 72.39% | 72.69% | 70.68% | 70.86% | 97.96% | 84.04% | 74.68% |
Normalized MFCCs | 72.61% | 73.35% | 71.25% | 71.57% | 98.01% | 84.14% | 75.01% |
Normalized Mel and MFCCs | 74.94% | 74.51% | 73.51% | 73.45% | 98.21% | 84.65% | 77.43% |
Method | Accuracy | Precision | Recall | F1−Score | AUC | Top−5 | mAP |
---|---|---|---|---|---|---|---|
CNN (ResNet−101) | 66.47% | 66.79% | 64.32% | 64.42% | 97.53% | 81.84% | 68.67% |
SVM | 22.16% | 42.01% | 16.39% | 19.51% | 89.15% | 49.94% | 25.79% |
RF | 52.84% | 81.58% | 53.14% | 62.02% | 92.28% | 65.57% | 58.54% |
FLDA | 17.01% | 17.39% | 17.69% | 16.58% | 78.43% | 38.24% | 11.52% |
k−NN | 11.12% | 10.66% | 10.64% | 9.75% | 75.33% | 48.32% | 11.07% |
GRU | 71.72% | 71.40% | 69.77% | 70.01% | 97.66% | 83.07% | 72.52% |
The proposed method (Ours) | 74.94% | 74.51% | 73.51% | 73.45% | 98.21% | 84.65% | 77.43% |
Module | Accuracy | Precision | Recall | F1−Score | AUC | Top−5 | mAP |
---|---|---|---|---|---|---|---|
LSTM (256 hidden units) | 67.19% | 67.57% | 64.99% | 65.19% | 97.63% | 82.35% | 69.60% |
LSTM (512 hidden units) | 72.29% | 70.89% | 70.85% | 70.18% | 97.94% | 83.48% | 74.52% |
LSTM (1024 hidden units) | 70.81% | 71.11% | 69.18% | 69.03% | 97.89% | 83.75% | 73.29% |
Bi−LSTM (512 hidden units) | 68.51% | 67.66% | 67.25% | 66.38% | 97.55% | 82.38% | 70.82% |
LSTM with SiLU | 72.75% | 72.00% | 71.19% | 70.82% | 97.94% | 84.04% | 74.80% |
LSTM with CA and SiLU | 74.94% | 74.51% | 73.51% | 73.45% | 98.21% | 84.65% | 77.43% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, H.; Xu, Y.; Yu, Y.; Lin, Y.; Ran, J. An Efficient Model for a Vast Number of Bird Species Identification Based on Acoustic Features. Animals 2022, 12, 2434. https://doi.org/10.3390/ani12182434
Wang H, Xu Y, Yu Y, Lin Y, Ran J. An Efficient Model for a Vast Number of Bird Species Identification Based on Acoustic Features. Animals. 2022; 12(18):2434. https://doi.org/10.3390/ani12182434
Chicago/Turabian StyleWang, Hanlin, Yingfan Xu, Yan Yu, Yucheng Lin, and Jianghong Ran. 2022. "An Efficient Model for a Vast Number of Bird Species Identification Based on Acoustic Features" Animals 12, no. 18: 2434. https://doi.org/10.3390/ani12182434