Paper 10
Paper 10
Scientific Programming
Volume 2022, Article ID 7994191, 9 pages
https://doi.org/10.1155/2022/7994191
Research Article
Audio Segmentation Techniques and Applications Based on
Deep Learning
Correspondence should be addressed to Shruti Aggarwal; [email protected] and Geleta Negasa Binegde;
[email protected]
Received 22 May 2022; Revised 12 July 2022; Accepted 18 July 2022; Published 19 August 2022
Copyright © 2022 Shruti Aggarwal et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Audio processing has become an inseparable part of modern applications in domains ranging from health care to speech-
controlled devices. In automated audio segmentation, deep learning plays a vital role. In this article, we are discussing audio
segmentation based on deep learning. Audio segmentation divides the digital audio signal into a sequence of segments or frames
and then classifies these into various classes such as speech recognition, music, or noise. Segmentation plays an important role in
audio signal processing. The most important aspect is to secure a large amount of high-quality data when training a deep learning
network. In this study, various application areas, citation records, documents published year-wise, and source-wise analysis are
computed using Scopus and Web of Science (WoS) databases. The analysis presented in this paper supports and establishes the
significance of the deep learning techniques in audio segmentation.
(Signal Level
(Equal Lengthed
Preprocessing
Segments Optimizations
Treatment) Audio to Tensor Streams of
formation) Firm Boundary
Audio Signal Filtering Data Conversion Segments/
RNN Detection of
Noise Cancilation FFT Frames
BiLSTM Segments
Signal
Code Division
Enhancement
calculated on an audio segment, a frame, or a set of samples The authors suggested that the noise injection method
that is a subset of the audio segment. effectively covers data shortage. Normally, data is audio data
In recent years, audio segmentation and deep learning that has augmented to prevent overfitting by deliberately
have received widespread attention for research focus. injecting noise; it adds random noise to the audio signal and
Several countries and researchers have successfully applied performs audio transformation that slightly deforms the
audio segmentation techniques in various fields like speech pitch and tempo [11]. When data augmentation is per-
recognition, music, or noise removal with different deep formed, the standard quality of the source data has a vital
learning algorithms [7]. A study analysis literature review influence. High-quality data mean a clear signal without any
method was employed by analyzing articles and conferences other type of noise in the audio signal. However, noise is
published from 2005 to 2021 using the VOS viewer software. unavoidable during recordings, and every sound recording
One hundred seventy documents were downloaded in has a different length [12]. For effective analysis, noise re-
(.CSV) file format using two keywords, “Audio Segmenta- moval is very important, and normalizing and generalizing
tion” and “Deep Learning,” from the Scopus database. the raw dataset is also required.
A study showed an improvement in performance by
conducting denoising in the preprocessing step [13]. The
1.1. Audio Data Analysis. A sound is represented as an audio authors mentioned the importance and effect of data
signal in which the frequency, bandwidth, decibel, and so on generalization [14]. It is important to extract appropriate
are the parameters. A typical audio signal can be represented features for each label to classify the data according to the
as a function of Amplitude and Time [8]. Several digital class. There are several methods for extracting the data,
devices help in the audio recording and then represent these like MFCC, spectrogram, and using a deep learning
sounds in a computer-readable way. These are some in- network.
stances of these formats:
(i) mp3 (MPEG-1 Audio Layer 3) format 2. Research Trends in Web of Science
(ii) wav (Waveform Audio File) format Database for Audio Segmentation Based on
(iii) WMA (Windows Media Audio) format Deep Learning
The extraction of acoustics features relevant to the task at In Figure 2, we have presented the Source-Wise Analysis of
hand is involved a typical audio data processing procedure audio segmentation and deep learning research trends using
followed by decision-making techniques, including detec- the Web of Science Database [15]. The experiment was
tion and classification. As a result, audio data analysis is used conducted based on the data collected and analyzed from the
to analyze and comprehend audio signals captured by digital Web of Science database using two keywords, “Audio
equipment, with various applications in healthcare, pro- Segmentation” and “Deep Learning,” from 1999–2021. Se-
duction, and enterprise [9]. Among these, applications are venty-five publications selected from the Web of Science
customer intelligence analysis from user service calls, social- Core Collection are shown in Figure 2. Figure 2 represents
media content analysis, medical aids, patient-care systems, the sources or the fields where the audio segmentation is
and public safety. used with deep learning. In Engineering Electrical Elec-
tronic, the audio segmentation has 36 publication records,
the maximum values in the Web of Science Database. The
1.2. Related Work. In the task of audio segmentation, several second highest records are 22 documents in Computer
authors have devised a segmentation approach by a classi- Science Artificial Intelligence. Also, in Acoustics, the audio
fication system based on neural networks. A multilayer segmentation has 16 documents and 14 in Computer Science
perceptron trained using genetic algorithms to achieve Information Systems [16].
multiclass audio segmentation is an example of a feed- In Figure 3, we represent the research work related to
forward network. Many data are needed to train deep neural audio segmentation that started majorly in 2005. Until a
networks for reliable predictions [10]. Some studies have decade, there was a very slow increase in this type of re-
used data augmentation approaches to expand the quantity search, but post-2016, there was a sharp rise in this area of
of data to overcome these problems. To tackle the data research [17]. The exponential increase in research trends
shortage problem, Raza used two approaches to enhance the can be seen in audio segmentation-related research since
amount of the dataset. 2017.
Scientific Programming 3
36 16 9 7
Engineering Electrical Electronic Acoustics Medical Computer
Informatics Science
Interdisciplinary
Applications
14
Computer Science Information Systems
5 4
Imaging Science
Photographic Robotics
22 Technology
Computer Science Artificial Intelligence
13
Computer Science Theory Methods
4
Computer Science
Cybernetics
18
90
16
80
14 70
12 60
Publications
10
Citations
50
8 40
6 30
4 20
2 10
0 0
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Publications
Citations
Figure 3: Trend analysis of audio segmentation.
It replicates that audio segmentation and deep learning co-occurrence network, which can be created using the VOS
field is becoming an attractive research area in the steady viewer software. In this network visualization, each cluster
research zone progressively as citation number is growing has a different colour [21]. This analysis considers the
previous five years [18]. It is undoubtedly realized that there keywords that appeared in at least three collected docu-
is the maximum number of publications accomplished in ments. From 1310 keywords, only 119 have met the
2019 out of the last fifteen years (2005 to 2021) in this area of threshold represented in co-occurrence network visualiza-
research. In the current year, 2021 has witnessed consid- tion to compose the critical areas of audio segmentation and
erable confidence amongst researchers regarding its appli- deep learning, as shown in Figure 4.
cation in this domain for research, that is why most The colours red, blue, and green are shown in Figure 4 to
publications related to this field have been published [19]. As represent co-occurrence in the related keywords. The shades
per the trend analysis, there is a very high potential for of purple, orange, and so on show that the co-occurrence is a
research in this domain, as shown by the cumulative rising combination of two or more two domains.
pattern of research for audio segmentation methods [20].
VOSviewer
2006
2008
2010
2012
2014
2016
2018
2020
2022
VOSviewer
Table 1: Publications and citations (country-wise) and global Table 2: Author-wise citations for audio segmentation research
occurrence analysis. trends along with year and references.
S. no. Country Documents Citations S. no. Document Citations Reference
1 United States 30 220 1 Zhang s. (2018) 94 [17]
2 China 28 247 2 Huang h. (2020) 63 [32]
3 India 11 26 3 Wang z. (2018) 43 [19]
4 Canada 7 58 4 Messner e. (2018) 42 [20]
5 United Kingdom 7 19 5 Leglaive s. (2018) 34 [18]
6 France 6 74 6 Baraldi l. (2017) 33 [13]
7 Germany 6 12 7 Gwardys g. (2014) 25 [11]
8 Japan 6 87 8 Akbari m. (2019) 24 [24]
9 South Korea 5 24 9 Lim m. (2018) 21 [21]
10 Spain 4 6 10 Leglaive s. (2019) 20 [25]
11 Switzerland 4 11 11 Lu w.-t. (2018) 20 [22]
12 Australia 3 15 12 Deng j. (2016) 16 [12]
13 Taiwan 3 29 13 Wu y. (2019) 15 [26]
14 Austria 2 42 14 Rahmani m.h. (2017) 15 [14]
15 Bangladesh 2 12 15 Min x. (2020) 14 [33]
16 Brazil 2 0 16 Laporte c. (2018) 14 [23]
17 Greece 2 4 17 Valliappan c.a. (2019) 13 [28]
18 Iran 2 15 18 Jati a. (2017) 13 [15]
19 Italy 2 35 19 Guo j. (2019) 12 [27]
20 Netherlands 2 7 20 Hossain s. (2019) 12 [29]
21 Baby a. (2017) 11 [16]
22 Leglaive s. (2020) 10 [34]
minimum of three documents [30]. United States, China, 23 Hesamian m.h. (2019) 10 [30]
and India are the top three countries, respectively, where 24 Li h. (2019) 10 [31]
research on audio segmentation is highest, and the entire
documents and citations are as high. It clearly shows how the
India, and a lot of research potential lies in countries like
related research is co-related [31].
Canada and the United Kingdom [33].
In this study, India has 11 documents and 26 citations,
indicating that Indian authors are more actively involved in
research based on the audio segmentation field [32]. So, 2.4. Prominent Researchers for Audio Segmentation. The
most researchers are from the United States, China, and publication searched from the Scopus database using two
6 Scientific Programming
VOSviewer
search keywords, “Audio Segmentation” and “Deep Language Identification, and Automatic Emotion Recog-
Learning,” has been cited several times as described in nition systems [36]. The audio signal is segmented into a
Table 2. By applying the filter of a minimum of 10 citations sequence of frames and classified into several classes like
for each document, we got 24 publications [34]. Table 2 music [37], speech [38], noise [39], and so on. The noise is
represents the citations for 24 publications identified using filtered out of the sound signal in this approach because
the VOS viewer software package. audio recordings are significant variations, like ratio [40],
Figure 7 shows the author-wise analysis of audio seg- audio encoding [41], bandwidth [42], language [43],
mentation research content based on the Scopus database. speaking styles [44], gender [45], and sound pitch [46],
The number mentioned in author’s citations is ninety-four, which are the challenges.
the highest number in this research area. From Figure 7, we Segmentation provides the most effective method for
can say that Zhang s. (2018) is cited most. Huang h. (2020) is splitting multimedia data into digital data by extracting
the second-largest because of the number of citations. The diverse aspects of multimedia data [47]. This segmentation
number of citations of Huang h. (2020) is sixty-three. yields useful information such as speaker signal and identity
division, as well as automatic indexing and data retrieval of
3. Application Areas of Audio Segmentation all instances of a certain speaker [48]. We can do automatic
online speech recognition acoustic models to improve
Audio segmentation is often utilized in various applications, overall system performance by collecting all segments
like Automatic Speech Recognition [35], Automatic produced by the same speaker [49]. Typically, a certain set of
Scientific Programming 7
[16] A. Jati and P. G. Georgiou, “Speaker2Vec: Unsupervised technique for real-time magnetic resonance imaging video
learning and Adaptation of a speaker Manifold using deep using segnet,” in Proceedings of the ICASSP 2019-2019 IEEE
neural networks with an evaluation on speaker segmentation,” International Conference on Acoustics, Speech and Signal
in Proceedings of the Annual Conference of the International Processing (ICASSP), IEEE, Brighton, UK, May 2019.
Speech Communication Association, pp. 3567–3571, INTER- [30] S. Hossain, S. Najeeb, A. Shahriyar, Z. R. Abdullah, and
SPEECH, Stockholm, Sweden, August 2017. M. A. Haque, “A pipeline for lung tumor detection and
[17] A. Baby, J. J. Prakash, S. R. Vignesh, and H. A. Murthy, “Deep segmentation from ct scans using dilated convolutional neural
learning techniques in tandem with signal processing cues for networks,” in Proceedings of the ICASSP 2019-2019 IEEE
phonetic segmentation for text to speech synthesis in Indian International Conference on Acoustics, Speech, and Signal
languages,” in Proceedings of the Annual Conference of the Processing (ICASSP), IEEE, Brighton, UK, May 2019.
International Speech Communication Association, pp. 3817– [31] M. H. Hesamian, W. Jia, X. He, and P. J. Kennedy, “Atrous
3821, INTERSPEECH, August 2017. convolution for binary semantic segmentation of lung nod-
[18] E. Messner, M. Zöhrer, and F. Pernkopf, “Heart sound seg- ule,” in Proceedings of the ICASSP 2019-2019 IEEE Interna-
mentation -- an event detection approach using deep re- tional Conference on Acoustics, Speech, and Signal Processing
current neural networks,” IEEE Transactions on Biomedical (ICASSP), pp. 1015–1019, IEEE, Brighton, UK, May 2019.
Engineering, vol. 65, no. 9, pp. 1964–1974, 2018. [32] H. Li, D. Chen, W. H. Nailon, M. E. Davies, and D. Laurenson,
[19] S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning “A deep dual-path network for improved mammogram image
affective features with a hybrid deep model for audio–visual processing,” in Proceedings of the ICASSP 2019-2019 IEEE
emotion recognition,” IEEE Transactions on Circuits and International Conference on Acoustics, Speech, and Signal
Systems for Video Technology, vol. 28, no. 10, pp. 3030–3043, Processing (ICASSP), pp. 1224–1228, IEEE, Brighton, UK, May
2018. 2019.
[20] Z. Wang and S. Ji, “Smoothed dilated convolutions for im- [33] H. Huang, L. Lin, R. Tong et al., “Unet 3+: a full-scale con-
proved dense prediction,” Proceedings of the ACM SIGKDD nected unet for medical image segmentation,” in Proceedings
International Conference on Knowledge Discovery and Data of the ICASSP 2020-2020 IEEE International Conference on
Mining, pp. 1–27, 2018. Acoustics, Speech, and Signal Processing (ICASSP), pp. 1055–
[21] S. Leglaive, L. Girin, and R. Horaud, “A variance modeling
1059, IEEE, Barcelona, Spain, May 2019.
framework based on variational autoencoders for speech [34] X. Min, G. Zhai, J. Zhou, X. P. Zhang, X. Yang, and X. Guan,
enhancement,” in Proceedings of the IEEE 28th International
“A multimodal saliency model for videos with high audio-
Workshop on Machine Learning for Signal Processing (MLSP),
visual correspondence,” IEEE Transactions on Image Pro-
IEEE, Aalborg, Denmark, September 2018.
cessing, vol. 29, pp. 3805–3819, 2020.
[22] M. Lim, D. Lee, H. Park et al., “Convolutional neural network
[35] S. Leglaive, X. Alameda-Pineda, L. Girin, and R. Horaud, “A
based audio event classification,” KSII Transactions on In-
recurrent variational autoencoder for speech enhancement,”
ternet and Information Systems (TIIS), vol. 12, no. 6,
in Proceedings of the ICASSP 2020-2020 IEEE International
pp. 2748–2760, 2018.
Conference on Acoustics, Speech, and Signal Processing
[23] W. T. Lu and L. Su, “Vocal Melody extraction with semantic
(ICASSP), pp. 371–375, IEEE, Barcelona, Spain, May 2020.
segmentation and audio-symbolic domain transfer learning,”
[36] G. Tzanetakis and P. Cook, “Multi-feature audio segmenta-
in Proceedings of the 19th International Society for Music
Information Retrieval Conference, pp. 521–528, ISMIR, Sep- tion for browsing and annotation,” in Proceedings of the 1999
tember 2018. IEEE Workshop on Applications of Signal Processing to Audio
[24] C. Laporte and L. Ménard, “Multi-hypothesis tracking of the and Acoustics. WASPAA’99 (Cat. No. 99TH8452), pp. 103–
tongue surface in ultrasound video recordings of normal and 106, IEEE, New Paltz, NY, USA, October 1999.
impaired speech,” Medical Image Analysis, vol. 44, pp. 98–114, [37] S. Venkatesh, D. Moffat, and E. R. Miranda, “Investigating the
2018. effects of training set synthesis for audio segmentation of
[25] M. Akbari, J. Liang, and J. Han, “DSSLIC: deep semantic radio broadcast,” Electronics, vol. 10, no. 7, p. 827, 2021.
segmentation-based layered image compression,” in Pro- [38] S. A. Deevi, C. P. Kaniraja, V. D. Mani, D. Mishra, S. Ummar,
ceedings of the ICASSP 2019-2019 IEEE International Con- and C. Satheesh, “HeartNetEC: a deep representation learning
ference on Acoustics, Speech and Signal Processing (ICASSP), approach for ECG beat classification,” Biomedical Engineering
pp. 2042–2046, IEEE, Brighton, UK, May 2019. Letters, vol. 11, no. 1, pp. 69–84, 2021.
[26] S. Leglaive, U. Şimşekli, A. Liutkus, L. Girin, and R. Horaud, [39] S. Suyanto, K. N. Ramadhani, S. Mandala, and A. Kurniawan,
“Speech enhancement with variational autoencoders and “Automatic segmented-Syllable and deep learning-based
alpha-stable distributions,” in Proceedings of the ICASSP 2019- Indonesian Audiovisual speech recognition,” in Proceedings of
2019 IEEE International Conference on Acoustics, Speech, and the 6th International Conference on Interactive Digital Media
Signal Processing (ICASSP), pp. 541–545, IEEE, Brighton, UK, (ICIDM), pp. 1–4, IEEE, Bandung, Indonesia, December 2020.
May 2019. [40] F. Barata, P. Tinschert, F. Rassouli et al., “Automatic recog-
[27] Y. Wu and W. Li, “Automatic audio chord recognition with nition, segmentation, and sex assignment of nocturnal
MIDI-trained deep feature and BLSTM-CRF sequence asthmatic coughs and cough epochs in smartphone audio
decoding model,” IEEE/ACM Transactions on Audio, Speech, recordings: observational field study,” Journal of Medical
and Language Processing, vol. 27, no. 2, pp. 355–366, 2019. Internet Research, vol. 22, no. 7, Article ID e18082, 2020.
[28] J. Guo, B. Song, P. Zhang, M. Ma, W. Luo, and J. lv, “Affective [41] O. Stephen and M. Sain, “Deep learning-based Scene image
video content analysis based on multimodal data fusion in detection and segmentation with speech synthesis in real-
heterogeneous networks,” Information Fusion, vol. 51, time,” in Smart Healthcare Analytics in IoT Enabled Envi-
pp. 224–232, 2019. ronment, pp. 163–171, Springer, Cham, 2020.
[29] C. A. Valliappan, A. Kumar, R. Mannem, G. R. Karthik, and [42] C. Park, D. Kim, and H. Ko, “Dilated convolution and gated
P. K. Ghosh, “An improved air tissue boundary segmentation linear unit based sound event detection and tagging algorithm
Scientific Programming 9
using weak label,” The Journal of the Acoustical Society of [56] C. S. S. Anupama, L. Natrayan, E. Laxmi Lydia et al., “Deep
Korea, vol. 39, no. 5, pp. 414–423, 2020. learning with backtracking search optimization-based skin
[43] M. F. M. Esa, N. H. Mustaffa, N. H. M. Radzi, and lesion diagnosis model,” Computers, Materials & Continua,
R. Sallehuddin, “Audio Deformation based data augmenta- vol. 70, no. 1, pp. 1297–1313, 2021.
tion for convolution neural network in Vibration analysis,” [57] S. Raja and A. J. Rajan, “A decision-making model for se-
IOP Conference Series: Materials Science and Engineering, lection of the Suitable FDM Machine using Fuzzy TOPSIS,”
vol. 551, no. 1, Article ID 012066, 2019. Mathematical Problems in Engineering, vol. 2022, Article ID
[44] A. Sendrayaperumal, S. Mahapatra, S. S. Parida et al., “Energy 7653292, 2022.
Auditing for efficient planning and Implementation in [58] S. Aggarwal, M. Suchithra, N. Chandramouli et al., “Rice
Commercial and Residential Buildings,” Advances in Civil disease detection using Artificial and Machine learning
Engineering, vol. 2021, pp. 1–10, 2021. techniques to Improvise Agro-Business,” Journal of Scientific
[45] L. P. Natrayan, S. S. Sundaram, and J. Elumalai, “Analyzing Programming, vol. 2022, Article ID 1757888, 2022.
the Uterine physiological with MMG signals using SVM,”
International journal of pharmaceutical research, vol. 11, no. 2,
pp. 165–170, 2019.
[46] K. Seeniappan, B. Venkatesan, N. N. Krishnan et al., “A
comparative assessment of performance and emission char-
acteristics of a DI diesel engine fuelled with ternary blends of
two higher alcohols with lemongrass oil biodiesel and diesel
fuel,” Energy & Environment, vol. 13, Article ID
0958305X2110513, 2021.
[47] K. R. Vaishali, S. R. Rammohan, L. Natrayan, D. Usha, and
V. R. Niveditha, “Guided container selection for data
streaming through neural learning in cloud,” International
Journal of System Assurance Engineering and Management,
vol. 16, pp. 1–7, 2021.
[48] G. Kanimozhi, L. Natrayan, S. Angalaeswari, and
P. Paramasivam, “An effective Charger for plug-in hybrid
Electric Vehicles (PHEV) with an enhanced PFC Rectifier and
ZVS-ZCS DC/DC high-frequency Converter,” Journal of
Advanced Transportation, vol. 2022, Article ID 7840102, 2022.
[49] S. Kaliappan, M. D. Raj Kamal, S. Mohanamurugan, and
P. K. Nagarajan, “Analysis of an Innovative Connecting Rod
by using finite Element method,” Taga Journal Of Graphic
Technology, vol. 14, pp. 1147–1152, 2018.
[50] D. K. Jain, S. K. S. Tyagi, S. Neelakandan, M. Prakash, and
L. Natrayan, “Metaheuristic optimization-based Resource
Allocation technique for Cybertwin-driven 6G on IoE envi-
ronment,” IEEE Transactions on Industrial Informatics,
vol. 18, no. 7, pp. 4884–4892, 2022.
[51] P. Asha, L. Natrayan, B. T. Geetha et al., “IoT enabled en-
vironmental toxicology for air pollution monitoring using AI
techniques,” Environmental Research, vol. 205, Article ID
112574, 2022.
[52] A. S. Kaliappan, S. Mohanamurugan, and P. K. Nagarajan,
“Numerical Investigation of Sinusoidal and Trapezoidal pis-
ton profiles for an IC engine,” Journal of Applied Fluid Me-
chanics, vol. 13, no. 1, pp. 287–298, 2020.
[53] S. S. Sundaram, N. Hari Basker, and L. Natrayan, “Smart
clothes with bio-sensors for ECG monitoring,” International
Journal of Innovative Technology and Exploring Engineering,
vol. 8, no. 4, pp. 298–301, 2019.
[54] K. Nagarajan, A. Rajagopalan, S. Angalaeswari, L. Natrayan,
and W. D. Mammo, “Combined Economic emission Dispatch
of Microgrid with the Incorporation of Renewable Energy
sources using improved Mayfly optimization algorithm,”
Computational Intelligence and Neuroscience, vol. 2022,
pp. 1–22, 2022.
[55] S. Magesh, V. R. Niveditha, P. S. Rajakumar, S. Radha Ram
Mohan, and L. Natrayan, “Pervasive computing in the context
of COVID-19 prediction with AI-based algorithms,” Inter-
national Journal of Pervasive Computing and Communica-
tions, vol. 16, no. 5, pp. 477–487, 2020.