0% found this document useful (0 votes)
78 views12 pages

Diagnostics: Efficiently Classifying Lung Sounds Through Depthwise Separable CNN Models With Fused STFT and MFCC Features

Uploaded by

Jehad Ur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views12 pages

Diagnostics: Efficiently Classifying Lung Sounds Through Depthwise Separable CNN Models With Fused STFT and MFCC Features

Uploaded by

Jehad Ur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

diagnostics

Article
Efficiently Classifying Lung Sounds through Depthwise
Separable CNN Models with Fused STFT and MFCC Features
Shing-Yun Jung 1, * , Chia-Hung Liao 1 , Yu-Sheng Wu 1 , Shyan-Ming Yuan 1,2, * and Chuen-Tsai Sun 1,2

1 Department of Computer Science, National Chiao Tung University, Hsinchu 300, Taiwan;
[email protected] (C.-H.L.); [email protected] (Y.-S.W.); [email protected] (C.-T.S.)
2 Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu 300, Taiwan
* Correspondence: [email protected] (S.-Y.J.); [email protected] (S.-M.Y.)

Abstract: Lung sounds remain vital in clinical diagnosis as they reveal associations with pulmonary
pathologies. With COVID-19 spreading across the world, it has become more pressing for medical
professionals to better leverage artificial intelligence for faster and more accurate lung auscultation.
This research aims to propose a feature engineering process that extracts the dedicated features for
the depthwise separable convolution neural network (DS-CNN) to classify lung sounds accurately
and efficiently. We extracted a total of three features for the shrunk DS-CNN model: the short-time
Fourier-transformed (STFT) feature, the Mel-frequency cepstrum coefficient (MFCC) feature, and the
fused features of these two. We observed that while DS-CNN models trained on either the STFT or
the MFCC feature achieved an accuracy of 82.27% and 73.02%, respectively, fusing both features led
to a higher accuracy of 85.74%. In addition, our method achieved 16 times higher inference speed on

an edge device and only 0.45% less accuracy than RespireNet. This finding indicates that the fusion
 of the STFT and MFCC features and DS-CNN would be a model design for lightweight edge devices
Citation: Jung, S.-Y.; Liao, C.-H.; Wu, to achieve accurate AI-aided detection of lung diseases.
Y.-S.; Yuan, S.-M.; Sun, C.-T.
Efficiently Classifying Lung Sounds Keywords: lung sounds; convolutional neural network; feature extraction; automatic auscultations;
through Depthwise Separable CNN depthwise separable convolution
Models with Fused STFT and MFCC
Features. Diagnostics 2021, 11, 732.
https://doi.org/10.3390/
diagnostics11040732 1. Introduction
The term lung sounds refers to “all respiratory sounds heard or detected over the
Academic Editor: Cesar A. Moran
chest wall or within the chest” [1]. In clinical practice, pulmonary conditions are diagnosed
through lung auscultation, which refers to using a stethoscope for hearing a patient’s
Received: 17 March 2021
Accepted: 13 April 2021
lung sounds. Lung auscultation can rapidly and safely rule out severe diseases and
Published: 20 April 2021
diagnose some pulmonary disorders’ flare-ups. Therefore, a stethoscope has been an
indispensable medical device for physicians to diagnose lung disorders for centuries.
Publisher’s Note: MDPI stays neutral
However, recognizing the subtle distinctions among various lung sounds is an acquired
with regard to jurisdictional claims in
skill that requires sufficient training and clinical experience. As COVID-19 sweeps the globe,
published maps and institutional affil- lung auscultation still stays vital for monitoring confirmed cases [2]. Remote automatic
iations. auscultation systems may play a crucial role in lowering infection risks in medical workers.
Hence, how artificial intelligence can be leveraged to assist physicians in performing
auscultation remotely and accurately has become ever more imperative.
While a variety of lung sound types have been defined by recent research, this paper
Copyright: © 2021 by the authors.
adopts the classification suggested by Pasterkamp et al. [3]. Lung sounds can be classified
Licensee MDPI, Basel, Switzerland.
into two main categories: normal and adventitious. Normal sounds are audible through
This article is an open access article
the whole inhalation phase till the early exhalation phase. Spectral characteristics show
distributed under the terms and that these normal sounds have peaks with typical frequencies below 100 Hz, and the sound
conditions of the Creative Commons energy steeply decreases between 100 and 200 Hz [4]. Adventitious sounds are the other
Attribution (CC BY) license (https:// sounds usually generated by respiratory disorders and are superimposed on normal sounds.
creativecommons.org/licenses/by/ Furthermore, adventitious sounds can be classified into two basic categories: continuous
4.0/). and discontinuous. Continuous and discontinuous sounds were termed wheeze and

Diagnostics 2021, 11, 732. https://doi.org/10.3390/diagnostics11040732 https://www.mdpi.com/journal/diagnostics


Diagnostics 2021, 11, 732 2 of 12

crackle, respectively, in 1957 [5]. In 1977, wheeze and crackle were sub-classified into more
classes according to various ranges of pitches [6]. Continuous sounds (wheeze) are typically
musical adventitious sounds with frequencies from 80 to 1600 Hz [7]. The term continuous
indicates that the sounds last longer than 250 ms [8]. These continuous sounds are caused
by the narrowing of the airway caliber [9]. Two factors determine the pitches of these
continuous sounds. One is the mass and elasticity of the airway walls, and the other is the
velocity of airflow [9]. The pitches of the continuous sounds correspond to the dominant
frequencies in the power spectrum. Continuous adventitious sounds can clinically signify
obstructive airway diseases such as asthma and chronic obstructive pulmonary disease
(COPD) [10]. Discontinuous sounds (crackle) are non-musical, short, explosive sounds that
usually last shorter than 20 ms [10]. Those discontinuous sounds are produced because
of the abrupt opening or closing of the airways, which are abnormally closed due to the
lung’s increased elastic recoil pressure [11]. The frequency of the discontinuous sounds
ranges from 100 to 2000 Hz depending on the airways’ diameter [10]. Additionally, the
discontinuous sounds can be related to the disease’s process and severity in patients with
pneumonia [12,13] and interstitial lung disorders [11]. Huang et al. [14] found that crackles
were one of the common abnormal breath sounds detected through COVID-19 patients’
auscultations.
Learning-based algorithms, particularly deep learning algorithms, have been driving
the development of remote automatic lung sound auscultation in recent years [15]. A
convolution neural network (CNN), one of the deep learning models, can automatically
learn abstract features from images [16]. The visual representations of lung sound sig-
nals such as spectrograms can be fed into a CNN as image-like features to train lung
sound recognition models [17–23]. Importantly, CNN models require large datasets for
training. The International Conference on Biomedical Health Informatics (ICBHI) Sci-
entific Challenge dataset [24], currently the largest public lung sound dataset, collected
three types of adventitious and normal lung sounds records from 126 subjects. Several
inspiring deep learning-related methods have been proposed on the basis of the ICBHI
dataset. Chen et al. [21] proposed an optimized S-transform to generate spectrograms with
enhanced frequency-related features and trained a deep residual network with the special
spectrograms to classify three types of lung sounds. The deep residual network achieved
98.79% accuracy on the ICBHI dataset. García-Ordás et al. [25] proposed a variational
autoencoder (VAE)-based method to address the imbalanced issue of the ICBHI dataset,
which even reached a 0.993 F-score. RespireNet [22] has been proposed to break through
the data amount limitation of the ICBHI dataset. The authors of RespireNet augmented
training datasets by concatenating two sound signals in the same class. This data augmen-
tation technique greatly improved the accuracy of classifying the adventitious sounds. In
addition to CNN, the recurrent neural network (RNN) was proposed to predict respiratory
anomalies based on the ICBHI dataset [15]. According to those previous studies [21,25],
most CNN-based models tend to achieve high accuracy when classifying lung sounds.
However, the standard CNN models require graphics processing units (GPUs) to
support the vast convolutional operations. The depthwise separable (DS) convolution is an
approach to reduce the computational operations of the standard convolution [26]. These
CNN models with DS convolution layers (DS-CNN) then empower those edge devices
with no GPUs and limited computational power to achieve higher efficiency for CNN
model inference. The development of automatic lung auscultation systems on low-cost
hardware devices has drawn a lot of attention [19,27,28]. How to better exploit the value of
DS-CNN for developing remote automatic lung auscultation systems still remains to be
explored.
This paper aims to propose a feature engineering process to extract the dedicated fea-
tures for DS-CNN to classify four types of lung sounds: normal, continuous, discontinuous,
and unknown. We shrank a DS-CNN model based on MobileNet [29] to save storage space
on edge devices. Then we extracted a total of three features for the shrunk DS-CNN model:
the short-time Fourier transformed (STFT) feature, the Mel-frequency cepstrum coefficient
Diagnostics 2021, 11, x FOR PEER REVIEW 3 of 12

Diagnostics 2021, 11, 732 3 of 12

CNN model: the short-time Fourier transformed (STFT) feature, the Mel-frequency
cepstrum coefficient (MFCC) feature, and the fused features of these two. To evaluate the
performance
(MFCC) of and
feature, the extracted
the fused features
features and the shrunk
of these two. ToDS-CNN
evaluatemodel, we compared
the performance the
of the
performance in three hierarchical levels of strategy: level 1—feature comparison;
extracted features and the shrunk DS-CNN model, we compared the performance in three level 2—
model architecture
hierarchical levels ofcomparison;
strategy: leveland1—feature
level 3—model performance
comparison; and inference
level 2—model efficiency
architecture
comparison.and
comparison; We level
observed that the
3—model model trained
performance andon either the
inference STFT feature
efficiency or the MFCC
comparison. We
feature achieved
observed the accuracy
that the model trained onof either
82.27%the
and 73.02%,
STFT respectively.
feature or the MFCC Importantly, fusing
feature achieved
both
the featuresofled
accuracy to a higher
82.27% accuracy
and 73.02%, of 85.74% inImportantly,
respectively. level 1 comparison. In level
fusing both 2 compar-
features led
ison, the shrunk DS-CNN outperformed other CNN-based architectures
to a higher accuracy of 85.74% in level 1 comparison. In level 2 comparison, the shrunkin terms of accu-
racy andoutperformed
DS-CNN number of parameters. In levelarchitectures
other CNN-based 3 comparison, our method
in terms achieved
of accuracy 16 times
and number
ofhigher inference
parameters. speed
In level on the edge device
3 comparison, with aachieved
our method drop of only 0.45%
16 times in accuracy
higher compared
inference speed
ontothe
RespireNet.
edge device with a drop of only 0.45% in accuracy compared to RespireNet.

2.2.Materials
Materialsand
andMethods
Methods
2.1. Dataset
2.1. Dataset
The
Thedataset
datasetfor
forthis
thisresearch
researchwas
wasprepared
preparedafter
afterpreprocessing
preprocessingthe
theacoustic
acousticrecordings
recordings
collected by Lin et al. [30]. These WAV format recordings were 15 s long, and the sampling
collected by Lin et al. [30]. These WAV format recordings were 15 s long, and the sampling
rate was 4 k Hz. The respiratory cycles in the recordings were segmented into clips and
rate was 4 k Hz. The respiratory cycles in the recordings were segmented into clips and
independently labeled by experienced respiratory therapists and physicians as one of the
independently labeled by experienced respiratory therapists and physicians as one of the
four types: normal, continuous, discontinuous, and unknown. The respiratory cycles
four types: normal, continuous, discontinuous, and unknown. The respiratory cycles with
with inconsistent labels would be further reviewed and discussed by the annotators for
inconsistent labels would be further reviewed and discussed by the annotators for consen-
consensus labeling. The audio clips were labeled as unknown if the noise in the clinical
sus labeling. The audio clips were labeled as unknown if the noise in the clinical environ-
environment, such as vocals or equipment sounds, was too loud for the experts to label the
ment, such as vocals or equipment sounds, was too loud for the experts to label the definite
definite types. The average length of the respiratory cycles in this dataset was 1.25 s. The
types. The average length of the respiratory cycles in this dataset was 1.25 s. The audio clips
audio clips shorter than 1.25 s were padded to this average length with zeros. The clips
shorter than 1.25 s were padded to this average length with zeros. The clips longer than 1.25
longer than 1.25 s were truncated as well. After labeling and adjustment of the length, our
s were truncated as well. After labeling and adjustment of the length, our dataset consisted
dataset consisted of 3605 normal, 3800 continuous, 3765 discontinuous, and 1521 unknown
of 3605
lung normal,
sound audio3800 continuous,
clips. 3765 was
This dataset discontinuous, and 1521
further divided intounknown lung sound audio
three sub-datasets: 72%
clips. Thisselected
randomly dataset samples
was furtherfor divided into three
model training, 8%sub-datasets: 72%and
for validating, randomly selected20%
the remaining sam-
ples for model
for testing. training, 8% for validating, and the remaining 20% for testing.

2.2.Feature
2.2. FeatureEngineering
Engineering
Ourfeature
Our featureengineering
engineeringprocess
processwas
wasderived
derivedfrom
fromreference
reference[31].
[31].Fusing
Fusingofofmulti-
multi-
spectrogram features as one new feature has been proposed to improve sound
spectrogram features as one new feature has been proposed to improve sound recognition recognition
accuracy[31].
accuracy [31].AAtotal
totalofofthree
threefeatures
featureswere
wereextracted.
extracted.One
Onewas wasthe
theSTFT
STFTfeature,
feature,and
andthe
the
secondwas
second wasthe
theMFCC
MFCCfeature.
feature.The
Thethird
thirdfeature
featurewas
wasextracted
extractedby byfusing
fusingthe
theSTFT
STFTand
and
MFCCfeatures.
MFCC features.The
Thewhole
wholefeature
featureengineering
engineeringprocess
processisispresented
presentedininFigure
Figure1.1.

Figure1.1.Flowchart
Figure Flowchartofof the
the proposed
proposed feature
featureengineering
engineeringprocess
processfor
fordepthwise
depthwise separable convolution
separable convolutionneural network
neural (DS-
network
CNN). Before the feature engineering step, each lung sound audio was padded or truncated to 1.25 s-long. A series of
(DS-CNN). Before the feature engineering step, each lung sound audio was padded or truncated to 1.25 s-long. A series
parameter combinations were searched in the feature engineering step, including the window size, hop length of the short-
of parameter combinations were searched in the feature engineering step, including the window size, hop length of the
time Fourier transformed (STFT) feature, and the number of Mel-frequency cepstrum coefficient (MFCC) features. The
short-time
DS-CNN Fourier
model’stransformed (STFT)
width and depth feature,
were and the in
determined number of Mel-frequency
the model training step.cepstrum coefficient
Several DS-CNN (MFCC)
models features.
were trained
The DS-CNN model’s width and depth were determined in the model training step. Several DS-CNN models were
and evaluated to extract the best features. For each feature, we selected the parameter combinations that led the DS-CNN trained
and evaluated
model to extract
to achieve theaccuracy.
the best best features. For each feature, we selected the parameter combinations that led the DS-CNN
model to achieve the best accuracy.
Diagnostics 2021, 11, x FOR PEER REVIEW 4 of 12

Diagnostics 2021, 11, 732 4 of 12

2.2.1. STFT Feature


STFT transforms only the fast-varying part of the signal, which corresponds to the
2.2.1. STFT Feature
high-frequency domain, and preserves the low-varying trend in the time domain. For a
signalSTFT
sequence { ( ), only
transforms = 0,1, . . .fast-varying
the N} of length part ofdiscrete
N, the the signal,
STFTwhich
at thecorresponds
frequency toand
the
high-frequency
the mth short timedomain,
intervaland preserves
is defined as the low-varying trend in the time domain. For a
signal sequence { x (n), n = 0, 1, . . . N} of length N, the discrete STFT at the frequency f
( , is
and the mth short time interval )= ∑
defined ( ) ( −
as ) ,

N −1 ( )=1 − ≤ ≤ L L (1)
X ( f , m) = ∑ x (n)w(n − mR)e− j2π f n , w(n) =21 f or − 2 ≤ n ≤ = 0 otherwise. (1)
2 2
n =0 = 0 otherwise.
Here
Herew(n)
w(n)isisa awindow
windowfunction
function with
withthethe
window
window size L, L, ∈L {64,
size ∈ {128, 256,256,
64, 128, 512}512
and}
Rand
is the
R ishop length
the hop length∈ R{20,
∈ {30,
20,40,
30,50}.
40, The
50}.window size, size,
The window L, represents the the
L, represents number
number of
samples
of samplesincluded
includedin each window
in each whenwhen
window computing the fast
computing theFourier transform
fast Fourier [32]. Both
transform [32].
the
Bothwindow size and
the window theand
size hopthe
length
hop determine how thehow
length determine spectrogram represents
the spectrogram the sound
represents the
data.
sound Generally, the window
data. Generally, size is relevant
the window size is to the frequency
relevant resolutionresolution
to the frequency and the timeandres-
the
olution of the spectrogram.
time resolution These twoThese
of the spectrogram. parameters were selected
two parameters wereto extractto
selected the best features
extract the best
for DS-CNN.
features Figure 2Figure
for DS-CNN. demonstrates the continuous-sound,
2 demonstrates discontinuous-sound,
the continuous-sound, discontinuous-sound,and
and normal-sound
normal-sound spectrograms.
spectrograms.

Figure
Figure2.2.The
Thecontinuous-sound,
continuous-sound,discontinuous-sound,
discontinuous-sound,and andnormal-sound
normal-soundspectrograms
spectrogramsareareshown
showninin(a–c),
(a–c),respectively.
respectively.
The
Thearrows
arrowsin in(a)
(a) indicate
indicatesome
somepeaks
peaksofof particular
particularfrequency
frequencydomains
domainsextending
extendingalong
alongwith
withthe
thetime
timedomain,
domain,which
which
implies that the continuous sounds may require high-frequency resolution to extract distinguishable features. The
implies that the continuous sounds may require high-frequency resolution to extract distinguishable features. The arrows arrows
in (b) point out that dozens of peaks of particular frequencies go up and down alternatively in a relatively short period
in (b) point out that dozens of peaks of particular frequencies go up and down alternatively in a relatively short period
along with the time domain, which implies that time resolutions are more relevant to extract recognizable features for the
along with the time domain, which implies that time resolutions are more relevant to extract recognizable features for the
discontinuous sounds. The normal-sound spectrogram (c) weighs more in the low-frequency region.
discontinuous sounds. The normal-sound spectrogram (c) weighs more in the low-frequency region.
2.2.2. MFCC Feature
2.2.2.
On MFCC Feature
the basis of cepstrum analysis, the Mel-frequency cepstrum analysis was devel-
oped, Onwhere
the the
basishuman auditory
of cepstrum system’s
analysis, theresponse to sounds
Mel-frequency was considered.
cepstrum Thedevel-
analysis was rela-
tion between
oped, where the Mel-frequency,
the human , and the
auditory system’s frequency,
response , is was
to sounds defined as
considered. The relation
between the Mel-frequency, m, and=the 2595 (1 + f ,/700).
frequency, is defined as (2)
The spectrums windowed by = 2595 spaced
m equally log(1 +Mel-frequency
f /700). seem to cause compara- (2)
ble sensitivities for human auditory perception, and this motivates the usage of MFCC,
whichThe spectrums
is derived windowed
through by equally
the following stepsspaced
[33]:Mel-frequency seem to cause comparable
sensitivities for human auditory perception, and this motivates the usage of MFCC, which
(1) Calculate the power spectrum, | ( )| , of the sound signal, ( ), through Fourier
is derived through the following steps [33]:
transform.
(1) Map
(2) Calculate
a set ofthe power
equally spectrum,
spaced | X ( f )|2 , of the
Mel-frequencies, { sound . . }, xto
signal,
, = 1,2,3. (t)the
, through Fourier
frequency do-
transform.
main to obtain { , = 1,2,3. . . }.
(2) Use
(3) Maptheatriangular
set of equally
windows spaced at { } to {get
Mel-frequencies,
centered 1, 2, 3 . . .}sum
k =weighted
mk ,the , to the frequency
of the power
domain to obtain { f , k = 1, 2, 3 . . . } .
spectrum and then take the logarithm of the power integral for each Mel-frequency.
k
(3) Use
(4) Use the triangular
discrete windowstocentered
cosine-transform transform at {the
f k }logarithmic
to get the weighted
power tosum of the power
get MFCCs.
spectrum and then take the logarithm of the power integral for each Mel-frequency
(4) Use discrete cosine-transform to transform the logarithmic power to get MFCCs.
Diagnostics 2021, 11, 732 5 of 12
Diagnostics 2021, 11, x FOR PEER REVIEW 5 of 12
Diagnostics 2021, 11, x FOR PEER REVIEW 5 of 12

Thispaper
This paperadopted
adoptedaashort-time
short-timeversion
versionof ofMFCC,
MFCC,where
whereaaperiod
periodofoftime
timesignal
signalwas
was
taken to
This extract
paper the MFCC
adopted a feature.
short-time The first-order
version of and
MFCC, second-order
where a differences
period of
taken to extract the MFCC feature. The first-order and second-order differences of MFCCs time of MFCCs
signal was
werealso
taken
were also extracted
to extract andappended
the and
extracted MFCC appended toMFCCs
feature. to
TheMFCCs asone
first-order
as one
andMFCC feature. The
second-order
MFCC feature. The numberof
differences
number ofMFCCs
of MFCC
MFCC
coefficients, N_mfcc,
were also extracted
coefficients, N_mfcc,and N
N mfcc ∈ { 10,
∈ {10, 13,
appended 13,
to 20},20
MFCCs } ,
wasas was selected
one MFCC
selected as a parameter
as a feature.
parameterThefor for
number tuning
tuningofthe
MFCCthe
ap-
appropriate
coefficients,
propriate feature.
N_mfcc,
feature. Figure
Figure ∈ {10, the
N 3 shows
3 shows 13, 20},
the MFCC
MFCC features
wasfeatures of continuous,
selectedofascontinuous, discontinuous,
a parameterdiscontinuous, and
for tuning theandap-
normallung
propriate
normal lung sounds.
feature.
sounds.Figure 3 shows the MFCC features of continuous, discontinuous, and
normal lung sounds.

Figure
Figure 3.
3. The
TheMFCC
MFCCfeatures
featuresof ofthe
the continuous
continuoussound,
sound,discontinuous
discontinuoussound,
sound,andandnormal
normalsound
soundare
arevisualized
visualizedinin(a–c),
(a–c),
respectively.
respectively. The arrows in (a) point at the dark red areas where the coefficients are positive. Those darkred
Figure 3. TheThe arrows
MFCC in (a)
features point
of theat the dark
continuous red areas
sound, where the
discontinuouscoefficients
sound, are
and positive.
normal Those
sound dark
are redareas
visualized tend
areasin to
(a–c),
tend to
form irregularThe
respectively. texture patterns.
arrows in (a) The arrows
point at the in (b)red
dark indicate
areas that the
where dark
the red areasare
coefficients alternate with
positive. the dark
Those blue red
areas where
areas tendthe
to
form irregular texture patterns. The arrows in (b) indicate that the dark red areas alternate with the blue areas where the
coefficients are negative,
form irregular which tends
texture patterns. to forminvertical-stripe-like
The arrows (b) indicate that thepatterns.
dark red areas alternate with the blue areas where the
coefficients are negative, which tends to form vertical-stripe-like patterns.
coefficients are negative, which tends to form vertical-stripe-like patterns.
2.3. DS-CNN
2.3. DS-CNN
2.3. DS-CNN
Factorizing standard convolution into depthwise convolution and pointwise convo-
Factorizing
is the key standard
lutionFactorizing standard
to convolution
convolution
accelerating into
intodepthwise
convolution depthwise convolution
operationsconvolutionand pointwise
andFigure
for DS-CNN. pointwise convolu-
convo-
4 describes
tion
lutionis the
is key
the to
key accelerating
to convolution
accelerating operations
convolution for
operations
how the standard and depthwise separable convolution work. DS-CNN.
for Figure
DS-CNN. 4 describes
Figure 4 how
describes
the standard and depthwise separable convolution work.
how the standard and depthwise separable convolution work.

Figure 4. (a) Standard convolution (b) Factorizing standard convolution into depthwise convolution and pointwise con-
volution.
Figure 4. (a)
(a)Standard
Standardconvolution (b)(b)
convolution Factorizing standard
Factorizing convolution
standard intointo
convolution depthwise convolution
depthwise and and
convolution pointwise con-
pointwise
volution.
convolution.
In what follows, we explicitly compare the computation costs between DS-CNN and
In what
standard CNN follows,
layers.we explicitly compare
Considering the computation
the convolutional costs
operation, between
which DS-CNN
is assumed and
stride
standard
one, paddingCNN layers.
same, andConsidering the convolutional
applied on layer operation,the
L in a neural network, which is assumedcost
computational stride
of
one, padding
standard same, and
convolution applied4aonislayer L in a neural network, the computational cost of
in Figure
standard convolution in Figure 4a is
 ℎ    , (3)
 ℎ
w · h and  
· N ·channel  ,
k · k · M,number of the input feature map(3) (3)
where w, h, and N are the width, height, at
where
layer L,w, h, and N are
respectively. Mthe width,
is the height,
number and channel
of square number
convolution of thewith
kernels input featuredimen-
k spatial map at
where w, h, and N are the width, height, and channel number of the input feature map
layer L,
sions. Forrespectively.
DS CNN inM is the4b,
Figure number of square convolution
the computational kernels with
cost of depthwise k spatial dimen-
convolution is
at layer L, respectively. M is the number of square convolution kernels with k spatial
sions. For DS CNN in Figure 4b, the computational cost of depthwise convolution is
dimensions. For DS CNN in Figure 4b, the  ℎ   .
computational cost of depthwise convolution (4)is
 ℎ   . (4)
The computational cost of pointwise w · hconvolution
· N · k · k . is (4)
The computational cost of pointwise convolution is
 ℎ 1 1  , (5)
 ℎ 1 1  , (5)
Diagnostics 2021, 11, 732 6 of 12

The computational cost of pointwise convolution is

w·h·1·1·N·M, (5)

where N is the depth of the 1 × 1 convolution kernel, which combines N channels’ features
produced by depthwise convolution. M is the number of 1 × 1 × N convolution kernels
to produce M output feature maps at layer L + 1 with width, w, and height, h. The
reduction in computation by factorizing standard convolution into depthwise convolution
and pointwise convolution is
computational cost o f DS − CNN w · h · N · k · k + w · h · 1 · 1 · N · M 1 1
= = + 2 (6)
computational cost o f stand CNN w·h·N·k·k·M M k .

Shrinking DS-CNN Model


To shrink the model and retain the model performance, a model selection procedure
was derived from reference [29]. The width multiplier, α, α ∈ {0.75, 0.5}, and the number of
DS blocks, β, β ∈ {12, 10, 8}, were adopted to form a simple 2 × 3 grid for model selection.
The architecture of the original MobilNet, including 13 DS-blocks and approximately
2 million parameters, was taken as the reference model. The width multiplier, α, was used
to determine the width of DS-CNN by evenly reducing the number of convolution kernels
or fully connected nodes for each layer. The reduced numbers of convolution kernels were
calculated by multiplying α with the original number of convolution kernels. The number
of DS blocks, β, was used to determine the depth of DS-CNN. The numbers of parameters
of different shrunk models produced by the combinations of α and β are listed in Table 1.

Table 1. The numbers of million parameters of different shrunk models produced by combinations
of α and β.

Depth 12-DS Blocks 10-DS Blocks 8-DS Blocks


Width β = 12 β = 10 β=8
α = 0.75 1.67 1.36 1.05
α = 0.50 0.76 0.61 0.47
The number of million parameters in bold indicates that the shrunk model was finally selected.

Eventually, the model with α = 0.75 and β = 10 was selected to strike a balance
between model performance and model complexity. The DS-CNN model was trained from
scratch without pre-trained weight. No data augmentation techniques were applied to
model training.

2.4. Model Evaluation


The models were evaluated and compared in a hierarchical way as follows:
Level 1: comparison among features
Level 2: comparison among deep learning model architectures
Level 3: comparison between our method and the other method
In level 1 comparison, the best features were selected through the feature engineering
process. The performances of a total of three features, the STFT feature, the MFCC feature,
and the fused features of these two, were compared.
In level 2 comparison, the performances of DS-CNN, standard-CNN, and RNN were
compared. Vgg16 [34], AlexNet [35], DS-AlexNet, Long Short-Term Memory (LSTM) [36],
Gated Recurrent Unit (GRU) [37], and Temporal Convolutional Network (TCM) [38] were
selected for comparison with DS-CNN. The selected models were trained using the fused
features of STFT and MFCC.
In level 3 comparison, RespireNet [22] was selected as the baseline to evaluate our
method because RespireNet is open source, which can be reproduced exactly like the
original way of implementation. On the contrary, the other methods [19,21,25] without
Diagnostics 2021, 11, 732 7 of 12

the publicly released codes were not selected for comparison. The best model of our
method and RespireNet were converted to TensorFlow Lite (TF Lite) models to accelerate
model inferencing. Eighty respiratory cycles, which contained 20 cycles of each lung sound
type, were selected for measuring the inference time. The inference time included the
time of feature extracting and model inferencing. The inference times of our method and
RespireNet were compared on both the edge device, Raspberry Pi 3 B+, and the cloud
server, Google Colab (CPU runtime), with TF Lite models.

3. Results
The models’ performances were evaluated by the index of F1 score, recall, precision,
and accuracy. For each sound type
i ∈ {Continuous, Discontinuous, Normal, Unknown}

M[i, i ]
Recall REC = (7)
∑ j M [i, j]

M [i, i ]
Precision PRC = (8)
∑ j M [ j, i ]
2 ∗ PRC ∗ REC
F1 score F1 = . (9)
PRC + REC
Here an element, M[i,j], of the 4 × 4 confusion matrix, M, indicates that M[i,j] samples
are predicted to be label j but are indeed label i. The overall accuracy is defined as

∑i M[i, i ]
Accuracy = . (10)
∑i,j M [i, j]

The results of level 1 to level 3 comparison examined our method’s performance across
features, model architecture, and levels of inference efficiency on edge devices. Table 2
shows the results of level 1 comparison. In level 1 comparison, the best STFT feature was
extracted when the window size and the hop length were 512 and 40, respectively. The best
MFCC feature was extracted when the number of MFCCs was 20. The fused features of
STFT and MFCCs, which performed the best, were extracted when the windows size and
the hop length were 256 and 40, respectively, after fine-tuning. According to Table 2, all the
indexes, including precision, recall, F1 score, and accuracy, were substantially increased
when STFT and MFCCs were fused as one feature.

Table 2. Results of level 1 comparison.

STFT MFCC Fused Features


F1 REC PRC F1 REC PRC F1 REC PRC
Continuous 0.86 0.86 0.87 0.73 0.74 0.73 0.89 0.88 0.91
Discontinuous 0.78 0.79 0.78 0.69 0.67 0.72 0.82 0.84 0.80
Normal 0.81 0.81 0.80 0.73 0.77 0.69 0.84 0.83 0.86
Unknown 0.87 0.85 0.89 0.82 0.78 0.86 0.90 0.90 0.90
Accuracy 82.27% 73.02% 85.74%

Table 3 summarizes the results of level 2 comparison. In level 2 comparison, CNN-


based models outperformed RNN-based models. Also, the shrunk DS-CNN model
achieved higher accuracy than standard CNN models. The shrunk DS-CNN model with
only 1.36 million parameters achieved the best accuracy, 85.74%. The second-best accuracy,
85.66%, was yielded by VGG-16 with 67.03 million parameters.
Diagnostics 2021, 11, 732 8 of 12

Table 3. Results of level 2 comparison.

Model.
Lung Sound Types DS-CNN * VGG-16 AlexNet DS-AlexNet LSTM GRU TCN
F1 Score
Continuous 0.89 0.89 0.85 0.85 0.81 0.80 0.78
Discontinuous 0.82 0.81 0.77 0.75 0.69 0.73 0.70
Normal 0.84 0.84 0.77 0.78 0.75 0.78 0.74
Unknown 0.90 0.91 0.79 0.89 0.88 0.88 0.86
Accuracy 85.74% 85.66% 79.92% 80.86% 76.92% 78.50% 75.51%
Million Parameters 1.36 67.03 32.99 1.71 0.29 0.23 0.02
* DS-CNN means the shrunk DS-CNN model.

The results of level 3 comparison are shown in Tables 4 and 5. According to Table 4,
our method performed nearly as accurately as RespireNet did. Our F1 scores of continuous
and discontinuous are equal to RespireNet’s, which are 0.89 and 0.82, respectively. Our
method achieved 85.74% accuracy, only 0.43% less than RespireNet, which achieved 86.17%.
On the contrary, our method had 16 times higher inference speed and 16 times smaller
model size than RespireNet on the edge device, according to Table 5.

Table 4. Results of level 3 comparison-1: Model performance.

Our Method RespireNet


F1 REC PRC F1 REC PRC
Continuous 0.89 0.88 0.91 0.89 0.85 0.94
Discontinuous 0.82 0.84 0.80 0.82 0.86 0.78
Normal 0.84 0.83 0.86 0.85 0.85 0.84
Unknown 0.90 0.90 0.90 0.93 0.92 0.94
Accuracy 85.74% 86.17%

Table 5. Results of level 3 comparison-2: Model Inference.

Comparison Inference Time Inference Time per Model Million TF Lite


Method per Cycle on Edge Cycle on Cloud Architecture Parameters Model Size
Our Method 0.22 s 0.038 s DS-CNN 1.36 5 MB
RespireNet 3.54 s 0.45 s Resnet 34 21.36 81 MB

The confusion matrices of level 1 and level 3 comparisons are shown in Figure 5.
For level 1 comparison in Figure 5a–c, the DS-CNN trained with fused STFT and MFCC
features had higher correct predictions for each lung sound type than the other two models.
For level 2 comparison in Figure 5b,c, our method’s confusion matrix presented a trend
similar to that of RespireNet.
Diagnostics 2021, 11, x FOR PEER REVIEW 9 of 12
Diagnostics 2021,
Diagnostics 2021, 11, x732
FOR PEER REVIEW 9 9ofof 12
12

Figure 5. Confusion matrices of (a) DS-CNN trained with STFT feature, (b) DS-CNN trained with MFCC feature, (c) DS-
Figure
Figure
CNN Confusion
5. Confusion
5.
trained matrices
matri
with fused STFT ofof(a)
ces and (a) DS-CNN
DS-CNN
MFCC trained
trained
features, with
andwith
(d) STFT
STFT feature,
feature,
RespireNet. (b) (b) DS-CNN
DS-CNN
Continuous, trained
trained withwith
discontinuous, MFCCMFCCandfeature,
feature,
normal, (c)
(c) DS-
unknown
CNN
DS-CNN
are trained with
trained
abbreviated fused
as with STFT
C, D,fused
N, and
U, MFCC
andSTFT features,
and MFCC
respectively. and (d)and
features, RespireNet. Continuous,
(d) RespireNet. discontinuous,
Continuous, normal, and
discontinuous, unknown
normal, and
are abbreviated
unknown as C, D, N,as
are abbreviated and
C, U,
D, respectively.
N, and U, respectively.
4. Discussion
4.
4. Discussion
Discussion
The shrunk DS-CNN model performance substantially increased when the model
The
The shrunk
was trained shrunk DS-CNN
withDS-CNN model
the fusedmodel performance
features of STFT and
performance substantially
MFCC. The
substantially increased
STFT and
increased whenwhenthe the
MFCC model
features
model was
was
may trained
complementwith the
each fused
other features
because of
theSTFT
MFCC and MFCC.
feature The STFT
represents and
human
trained with the fused features of STFT and MFCC. The STFT and MFCC features may com- MFCC features
auditory per-
may
ceptioncomplement
plement more
each othereach
closely. otherthe
because because
Therefore, some
MFCC the MFCC
acoustic
feature feature represents
distinctions
represents human betweenhuman
auditory auditory
different
perceptiontypes per-
moreof
ception
lung
closely. more
sounds closely.
may be
Therefore, Therefore,
enhanced
some acoustic some acoustic
bydistinctions
the MFCC between distinctions
feature. Figure
differentbetween
6 shows different
types ofanlung
example types
sounds ofmay of
the
lung sounds
situation
be enhanced may
mentioned
by thebeMFCC
enhanced
earlier. by the
Besides,
feature. theMFCC
Figure feature.
feature
6 showsshould Figure
also be6of
an example shows
extractedanwith
example
the situation only of the
a few
mentioned
situation
computationalmentioned
costs earlier.
to take Besides,
advantage theof feature
DS-CNN, should
whichalso be extracted
accelerates with
convolution
earlier. Besides, the feature should also be extracted with only a few computational costs only a few
opera-
computational
tions to advantage
to take costs
ofto
a great extent take
on edge
DS-CNN, advantage
devices. of DS-CNN,
Both which
STFT convolution
which accelerates and MFCC accelerates
can convolution
be calculated
operations opera-
efficiently
to a great extent
tions
by
onthe to
edge a great extent
fastdevices.
FourierBoth on edge
transform devices.
STFT algorithm Both
and MFCC[32] STFT
cantobe and
avoid MFCC can
the bottleneck
calculated be calculated
in by
efficiently thethe efficiently
feature extrac-
fast Fourier
by
tionthe fast algorithm
step.
transform Fourier transform algorithm
[32] to avoid [32] to avoid
the bottleneck the
in the bottleneck
feature in thestep.
extraction feature extrac-
tion step.

Figure
Figure 6.
6. The
The upper
upper part
part and
and the
the lower part show
lower part show the
the MFCC
MFCC feature
featureand
andthe
theSTFT
STFTfeature,
feature,respectively.
respectively.The
The STFT
STFT feature
feature of
Figure
of (a) 6. The upper
continuous andpart
(b)and the lower part
discontinuous showshows
sounds the MFCC
few feature andbetween
distinctions the STFTthe
feature,
two respectively.
lung sound The STFT
types. On feature
the con-
(a) continuous and (b) discontinuous sounds shows few distinctions between the two lung sound types. On the contrary,
of (a) the
trary, continuous and (b)appears
MFCC feature discontinuous sounds showsbetween
to be distinguishable few distinctions
the two. between
The STFTthe two lung
feature sound
and the MFCC types. On the
feature con-
tend to
the MFCC feature appears to be distinguishable between the two. The STFT feature and the MFCC feature tend to be
trary,
be the MFCC feature
complementary appears
to each other. to be distinguishable between the two. The STFT feature and the MFCC feature tend to
complementary
be complementary to each other.
to each other.
The
Thefused
fused features of STFT
features of STFT and
andMFCC
MFCCextracted
extractedfrom
fromthe
theproposed
proposedfeature
feature engineer-
engineering
ing The fused
process features oftoSTFT
contributed the and MFCC
shrunk extracted
DS-CNN from high
model’s the proposed
accuracyfeature engineer-
compared with
process contributed to the shrunk DS-CNN model’s high accuracy compared with model
ing process
model contributed
architectures. to the all
Moreover, shrunk DS-CNN
CNN-based model’s
models high accuracy
outperformed compared
RNN-based with
models
architectures. Moreover, all CNN-based models outperformed RNN-based models in terms
model architectures. Moreover, all CNN-based models outperformed RNN-based models
of accuracy. The results of level 2 comparison indicate that the fused features that we
Diagnostics 2021, 11, 732 10 of 12

extracted are appropriate for DS-CNN-based models. CNN-based models were originally
designed for image recognition tasks, whereas RNN-based models were designed for
learning the features of sequences. The STFT and MFCC features can resemble either
images or multi-dimensional time-series data. However, we fine-tuned the fused features
based on DS-CNN-based models rather than RNN-based models. There is inevitably a
trade-off between frequency resolution and time resolution when extracting STFT and
MFCC features. The demand for frequency or time domain resolution may depend on
the model architectures. Hence, the appropriate features for DS-CNN-based models may
not have enough time domain resolution for the RNN-based models. Additionally, the
proposed feature engineering process can be employed to extract the appropriate features
for any other model architectures. Likewise, the lung sound can be replaced by other sound
types, such as heart sounds.
Compared with RespireNet, our method provided a smaller-sized model, higher
inference speed, and comparable model performance. This result presents a trend similar
to the study of respiratory sound classification in wearable devices [19]. As observed in
reference [19], the DS-CNN-based model (MobileNet) required the least computational
complexity and had only 4.78% less F1 score than the best model they proposed on the
ICBHI dataset. When it comes to developing the automatic lung sound recognition system
on edge devices, the models should not consume too much computational power and
memory space. There should be enough hardware resources to maintain the operations
of the whole system. Through the proposed feature engineering and model-shrinking
process, a shrunk DS-CNN model may be trained to recognize lung sounds on edge devices
accurately and efficiently.
The model training process adopted by the original RespireNet is consistent with many
previous studies [19,21,25]. They used the ICBHI dataset, pre-trained weights, and used
augmented data to train their CNN-based models. The sound signals were transformed
into 3-channel color images. Those color images were preprocessed by cropping or resizing
to enhance visual patterns for the model to learn features. However, our method used
original values of STFT spectrograms and MFCCs with only one channel rather than three
channels to train all CNN-based models. We expected the model to learn the features that
reveal the direct and intuitive information of the spectrograms. The CNN-based models
were trained from scratch without pre-trained weights and data augmentation because
the dataset used in this research is different and larger than the ICBHI dataset. The results
shown in Table 4 imply that the DS-CNN model may learn the features from original
spectrograms without pre-trained weights if the dataset is large enough.
Furthermore, a possible explanation for our method achieving lower evaluation
indexes of the unknown lung sound might be that there is no data augmentation adopted
through model training. The unknown lung sound dataset is not as large as any other
three types of lung sound dataset. The data augmentation technique originally proposed
by RespireNet to handle the data imbalance issue of the ICBHI dataset may lead to better
performance for recognizing the unknown lung sounds.
Autonomous stethoscopes developed by integrating AI-algorithm into portable digital
stethoscopes have been proposed by Glangetas et al. [39]. Portable digital stethoscopes can
be various forms of smartphone accessories for easy mobility [40]. The fused STFT, MFCC
features, and DS-CNN model may be one appropriate AI algorithm for autonomous stetho-
scopes. The autonomous stethoscopes appear to increase the accessibilities of high-quality
lung auscultation to medical workers or patients for self-monitoring. With the help of this
device, clinicians and caregivers could interpret pathological and physiological information
in the lung sounds at the first sign of a patient’s abnormal conditions. This information
tends to be practical to identify the need for timely treatment or early hospitalization.

5. Conclusions
We have proposed a feature engineering process to extract dedicated features for the
shrunk DS-CNN to classify four types of lung sounds. We observed that fusing the STFT
Diagnostics 2021, 11, 732 11 of 12

and MFCC features led to a higher accuracy of 85.74%. In contrast, the model trained on
only one STFT or MFCC feature achieved the accuracies of 82.27% and 73.02%, respectively.
We then evaluated our method by comparing it with RespireNet. While RespireNet was
0.43% better than our method in terms of accuracy, our method achieved 16 times higher
inference speed on the edge device.
To summarize, these results support the idea that DS-CNN may perform nearly as
accurately as standard CNN by training with appropriate features. The feature engineering
process that we have proposed can be applied to the extraction of dedicated features for
other types of sound signals or for other architectures of deep learning models. However,
we did not use any data augmentation techniques in this study. Further research might
explore how data augmentation techniques affect the performance of sound recognition
models.

Author Contributions: Conceptualization, C.-H.L.; methodology, C.-H.L. and Y.-S.W.; validation,


S.-Y.J.; formal analysis, S.-Y.J.; investigation, C.-H.L.; writing—original draft preparation, S.-Y.J.;
writing—review and editing, S.-M.Y. and C.-T.S.; visualization, S.-Y.J. and Y.-S.W.; supervision, S.-M.Y.
and C.-T.S. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by Research and implementation development of a huge data
security collection system for machines based on edge computing, from the AI Center, Tung-Hai
University, grant number 109A094.
Institutional Review Board Statement: The data collection is approved by the review of the institu-
tional review board at the Far East Memorial Hospital, IRB No. 107052-F on August 2018.
Informed Consent Statement: The patient informed consent was done by the approved inform
consent form.
Acknowledgments: The authors of this study sincerely thank Chii-Wann Lin’s group in National
Taiwan University and the Heroic-Faith Medical Science for collecting and annotating the data.
The authors also extend their thanks to Wei-Ting Hsu and Chia-Che Tsai for reviewing grammar
and syntax.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Sovijarvi, A.; Dalmasso, F.; Vanderschoot, J.; Malmberg, L.; Righini, G.; Stoneman, S. Definition of terms for applications of
respiratory sounds. Eur. Respir. Rev. 2000, 10, 597–610.
2. Jiang, C.; Zhao, J.; Huang, B.; Zhu, J.; Yu, J. A basic investigation into the optimization of cylindrical tubes used as acoustic
stethoscopes for auscultation in COVID-19 diagnosis. J. Acoust. Soc. Am. 2021, 149, 66–69. [CrossRef] [PubMed]
3. Pasterkamp, H.; Brand, P.L.; Everard, M.; Garcia-Marcos, L.; Melbye, H.; Priftis, K.N. Towards the standardisation of lung sound
nomenclature. Eur. Respir. J. 2016, 47, 724–732. [CrossRef]
4. Gavriely, N.; Nissan, M.; Rubin, A.; Cugell, D.W. Spectral characteristics of chest wall breath sounds in normal subjects. Thorax
1995, 50, 1292–1300. [CrossRef] [PubMed]
5. Robertson, A.J.; Coope, R. Rales, rhonchi, and Laennec. Lancet (Lond. Engl.) 1957, 273, 417–423. [CrossRef]
6. Subcommittee, A.-A.A.H. Report on pulmonary nomenclature. ATS News 1977, 3, 5–6.
7. Gavriely, N.; Palti, Y.; Alroy, G.; Grotberg, J.B. Measurement and theory of wheezing breath sounds. J. Appl. Physiol. 1984, 57,
481–492. [CrossRef] [PubMed]
8. Meslier, N.; Charbonneau, G.; Racineux, J. Wheezes. Eur. Respir. J. 1995, 8, 1942–1948. [CrossRef] [PubMed]
9. Forgacs, P. Lung Sounds/Paul Forgacs; Bailliere Tindall: London, UK, 1978.
10. Sovijarvi, A. Characteristics of breath sounds and adventitious respiratory sounds. Eur. Respir. Rev. 2000, 10, 591–596.
11. Piirilä, P.; Sovijärvi, A.R. Crackles: Recording, analysis and clinical significance. Eur. Respir. J. 1995, 8, 2139–2148. [CrossRef]
[PubMed]
12. Piirilä, P. Changes in crackle characteristics during the clinical course of pneumonia. Chest 1992, 102, 176–183. [CrossRef]
[PubMed]
13. Murphy, R.L.; Vyshedskiy, A.; Power-Charnitsky, V.A.; Bana, D.S.; Marinelli, P.M.; Wong-Tse, A.; Paciej, R. Automated lung sound
analysis in patients with pneumonia. Respir. Care 2004, 49, 1490–1497. [PubMed]
14. Huang, Y.; Meng, S.; Zhang, Y.; Wu, S.; Zhang, Y.; Zhang, Y.; Ye, Y.; Wei, Q.; Zhao, N.; Jiang, J.; et al. The respiratory sound
features of COVID-19 patients fill gaps between clinical data and screening methods. medRxiv 2020. [CrossRef]
Diagnostics 2021, 11, 732 12 of 12

15. Pramono, R.X.A.; Bowyer, S.; Rodriguez-Villegas, E. Automatic adventitious respiratory sound analysis: A systematic review.
PLoS ONE 2017, 12, e0177926. [CrossRef] [PubMed]
16. Dara, S.; Tumma, P.; Eluri, N.R.; Kancharla, G.R. Feature extraction in medical images by using deep learning approach. Int. J.
Pure Appl. Math. 2018, 120, 305–312.
17. Bardou, D.; Zhang, K.; Ahmad, S.M. Lung sounds classification using convolutional neural networks. Artif. Intell. Med. 2018, 88,
58–69. [CrossRef] [PubMed]
18. Demir, F.; Sengur, A.; Bajaj, V. Convolutional neural networks based efficient approach for classification of lung diseases. Health
Inf. Sci. Syst. 2020, 8, 1–8. [CrossRef] [PubMed]
19. Acharya, J.; Basu, A. Deep neural network for respiratory sound classification in wearable devices enabled by patient specific
model tuning. IEEE Trans. Biomed. Circuits Syst. 2020, 14, 535–544. [CrossRef] [PubMed]
20. Aykanat, M.; Kılıç, Ö.; Kurt, B.; Saryal, S. Classification of lung sounds using convolutional neural networks. EURASIP J. Image
Video Process. 2017, 2017, 1–9. [CrossRef]
21. Chen, H.; Yuan, X.; Pei, Z.; Li, M.; Li, J. Triple-classification of respiratory sounds using optimized s-transform and deep residual
networks. IEEE Access 2019, 7, 32845–32852. [CrossRef]
22. Gairola, S.; Tom, F.; Kwatra, N.; Jain, M. RespireNet: A Deep Neural Network for Accurately Detecting Abnormal Lung Sounds
in Limited Data Setting. arXiv 2020, arXiv:2011.00196.
23. Wu, Y.-S.; Liao, C.-H.; Yuan, S.-M. Automatic auscultation classification of abnormal lung sounds in critical patients through deep
learning models. In Proceedings of the 2020 3rd IEEE International Conference on Knowledge Innovation and Invention (ICKII),
Kaohsiung, Taiwan, 21–23 August 2020; pp. 9–11.
24. Rocha, B.; Filos, D.; Mendes, L.; Vogiatzis, I.; Perantoni, E.; Kaimakamis, E.; Natsiavas, P.; Oliveira, A.; Jácome, C.; Marques, A. A
respiratory sound database for the development of automated classification. In International Conference on Biomedical and Health
Informatics; Springer: Singapore, 2017; pp. 33–37.
25. García-Ordás, M.T.; Benítez-Andrades, J.A.; García-Rodríguez, I.; Benavides, C.; Alaiz-Moretón, H. Detecting respiratory
pathologies using convolutional neural networks and variational autoencoders for unbalancing data. Sensors 2020, 20, 1214.
[CrossRef] [PubMed]
26. Sifre, L. Rigid-Motion Scattering for Image Classification. arXiv 2014, arXiv:1403.1687.
27. Reyes, B.A.; Reljin, N.; Kong, Y.; Nam, Y.; Ha, S.; Chon, K.H. Towards the development of a mobile phonopneumogram:
Automatic breath-phase classification using smartphones. Ann. Biomed. Eng. 2016, 44, 2746–2759. [CrossRef] [PubMed]
28. Azam, M.A.; Shahzadi, A.; Khalid, A.; Anwar, S.M.; Naeem, U. Smartphone based human breath analysis from respiratory
sounds. In Proceedings of the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology
Society (EMBC), Honolulu, HI, USA, 18–21 July 2018; pp. 445–448.
29. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient
convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861.
30. Hsiao, C.-H.; Lin, T.-W.; Lin, C.-W.; Hsu, F.-S.; Lin, F.Y.-S.; Chen, C.-W.; Chung, C.-M. Breathing Sound Segmentation and
Detection Using Transfer Learning Techniques on an Attention-Based Encoder-Decoder Architecture. In Proceedings of the 2020
42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Montreal, QC, Canada,
20–24 July 2020; pp. 754–759.
31. Peng, N.; Chen, A.; Zhou, G.; Chen, W.; Zhang, W.; Liu, J.; Ding, F. Environment Sound Classification Based on Visual
Multi-Feature Fusion and GRU-AWS. IEEE Access 2020, 8, 191100–191114. [CrossRef]
32. Walker, J.S. Fast Fourier Transforms; CRC Press: Boca Raton, FL, USA, 1996; Volume 24.
33. Cristea, P.; Valsan, Z. New cepstrum frequency scale for neural network speaker verification. In Proceedings of the ICECS’99, 6th
IEEE International Conference on Electronics, Circuits and Systems (Cat. No. 99EX357), Paphos, Cyprus, 5–8 September 1999; pp.
1573–1576.
34. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
35. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105. [CrossRef]
36. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
37. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
2014, arXiv:1412.3555.
38. Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling.
arXiv 2018, arXiv:1803.01271.
39. Glangetas, A.; Hartley, M.-A.; Cantais, A.; Courvoisier, D.S.; Rivollet, D.; Shama, D.M.; Perez, A.; Spechbach, H.; Trombert, V.;
Bourquin, S. Deep learning diagnostic and risk-stratification pattern detection for COVID-19 in digital lung auscultations: Clinical
protocol for a case–control and prospective cohort study. Bmc Pulm. Med. 2021, 21, 1–8. [CrossRef] [PubMed]
40. Vasudevan, R.S.; Horiuchi, Y.; Torriani, F.J.; Cotter, B.; Maisel, S.M.; Dadwal, S.S.; Gaynes, R.; Maisel, A.S. Persistent Value of the
Stethoscope in the Age of COVID-19. Am. J. Med. 2020, 133, 1143–1150. [CrossRef] [PubMed]

You might also like