0% found this document useful (0 votes)
28 views8 pages

Heart Sound

Deep neural networks have shown promising results for classifying heart sounds. However, existing models rely on hand-crafted features and have limited generalization. The authors propose a novel deep learning model with an attention mechanism to automatically learn representations directly from heart sound data. They evaluate the model on a dataset of 170 subjects and achieve 51.2% average recall across 3 heart sound categories. The attention mechanism improves performance by estimating the contribution of different learned features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views8 pages

Heart Sound

Deep neural networks have shown promising results for classifying heart sounds. However, existing models rely on hand-crafted features and have limited generalization. The authors propose a novel deep learning model with an attention mechanism to automatically learn representations directly from heart sound data. They evaluate the model on a dataset of 170 subjects and achieve 51.2% average recall across 3 heart sound categories. The attention mechanism improves performance by estimating the contribution of different learned features.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1

Deep Attention-based Representation Learning


for Heart Sound Classification
Zhao Ren, Student Member, IEEE, Kun Qian, Member, IEEE, Fengquan Dong, Zhenyu Dai,
Yoshiharu Yamamoto, Member, IEEE, Björn W. Schuller, Fellow, IEEE

Abstract—Cardiovascular diseases are the leading cause of per year by 2030 [2]. Early-stage diagnosis and proper man-
deaths and severely threaten human health in daily life. On the agement of CVDs can be very beneficial to mitigate the high
one hand, there have been dramatically increasing demands from costs and social burdens by coping with serious CVDs [3], [4].
arXiv:2101.04979v1 [cs.SD] 13 Jan 2021

both the clinical practice and the smart home application for
monitoring the heart status of subjects suffering from chronic Auscultation of the heart sounds, as a cheap, convenient, and
cardiovascular diseases. On the other hand, experienced physi- non-invasive method, has been successfully used by physicians
cians who can perform an efficient auscultation are still lacking in for over a century [5]. However, this clinical skill needs
terms of number. Automatic heart sound classification leveraging tremendous training and is still difficult for more than 20 %
the power of advanced signal processing and machine learning of the less experienced medical interns to efficiently use [6].
technologies has shown encouraging results. Nevertheless, human
hand-crafted features are expensive and time-consuming. To this Therefore, developing an automatic auscultation framework
end, we propose a novel deep representation learning method can facilitate the early cost-effective screening of CVDs, and at
with an attention mechanism for heart sound classification. In the same time, to manage the progression of its condition [5].
this paradigm, high-level representations are learnt automatically Computer audition (CA) and its applications in health-
from the recorded heart sound data. Particularly, a global care [7] have yielded encouraging results in the past decades.
attention pooling layer improves the performance of the learnt
representations by estimating the contribution of each unit in Due to its non-invasive and ubiquitous characteristic, CA-
feature maps. The Heart Sounds Shenzhen (HSS) corpus (170 based methods can facilitate automatic heart sound analysis
subjects involved) is used to validate the proposed method. studies, which have already attracted a plethora of efforts [5].
Experimental results validate that, our approach can achieve Additionally, benefited from the fast development of machine
an unweighted average recall of 51.2 % for classifying three learning (ML), particularly, its subsets, i. e., deep learning
categories of heart sounds, i. e., normal, mild, and moderate/severe
annotated by cardiologists with the help of Echocardiography. (DL), and the prevalent smart sensors, wearables, devices,
etc., intelligent healthcare can be implemented feasibly in this
Index Terms—Computer audition, digital health, heart sound
era of AIoT (artificial intelligence enabled internet of things).
classification, deep learning, attention mechanism.
A systematical and comprehensive review of the existing
literature on heart sound analysis via ML was provided in [5].
I. I NTRODUCTION In the early works, designing efficient features ranging from
S reported by the World Health Organisation (WHO),
A Cardiovascular diseases (CVDs) are the first leading
cause of death globally, which made 17.9 million people dead
classic Fourier transformation to multi-resolution analysis
(e. g., wavelet transformation) dominated the well-documented
literature in this field. To the recent years, using DL models
in 2016 (representing 31 % of all global deaths) [1]. More for analysing and extracting high-level representations from
seriously, this number is predicted to be around 23 million heart sounds automatically has increasingly been studied [8].
Furthermore, as indicated in [9], the current trend is to classify
This work was partially supported by the Horizon H2020 Marie
Skłodowska-Curie Actions Initial Training Network European Training Net- the heart sounds from the whole audio recording without any
work (MSCA-ITN-ETN) project under grant agreement No. 766287 (TAPAS), segmentation step. On the one hand, the state-of-the-art DL
Germany, the Zhejiang Lab’s International Talent Fund for Young Profes- methods aim to build a deep end-to-end architecture that can
sionals (Project HANAMI), P. R. China, the JSPS Postdoctoral Fellowship
for Research in Japan (ID No. P19081) from the Japan Society for the learn high-level representations from the heart sound itself
Promotion of Science (JSPS), Japan, the Grants-in-Aid for Scientific Research without any human hand-crafted features. On the other hand,
(No. 19F19081 and No. 17H00878) from the Ministry of Education, Culture, the DL models are restrained by the generalisation of the learnt
Sports, Science and Technology, Japan, and the Natural Science Foundation of
Shenzhen University General Hospital (No. SUGH2018QD013), P. R. China. representations from a limited data set. However, with the DL-
K. Qian is the Corresponding author. based systems of heart sounds analysis, black-box DL models
Z. Ren and B. W. Schuller are with the Chair of Embedded Intelligence cannot produce transparent and understandable decisions for
for Health Care and Wellbeing, University of Augsburg, Germany (e-mail:
{zhao.ren, schuller}@informatik.uni-augsburg.de). physicians to provide the next physical examination and an
K. Qian and Y. Yamamoto are with the Educational Physiology Labo- appropriate treatment. Making explainable decisions via DL-
ratory, Graduate School of Education, The University of Tokyo, Japan (e- based systems is a trend to enhance the trust of physicians
mail: {qian, yamamoto}@p.u-tokyo.ac.jp).
F. Dong is with the Department of Cardiology, Shenzhen University General in the systems and promote their application in the medical
Hospital, P. R. China (e-mail: [email protected]). area [10]. In the recent study [11], a promising attention
Z. Dai is with the Department of Cardiovascular, Wenzhou Medical Univer- mechanism was proposed to explain the DL models via
sity First Affiliated Hospital, P. R. China (e-mail: [email protected]).
B. W. Schuller is also with the GLAM – Group on Language, Audio & visualising the internal layers.
Music, Imperial College London, UK (email: [email protected]). To this end, we propose a novel attention-based deep
2

were introduced in [15]. Wavelet packet transformation (WPT)


based features were used in [16], by which a full decom-
position tree can be generated in one-level decomposition
process. Besides using the directly extracted low-level descrip-
tors (LLDs) of the wavelet features, some high-level repre-
sentations can also be derived. For example, auto-correlation
Heart sound
features can be extracted from the sub-band envelopes that are
calculated from the sub-band coefficients of the heart sound
by DWT [17]. A combination of WT and WPT energy-based
features combined with a deep RNN model was proposed
Log mel spectrogram
in [18]. Compared with the conventional short-time Fourier
Prediction
transformation (STFT) based features used for heart sound
(Normal, Mild, Moderate/Severe) Attention based Deep Learning Model
classification (cf. [19]), wavelet features can provide a multi-
resolution analysis of the non-stationary signals (heart sounds).
Fig. 1. The overview scheme of our heart sound classification procedure.
This capacity helps to optimise the Heisenberg-alike time-
frequency trade-off in time-frequency transformations [20].
representation learning method for heart sound classification Nevertheless, wavelet transformation still has its own draw-
in this study (Fig. 1). The main contributions of the work backs. In particular, designing a suitable wavelet function is
are: First, to the best of our knowledge, it is the first time to not an easy job, which demands tremendous empirical exper-
introduce an attention mechanism to heart sound classification. iments for specific tasks. Benefited from the fast development
By leveraging the power of a global attention pooling layer, of DL, the heart sound feature extraction can be realised
the DL models can learn more robust and generalised high- without a domain knowledge. The higher representations of
level representations from the heart sound. Second, we make a the heart sounds can be automatically extracted from (pre-
comprehensive investigation and comparison of the topologies trained) CNNs and be fed, e. g., into an SVM classifier [21].
of DL models, i, e., convolutional neural networks (CNNs) and Amiriparian et al. introduced an unsupervised representation
recurrent neural networks (RNNs), and validate them on an learning method using an auto-encoder-based recurrent neural
open access database, i. e., HSS, hence rendering our studies network in the paradigm of sequence-to-sequence (Seq2Seq)
reproducible and sustainable. Third, we compare the proposed learning [22]. In a most recent work, Fernando et al. in-
method with other state-of-the-art approaches using the same troduced the attention based deep learning model for the
database and standard processing. In addition, we explore the heart sound segmentation task, and indicated that their model
visualisation of the learnt high-level representations of our outperformed the state-of-the-art baseline methods [23].
proposed DL models using an attention mechanism, which can With the generated high-level representations, most end-to-
contribute to an explainable AI (XAI) [12]. Last but not least, end deep representation learning methods, particularly CNNs
we indicate the current limitations and give our perspectives and RNNs, use a global pooling layer to summarise the high-
in this domain, which can be a good guidance for future work. dimensional representations into one-dimensional vectors for
The remainder of this paper will be structured as follows: later classification [24], [25]. For example, global max-pooling
First, a brief description of related work will be given in selects the maximum value from each two-dimensional feature
Section II. Then, we introduce the database and methods used map in CNNs [25]. Yet, our previous study has shown that
in Section III. The experimental results and discussion are global max-pooling loses the contribution of the other smaller
illustrated in Section IV and Section V, respectively. Finally, values [26]. Global attention pooling was proposed in [25]
we summarise our work in Section VI. to improve the performance of CNNs through estimating the
contribution of each unit in the feature maps to the classi-
fication task. An attention mechanism was also employed to
II. R ELATED W ORK explain the decisions via visualising the internal layers of DL
In the classic ML paradigm, human hand-crafted feature models in [11], [26]. With the inspiration of global attention
extraction is a prerequisite, which aims to design a series of pooling [11], we will show the effectiveness of CNNs with
efficient and robust features from the signals for specific tasks, attention at the time-frequency level, and RNNs with attention
e. g., heart sound classification. Among the features, wavelet at the time level, respectively. The input of the deep learning
transformation (WT) based representations showed efficient models is thereby the log Mel spectrograms of heart sound
and excellent performance. For instance, wavelet features fed signals.
into a least square support vector machine (LSSVM) can
enable to recognise the cases of normal, aortic insufficiency, III. M ATERIALS AND M ETHODS
aortic stenosis, atrial septal defect, themitral regurgitation, In this section, the HSS corpus, which was collected for
and themitral stenosis [13]. Moreover, Uğuz designed entropy heart sound classification, will be firstly introduced. After-
features of sub-bands by using discrete wavelet transformation wards, two DL topologies, including a CNN and an RNN,
(DWT) for classifying heart sounds [14]. Similarly, tunable-Q are presented, and the attention mechanisms applied to each
wavelet transformation (TQWT) based features that charac- of them are described in details. Finally, the evaluation metrics
terise the various types of murmurs in cardiac sound signals for the task of heart sound classification will be given.
3

𝑖𝑗
TABLE I where h𝑖𝑚 is the 𝑖-th channel of h𝑚 , w𝑚+1 denotes the (𝑖, 𝑗)-
T HE DATA PARTITIONS , I . E ., TRAIN , DEV ( ELOPMENT ), AND TEST SETS , OF th convolutional kernel, ∗ is the convolutional operation, and
THE HSS CORPUS AT THE THREE CLASSES , I . E ., NORMAL , MILD , AND 𝑗
MOD ( ERATE )/ SEV ( ERE ), AND SUBJECT NUMBERS . 𝑏 𝑚+1 is the bias. Each two-dimensional convolutional kernel
works on the feature maps at each channel, therefore the
# Subject Normal Mild Mod./Sev. 𝚺 convolutional layers can learn the representations at the time-
Train 100 84 276 142 502 frequency level. Notably, batch normalisation and an activation
Dev 35 32 98 50 180 function of rectified linear unit (ReLU) are utilised to deal with
Test 35 28 91 44 163 the output of each convolutional layer, as batch normalisation
𝚺 170 144 465 236 845 usually improves the stability of CNNs, and both of them can
accelerate the convergence speed [30].
Convolutional layers with batch normalisation and a ReLU
A. HSS Corpus activation function are mostly followed by local pooling layers,
The HSS corpus was established by Shenzhen Univer- which reduce the computational cost via downsampling the
sity General Hospital, Shenzhen, P. R. China [9]. Please note feature maps [31]. Through local pooling layers, the robustness
that the study [9] was approved by the ethic committee of CNNs is also improved against the input variation [31].
of the Shenzhen University General Hospital. During the Since local max-pooling has been successfully employed in
data collection, 170 participants (Female: 55, Male: 115, our previous study [25], we use local max-pooling layers
Age: 65.4 ± 13.2 years) were involved. Specifically, the heart following each convolutional layer.
sound signals were recorded from four positions on the body 2) Recurrent Neural Network: RNNs can extract sequential
of each subject, including auscultatory mitral, aortic valve representations from time-series data using a set of recurrent
auscultation, pulmonary valve auscultation, and auscultatory layers (cf. Fig. 3). Each recurrent layer contains a sequence
areas of the tricuspid valve, through an electronic stethoscope of recurrent units, each of which is used to process the
(Eko CORE, USA) using Bluetooth 4.0 and 4 kHz sampling corresponding time step of the input data. The hidden states,
rate. Then, experienced cardiologists annotated the data into output from each recurrent layer, are fed into the next recurrent
three categories: normal, mild, and moderate/severe by using layer. Finally, the hidden states of the final recurrent layer are
Echocardiography as the golden standard. Finally, 845 audio used to predict the classes of the samples.
recordings, each of which has around 30 s, were obtained, i. e., We define the number of the total time steps by 𝑇. At the 𝑡-th
approximately 7 hours. Considering subject-independency, and time step, 𝑡 = 1, ..., 𝑇, a traditional recurrent unit computes its
balanced age and gender distribution, the HSS corpus was split output via a weighted sum of the input 𝑥 𝑡 and the hidden state
into three data sets: train, dev(elopment), and test sets (cf. ℎ𝑡−1 . Due to the vanishing gradient problem caused by the
Table I). For more details on the HSS collection and further traditional recurrent unit [32], in particular, two recurrent units
information, interested readers are suggested to refer to [9]. were proposed in the literature: Long Short-Term Memory
(LSTM) cells [33], and Gated Recurrent Units (GRUs) [34].
B. Deep Learning Models At the 𝑡-th time step, an LSTM unit consists of an input
In essence, DL is a series of non-linear transformations gate 𝑖 𝑡 , an output gate 𝑜 𝑡 , a forget gate 𝑓𝑡 , and a cell state 𝑐 𝑡 .
of the inputs, resulting in the highly abstract representa- The procedure of an LSTM unit is defined by
tions which have shown effectiveness in audio classification 𝑖 𝑡 = 𝜎(wi 𝑥 𝑡 + ui ℎ𝑡−1 + 𝑏 𝑖 ), (2)
tasks [21], [27]. For this study, two typical DL topologies,
i. e., a CNN (with a strong feature extraction capacity) and an 𝑓𝑡 = 𝜎(wf 𝑥 𝑡 + uf ℎ𝑡−1 + 𝑏 𝑓 ), (3)
RNN (which can capture the context information from time- 𝑜 𝑡 = 𝜎(wo 𝑥 𝑡 + uo ℎ𝑡−1 + 𝑏 𝑜 ), (4)
series data), will be investigated.
1) Convolutional Neural Network: With a strong capability 𝑐 𝑡 = 𝑓𝑡 𝑐 𝑡−1 + 𝑖 𝑡 tanh(wc 𝑥 𝑡 + uc ℎ𝑡−1 + 𝑏 𝑐 ), (5)
of feature extraction, CNN models have been applied to heart
ℎ𝑡 = 𝑜 𝑡 tanh(𝑐 𝑡 ), (6)
sound classification in previous research [28], [29]. As shown
in Fig. 2, a CNN model generally contains a stack of convo- where w and u are the weight matrices, 𝑏 denotes the
lutional layers and local max-pooling layers to extract high- bias, 𝜎 stands for a logistic sigmoid function, and means
level representations. Convolutional layers capture abstract the element-wise multiplication. Compared to the traditional
features using a set of convolutional kernels, which achieve recurrent unit, an LSTM cell can control what information
convolution operations on the input or the feature maps from to remember using an input gate, and what to forget using a
the intermediate layers. At the 𝑚-th layer, 𝑚 = 1, ..., 𝑀, where forget gate.
𝑀 is the total number of layers, an 𝐼 × 𝑃 × 𝑄 feature map h𝑚 Different from an LSTM cell, a GRU contains a reset gate
is produced, where 𝐼 is the number of channels, and 𝑃 × 𝑄 𝑟 𝑡 and an update gate 𝑧 𝑡 at the 𝑡 time step. The procedure of
stands for the size of h𝑚 at each channel. While the (𝑚 + 1)- a GRU is defined by
th layer is a convolutional layer, the 𝑗-th channel of h𝑚+1 is
calculated by 𝑟 𝑡 = 𝜎(wr 𝑥 𝑡 + ur ℎ𝑡−1 + 𝑏𝑟 ), (7)
𝐼
∑︁ 𝑧 𝑡 = 𝜎(wz 𝑥 𝑡 + uz ℎ𝑡−1 + 𝑏 𝑧 ), (8)
𝑗 𝑖𝑗 𝑗
h𝑚+1 = w𝑚+1 ∗ h𝑖𝑚 + 𝑏 𝑚+1 , (1)
𝑖=1
ℎ𝑡 = (1− 𝑧 𝑡 ) ℎ𝑡−1 + 𝑧 𝑡 tanh(wh 𝑥 𝑡 +uh (𝑟 𝑡 ℎ𝑡−1 ) + 𝑏 ℎ ). (9)
4

Conv + local max pooling Conv + local max pooling Attention


Conv

Log

Co softmax
Log mel spectrogram
nv Predictions
+
no
rm

Fig. 2. The structure of the chosen CNN model with attention. The input are log Mel spectrograms. The CNN model consists of several convolutional layers,
local max-pooling layers, an attention layer, and a log softmax layer for classification.

the final predictions by estimating a weight value for each


Log softmax
bin. As shown in Fig. 2, the global attention pooling consists
Attention
of two components: the top one has a convolution layer,
Norm
C=C1, C2, ..., CT A=A1, A2, ..., AT
and the bottom one is comprised of a convolutional layer
and a normalisation operation. In the top component, the
convolutional layer is set up with 1 × 1 kernels and an output
Recurrent layer N channel of the class number. In the bottom component, the
convolutional layer has the same hyperparameters as that in
the top one. Afterwards, to calculate the weight tensor of h 𝑀 ,
Recurrent layer 1
an activation function is employed to rectify the values of
the feature map from the convolutional layer in the bottom
component. Both softmax and sigmoid functions can rectify
Log mel spectrogram the values into the interval of [0, 1]. Further, a normalisation
is applied to the rectified feature map 𝐹 using
Fig. 3. The structure of the chosen RNN model. The RNN model learns
sequential representations from the logmel spectrograms, and the features from
the final recurrent layer are then processed by an attention layer and a log 𝐹
𝐹∗ = , (10)
softmax layer for classification. In the attention layer, 𝐴 = 𝐴1 , 𝐴2, ..., 𝐴𝑇 𝑃0
Í Í0
𝑄
is the attention vector, and 𝐶 = 𝐶1 , 𝐶2 , ..., 𝐶𝑇 is the classification vector, 𝐹 𝑝𝑞
while 𝑇 denotes the frame number of each log Mel spectrogram. 𝑝=1 𝑞=1

where 𝐹 ∗ is the output of the bottom component. Next, the


With two gates inside one unit, a GRU has less parameters feature map from the top component is multiplied with 𝐹 ∗ ,
than an LSTM cell. As both LSTM–RNN and GRU–RNN leading to an element-wise product, which is then summed
have been employed in audio classification tasks [9], [35], the to a vector with the length equalling the number of classes.
effectiveness of them are explored in this study. Finally, log softmax is employed to fit the chosen negative
In a similar way to CNNs, a layer normalisation [36] log-likelihood (NLL) loss function.
and an activation function of scaled exponential linear unit 2) Attention in RNN: The representation from the recurrent
(SELU) [37] are set to follow each recurrent layer, since layers has two dimensions 𝑇 × 𝑄 00, where 𝑄 00 denotes the
layer normalisation can stabilise the hidden state dynamics length of feature at each time frame. While summarising the
in RNNs [36], and the SELU activation function has been representation to a vector for classification, it would be worth
successfully applied in the previous study [37]. to explain the essential time frames using global attention
pooling in RNNs. As the length of the time frames is equal
C. Attention Mechanism to that of the original log Mel spectrogram, an attention
It is essential to interpret the key parts of the input inside a mechanism can show more details at the frame level in RNNs
deep learning model, especially in the applications of medical than in CNNs.
diagnosis. As aforementioned in Section II, global attention As shown in Fig. 3, the global attention pooling in RNNs
pooling can evaluate the contribution of each unit in a repre- also includes two components as the attention mechanism in
sentation. We will now introduce the attention mechanisms in CNNs. In a similar way to the attention mechanism in CNNs,
CNNs and RNNs, respectively. the left component (corresponding to the top one in Fig.2)
1) Attention in a CNN: While a log Mel spectrogram is fed here contains a one-dimensional convolutional layer, in which
into a CNN model, the feature map h 𝑀 output by the final the kernel size is 1 and the output channel number is equal to
layer before the attention mechanism has three dimensions the class number, leading to a classification tensor 𝐶 of size
𝐼 0 × 𝑃 0 × 𝑄 0, where 𝐼 0 is the number of channels, and 𝑃 0 × 𝑄 0 𝑇 × 𝑐𝑙𝑎𝑠𝑠 𝑛𝑢𝑚𝑏𝑒𝑟. The right component (corresponding to the
denotes the feature map size at the time-frequency level. To bottom one in Fig.2) consists of a convolutional layer with the
achieve the heart sound classification, the dimensions of h 𝑀 same setting as that in the left component, and a normalisation
are reduced from three into one. During this procedure of procedure. In the right component, the convolutional layer is
dimension reduction, the global attention pooling evaluates also followed by an activation function (softmax or sigmoid) to
that how much each time-frequency bin in h 𝑀 devotes to rectify the values of the representation. Then, a normalisation
5

is applied to the rectified representation 𝐴 using TABLE II


T HE RESULTS COMPARISON OF DIFFERENT DEEP LEARNING TOPOLOGIES .
𝐴
𝐴∗ = , (11)
𝑇
Í w/o upsampling w/ upsampling
𝐴𝑡
𝑡=1 UAR [%] Dev Test Dev Test
where 𝐴∗ is the normalised feature in the right component. CNN
The element-wise product of 𝐴∗ and 𝐶 is then followed by a Flattening 35.6 37.6 35.6 39.9
log softmax layer for the heart sound classification. Max-pooling 41.7 38.4 39.3 38.5
Attention-softmax 31.5 43.1 38.3 47.3
Attention-sigmoid 40.1 51.2 39.6 50.5
D. Evaluation Metrics LSTM–RNN
To evaluate the performance of the proposed models, the Last-time stamp 39.3 36.1 40.7 35.7
unweighted average recall (UAR) is employed as the main Max-pooling 32.9 38.9 34.6 38.1
evaluation metric by taking the imbalanced characteristic of Attention-softmax 40.0 39.6 39.0 39.4
Attention-sigmoid 39.6 38.9 42.0 42.6
the HSS database and the inherent phenomena into account.
Compared to another popular evaluation metric, weighted GRU–RNN
average recall (WAR), aka accuracy, UAR shows more rea- Last-time stamp 39.0 36.5 37.4 36.1
sonable in measuring the performance of a model trained by Max-pooling 38.7 35.8 40.7 35.2
Attention-softmax 30.8 44.7 35.3 46.8
imbalanced data [38]. The value of UAR is defined as: Attention-sigmoid 34.9 44.2 34.5 45.7
𝑁
Í𝑐
𝑟𝑒𝑐𝑎𝑙𝑙 𝑖
𝑖=1 TABLE III
UAR = , (12) T HE RESULTS COMPARISON AMONG THE STATE - OF - THE - ART METHODS
𝑁𝑐
AND OUR PROPOSED MODEL .
where 𝑟𝑒𝑐𝑎𝑙𝑙𝑖 is the recall achieved for the 𝑖-th class, and 𝑁 𝑐
UAR [%] Dev Test
denotes the number of classes (𝑁 𝑐 = 3 in this study).
Additionally, when comparing two methods’ performances, C OM PAR E baseline (End2You) [41] 41.2 37.7
C OM PAR E baseline (openSMILE) [41] 50.3 46.4
we use a one-tailed 𝑧-test [39] by checking if a finding is C OM PAR E baseline (openXBOW) [41] 42.6 52.3
significant (𝑝 < 0.05) or not. C OM PAR E baseline (fusion) [41] – 56.2
Ensemble of transfer learning [42] 57.9 42.1
Utterance-level feature and SVMs [43] 53.2 49.3
IV. E XPERIMENTAL R ESULTS Seq2Seq autoencoders and SVMs [22] 35.2 47.9
We give a brief description of our experimental setup at Log Mel features and SVMs [9] 46.5 49.7
Our proposed approach 40.1 51.2
first. Then, we present and discuss the results achieved in this
study.
B. Results
A. Setup
The experimental results (UARs in [%]) of all three DL
First, a series of 936×64 log Mel spectrograms are extracted topologies (CNN, LSTM–RNN, and GRU–RNN) are shown
from the audio signals in the HSS corpus using a Hamming in Table II. The best result (a UAR of 51.2 %) is achieved by
window of 256 samples width with 50 % overlap and 64 Mel the CNN model with attention mechanism (using a sigmoid
frequency bins. During training, all models are learnt with an function). The best results for LSTM–RNN and GRU–RNN
Adam optimiser and a batch size of 32. The initial learning are 42.6 % UAR and 46.8 % UAR, respectively. We can see
rate is empirically set to 0.0001, and is reduced into 90 % at that, an attention-based mechanism can significantly improve
each 100-th iteration with the aim of stabilising the training the corresponding DL models in recognising heart sound. For
process. Finally, the learnt models at the 3 000-th iteration are instance, CNN with sigmoid-attention (a UAR of 51.2 %)
used to predict the audio samples in the development/test set. performs better than a CNN with flattening (UAR of 37.6 %),
All models employ a flattening layer or a global pooling and a CNN with max-pooling (an UAR of 38.4 %) (in a one-
layer before the final log softmax layer for classification. The tailed z-test, 𝑝 < 0.01), and a GRU–RNN with softmax-
structures before the flattening or global pooling layer in the attention (a UAR of 46.8 %) outperforms GRU–RNNs without
deep neural networks are empirically set as follows. attention (UARs of 36.1 % and 35.2 %) (in a one-tailed z-test,
• The CNN models consist of four convlutional layers with 𝑝 < 0.05). The upsampling strategy can slightly improve the
output channels 64, 128, 256, and 256. Each convolu- performances of the best RNN models. Compared to other
tional layer is followed by a local max-pooling layer with state-of-the-art studies, our proposed method can perform
2 × 2 kernels. better than most performances achieved by single models (cf.
• Both LSTM–RNN and GRU–RNN models contain three Table III).
recurrent layers with output channels 256, 1 024, and 256. When looking at the confusion matrices (cf. Fig. 4), we
To investigate the effect of balanced training set to the DL find that the best CNN and GRU–RNN models outperform
models, we compare the results on the original imbalanced the best LSTM–RNN model in recognising the ‘Mild’ type of
HSS data and balanced HSS training data produced by a heart sounds. For all the three models, both the ‘Normal’ and
random upsampling strategy aiming at class-balance [40]. ‘Mod./Sev.’ types of heart sounds are incorrectly recognised as
6

UAR: 51.2% UAR: 42.6% UAR: 46.8%


0.6 0.7

Normal 0.357 0.571 0.071 0.7


Normal 0.286 0.607 0.107 0.5 Normal 0.179 0.643 0.179 0.6
0.6
0.5
0.5 0.4
True label

True label

True label
Mild 0.121 0.791 0.088 0.4 Mild 0.143 0.560 0.297 Mild 0.055 0.725 0.220 0.4
0.3
0.3 0.3
0.2
0.2
Mod./Sev. 0.045 0.568 0.386 0.2
Mod./Sev. 0.045 0.523 0.432 Mod./Sev. 0.091 0.409 0.500
0.1 0.1
0.1

Normal Mild Mod./Sev. Normal Mild Mod./Sev. Normal Mild Mod./Sev.


Predicted label Predicted label Predicted label
(a) (b) (c)

Fig. 4. Confusion matrices (normalised) achieved by the best models on the test set. The best three models are (a) CNN, (b) LSTM–RNN, and (c) GRU–RNN,
respectively.

1.0 of the CNN is superior to or comparable to that of our GRU–


RNN. Finally, the area under the ROC curve (AUC) of the
CNN is the highest in those of the three models.
0.8
As depicted in Fig. 6, compared to ‘Normal’ or ‘Mild’ types
of heart sounds, the ‘Mod./Sev.’ type shows more irregular
True Positive Rate

0.6 waveforms and spectrograms. In addition, by checking the


learnt high-level representations of the CNN models, the
‘Mod./Sev.’ types of heart sounds can have a higher number
0.4 of higher energy components than the other two types at the
similar frequency bands.
0.2 Such irregular changes in frequency bands via the time axis
CNN ROC curve (AUC = 0.73)
LSTM-RNN ROC curve (AUC = 0.60) of the heart sound might be caused by the pathological changes
GRU-RNN ROC curve (AUC = 0.70) in the heart. When looking at the learnt representations of the
0.0 RNN models (cf. Fig. 6), we can see the periodic signal’s
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate characteristics in the ‘Normal’ types of heart sound. It is
worth exploring the fundamental mechanism of CVDs and
Fig. 5. Comparison of the macro-average receiver operating characteristic their corresponding properties in heart sound’s changes.
(ROC) curves of the best three models on the test set. The corresponding
area-under-curve (AUC) is also computed for each model.
B. Limitations and Perspectives
Data size limitation is the biggest challenge in the current
the ‘Mild’ type of heart sounds. Fig. 5 and Fig. 6 present the
study. Moreover, similar as other clinical data studies, e. g.,
macro-averaged receiver operating characteristic (ROC) curves
snore sound [44], asking experienced medical experts to
and the visualisation of the best three proposed attention-based
annotate massive data is expensive, time-consuming, and even
DL models on each class, respectively.
unavailable in practice. Even though the data augmentation
did not show excellent performance in this study, it is a
V. D ISCUSSION necessary step in improving the DL models’ generalisation
and robustness. More recently, some advanced data augmen-
In this section, we summarise the findings from this study. tation technologies, e. g., the generative adversarial networks
Afterwards, we indicate the limitations and future work by (GANs) [45] can be considered. In future work, we should
providing our perspectives. explore using more sophisticated data augmentation tech-
nologies for heart sound classification. Moreover, (labelled)
data scarcity is a challenging issue for almost all of the
A. Findings of this Study
biomedical areas including heart sound. One should consider
In this study, simple data augmentation (upsampling) cannot using unsupervised learning, semi-supervised learning, active
yield significantly better results than using the original data set learning, and cooperative learning paradigms in future studies.
(cf. Table II). We may think that the upsampling technique can- The best model’s result is encouraging but modest. In
not generate sufficiently informative instances for improving a future effort, one should consider using hybrid network
the models’ performances. A CNN is found to be superior to architectures [46] or model fusion strategies [47]. On the
an RNN in recognising heart sounds in this study. As shown in one hand, we can find the promising results achieved by the
Fig. 5, the ROC curve of the considered CNN and GRU–RNN deep attention-based models. On the other hand, the inherited
can yield a higher true positive rate at a given false positive mechanism is still unclear. We tried to visualise the learnt
rate compared to the LSTM–RNN, and the true positive rate representations of the hidden layers, but it failed to make any
7

0.75
Original audio signals 63
Log mel spectrograms Attention in CNNs 0.8
Attention in LSTM-RNNs 0.006 Attention in GRU-RNNs
3

Mel-frequency

Mel-frequency

Att. weight

Att. weight
Normal Amplitude
0
0.75 0 0.0 0.000
0 119999 0 935 0 57 0 935 0 935
0.75 63 0.8 0.006
3

Mel-frequency

Mel-frequency

Att. weight

Att. weight
Amplitude

Mild
0
0.75 0 0.0 0.000
0 120159 0 935 0 57 0 935 0 935
0.75 63 0.8 0.006
3
Mel-frequency

Mel-frequency

Att. weight

Att. weight
Amplitude

Mod./Sev.
0
0.75 0 0.0 0.000
0 119711 0 935 0 57 0 935 0 935
12 9 6 3 0.00 0.25 0.50 0.75
Frame

Fig. 6. Visualisation of three examples with the classes Normal, Mild, and Moderate/Severe, respectively. Each example consists of an original audio signal,
its corresponding log Mel spectrogram, the attention matrix in the CNNs, the attention vector in the LSTM-CNNs, and the attention vector in the GRU–RNNs.

consolidate conclusion. Another direction is to explore the [3] L. H. Schwamm, N. Chumbler, E. Brown, G. C. Fonarow, D. Berube,
learnt representations by DL models, which aims to present K. Nystrom, R. Suter, M. Zavala, D. Polsky, K. Radhakrishnan, N. Lack-
tman, K. Horton, M.-B. Malcarney, J. Halamka, and A. C. Tiner, “Rec-
the interpretations between the model architectures and the ommendations for the implementation of telehealth in cardiovascular
pathological meaning of the heart sound. An explainable AI and stroke care: a policy statement from the american heart association,”
is essential for intelligent medical applications. Circulation, vol. 135, no. 7, pp. e24–e44, 2017.
[4] J. Hu, X. Cui, Y. Gong, X. Xu, B. Gao, T. Wen, T. J. Lu, and F. Xu,
“Portable microfluidic and smartphone-based devices for monitoring of
cardiovascular diseases at the point of care,” Biotechnology Advances,
VI. C ONCLUSION vol. 34, no. 3, pp. 305–320, 2016.
[5] A. K. Dwivedi, S. A. Imtiaz, and E. Rodriguez-Villegas, “Algorithms
In this work, we proposed a novel attention-based deep for automatic analysis and classification of heart sounds–A systematic
representation learning method for heart sound classification. review,” IEEE Access, vol. 7, pp. 8316–8345, 2018.
We also investigated and compared different topologies of the [6] S. Mangione, “Cardiac auscultatory skills of physicians-in-training: a
comparison of three english-speaking countries,” The American Journal
DL models and found the considered CNN model as the best of Medicine, vol. 110, no. 3, pp. 210–216, 2001.
option in this study. The efficacy of the proposed method was [7] K. Qian, X. Li, H. Li, S. Li, W. Li, Z. Ning, S. Yu, L. Hou, G. Tang,
successfully validated by the publicly accessible HSS corpus. J. Lu, F. Li, S. Duan, C. Du, Y. Cheng, Y. Wang, L. Gan, Y. Yamamoto,
and B. W. Schuller, “Computer audition for healthcare: Opportunities
We also compared the results with other state-of-the-art work and challenges,” Frontiers in Digital Health, vol. 2, no. 5, pp. 1–4,
and pointed out the current limitations and future directions. 2020.
For a three-category classification task, the proposed approach [8] G. D. Clifford, C. Liu, B. Moody, J. Millet, S. Schmidt, Q. Li,
I. Silva, and R. G. Mark, “Recent advances in heart sound analysis,”
achieved an unweighted average recall of 51.2 %, which Physiological Measurement, vol. 38, no. 8, pp. E10–E25, 2017.
outperformed the other models trained by traditional human [9] F. Dong, K. Qian, R. Zhao, A. Baird, X. Li, Z. Dai, B. Dong,
hand-crafted features or other deep learning approaches. In F. Metze, Y. Yamamoto, and B. W. Schuller, “Machine listening for heart
status monitoring: Introducing and benchmarking HSS–the heart sounds
future work, we will improve our model’s generalisation and Shenzhen corpus,” IEEE Journal of Biomedical and Health Informatics,
explainability for the heart sound classification task. vol. 24, no. 7, pp. 2082–2092, 2020.
[10] A. Holzinger, C. Biemann, C. S. Pattichis, and D. B. Kell, “What do we
need to build explainable AI systems for the medical domain?” arXiv
ACKNOWLEDGMENT preprint arXiv:1712.09923, 2017.
[11] Y. Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley, “Attention
The authors would like to thank the colleagues who col- and localization based on a deep convolutional recurrent model for
weakly supervised audio tagging,” in Proc. INTERSPEECH, Stockholm,
lected the HSS corpus. Sweden, 2017, pp. 3083–3087.
[12] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey
on explainable artificial intelligence (XAI),” IEEE Access, vol. 6, pp.
R EFERENCES 52 138–52 160, 2018.
[13] S. Ari, K. Hembram, and G. Saha, “Detection of cardiac abnormality
[1] Wolrd Health Organisation (WHO). (2017) Cardiovascular diseases from PCG signal using LMS based least square SVM classifier,” Expert
(CVDs) Key facts. [Online]. Available: https://www.who.int/en/ Systems with Applications, vol. 37, no. 12, pp. 8019–8026, 2010.
news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) [14] H. Uğuz, “Adaptive neuro-fuzzy inference system for diagnosis of the
[2] E. J. Benjamin, P. Muntner, and M. S. Bittencourt, “Heart disease heart valve diseases using wavelet transform with entropy,” Neural
and stroke statistics-2019 update: a report from the american heart Computing and Applications, vol. 21, no. 7, pp. 1617–1628, 2012.
association,” Circulation, vol. 139, no. 10, pp. e56–e528, 2019. [15] S. Patidar, R. B. Pachori, and N. Garg, “Automatic diagnosis of septal
8

defects based on tunable-q wavelet transform of cardiac sound signals,” [39] T. G. Dietterich, “Approximate statistical tests for comparing supervised
Expert Systems with Applications, vol. 42, no. 7, pp. 3315–3326, 2015. classification learning algorithms,” Neural Computation, vol. 10, no. 7,
[16] Y. Zheng, X. Guo, and X. Ding, “A novel hybrid energy fraction pp. 1895–1923, 1998.
and entropy-based approach for systolic heart murmurs identification,” [40] Z. Zhang and B. Schuller, “Active learning by sparse instance tracking
Expert Systems with Applications, vol. 42, no. 5, pp. 2710–2721, 2015. and classifier confidence in acoustic emotion recognition,” in Proc.
[17] S.-W. Deng and J.-Q. Han, “Towards heart sound classification without INTERSPEECH, Portland, OR, 2012, pp. 362–365.
segmentation via autocorrelation feature and diffusion maps,” Future [41] B. Schuller, S. Steidl, A. Batliner, P. B. Marschik, H. Baumeister,
Generation Computer Systems, vol. 60, pp. 13–21, 2016. F. Dong, S. Hantke, F. Pokorny, E.-M. Rathner, K. D. Bartl-Pokorny,
[18] K. Qian, Z. Ren, F. Dong, W.-H. Lai, B. W. Schuller, and Y. Yamamoto, C. Einspieler, D. Zhang, A. Baird, S. Amiriparian, K. Qian, Z. Ren,
“Deep wavelets for heart sound classification,” in Proc. ISPACS, Taipei, M. Schmitt, P. Tzirakis, and S. Zafeiriou, “The interspeech 2018
Taiwan, China, 2019, pp. 1–2. computational paralinguistics challenge: Atypical & self-assessed affect,
[19] P. Wang, C. S. Lim, S. Chauhan, J. Y. A. Foo, and V. Anantharaman, crying & heart beats,” in Proc INTERSPEECH, Hyderabad, India, 2018,
“Phonocardiographic signal analysis method using a modified hidden pp. 122–126.
markov model,” Annals of Biomedical Engineering, vol. 35, no. 3, pp. [42] A. Humayun, M. Khan, S. Ghaffarzadegan, Z. Feng, and T. Hasan, “An
367–374, 2007. ensemble of transfer, semi-supervised and supervised learning methods
[20] N. De Bruijn, “Uncertainty principles in Fourier analysis,” in Inequalities for pathological heart sound classification,” in Proc INTERSPEECH,
(Proc. Sympos. Wright-Patterson Air Force Base, Ohio, 1965). Aca- Hyderabad, India, 2018, pp. 127–131.
demic Press, New York, NY, 1967, pp. 57–71. [43] G. Gosztolya, T. Grósz, and L. Tóth, “General utterance-level feature
extraction for classifying crying sounds, atypical & self-assessed affect
[21] Z. Ren, N. Cummins, V. Pandit, J. Han, K. Qian, and B. W. Schuller,
and heart beats,” in Proc INTERSPEECH, Hyderabad, India, 2018, pp.
“Learning image-based representations for heart sound classification,”
531–535.
in Proc. DH. Lyon, France: ACM, 2018, pp. 143–147.
[44] K. Qian, C. Janott, M. Schmitt, Z. Zhang, C. Heiser, W. Hemmert,
[22] S. Amiriparian, M. Schmitt, N. Cummins, K. Qian, F. Dong, and Y. Yamamoto, and B. W. Schuller, “Can machine learning assist locating
B. Schuller, “Deep unsupervised representation learning for abnormal the excitation of snore sound? A review,” IEEE Journal of Biomedical
heart sound classification,” in Proc. EMBC, Honolulu, HI, 2018, pp. and Health Informatics, pp. 1–14, 2020, in press.
4776–4779. [45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[23] T. Fernando, H. Ghaemmaghami, S. Denman, S. Sridharan, N. Hussain, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
and C. Fookes, “Heart sound segmentation using bidirectional LSTMs Proc. NIPS, Montreal, Canada, 2014, pp. 2672–2680.
with attention,” IEEE Journal of Biomedical and Health Informatics, [46] S. Yu, Y. Cheng, L. Xie, Z. Luo, M. Huang, and S. Li, “A novel recurrent
vol. 24, no. 6, pp. 1601–1609, 2020. hybrid network for feature fusion in action recognition,” Journal of
[24] N. Akhtar and U. Ragavendran, “Interpretation of intelligence in CNN- Visual Communication and Image Representation, vol. 49, pp. 192–203,
pooling processes: A methodological survey,” Neural Computing and 2017.
Applications, no. 32, pp. 879–898, 2020. [47] K. Qian, Z. Ren, V. Pandit, Z. Yang, Z. Zhang, and B. Schuller,
[25] Z. Ren, Q. Kong, K. Qian, M. Plumbley, and B. Schuller, “Attention- “Wavelets revisited for the classification of acoustic scenes,” in Proc.
based convolutional neural networks for acoustic scene classification,” DCASE Workshop, Munich, Germany, 2017, pp. 108–112.
in Proc. DCASE, Surrey, UK, 2018, pp. 39–43.
[26] Z. Ren, Q. Kong, J. Han, M. Plumbley, and B. Schuller, “Attention-based
atrous convolutional neural networks: Visualisation and understanding
perspectives of acoustic scenes,” in Proc. ICASSP, Brighton, UK, 2019,
pp. 56–60.
[27] S. Amiriparian, M. Freitag, N. Cummins, and B. Schuller, “Sequence
to sequence autoencoders for unsupervised representation learning from
audio,” in Proc. of the DCASE 2017 Workshop, 2017.
[28] M. Tschannen, T. Kramer, G. Marti, M. Heinzmann, and T. Wiatowski,
“Heart sound classification using deep structured features,” in Proc.
CinC. Vancouver, Canada: IEEE, 2016, pp. 565–568.
[29] H. Ryu, J. Park, and H. Shin, “Classification of heart sound recordings
using convolution neural network,” in Proc. CinC. Vancouver, Canada:
IEEE, 2016, pp. 1153–1156.
[30] H. Ide and T. Kurita, “Improvement of learning for CNN with ReLU
activation by sparse regularization,” in Proc. IJCNN, Anchorage, AK,
2017, pp. 2684–2691.
[31] T. Kobayashi, “Global feature guided local pooling,” in Proc. ICCV,
Seoul, Korea, 2019, pp. 3365–3374.
[32] S. Hochreiter, “The vanishing gradient problem during learning recurrent
neural nets and problem solutions,” International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116,
1998.
[33] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[34] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalua-
tion of gated recurrent neural networks on sequence modeling,” in
Proc. NIPS Deep Learning and Representation Learning Workshop,
Montreal, Canada, 2014, pp. 1–9.
[35] Z. Ren, K. Qian, Z. Zhang, V. Pandit, A. Baird, and B. Schuller, “Deep
scalogram representations for acoustic scene classification,” IEEE/CAA
Journal of Automatica Sinica, vol. 5, no. 3, pp. 662–669, 2018.
[36] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
preprint arXiv:1607.06450, 2016.
[37] M. Phankokkruad and S. Wacharawichanant, “A comparison of effi-
ciency improvement for long short-term memory model using convolu-
tional operations and convolutional neural network,” in Proc. ICOIACT.
Yogyakarta, Indonesia: IEEE, 2019, pp. 608–613.
[38] B. W. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009
emotion challenge,” in Proc. INTERSPEECH, Brighton, UK, 2009, pp.
312–315.

You might also like