Heart Sound
Heart Sound
Abstract—Cardiovascular diseases are the leading cause of per year by 2030 [2]. Early-stage diagnosis and proper man-
deaths and severely threaten human health in daily life. On the agement of CVDs can be very beneficial to mitigate the high
one hand, there have been dramatically increasing demands from costs and social burdens by coping with serious CVDs [3], [4].
arXiv:2101.04979v1 [cs.SD] 13 Jan 2021
both the clinical practice and the smart home application for
monitoring the heart status of subjects suffering from chronic Auscultation of the heart sounds, as a cheap, convenient, and
cardiovascular diseases. On the other hand, experienced physi- non-invasive method, has been successfully used by physicians
cians who can perform an efficient auscultation are still lacking in for over a century [5]. However, this clinical skill needs
terms of number. Automatic heart sound classification leveraging tremendous training and is still difficult for more than 20 %
the power of advanced signal processing and machine learning of the less experienced medical interns to efficiently use [6].
technologies has shown encouraging results. Nevertheless, human
hand-crafted features are expensive and time-consuming. To this Therefore, developing an automatic auscultation framework
end, we propose a novel deep representation learning method can facilitate the early cost-effective screening of CVDs, and at
with an attention mechanism for heart sound classification. In the same time, to manage the progression of its condition [5].
this paradigm, high-level representations are learnt automatically Computer audition (CA) and its applications in health-
from the recorded heart sound data. Particularly, a global care [7] have yielded encouraging results in the past decades.
attention pooling layer improves the performance of the learnt
representations by estimating the contribution of each unit in Due to its non-invasive and ubiquitous characteristic, CA-
feature maps. The Heart Sounds Shenzhen (HSS) corpus (170 based methods can facilitate automatic heart sound analysis
subjects involved) is used to validate the proposed method. studies, which have already attracted a plethora of efforts [5].
Experimental results validate that, our approach can achieve Additionally, benefited from the fast development of machine
an unweighted average recall of 51.2 % for classifying three learning (ML), particularly, its subsets, i. e., deep learning
categories of heart sounds, i. e., normal, mild, and moderate/severe
annotated by cardiologists with the help of Echocardiography. (DL), and the prevalent smart sensors, wearables, devices,
etc., intelligent healthcare can be implemented feasibly in this
Index Terms—Computer audition, digital health, heart sound
era of AIoT (artificial intelligence enabled internet of things).
classification, deep learning, attention mechanism.
A systematical and comprehensive review of the existing
literature on heart sound analysis via ML was provided in [5].
I. I NTRODUCTION In the early works, designing efficient features ranging from
S reported by the World Health Organisation (WHO),
A Cardiovascular diseases (CVDs) are the first leading
cause of death globally, which made 17.9 million people dead
classic Fourier transformation to multi-resolution analysis
(e. g., wavelet transformation) dominated the well-documented
literature in this field. To the recent years, using DL models
in 2016 (representing 31 % of all global deaths) [1]. More for analysing and extracting high-level representations from
seriously, this number is predicted to be around 23 million heart sounds automatically has increasingly been studied [8].
Furthermore, as indicated in [9], the current trend is to classify
This work was partially supported by the Horizon H2020 Marie
Skłodowska-Curie Actions Initial Training Network European Training Net- the heart sounds from the whole audio recording without any
work (MSCA-ITN-ETN) project under grant agreement No. 766287 (TAPAS), segmentation step. On the one hand, the state-of-the-art DL
Germany, the Zhejiang Lab’s International Talent Fund for Young Profes- methods aim to build a deep end-to-end architecture that can
sionals (Project HANAMI), P. R. China, the JSPS Postdoctoral Fellowship
for Research in Japan (ID No. P19081) from the Japan Society for the learn high-level representations from the heart sound itself
Promotion of Science (JSPS), Japan, the Grants-in-Aid for Scientific Research without any human hand-crafted features. On the other hand,
(No. 19F19081 and No. 17H00878) from the Ministry of Education, Culture, the DL models are restrained by the generalisation of the learnt
Sports, Science and Technology, Japan, and the Natural Science Foundation of
Shenzhen University General Hospital (No. SUGH2018QD013), P. R. China. representations from a limited data set. However, with the DL-
K. Qian is the Corresponding author. based systems of heart sounds analysis, black-box DL models
Z. Ren and B. W. Schuller are with the Chair of Embedded Intelligence cannot produce transparent and understandable decisions for
for Health Care and Wellbeing, University of Augsburg, Germany (e-mail:
{zhao.ren, schuller}@informatik.uni-augsburg.de). physicians to provide the next physical examination and an
K. Qian and Y. Yamamoto are with the Educational Physiology Labo- appropriate treatment. Making explainable decisions via DL-
ratory, Graduate School of Education, The University of Tokyo, Japan (e- based systems is a trend to enhance the trust of physicians
mail: {qian, yamamoto}@p.u-tokyo.ac.jp).
F. Dong is with the Department of Cardiology, Shenzhen University General in the systems and promote their application in the medical
Hospital, P. R. China (e-mail: [email protected]). area [10]. In the recent study [11], a promising attention
Z. Dai is with the Department of Cardiovascular, Wenzhou Medical Univer- mechanism was proposed to explain the DL models via
sity First Affiliated Hospital, P. R. China (e-mail: [email protected]).
B. W. Schuller is also with the GLAM – Group on Language, Audio & visualising the internal layers.
Music, Imperial College London, UK (email: [email protected]). To this end, we propose a novel attention-based deep
2
𝑖𝑗
TABLE I where h𝑖𝑚 is the 𝑖-th channel of h𝑚 , w𝑚+1 denotes the (𝑖, 𝑗)-
T HE DATA PARTITIONS , I . E ., TRAIN , DEV ( ELOPMENT ), AND TEST SETS , OF th convolutional kernel, ∗ is the convolutional operation, and
THE HSS CORPUS AT THE THREE CLASSES , I . E ., NORMAL , MILD , AND 𝑗
MOD ( ERATE )/ SEV ( ERE ), AND SUBJECT NUMBERS . 𝑏 𝑚+1 is the bias. Each two-dimensional convolutional kernel
works on the feature maps at each channel, therefore the
# Subject Normal Mild Mod./Sev. 𝚺 convolutional layers can learn the representations at the time-
Train 100 84 276 142 502 frequency level. Notably, batch normalisation and an activation
Dev 35 32 98 50 180 function of rectified linear unit (ReLU) are utilised to deal with
Test 35 28 91 44 163 the output of each convolutional layer, as batch normalisation
𝚺 170 144 465 236 845 usually improves the stability of CNNs, and both of them can
accelerate the convergence speed [30].
Convolutional layers with batch normalisation and a ReLU
A. HSS Corpus activation function are mostly followed by local pooling layers,
The HSS corpus was established by Shenzhen Univer- which reduce the computational cost via downsampling the
sity General Hospital, Shenzhen, P. R. China [9]. Please note feature maps [31]. Through local pooling layers, the robustness
that the study [9] was approved by the ethic committee of CNNs is also improved against the input variation [31].
of the Shenzhen University General Hospital. During the Since local max-pooling has been successfully employed in
data collection, 170 participants (Female: 55, Male: 115, our previous study [25], we use local max-pooling layers
Age: 65.4 ± 13.2 years) were involved. Specifically, the heart following each convolutional layer.
sound signals were recorded from four positions on the body 2) Recurrent Neural Network: RNNs can extract sequential
of each subject, including auscultatory mitral, aortic valve representations from time-series data using a set of recurrent
auscultation, pulmonary valve auscultation, and auscultatory layers (cf. Fig. 3). Each recurrent layer contains a sequence
areas of the tricuspid valve, through an electronic stethoscope of recurrent units, each of which is used to process the
(Eko CORE, USA) using Bluetooth 4.0 and 4 kHz sampling corresponding time step of the input data. The hidden states,
rate. Then, experienced cardiologists annotated the data into output from each recurrent layer, are fed into the next recurrent
three categories: normal, mild, and moderate/severe by using layer. Finally, the hidden states of the final recurrent layer are
Echocardiography as the golden standard. Finally, 845 audio used to predict the classes of the samples.
recordings, each of which has around 30 s, were obtained, i. e., We define the number of the total time steps by 𝑇. At the 𝑡-th
approximately 7 hours. Considering subject-independency, and time step, 𝑡 = 1, ..., 𝑇, a traditional recurrent unit computes its
balanced age and gender distribution, the HSS corpus was split output via a weighted sum of the input 𝑥 𝑡 and the hidden state
into three data sets: train, dev(elopment), and test sets (cf. ℎ𝑡−1 . Due to the vanishing gradient problem caused by the
Table I). For more details on the HSS collection and further traditional recurrent unit [32], in particular, two recurrent units
information, interested readers are suggested to refer to [9]. were proposed in the literature: Long Short-Term Memory
(LSTM) cells [33], and Gated Recurrent Units (GRUs) [34].
B. Deep Learning Models At the 𝑡-th time step, an LSTM unit consists of an input
In essence, DL is a series of non-linear transformations gate 𝑖 𝑡 , an output gate 𝑜 𝑡 , a forget gate 𝑓𝑡 , and a cell state 𝑐 𝑡 .
of the inputs, resulting in the highly abstract representa- The procedure of an LSTM unit is defined by
tions which have shown effectiveness in audio classification 𝑖 𝑡 = 𝜎(wi 𝑥 𝑡 + ui ℎ𝑡−1 + 𝑏 𝑖 ), (2)
tasks [21], [27]. For this study, two typical DL topologies,
i. e., a CNN (with a strong feature extraction capacity) and an 𝑓𝑡 = 𝜎(wf 𝑥 𝑡 + uf ℎ𝑡−1 + 𝑏 𝑓 ), (3)
RNN (which can capture the context information from time- 𝑜 𝑡 = 𝜎(wo 𝑥 𝑡 + uo ℎ𝑡−1 + 𝑏 𝑜 ), (4)
series data), will be investigated.
1) Convolutional Neural Network: With a strong capability 𝑐 𝑡 = 𝑓𝑡 𝑐 𝑡−1 + 𝑖 𝑡 tanh(wc 𝑥 𝑡 + uc ℎ𝑡−1 + 𝑏 𝑐 ), (5)
of feature extraction, CNN models have been applied to heart
ℎ𝑡 = 𝑜 𝑡 tanh(𝑐 𝑡 ), (6)
sound classification in previous research [28], [29]. As shown
in Fig. 2, a CNN model generally contains a stack of convo- where w and u are the weight matrices, 𝑏 denotes the
lutional layers and local max-pooling layers to extract high- bias, 𝜎 stands for a logistic sigmoid function, and means
level representations. Convolutional layers capture abstract the element-wise multiplication. Compared to the traditional
features using a set of convolutional kernels, which achieve recurrent unit, an LSTM cell can control what information
convolution operations on the input or the feature maps from to remember using an input gate, and what to forget using a
the intermediate layers. At the 𝑚-th layer, 𝑚 = 1, ..., 𝑀, where forget gate.
𝑀 is the total number of layers, an 𝐼 × 𝑃 × 𝑄 feature map h𝑚 Different from an LSTM cell, a GRU contains a reset gate
is produced, where 𝐼 is the number of channels, and 𝑃 × 𝑄 𝑟 𝑡 and an update gate 𝑧 𝑡 at the 𝑡 time step. The procedure of
stands for the size of h𝑚 at each channel. While the (𝑚 + 1)- a GRU is defined by
th layer is a convolutional layer, the 𝑗-th channel of h𝑚+1 is
calculated by 𝑟 𝑡 = 𝜎(wr 𝑥 𝑡 + ur ℎ𝑡−1 + 𝑏𝑟 ), (7)
𝐼
∑︁ 𝑧 𝑡 = 𝜎(wz 𝑥 𝑡 + uz ℎ𝑡−1 + 𝑏 𝑧 ), (8)
𝑗 𝑖𝑗 𝑗
h𝑚+1 = w𝑚+1 ∗ h𝑖𝑚 + 𝑏 𝑚+1 , (1)
𝑖=1
ℎ𝑡 = (1− 𝑧 𝑡 ) ℎ𝑡−1 + 𝑧 𝑡 tanh(wh 𝑥 𝑡 +uh (𝑟 𝑡 ℎ𝑡−1 ) + 𝑏 ℎ ). (9)
4
Log
Co softmax
Log mel spectrogram
nv Predictions
+
no
rm
Fig. 2. The structure of the chosen CNN model with attention. The input are log Mel spectrograms. The CNN model consists of several convolutional layers,
local max-pooling layers, an attention layer, and a log softmax layer for classification.
True label
True label
Mild 0.121 0.791 0.088 0.4 Mild 0.143 0.560 0.297 Mild 0.055 0.725 0.220 0.4
0.3
0.3 0.3
0.2
0.2
Mod./Sev. 0.045 0.568 0.386 0.2
Mod./Sev. 0.045 0.523 0.432 Mod./Sev. 0.091 0.409 0.500
0.1 0.1
0.1
Fig. 4. Confusion matrices (normalised) achieved by the best models on the test set. The best three models are (a) CNN, (b) LSTM–RNN, and (c) GRU–RNN,
respectively.
0.75
Original audio signals 63
Log mel spectrograms Attention in CNNs 0.8
Attention in LSTM-RNNs 0.006 Attention in GRU-RNNs
3
Mel-frequency
Mel-frequency
Att. weight
Att. weight
Normal Amplitude
0
0.75 0 0.0 0.000
0 119999 0 935 0 57 0 935 0 935
0.75 63 0.8 0.006
3
Mel-frequency
Mel-frequency
Att. weight
Att. weight
Amplitude
Mild
0
0.75 0 0.0 0.000
0 120159 0 935 0 57 0 935 0 935
0.75 63 0.8 0.006
3
Mel-frequency
Mel-frequency
Att. weight
Att. weight
Amplitude
Mod./Sev.
0
0.75 0 0.0 0.000
0 119711 0 935 0 57 0 935 0 935
12 9 6 3 0.00 0.25 0.50 0.75
Frame
Fig. 6. Visualisation of three examples with the classes Normal, Mild, and Moderate/Severe, respectively. Each example consists of an original audio signal,
its corresponding log Mel spectrogram, the attention matrix in the CNNs, the attention vector in the LSTM-CNNs, and the attention vector in the GRU–RNNs.
consolidate conclusion. Another direction is to explore the [3] L. H. Schwamm, N. Chumbler, E. Brown, G. C. Fonarow, D. Berube,
learnt representations by DL models, which aims to present K. Nystrom, R. Suter, M. Zavala, D. Polsky, K. Radhakrishnan, N. Lack-
tman, K. Horton, M.-B. Malcarney, J. Halamka, and A. C. Tiner, “Rec-
the interpretations between the model architectures and the ommendations for the implementation of telehealth in cardiovascular
pathological meaning of the heart sound. An explainable AI and stroke care: a policy statement from the american heart association,”
is essential for intelligent medical applications. Circulation, vol. 135, no. 7, pp. e24–e44, 2017.
[4] J. Hu, X. Cui, Y. Gong, X. Xu, B. Gao, T. Wen, T. J. Lu, and F. Xu,
“Portable microfluidic and smartphone-based devices for monitoring of
cardiovascular diseases at the point of care,” Biotechnology Advances,
VI. C ONCLUSION vol. 34, no. 3, pp. 305–320, 2016.
[5] A. K. Dwivedi, S. A. Imtiaz, and E. Rodriguez-Villegas, “Algorithms
In this work, we proposed a novel attention-based deep for automatic analysis and classification of heart sounds–A systematic
representation learning method for heart sound classification. review,” IEEE Access, vol. 7, pp. 8316–8345, 2018.
We also investigated and compared different topologies of the [6] S. Mangione, “Cardiac auscultatory skills of physicians-in-training: a
comparison of three english-speaking countries,” The American Journal
DL models and found the considered CNN model as the best of Medicine, vol. 110, no. 3, pp. 210–216, 2001.
option in this study. The efficacy of the proposed method was [7] K. Qian, X. Li, H. Li, S. Li, W. Li, Z. Ning, S. Yu, L. Hou, G. Tang,
successfully validated by the publicly accessible HSS corpus. J. Lu, F. Li, S. Duan, C. Du, Y. Cheng, Y. Wang, L. Gan, Y. Yamamoto,
and B. W. Schuller, “Computer audition for healthcare: Opportunities
We also compared the results with other state-of-the-art work and challenges,” Frontiers in Digital Health, vol. 2, no. 5, pp. 1–4,
and pointed out the current limitations and future directions. 2020.
For a three-category classification task, the proposed approach [8] G. D. Clifford, C. Liu, B. Moody, J. Millet, S. Schmidt, Q. Li,
I. Silva, and R. G. Mark, “Recent advances in heart sound analysis,”
achieved an unweighted average recall of 51.2 %, which Physiological Measurement, vol. 38, no. 8, pp. E10–E25, 2017.
outperformed the other models trained by traditional human [9] F. Dong, K. Qian, R. Zhao, A. Baird, X. Li, Z. Dai, B. Dong,
hand-crafted features or other deep learning approaches. In F. Metze, Y. Yamamoto, and B. W. Schuller, “Machine listening for heart
status monitoring: Introducing and benchmarking HSS–the heart sounds
future work, we will improve our model’s generalisation and Shenzhen corpus,” IEEE Journal of Biomedical and Health Informatics,
explainability for the heart sound classification task. vol. 24, no. 7, pp. 2082–2092, 2020.
[10] A. Holzinger, C. Biemann, C. S. Pattichis, and D. B. Kell, “What do we
need to build explainable AI systems for the medical domain?” arXiv
ACKNOWLEDGMENT preprint arXiv:1712.09923, 2017.
[11] Y. Xu, Q. Kong, Q. Huang, W. Wang, and M. D. Plumbley, “Attention
The authors would like to thank the colleagues who col- and localization based on a deep convolutional recurrent model for
weakly supervised audio tagging,” in Proc. INTERSPEECH, Stockholm,
lected the HSS corpus. Sweden, 2017, pp. 3083–3087.
[12] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey
on explainable artificial intelligence (XAI),” IEEE Access, vol. 6, pp.
R EFERENCES 52 138–52 160, 2018.
[13] S. Ari, K. Hembram, and G. Saha, “Detection of cardiac abnormality
[1] Wolrd Health Organisation (WHO). (2017) Cardiovascular diseases from PCG signal using LMS based least square SVM classifier,” Expert
(CVDs) Key facts. [Online]. Available: https://www.who.int/en/ Systems with Applications, vol. 37, no. 12, pp. 8019–8026, 2010.
news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) [14] H. Uğuz, “Adaptive neuro-fuzzy inference system for diagnosis of the
[2] E. J. Benjamin, P. Muntner, and M. S. Bittencourt, “Heart disease heart valve diseases using wavelet transform with entropy,” Neural
and stroke statistics-2019 update: a report from the american heart Computing and Applications, vol. 21, no. 7, pp. 1617–1628, 2012.
association,” Circulation, vol. 139, no. 10, pp. e56–e528, 2019. [15] S. Patidar, R. B. Pachori, and N. Garg, “Automatic diagnosis of septal
8
defects based on tunable-q wavelet transform of cardiac sound signals,” [39] T. G. Dietterich, “Approximate statistical tests for comparing supervised
Expert Systems with Applications, vol. 42, no. 7, pp. 3315–3326, 2015. classification learning algorithms,” Neural Computation, vol. 10, no. 7,
[16] Y. Zheng, X. Guo, and X. Ding, “A novel hybrid energy fraction pp. 1895–1923, 1998.
and entropy-based approach for systolic heart murmurs identification,” [40] Z. Zhang and B. Schuller, “Active learning by sparse instance tracking
Expert Systems with Applications, vol. 42, no. 5, pp. 2710–2721, 2015. and classifier confidence in acoustic emotion recognition,” in Proc.
[17] S.-W. Deng and J.-Q. Han, “Towards heart sound classification without INTERSPEECH, Portland, OR, 2012, pp. 362–365.
segmentation via autocorrelation feature and diffusion maps,” Future [41] B. Schuller, S. Steidl, A. Batliner, P. B. Marschik, H. Baumeister,
Generation Computer Systems, vol. 60, pp. 13–21, 2016. F. Dong, S. Hantke, F. Pokorny, E.-M. Rathner, K. D. Bartl-Pokorny,
[18] K. Qian, Z. Ren, F. Dong, W.-H. Lai, B. W. Schuller, and Y. Yamamoto, C. Einspieler, D. Zhang, A. Baird, S. Amiriparian, K. Qian, Z. Ren,
“Deep wavelets for heart sound classification,” in Proc. ISPACS, Taipei, M. Schmitt, P. Tzirakis, and S. Zafeiriou, “The interspeech 2018
Taiwan, China, 2019, pp. 1–2. computational paralinguistics challenge: Atypical & self-assessed affect,
[19] P. Wang, C. S. Lim, S. Chauhan, J. Y. A. Foo, and V. Anantharaman, crying & heart beats,” in Proc INTERSPEECH, Hyderabad, India, 2018,
“Phonocardiographic signal analysis method using a modified hidden pp. 122–126.
markov model,” Annals of Biomedical Engineering, vol. 35, no. 3, pp. [42] A. Humayun, M. Khan, S. Ghaffarzadegan, Z. Feng, and T. Hasan, “An
367–374, 2007. ensemble of transfer, semi-supervised and supervised learning methods
[20] N. De Bruijn, “Uncertainty principles in Fourier analysis,” in Inequalities for pathological heart sound classification,” in Proc INTERSPEECH,
(Proc. Sympos. Wright-Patterson Air Force Base, Ohio, 1965). Aca- Hyderabad, India, 2018, pp. 127–131.
demic Press, New York, NY, 1967, pp. 57–71. [43] G. Gosztolya, T. Grósz, and L. Tóth, “General utterance-level feature
extraction for classifying crying sounds, atypical & self-assessed affect
[21] Z. Ren, N. Cummins, V. Pandit, J. Han, K. Qian, and B. W. Schuller,
and heart beats,” in Proc INTERSPEECH, Hyderabad, India, 2018, pp.
“Learning image-based representations for heart sound classification,”
531–535.
in Proc. DH. Lyon, France: ACM, 2018, pp. 143–147.
[44] K. Qian, C. Janott, M. Schmitt, Z. Zhang, C. Heiser, W. Hemmert,
[22] S. Amiriparian, M. Schmitt, N. Cummins, K. Qian, F. Dong, and Y. Yamamoto, and B. W. Schuller, “Can machine learning assist locating
B. Schuller, “Deep unsupervised representation learning for abnormal the excitation of snore sound? A review,” IEEE Journal of Biomedical
heart sound classification,” in Proc. EMBC, Honolulu, HI, 2018, pp. and Health Informatics, pp. 1–14, 2020, in press.
4776–4779. [45] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[23] T. Fernando, H. Ghaemmaghami, S. Denman, S. Sridharan, N. Hussain, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
and C. Fookes, “Heart sound segmentation using bidirectional LSTMs Proc. NIPS, Montreal, Canada, 2014, pp. 2672–2680.
with attention,” IEEE Journal of Biomedical and Health Informatics, [46] S. Yu, Y. Cheng, L. Xie, Z. Luo, M. Huang, and S. Li, “A novel recurrent
vol. 24, no. 6, pp. 1601–1609, 2020. hybrid network for feature fusion in action recognition,” Journal of
[24] N. Akhtar and U. Ragavendran, “Interpretation of intelligence in CNN- Visual Communication and Image Representation, vol. 49, pp. 192–203,
pooling processes: A methodological survey,” Neural Computing and 2017.
Applications, no. 32, pp. 879–898, 2020. [47] K. Qian, Z. Ren, V. Pandit, Z. Yang, Z. Zhang, and B. Schuller,
[25] Z. Ren, Q. Kong, K. Qian, M. Plumbley, and B. Schuller, “Attention- “Wavelets revisited for the classification of acoustic scenes,” in Proc.
based convolutional neural networks for acoustic scene classification,” DCASE Workshop, Munich, Germany, 2017, pp. 108–112.
in Proc. DCASE, Surrey, UK, 2018, pp. 39–43.
[26] Z. Ren, Q. Kong, J. Han, M. Plumbley, and B. Schuller, “Attention-based
atrous convolutional neural networks: Visualisation and understanding
perspectives of acoustic scenes,” in Proc. ICASSP, Brighton, UK, 2019,
pp. 56–60.
[27] S. Amiriparian, M. Freitag, N. Cummins, and B. Schuller, “Sequence
to sequence autoencoders for unsupervised representation learning from
audio,” in Proc. of the DCASE 2017 Workshop, 2017.
[28] M. Tschannen, T. Kramer, G. Marti, M. Heinzmann, and T. Wiatowski,
“Heart sound classification using deep structured features,” in Proc.
CinC. Vancouver, Canada: IEEE, 2016, pp. 565–568.
[29] H. Ryu, J. Park, and H. Shin, “Classification of heart sound recordings
using convolution neural network,” in Proc. CinC. Vancouver, Canada:
IEEE, 2016, pp. 1153–1156.
[30] H. Ide and T. Kurita, “Improvement of learning for CNN with ReLU
activation by sparse regularization,” in Proc. IJCNN, Anchorage, AK,
2017, pp. 2684–2691.
[31] T. Kobayashi, “Global feature guided local pooling,” in Proc. ICCV,
Seoul, Korea, 2019, pp. 3365–3374.
[32] S. Hochreiter, “The vanishing gradient problem during learning recurrent
neural nets and problem solutions,” International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116,
1998.
[33] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[34] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalua-
tion of gated recurrent neural networks on sequence modeling,” in
Proc. NIPS Deep Learning and Representation Learning Workshop,
Montreal, Canada, 2014, pp. 1–9.
[35] Z. Ren, K. Qian, Z. Zhang, V. Pandit, A. Baird, and B. Schuller, “Deep
scalogram representations for acoustic scene classification,” IEEE/CAA
Journal of Automatica Sinica, vol. 5, no. 3, pp. 662–669, 2018.
[36] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
preprint arXiv:1607.06450, 2016.
[37] M. Phankokkruad and S. Wacharawichanant, “A comparison of effi-
ciency improvement for long short-term memory model using convolu-
tional operations and convolutional neural network,” in Proc. ICOIACT.
Yogyakarta, Indonesia: IEEE, 2019, pp. 608–613.
[38] B. W. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009
emotion challenge,” in Proc. INTERSPEECH, Brighton, UK, 2009, pp.
312–315.