Deep Learning For Infant Cry Recognition
Deep Learning For Infant Cry Recognition
Environmental Research
and Public Health
Article
Deep Learning for Infant Cry Recognition
Yun-Chia Liang 1, *, Iven Wijaya 1 , Ming-Tao Yang 2,3 , Josue Rodolfo Cuevas Juarez 1 and Hou-Tai Chang 1,3
1 Department of Industrial Engineering and Management, Yuan Ze University, No. 135, Yuan-Tung Rd.,
Chung-Li Dist., Taoyuan City 32003, Taiwan; [email protected] (I.W.);
[email protected] (J.R.C.J.); [email protected] (H.-T.C.)
2 Department of Chemical Engineering and Materials Science, Yuan Ze University, No. 135, Yuan-Tung Rd.,
Chung-Li Dist., Taoyuan City 32003, Taiwan; [email protected]
3 Far Eastern Memorial Hospital, No. 21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City 22000, Taiwan
* Correspondence: [email protected]
Abstract: Recognizing why an infant cries is challenging as babies cannot communicate verbally
with others to express their wishes or needs. This leads to difficulties for parents in identifying
the needs and the health of their infants. This study used deep learning (DL) algorithms such
as the convolutional neural network (CNN) and long short-term memory (LSTM) to recognize
infants’ necessities such as hunger/thirst, need for a diaper change, emotional needs (e.g., need for
touch/holding), and pain caused by medical treatment (e.g., injection). The classical artificial neural
network (ANN) was also used for comparison. The inputs of ANN, CNN, and LSTM were the features
extracted from 1607 10 s audio recordings of infants using mel-frequency cepstral coefficients (MFCC).
Results showed that CNN and LSTM both provided decent performance, around 95% in accuracy,
precision, and recall, in differentiating healthy and sick infants. For recognizing infants’ specific
needs, CNN reached up to 60% accuracy, outperforming LSTM and ANN in almost all measures.
These results could be applied as indicators for future applications to help parents understand their
infant’s condition and needs.
Citation: Liang, Y.-C.; Wijaya, I.;
Yang, M.-T.; Cuevas Juarez, J.R.;
Keywords: infant cry recognition; convolutional neuron network; long short-term memory; deep learning
Chang, H.-T. Deep Learning for
Infant Cry Recognition. Int. J.
Environ. Res. Public Health 2022, 19,
6311. https://doi.org/10.3390/
ijerph19106311 1. Introduction
Through language, humans deliver information to express their will. However, they
Academic Editors: Paul B. Tchounwou
and Massimo Esposito
have to learn from scratch. Lacking language, newborn babies are unable to express their
specific desires. In general, a baby’s parents are its first teachers, and this interaction is the
Received: 30 April 2022 most crucial aspect of babies’ growth. Newborn babies express negative emotion or need
Accepted: 20 May 2022 by crying [1], often to the consternation of parents who cannot immediately ascertain the
Published: 23 May 2022 nature of this need.
Publisher’s Note: MDPI stays neutral Previous research has found that infants cry in fundamental frequencies that correlate
with regard to jurisdictional claims in to different factors, such as emotional state, health, gender, disease (abnormalities), pre-
published maps and institutional affil- term vs. full-term, first cry, identity, etc. [2] In addition to these fundamental frequencies,
iations. infant cries have been subjected to signal analysis based on features including latency,
duration, formant frequencies, pitch contour, and stop pattern [2].
Previous studies have sought to classify infant cries by type, with most focusing on
using artificial intelligence approaches to predict physiological sensations such as hunger,
Copyright: © 2022 by the authors. pain, diaper change, and discomfort [3]. Some previous studies used pathological classes
Licensee MDPI, Basel, Switzerland. such as normal cries, hypo-acoustic (deaf) cries, and asphyxiating cries. For instance, Reyes–
This article is an open access article
Galaviz and Arch–Tirado applied linear prediction and adaptive neuro-fuzzy inference
distributed under the terms and
system (ANFIS) analysis of the cry sound wave, successfully distinguishing fundamental
conditions of the Creative Commons
frequencies among infants aged under 6 months [4].
Attribution (CC BY) license (https://
Yong et al. used feature extraction to analyze infant cry signals, extracting 12 orders
creativecommons.org/licenses/by/
of mel-frequency cepstral coefficient (MFCC) features for model development [3]. Their
4.0/).
Int. J. Environ. Res. Public Health 2022, 19, 6311. https://doi.org/10.3390/ijerph19106311 https://www.mdpi.com/journal/ijerph
Int. J. Environ. Res. Public Health 2022, 19, 6311 2 of 10
developed model combined the convolutional neural network (CNN) and a stacked re-
stricted Boltzmann machine (RBM). The model classified results as pathological based
on the health status of the baby (sick vs. healthy), recognizing pathological conditions
classified as hungry, in need of diaper changing, emotional needs, and in pain caused by
medical treatment.
In using artificial intelligence approaches for building a classification model, feature
selection plays a key role in determining model accuracy. On the other hand, deep learn-
ing approaches often provide satisfactory classification results, such as using artificial
neural networks (ANN) or multi-layer perceptrons (MLP), CNN, and long-short term
memory (LSTM); meanwhile, MFCC is commonly used for feature extraction in audio
analysis. Therefore, this study sought to develop deep learning algorithms for infant cry
classification.
The rest of this paper is organized as follows: Section 2 describes the methodology
for data collection, data cleaning, feature extraction, and data analysis. The results are
summarized and discussed in Section 3. The concluding remarks and future search are
provided in Section 4.
2. Methodology
The cry signal was analyzed to extract important signal features [5]. One such feature
was fundamental frequency in the range of 400 Hz to 500 Hz, compared with 200 Hz to
300 Hz for adults [6]. Other features for audio analysis include latency, duration, and
formant frequency and are depicted as spectrograms for ease of use [2].
Data collection, data pre-processing, feature extraction, and data analysis will be
detailed in the following subsections.
The raw audio data collected from the hospital included noise which needed to be
removed
Int. J. Environ. Res. Public Health 2022, 19, 6311 before modeling. Prior to cleaning, the data were retrieved from cloud storage
3 of 10
in Wavpack format (.wav). The original 10 s audio clips were split into five 2 s clips and
converted to 16-bit.wav files with a sampling rate of 8000 Hz.
One of the most important indicators of data cleaning performance is the removal of
One of the most important indicators of data cleaning performance is the removal of
all bad data. Data were considered unclean data if 60% of the data matched these criteria:
all bad data. Data were considered unclean data if 60% of the data matched these criteria:
sound of adult human detected, data mislabeled, electronic/mechanical noise detected,
sound of adult human detected, data mislabeled, electronic/mechanical noise detected,
sound of other infants detected, silence, etc.
sound of other infants detected, silence, etc.
2.3.Feature
2.3. FeatureExtraction
Extraction
MFCCisiswidely
MFCC widelyused
usedfor
forfeature
featureextraction
extractionin inaudio
audioanalysis
analysisbecause
becauseofofits
itshighly
highly
efficient computer schemes and its robustness in distinguishing different noises
efficient computer schemes and its robustness in distinguishing different noises [9]. MFCC [9].
MFCC effectively detects the human ear’s critical bandwidth frequencies used to retrieve
effectively detects the human ear’s critical bandwidth frequencies used to retrieve important
important
speech speech characteristics,
characteristics, and itsisprocedure
and its procedure shown in is shown
Figure 1. in Figure 1.
Figure 4. An
Figure 4. An illustrative
illustrative example
example of
of the
the LSTM
LSTM structure
structure [14].
[14].
As shown in Figure 4, the key step of LSTM is the cell state of c(t − 1). The horizontal
line runs through the top of the diagram, like a conveyor belt straight down through the
entire chain. The first step in LSTM is in the forget gate (ft ). h(t − 1) and x(t) are numbers
between 0 and 1 for each number in the cell state c(t − 1). The forget gate decision is made
in the sigmoid function.
The next step is the input gate (it ), where the new information is either added in the
cell state or is not added. Next is Cet , where the new candidate could be added or not. This
step determines the new input and either adds a new subject or replaces the old one.
Furthermore, the cell state needs to update c(t − 1) into the new cell state c(t). The old
state c(t − 1) is multiplied by the old state ft which was decided to be forgotten earlier, and
it is added to the product between it and C et . These new values are scaled to decide how
to update each state value. The last step is to get the output of the cell by using the tanh
function (value between –1 and 1).
Figure 5. An example of the data collection device and the sample infant.
Figure 5. An example of the data collection device and the sample infant.
This Lollipop device was activated by the sounds of crying infants. Each sound was
recorded for a period of ten seconds. If the cry lasted longer than ten seconds, the first ten
seconds would be saved. There was then a one-minute rest period for the app to process
and upload to the cloud. Additionally, if the baby cried for fewer than ten seconds, the app
did not record their cry. Lastly, if a baby cried for more than one minute and ten seconds,
two samples were collected consecutively. In collaboration with the nurses, an application
embedded within a mobile device was used to label each recording. For the purposes of
analysis, each recording was divided into five segments (each lasting two seconds).
After cleaning the data, the infants’ needs recognition was split into four classes.
Table 1 shows the summary of the four-class dataset. The number of data points obtained
from the healthy group was 1705, and the number of data points in the sick group were
840, nearly half of the healthy group. Also, the hungry class owned the most data points,
i.e., 1171, and medical treatment was the least with only 88 samples. In order to limit the
influence of unbalanced data, some balanced datasets were also established based on the
number of samples in the smallest class (e.g., the medical treatment in the four-class, the
diaper change in the three-class, and the sick group in the two-class, respectively). The data
were randomly split into three categories: 70% for training, 15% for validation, and 15% for
testing.
Parameter Value
Audio length 2s
Sampling rate 8000 Hz
Framing 25 ms
Overlapping 10 ms
Number of coefficients 12
Through the preliminary analysis, the parameter setting of each classification method—
ANN, CNN, and LSTM—were summarized in Table 3. The five-fold cross-validation was
employed for evaluating the performance of the proposed methods.
Method Parameters
Activation function: ReLU
Optimizer: Adam
ANN • Input layer = (201.12)
• Hidden layer 1 = 256
• Dropout = 50%
• Hidden layer 2 = 128
• Dropout = 50%
• Output layer = (total class)
Epoch = 20
Optimizer Adam
CNN
• Input layer = (201.12)
• Convolutional 1-D = 364, kernel = 3, kernel regulation = l2 (0.01)
• Max-Pooling 1-D (Kernel = 3)
• Convolutional 1-D = 180, kernel = 3, kernel regulation = l2 (0.01)
• Max-Pooling 1-D (Kernel = 3)
• Global Average Pooling 1-D
• Hidden layer 1 = 32
• Dropout = 40%
• Output layer = (total class)
Epoch = 20
Input Size = (201.12)
Activation function: Sigmoid
Optimizer: Adam
Number of LSTM Neuron:
• Hidden layer 2 = 32
Dropout = 5%
Recurrent dropout = 35%
Return sequences = False
Output layer = (total class)
Epoch = 20
the prediction. However, the balanced data strategy showed tremendous improvement
for precision and recall in changing diapers and medical treatment classes. For example,
medical treatment’s precision improved from 7% to 53%, and the recall of change diapers
improved from 24% to 53% in CNN in Table 4. Similar improvement can also be observed
in the other two methods in Tables 4 and 5. CNN performed the best of the three competing
methods, while ANN showed inferior results. For the balanced dataset, CNN’s precision
and recall ranged from 46% to 60% in the four-class and 55% to 61% in the three-class
analyses.
Precision Recall
Class Dataset Method Medical Medical
Change Emotional Change Emotional
Hungry Treat- Hungry Treat-
Diaper Needs Diaper Needs
ment ment
ANN 0.54 0.06 0.42 0.06 0.51 0.21 0.32 0.01
Full CNN 0.57 0.41 0.55 0.07 0.54 0.24 0.54 0.01
Four- LSTM 0.52 0.22 0.42 0.24 0.55 0.13 0.50 0.03
class ANN 0.24 0.40 0.36 0.29 0.27 0.46 0.37 0.22
Balanced CNN 0.54 0.54 0.60 0.53 0.46 0.53 0.59 0.49
LSTM 0.34 0.35 0.43 0.31 0.36 0.29 0.47 0.35
Precision Recall
Class Dataset Method Change Emotional Change Emotional
Hungry Hungry
Diaper Needs Diaper Needs
ANN 0.51 0.11 0.32 0.37 0.21 0.35
Full CNN 0.62 0.50 0.58 0.69 0.12 0.65
LSTM 0.60 0.27 0.50 0.62 0.12 0.58
Three-class
ANN 0.42 0.44 0.47 0.44 0.46 0.43
Balanced CNN 0.61 0.55 0.56 0.58 0.58 0.55
LSTM 0.47 0.44 0.45 0.44 0.45 0.48
Precision Recall
Class Dataset Method
Sick Healthy Sick Healthy
ANN 0.96 0.90 0.93 0.93
Full CNN 0.96 0.95 0.98 0.89
LSTM 0.95 0.88 0.94 0.91
Two-class
ANN 0.90 0.86 0.83 0.89
Balanced CNN 0.94 0.97 0.98 0.94
LSTM 0.98 0.93 0.93 0.98
As shown in Table 6, CNN and LSTM showed competitive performance in the two-
class. CNN obtained 97% of healthy class’s precision and 98% of sick class’s recall in
the balanced dataset, while LSTM had similar digits in those measures. Not surprisingly,
ANN’s most inferior performance showed in the two-class case again. In addition, the
balanced dataset also improved the performance of classification of all three methods in
Table 6, although the gap was not as significant as the ones in Tables 4 and 5.
Finally, the average accuracy over all classes is summarized in Table 7. Again, consis-
tent with the precision and recall performance, CNN outperformed LSTM and ANN, and
the balanced dataset helped improve the accuracy. For example, CNN’s average accuracy
reached 64% and 60% in the four-class and three-class, respectively, while 96% of accuracy
was obtained for the two-class.
Int. J. Environ. Res. Public Health 2022, 19, 6311 9 of 10
The proposed methods could all distinguish the cry of an infant in healthy condition
with high accuracy, precision, and recall. However, when it came to classifying psychologi-
cal or physiological needs, the performance of classification deteriorated. This condition
could be attributed to the dataset’s labeling errors, resulting in erroneous class predictions.
For example, an infant may be in distress because it simultaneously is hungry and wants to
be held. This kind of compound behavior is difficult to predict and is tough to be labeled
by nurses.
4. Conclusions
The proposed deep learning approaches, CNN and LSTM, provided reliable and
robust results for classifying sick and healthy infants based on recordings of infant cries.
Recognition accuracy was improved by using a balanced dataset, with testing results of
up to 64% on CNN for the four-class categorization. Better results were obtained in the
health needs (two-class) test, possibly because of the data collection method employed,
wherein the healthy and sick infants were diagnosed by doctors and were kept in two
different rooms. This resulted in more controlled and accurate situations for data collection,
as opposed to the emotional-state data collection, which presented the increased chance
of mislabeling. Another possible reason for mislabeling was that an infant may have
simultaneously experienced multiple stimuli resulting in crying behavior, making it difficult
to isolate the actual proximal cause.
This study involved data samples with some unique characteristics such as race, age,
residence area, and health status, as compared with other similar studies in the literature.
Moreover, good data always play a major part in recognition performance. Improving the
quality of data points is one way to get better recognition. Future work should seek to
further improve data quality by better controlling the data collection environment, and
additional feature extraction methods should be used for performance comparison against
the MFCC feature set used here. The current dataset could also be combined with data from
other hospitals, and dataset with age considerations is another way to boost the robustness
of the model. In addition, the current model only included audio signals, and future work
could integrate video signals to improve model robustness. Moreover, ensemble learning
may offer performance improvements on the algorithmic side. Research involving data
pertaining to multiple labels and experiments on different feature-extraction techniques
can also be interesting areas for future investigation.
Author Contributions: Conceptualization, Y.-C.L., I.W. and J.R.C.J.; data curation, I.W.; formal
analysis, I.W. and Y.-C.L.; funding acquisition, Y.-C.L. and M.-T.Y.; investigation, I.W. and Y.-C.L.;
methodology, I.W., J.R.C.J. and Y.-C.L.; project administration, Y.-C.L. and M.-T.Y.; resources, Y.-C.L.,
M.-T.Y., J.R.C.J. and H.-T.C.; writing—original draft, I.W.; writing—review and editing, Y.-C.L. and
M.-T.Y. All authors have read and agreed to the published version of the manuscript.
Funding: This research was partially funded by the Far Eastern Memorial Hospital and Yuan Ze
University, FEMH-YZU-2018-010.
Institutional Review Board Statement: The study was conducted in accordance with the Declaration
of Helsinki, and approved by the Institutional Review Board of the Far Eastern Memorial Hospital
(protocol code IRB 108059-F and date of approval is on 10 June 2019).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Int. J. Environ. Res. Public Health 2022, 19, 6311 10 of 10
References
1. Adachi, T.; Murai, N.; Okada, H.; Nihei, Y. Acoustic properties of infant cries and maternal perception. Ohoku Psychol. Folia 1985,
44, 51–58.
2. Patil, H.A. Cry baby: Using spectrographic analysis. In Advances in Speech Recognition; Neustein, A., Ed.; Springer: New York, NY,
USA, 2010; pp. 323–348.
3. Yong, B.F.; Ting, H.; Ng, K. Baby cry recognition using deep neural networks. In World Congress on Medical; Springer: Prague,
Czech, 2019; pp. 809–816.
4. Reyes-Galaviz, O.F.; Arch-Tirado, E. Classification of infant crying to identify pathologies in recently born babies with ANFIS. In
International Conference on Computers Helping People with Special Needs; Research Gate: Paris, France, 2004; pp. 408–415.
5. Garcia, J.; García, C. Clasification of infant cry ising a scaled conjugate gradient neural. In European Symposium on Artificial Neural
Networks; d-side publi: Bruges, Belgium, 2003; pp. 349–354.
6. Guo, L.; Yu, H.Z.; Li, Y.H.; Ma, N. Pitch analysis of infant crying. Int. J. Digit. Content Technol. Its Appl. 2013, 7, 1072–1079.
7. Narang, S.; Gupta, D. Speech feature extraction techniques: A review. Int. J. Comput. Sci. Mob. Comput. 2015, 4, 107–114.
8. Yu, H.; Zhang, X.; Zhen, Y.; Jiang, G. A universal data cleaning framework based on user model. In Proceedings of the IEEE
International Colloquium on Computing, Communication, Control and Management, Sanya, China, 8–9 August 2009; pp.
200–202.
9. Prajapati, P.; Patel, M. Feature extraction of isolated gujarati digits with Mel Frequency Cepstral Coefficients (MFCCs). Int. J.
Comput. Appl. 2017, 163, 29–33. [CrossRef]
10. Miranda, I.; Diacon, A.; Nielser, T. A comparative study of features for acoustic cough detection using deep architectures. In
Proceedings of the 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Berlin,
Germany, 23–27 July 2019; pp. 2601–2605.
11. Moller, H.; Pedersen, C. Hearing at low and infrasonic frequencies. Noise Healthy 2004, 6, 37–57.
12. Girirajan, S.; Sangeetha, R.; Preethi, T.; Chinnappa, A. Automatic speech recognition with stuttering. Int. J. Recent Technol. Eng.
2020, 8, 1677–1681.
13. Lavner, Y. Baby cry detection in domestic environment using deep learning. In Proceedings of the ICSEE International Conference
on the Science of Electrical Engineering, Eilat, Israel, 16–18 November 2016.
14. Zan, T.; Wang, H.; Wang, M.; Liu, Z.; Gao, X. Application of Multi-Dimension Input Convolutional Neural Network in Fault
Diagnosis of Rolling Bearings. Appl. Sci. 2019, 9, 2690. [CrossRef]
15. Swedia, E.; Mutiara, A.; Subali, M. Deep learning Long-Short Term Memory (LSTM) for indonesian speech digit recognition
using LPC and MFCC Feature. In Proceedings of the 2018 Third International Conference on Informatics and Computing (ICIC),
Palembang, Indonesia, 17–18 October 2018; pp. 1–5.