0% found this document useful (0 votes)
25 views10 pages

Deep Learning For Infant Cry Recognition

This study explores the use of deep learning algorithms, specifically CNN and LSTM, for recognizing the needs of infants based on their cries. The research found that these models achieved around 95% accuracy in distinguishing healthy from sick infants, with CNN outperforming LSTM and ANN in identifying specific needs like hunger and emotional comfort. The findings suggest potential applications for aiding parents in understanding their infants' conditions and needs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Deep Learning For Infant Cry Recognition

This study explores the use of deep learning algorithms, specifically CNN and LSTM, for recognizing the needs of infants based on their cries. The research found that these models achieved around 95% accuracy in distinguishing healthy from sick infants, with CNN outperforming LSTM and ANN in identifying specific needs like hunger and emotional comfort. The findings suggest potential applications for aiding parents in understanding their infants' conditions and needs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

International Journal of

Environmental Research
and Public Health

Article
Deep Learning for Infant Cry Recognition
Yun-Chia Liang 1, *, Iven Wijaya 1 , Ming-Tao Yang 2,3 , Josue Rodolfo Cuevas Juarez 1 and Hou-Tai Chang 1,3

1 Department of Industrial Engineering and Management, Yuan Ze University, No. 135, Yuan-Tung Rd.,
Chung-Li Dist., Taoyuan City 32003, Taiwan; [email protected] (I.W.);
[email protected] (J.R.C.J.); [email protected] (H.-T.C.)
2 Department of Chemical Engineering and Materials Science, Yuan Ze University, No. 135, Yuan-Tung Rd.,
Chung-Li Dist., Taoyuan City 32003, Taiwan; [email protected]
3 Far Eastern Memorial Hospital, No. 21, Sec. 2, Nanya S. Rd., Banciao Dist., New Taipei City 22000, Taiwan
* Correspondence: [email protected]

Abstract: Recognizing why an infant cries is challenging as babies cannot communicate verbally
with others to express their wishes or needs. This leads to difficulties for parents in identifying
the needs and the health of their infants. This study used deep learning (DL) algorithms such
as the convolutional neural network (CNN) and long short-term memory (LSTM) to recognize
infants’ necessities such as hunger/thirst, need for a diaper change, emotional needs (e.g., need for
touch/holding), and pain caused by medical treatment (e.g., injection). The classical artificial neural
network (ANN) was also used for comparison. The inputs of ANN, CNN, and LSTM were the features
extracted from 1607 10 s audio recordings of infants using mel-frequency cepstral coefficients (MFCC).
Results showed that CNN and LSTM both provided decent performance, around 95% in accuracy,
precision, and recall, in differentiating healthy and sick infants. For recognizing infants’ specific
needs, CNN reached up to 60% accuracy, outperforming LSTM and ANN in almost all measures.
These results could be applied as indicators for future applications to help parents understand their
infant’s condition and needs.
Citation: Liang, Y.-C.; Wijaya, I.;
Yang, M.-T.; Cuevas Juarez, J.R.;
Keywords: infant cry recognition; convolutional neuron network; long short-term memory; deep learning
Chang, H.-T. Deep Learning for
Infant Cry Recognition. Int. J.
Environ. Res. Public Health 2022, 19,
6311. https://doi.org/10.3390/
ijerph19106311 1. Introduction
Through language, humans deliver information to express their will. However, they
Academic Editors: Paul B. Tchounwou
and Massimo Esposito
have to learn from scratch. Lacking language, newborn babies are unable to express their
specific desires. In general, a baby’s parents are its first teachers, and this interaction is the
Received: 30 April 2022 most crucial aspect of babies’ growth. Newborn babies express negative emotion or need
Accepted: 20 May 2022 by crying [1], often to the consternation of parents who cannot immediately ascertain the
Published: 23 May 2022 nature of this need.
Publisher’s Note: MDPI stays neutral Previous research has found that infants cry in fundamental frequencies that correlate
with regard to jurisdictional claims in to different factors, such as emotional state, health, gender, disease (abnormalities), pre-
published maps and institutional affil- term vs. full-term, first cry, identity, etc. [2] In addition to these fundamental frequencies,
iations. infant cries have been subjected to signal analysis based on features including latency,
duration, formant frequencies, pitch contour, and stop pattern [2].
Previous studies have sought to classify infant cries by type, with most focusing on
using artificial intelligence approaches to predict physiological sensations such as hunger,
Copyright: © 2022 by the authors. pain, diaper change, and discomfort [3]. Some previous studies used pathological classes
Licensee MDPI, Basel, Switzerland. such as normal cries, hypo-acoustic (deaf) cries, and asphyxiating cries. For instance, Reyes–
This article is an open access article
Galaviz and Arch–Tirado applied linear prediction and adaptive neuro-fuzzy inference
distributed under the terms and
system (ANFIS) analysis of the cry sound wave, successfully distinguishing fundamental
conditions of the Creative Commons
frequencies among infants aged under 6 months [4].
Attribution (CC BY) license (https://
Yong et al. used feature extraction to analyze infant cry signals, extracting 12 orders
creativecommons.org/licenses/by/
of mel-frequency cepstral coefficient (MFCC) features for model development [3]. Their
4.0/).

Int. J. Environ. Res. Public Health 2022, 19, 6311. https://doi.org/10.3390/ijerph19106311 https://www.mdpi.com/journal/ijerph
Int. J. Environ. Res. Public Health 2022, 19, 6311 2 of 10

developed model combined the convolutional neural network (CNN) and a stacked re-
stricted Boltzmann machine (RBM). The model classified results as pathological based
on the health status of the baby (sick vs. healthy), recognizing pathological conditions
classified as hungry, in need of diaper changing, emotional needs, and in pain caused by
medical treatment.
In using artificial intelligence approaches for building a classification model, feature
selection plays a key role in determining model accuracy. On the other hand, deep learn-
ing approaches often provide satisfactory classification results, such as using artificial
neural networks (ANN) or multi-layer perceptrons (MLP), CNN, and long-short term
memory (LSTM); meanwhile, MFCC is commonly used for feature extraction in audio
analysis. Therefore, this study sought to develop deep learning algorithms for infant cry
classification.
The rest of this paper is organized as follows: Section 2 describes the methodology
for data collection, data cleaning, feature extraction, and data analysis. The results are
summarized and discussed in Section 3. The concluding remarks and future search are
provided in Section 4.

2. Methodology
The cry signal was analyzed to extract important signal features [5]. One such feature
was fundamental frequency in the range of 400 Hz to 500 Hz, compared with 200 Hz to
300 Hz for adults [6]. Other features for audio analysis include latency, duration, and
formant frequency and are depicted as spectrograms for ease of use [2].
Data collection, data pre-processing, feature extraction, and data analysis will be
detailed in the following subsections.

2.1. Data Collection


Audio data were collected by hospital nurses when the infants under their care began
crying. The nurses would then note the condition they discovered to be the proximal cause
of the infant crying:
• Hunger: The infant ceased crying when fed.
• Diaper change: The infant ceased crying following diaper change.
• Emotional needs: The infant ceased crying following physical touch/holding.
• Physiological Pain: Infant pain was caused by invasive medical treatment, including
injection.
Infants were divided into “healthy” and “sick” datasets depending on whether they
were in the nursery or neonatal intensive care unit (NICU), respectively. Data were recorded
at the Far Eastern Memorial Hospital with infants in the nursery deemed healthy, while
those in the ICU were deemed unhealthy. Anonymized audio signal recordings of infants
(10 healthy infants aged 2 to 27 days in the nursery, and 6 neonatal ICU infants aged up
to 4 months) were collected by an app and labeled by nursing staff at the Far Eastern
Memorial Hospital, with IRB approval and informed parental consent.
Each audio recording had a duration of 10 s. Crying incidents lasting less than 10 s
were not recorded. The app also determined that if an infant cried longer than 10 s, only the
first 10 s of data was collected. There was then a 1 min resting time for the app to process
and upload the data to the cloud. Finally, if an infant’s cry lasted more than 1 min and 10 s,
the data were recorded as two separate cries.

2.2. Data Pre-Processing


Audio data quality highly depends on signal pre-processing [7,8]. Signal pre-processing
eliminates irrelevant or unwanted information like noise and channel distortion [5].
The raw audio data collected from the hospital included noise which needed to be
removed before modeling. Prior to cleaning, the data were retrieved from cloud storage
in Wavpack format (.wav). The original 10 s audio clips were split into five 2 s clips and
converted to 16-bit.wav files with a sampling rate of 8000 Hz.
Int. J. Environ. Res. Public Health 2022, 19, x FOR PEER REVIEW 3 of 11

The raw audio data collected from the hospital included noise which needed to be
removed
Int. J. Environ. Res. Public Health 2022, 19, 6311 before modeling. Prior to cleaning, the data were retrieved from cloud storage
3 of 10
in Wavpack format (.wav). The original 10 s audio clips were split into five 2 s clips and
converted to 16-bit.wav files with a sampling rate of 8000 Hz.
One of the most important indicators of data cleaning performance is the removal of
One of the most important indicators of data cleaning performance is the removal of
all bad data. Data were considered unclean data if 60% of the data matched these criteria:
all bad data. Data were considered unclean data if 60% of the data matched these criteria:
sound of adult human detected, data mislabeled, electronic/mechanical noise detected,
sound of adult human detected, data mislabeled, electronic/mechanical noise detected,
sound of other infants detected, silence, etc.
sound of other infants detected, silence, etc.
2.3.Feature
2.3. FeatureExtraction
Extraction
MFCCisiswidely
MFCC widelyused
usedfor
forfeature
featureextraction
extractionin inaudio
audioanalysis
analysisbecause
becauseofofits
itshighly
highly
efficient computer schemes and its robustness in distinguishing different noises
efficient computer schemes and its robustness in distinguishing different noises [9]. MFCC [9].
MFCC effectively detects the human ear’s critical bandwidth frequencies used to retrieve
effectively detects the human ear’s critical bandwidth frequencies used to retrieve important
important
speech speech characteristics,
characteristics, and itsisprocedure
and its procedure shown in is shown
Figure 1. in Figure 1.

Figure 1. Block diagram of MFCC.


Figure 1. Block diagram of MFCC.
In the first step, the audio was passed through a filter that emphasized higher frequen-
In the
cies. As first step,
a result of thethelossaudio was passedinthrough
of information the higher a filter that emphasized
frequencies, pre-emphasishigherwasfre-
quencies.toAs
required a resultthe
maintain of information
the loss of information
in the higherinfrequencies.
the higher frequencies, pre-emphasis
wasTherequired to maintain the information
second step, framing, involved dividing audio in the higher frequencies.
signals into smaller segments.
The second step, framing, involved
Framing was needed to get stationary information from part dividing audio signals
of theinto smaller
signal. segments.
In general, the
Framing was needed to get stationary information from part of
width of the frame was around 20–30 ms. During this step, in order to prevent the adjacent the signal. In general, the
width of the frame was around 20–30 ms. During this step, in order
frames from changing excessively, there was an overlap area between the two frames. The to prevent the adjacent
frames from
standard changing
windows excessively,
for MFCC there
were 25 mswasframean with
overlap10 ms area between
overlap the two frames. The
[10].
standard windows
The next step wasfor MFCC
Hamming werewindowing,
25 ms framewhere with 10 eachms frame
overlap was[10].multiplied in the
hamming window. This step helped to reduce discontinuity in a signal bymultiplied
The next step was Hamming windowing, where each frame was minimizing inthe
the
hamming window. This step helped to reduce discontinuity
spectral distortion at the beginning and the end of each frame. With Hamming window in a signal by minimizing the
spectral distortion at the beginning and the end of each frame.
added to the spectrum, the intensity of the noise was reduced, and the peak representing With Hamming window
added
the to signal
target the spectrum,
was more theapparent.
intensity of the noise was reduced, and the peak representing
the target
In fastsignal
Fourier was more apparent.
transform (FFT), the frames were converted from the time domain to
In fast Fourier
the frequency domain.transform (FFT), the frames were
The multiple-magnitude converted
frequency passed from the time
through domain to
a triangular
the frequency
bandpass domain.
filter used The multiple-magnitude
to smooth the magnitude spectrum frequency and to passed
reducethrough
the sizeaoftriangular
features
bandpass
involved. filter used to smooth the magnitude spectrum and to reduce the size of features
involved.
From the frequency spectrum to a Mel spectrum, the Mel filter bank was the primary
From step.
conversion the frequency spectrum
The Mel-scale to a Melwere
frequencies spectrum, the Mel
distributed filter bank
linearly in thewaslowthe primary
range, but
conversion step.
logarithmically inThe
the Mel-scale
high range. frequencies
Human ears werearedistributed
capable of linearly
hearing in the at
tones low range, but
frequencies
lower than 1000 Hz
logarithmically on high
in the the linear
range.scale
Human[11].ears
Theare
following
capableequation
of hearing was used
tones atto calculate
frequencies
this
lowerMelthan
filter bank:
1000 Hz on the linear scale [11]. Thefollowing  equation was used to calculate
this Mel filter bank: f req
f Mel = 2595 × log10 1 + (1)
700
 freq 
Next, the log Mel spectrum = 2595
f Melwas × log10 1using
transformed + the discrete cosine transform(1)
(DCT). These features are similar to the spectrums and 700 
are commonly called Mel-scale
cepstral coefficients. The Mel-scale cepstral coefficients were obtained from a frame of
audio derived from the output of the DCT transforms of the MFCC.
Next, the log Mel spectrum was transformed using the discrete cosine transform
(DCT).Next, the log
These Mel spectrum
features are similarwas transformed
to the spectrumsusing the commonly
and are discrete cosine
calledtransform
Mel-scale
(DCT). These features are similar to the spectrums and are commonly called
cepstral coefficients. The Mel-scale cepstral coefficients were obtained from a frame Mel-scale
of au-
cepstral
Int. J. Environ. Res. Public Health 2022, 19, 6311coefficients. The Mel-scale cepstral coefficients were obtained from a frame of4au-
dio derived from the output of the DCT transforms of the MFCC. of 10
dio derived from the output of the DCT transforms of the MFCC.
2.4. Classification Model
2.4.
2.4.Classification
Classification Model
The data areModel split into training, validation, and testing sets with a ratio of 70/15/15.
TheThe Thedata are
datadata
training aresplit
splitinto
intotraining,
are used training,
to validation,
train thevalidation,
model with and testing
testingsets
setswith
andbackpropagation within aaratio
ratio of
each of 70/15/15.
70/15/15.
epoch to de-
The training
The training
crease data are
data are
the loss/error used
used
rate to train
to train
resulting the model
thechanging
from with
model with backpropagation
backpropagation
the weights in
used in theineach epoch
each epoch
training to de-to
process.
crease
Testing thevalidated
decrease loss/error
the loss/errorraterobustness
the resulting from
of thechanging
rate resulting from the
changing
training weights
process. the used
weights
Data usedin the
used
in training
the in theprocess.
testing training
process
Testing
process.
were validated
Testingfrom
excluded the robustness
validated
use inthe of the training process. Data used in
robustness of the training process. Data used in the
training. the testing process
testing
were excluded
process from use from
were excluded in training.
use in training.
2.4.1. Artificial Neural Network (ANN)
2.4.1.Artificial
2.4.1. ArtificialNeural
NeuralNetwork
Network(ANN) (ANN)
Artificial neural network is used to find patterns in complex classification problems
Artificial
[12].Artificial
ANN isneural aneural
machine network
network is algorithm
is used
learning used to find
to find patterns
patterns
using a in in layer
complex
complex
dense classification
classification
as the problems
perceptron. prob-
The
lemsANN
[12]. [12]. isANN
a is a machine
machine learning learning
algorithm algorithm
using using
a dense a dense
layer aslayer
the as the perceptron.
perceptron. The
ready-to-use MFCC was the input of the neural network, with 201 sequences for each co-
The ready-to-use MFCC
ready-to-use wasinput
the inputtheofneural
the neural network, withsequences
201 sequences for each
efficient andMFCC 12 × 201wassequences
the =of2412 values network,
from eachwith data201
as the input for for the
each co-
ANN
coefficient
efficient andand12 × 201
×12201 sequences
sequences = 2412 values
= 2412 from each data
as as
thethe inputforfor theANN ANN
layer. Figure 2 illustrates an ANN with values from
three input each
neurons, datatwo hidden inputlayers the
(each with
layer.
layer. Figure 2 illustrates
Figure 2neurons),
illustratesand an
an ANNANN with three
with three input
input neurons, two hidden layers
neurons, two hidden layers (each with (each with
four hidden two classes for the output.
four hidden neurons), and two classes
four hidden neurons), and two classes for the output. for the output.

Figure 2. Artificial neural network structure for two-class classification [12].


Artificialneural
Figure2.2.Artificial
Figure neuralnetwork
networkstructure
structurefor
fortwo-class
two-classclassification
classification[12].
[12].

2.4.2. Convolutional Neural Network (CNN)


2.4.2. Convolutional Neural Network (CNN)
Convolutional
Convolutional neural
neural network
network isis aa neural
neural network
network that
that contains
contains many
many layers
layers con-
con-
Convolutional
necting all feature neural network is
maps, allowing a neural
it to network
learn by that contains
its weights. Each layermany
canlayers
becomecon-the
necting all feature maps,
features layer [13,14]. CNN allowing it to learn by its weights. Each layer can become
CNN is widely used in image classification and audio processing the
features
due to itslayer [13,14].
results and CNN is widely
performance used in image
reliability. Figureclassification and audioexample
3 shows an illustrative processing
of a
due to its results and performance
one-dimensional CNN structure. reliability. Figure 3 shows an illustrative example of a
one-dimensional CNN structure.

Figure 3. An illustrative example of the one-dimensional CNN structure [14].


Figure 3. An illustrative example of the one-dimensional CNN structure [14].
The convolutional neural network retrieved feature maps for each layer, and the stride,
padding, and kernel size were set in this layer. Kernel weights were learned during the
training process in each neuron. Each position was multiplied by time series plus a bias
term. Stride determined the step size of moving the kernel, while padding controlled the
output result of the activation map. Several activation maps such as Sigmoid, ReLU, and
hyperbolic tangent (tanh) were applied.
2.4.3. Long Short-Term Memory (LSTM)
Long short-term memory is a method which avoids the vanishing gradient problem
by which a neural network never reaches its optimal weight. This problem is caused by
the error value disappearing during the backward process [15]. LSTM features three gates
Int. J. Environ. Res. Public Health 2022,(input,
19, 6311 forget, and output) that store important information, along with a one-cell state 5 of as
10
illustrated in Figure 4.
As shown in Figure 4, the key step of LSTM is the cell state of c(t − 1). The horizontal
line runs through the top of the diagram, like a conveyor belt straight down through the
The pooling layer retrieved the output from the convolutional base on stride and
entire chain. The first step in LSTM is in the forget gate (ft). h(t − 1) and x(t) are numbers
padding that were set before. Two kinds of pooling were set: the maximum and the average
between 0 and 1 for each number in the cell state c(t − 1). The forget gate decision is made
of the result. Finally, the fully connected layer converted the last layer’s output and became
in the sigmoid function.
the one-dimension result. The output took the biggest probability for the classification by
The next step is the input gate (it), where the new information is either added in the
using SoftMax function.
cell state or is not added. Next is 𝐶 t, where the new candidate could be added or not. This
step determines
2.4.3. the new
Long Short-Term input and
Memory either adds a new subject or replaces the old one.
(LSTM)
Furthermore, the cell state needs to update c(t − 1) into the new cell state c(t). The old
Long short-term memory is a method which avoids the vanishing gradient problem
state c(t − 1) is multiplied by the old state ft which was decided to be forgotten earlier, and
by which a neural network never reaches its optimal weight. This problem is caused by
it iserror
added to the product between t and 𝐶 t. These new values are scaled to decide how
the value disappearing during ithe backward process [15]. LSTM features three gates
to update each state value. The last step is to get
(input, forget, and output) that store important the outputalong
information, of thewith
cell by using the
a one-cell tanh
state as
function (value between
illustrated in Figure 4. –1 and 1).

Figure 4. An
Figure 4. An illustrative
illustrative example
example of
of the
the LSTM
LSTM structure
structure [14].
[14].

As shown in Figure 4, the key step of LSTM is the cell state of c(t − 1). The horizontal
line runs through the top of the diagram, like a conveyor belt straight down through the
entire chain. The first step in LSTM is in the forget gate (ft ). h(t − 1) and x(t) are numbers
between 0 and 1 for each number in the cell state c(t − 1). The forget gate decision is made
in the sigmoid function.
The next step is the input gate (it ), where the new information is either added in the
cell state or is not added. Next is Cet , where the new candidate could be added or not. This
step determines the new input and either adds a new subject or replaces the old one.
Furthermore, the cell state needs to update c(t − 1) into the new cell state c(t). The old
state c(t − 1) is multiplied by the old state ft which was decided to be forgotten earlier, and
it is added to the product between it and C et . These new values are scaled to decide how
to update each state value. The last step is to get the output of the cell by using the tanh
function (value between –1 and 1).

2.5. Classification of Experiments


This research consisted of several experiments. Depending on the number of classes
considered, two-class, three-class, and four-class were investigated. The four-class consisted
of the labels of the hungry class, diaper change class, emotional need class, and medical
treatment class. The three-class removed the medical treatment class due to the relatively
small number of data points. The two-class classification was to distinguish the health
condition of the infant, i.e., healthy or sick. The full (unbalanced) dataset considered all
the data samples, while the balanced dataset employed the down-sampling technique
based on the smallest number of data points in any class considered. For example, since
health condition of the infant, i.e., healthy or sick. The full (unbalanced) dataset consid-
ered all the data samples, while the balanced dataset employed the down-sampling tech-
nique based on the smallest number of data points in any class considered. For example,
since the medical treatment class owned only 88 data points, which was the smallest cat-
egory, when considering the four-class classification, other three classes randomly se-
lected 88 data points for further analysis.
Int. J. Environ. Res. Public Health 2022, 19, 6311 6 of 10
3. Results and Discussions
3.1. Data Summary
The infant cry data were collected from 33 babies in the neonatal intensive care unit
the
and medical treatment
26 babies from classunit
the nursery owned
betweenonly 88 data
October 2019 points, which
and January 2020was
at thethe
Farsmallest category,
when
Easternconsidering the four-class
Hospital Memorial Hospital. Aclassification, other three
Lollipop baby monitor classes
provided randomly selected 88 data
by Masterwork
points for further
Aoitek Tech Corp wasanalysis.
used in this study as the data collection tool. Figure 5 illustrates
that each baby recorded in each unit was separate from the other babies in the unit and
that
3. the device
Results wasDiscussions
and placed approximately 30 cm from the bed.
This Lollipop device was activated by the sounds of crying infants. Each sound was
3.1. Data Summary
recorded for a period of ten seconds. If the cry lasted longer than ten seconds, the first ten
The
seconds infant
would cry data
be saved. Therewere collected
was then from rest
a one-minute 33 babies inthe
period for theapp
neonatal intensive care unit
to process
and upload
and to thefrom
26 babies cloud.the
Additionally,
nursery ifunit
the baby cried for
between fewer than
October 2019tenand
seconds, the 2020 at the Far
January
app did not record their cry. Lastly, if a baby cried for more than one minute and ten
Eastern Hospital Memorial Hospital. A Lollipop baby monitor provided by Masterwork
seconds, two samples were collected consecutively. In collaboration with the nurses, an
Aoitek Tech Corp was
application embedded used
within in this
a mobile study
device wasasused
the todata
labelcollection tool. For
each recording. Figure
the 5 illustrates that
each baby
purposes of recorded in each
analysis, each unitwas
recording wasdivided
separateinto from the other
five segments babies
(each intwo
lasting the unit and that the
seconds).was placed approximately 30 cm from the bed.
device

Figure 5. An example of the data collection device and the sample infant.
Figure 5. An example of the data collection device and the sample infant.

This Lollipop device was activated by the sounds of crying infants. Each sound was
recorded for a period of ten seconds. If the cry lasted longer than ten seconds, the first ten
seconds would be saved. There was then a one-minute rest period for the app to process
and upload to the cloud. Additionally, if the baby cried for fewer than ten seconds, the app
did not record their cry. Lastly, if a baby cried for more than one minute and ten seconds,
two samples were collected consecutively. In collaboration with the nurses, an application
embedded within a mobile device was used to label each recording. For the purposes of
analysis, each recording was divided into five segments (each lasting two seconds).
After cleaning the data, the infants’ needs recognition was split into four classes.
Table 1 shows the summary of the four-class dataset. The number of data points obtained
from the healthy group was 1705, and the number of data points in the sick group were
840, nearly half of the healthy group. Also, the hungry class owned the most data points,
i.e., 1171, and medical treatment was the least with only 88 samples. In order to limit the
influence of unbalanced data, some balanced datasets were also established based on the
number of samples in the smallest class (e.g., the medical treatment in the four-class, the
diaper change in the three-class, and the sick group in the two-class, respectively). The data
were randomly split into three categories: 70% for training, 15% for validation, and 15% for
testing.

Table 1. A summary of infant cry dataset.

Change Emotional Medical


Group Hungry Total
Diaper Needs Treatment
Healthy 868 301 486 50 1705
Sick 303 74 425 38 840
Int. J. Environ. Res. Public Health 2022, 19, 6311 7 of 10

3.2. Parameter Setting


Parameter setting was divided into two parts: one for the MFCC and another for
the classification methods. Table 2 provides the parameter values for the MFCC. In every
MFCC, there were 201 sequences for each coefficient. All 12 coefficients were included
in the consideration. Thus, there were 2412 values created (201 × 12 = 2412). This value
became the input of the model and the dimension based on the model used.

Table 2. MFCC Parameter.

Parameter Value
Audio length 2s
Sampling rate 8000 Hz
Framing 25 ms
Overlapping 10 ms
Number of coefficients 12

Through the preliminary analysis, the parameter setting of each classification method—
ANN, CNN, and LSTM—were summarized in Table 3. The five-fold cross-validation was
employed for evaluating the performance of the proposed methods.

Table 3. Parameter setting of each classification method.

Method Parameters
Activation function: ReLU
Optimizer: Adam
ANN • Input layer = (201.12)
• Hidden layer 1 = 256
• Dropout = 50%
• Hidden layer 2 = 128
• Dropout = 50%
• Output layer = (total class)

Epoch = 20
Optimizer Adam
CNN
• Input layer = (201.12)
• Convolutional 1-D = 364, kernel = 3, kernel regulation = l2 (0.01)
• Max-Pooling 1-D (Kernel = 3)
• Convolutional 1-D = 180, kernel = 3, kernel regulation = l2 (0.01)
• Max-Pooling 1-D (Kernel = 3)
• Global Average Pooling 1-D
• Hidden layer 1 = 32
• Dropout = 40%
• Output layer = (total class)

Epoch = 20
Input Size = (201.12)
Activation function: Sigmoid
Optimizer: Adam
Number of LSTM Neuron:

• Hidden layer 1 = 128


Dropout = 5%
Recurrent dropout = 35%
LSTM
Return sequences = True

• Hidden layer 2 = 32
Dropout = 5%
Recurrent dropout = 35%
Return sequences = False
Output layer = (total class)
Epoch = 20

3.3. Experimental Results


Tables 4–6 show the precision and recall of the four-class, three-class, and two-class
classification results, respectively. Tables 4 and 5 show clear differences between the
balanced and the full (imbalanced) datasets. The classes with the fewest data points in
the full dataset, i.e., medical treatment in Table 4 and change diaper in Table 5, failed
Int. J. Environ. Res. Public Health 2022, 19, 6311 8 of 10

the prediction. However, the balanced data strategy showed tremendous improvement
for precision and recall in changing diapers and medical treatment classes. For example,
medical treatment’s precision improved from 7% to 53%, and the recall of change diapers
improved from 24% to 53% in CNN in Table 4. Similar improvement can also be observed
in the other two methods in Tables 4 and 5. CNN performed the best of the three competing
methods, while ANN showed inferior results. For the balanced dataset, CNN’s precision
and recall ranged from 46% to 60% in the four-class and 55% to 61% in the three-class
analyses.

Table 4. Four-class precision and recall results.

Precision Recall
Class Dataset Method Medical Medical
Change Emotional Change Emotional
Hungry Treat- Hungry Treat-
Diaper Needs Diaper Needs
ment ment
ANN 0.54 0.06 0.42 0.06 0.51 0.21 0.32 0.01
Full CNN 0.57 0.41 0.55 0.07 0.54 0.24 0.54 0.01
Four- LSTM 0.52 0.22 0.42 0.24 0.55 0.13 0.50 0.03
class ANN 0.24 0.40 0.36 0.29 0.27 0.46 0.37 0.22
Balanced CNN 0.54 0.54 0.60 0.53 0.46 0.53 0.59 0.49
LSTM 0.34 0.35 0.43 0.31 0.36 0.29 0.47 0.35

Table 5. Three-class precision and recall results.

Precision Recall
Class Dataset Method Change Emotional Change Emotional
Hungry Hungry
Diaper Needs Diaper Needs
ANN 0.51 0.11 0.32 0.37 0.21 0.35
Full CNN 0.62 0.50 0.58 0.69 0.12 0.65
LSTM 0.60 0.27 0.50 0.62 0.12 0.58
Three-class
ANN 0.42 0.44 0.47 0.44 0.46 0.43
Balanced CNN 0.61 0.55 0.56 0.58 0.58 0.55
LSTM 0.47 0.44 0.45 0.44 0.45 0.48

Table 6. Two-class precision and recall results.

Precision Recall
Class Dataset Method
Sick Healthy Sick Healthy
ANN 0.96 0.90 0.93 0.93
Full CNN 0.96 0.95 0.98 0.89
LSTM 0.95 0.88 0.94 0.91
Two-class
ANN 0.90 0.86 0.83 0.89
Balanced CNN 0.94 0.97 0.98 0.94
LSTM 0.98 0.93 0.93 0.98

As shown in Table 6, CNN and LSTM showed competitive performance in the two-
class. CNN obtained 97% of healthy class’s precision and 98% of sick class’s recall in
the balanced dataset, while LSTM had similar digits in those measures. Not surprisingly,
ANN’s most inferior performance showed in the two-class case again. In addition, the
balanced dataset also improved the performance of classification of all three methods in
Table 6, although the gap was not as significant as the ones in Tables 4 and 5.
Finally, the average accuracy over all classes is summarized in Table 7. Again, consis-
tent with the precision and recall performance, CNN outperformed LSTM and ANN, and
the balanced dataset helped improve the accuracy. For example, CNN’s average accuracy
reached 64% and 60% in the four-class and three-class, respectively, while 96% of accuracy
was obtained for the two-class.
Int. J. Environ. Res. Public Health 2022, 19, 6311 9 of 10

Table 7. Accuracy of different classes over three methods.

Class Dataset ANN CNN LSTM


Full 0.28 0.55 0.46
Four-class
Balanced 0.33 0.64 0.37
Full 0.38 0.60 0.54
Three-class
Balanced 0.45 0.60 0.45
Full 0.92 0.94 0.93
Two-class
Balanced 0.87 0.96 0.95

The proposed methods could all distinguish the cry of an infant in healthy condition
with high accuracy, precision, and recall. However, when it came to classifying psychologi-
cal or physiological needs, the performance of classification deteriorated. This condition
could be attributed to the dataset’s labeling errors, resulting in erroneous class predictions.
For example, an infant may be in distress because it simultaneously is hungry and wants to
be held. This kind of compound behavior is difficult to predict and is tough to be labeled
by nurses.

4. Conclusions
The proposed deep learning approaches, CNN and LSTM, provided reliable and
robust results for classifying sick and healthy infants based on recordings of infant cries.
Recognition accuracy was improved by using a balanced dataset, with testing results of
up to 64% on CNN for the four-class categorization. Better results were obtained in the
health needs (two-class) test, possibly because of the data collection method employed,
wherein the healthy and sick infants were diagnosed by doctors and were kept in two
different rooms. This resulted in more controlled and accurate situations for data collection,
as opposed to the emotional-state data collection, which presented the increased chance
of mislabeling. Another possible reason for mislabeling was that an infant may have
simultaneously experienced multiple stimuli resulting in crying behavior, making it difficult
to isolate the actual proximal cause.
This study involved data samples with some unique characteristics such as race, age,
residence area, and health status, as compared with other similar studies in the literature.
Moreover, good data always play a major part in recognition performance. Improving the
quality of data points is one way to get better recognition. Future work should seek to
further improve data quality by better controlling the data collection environment, and
additional feature extraction methods should be used for performance comparison against
the MFCC feature set used here. The current dataset could also be combined with data from
other hospitals, and dataset with age considerations is another way to boost the robustness
of the model. In addition, the current model only included audio signals, and future work
could integrate video signals to improve model robustness. Moreover, ensemble learning
may offer performance improvements on the algorithmic side. Research involving data
pertaining to multiple labels and experiments on different feature-extraction techniques
can also be interesting areas for future investigation.

Author Contributions: Conceptualization, Y.-C.L., I.W. and J.R.C.J.; data curation, I.W.; formal
analysis, I.W. and Y.-C.L.; funding acquisition, Y.-C.L. and M.-T.Y.; investigation, I.W. and Y.-C.L.;
methodology, I.W., J.R.C.J. and Y.-C.L.; project administration, Y.-C.L. and M.-T.Y.; resources, Y.-C.L.,
M.-T.Y., J.R.C.J. and H.-T.C.; writing—original draft, I.W.; writing—review and editing, Y.-C.L. and
M.-T.Y. All authors have read and agreed to the published version of the manuscript.
Funding: This research was partially funded by the Far Eastern Memorial Hospital and Yuan Ze
University, FEMH-YZU-2018-010.
Institutional Review Board Statement: The study was conducted in accordance with the Declaration
of Helsinki, and approved by the Institutional Review Board of the Far Eastern Memorial Hospital
(protocol code IRB 108059-F and date of approval is on 10 June 2019).
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Int. J. Environ. Res. Public Health 2022, 19, 6311 10 of 10

Data Availability Statement: Available upon request.


Conflicts of Interest: The authors declare no conflict of interest.

References
1. Adachi, T.; Murai, N.; Okada, H.; Nihei, Y. Acoustic properties of infant cries and maternal perception. Ohoku Psychol. Folia 1985,
44, 51–58.
2. Patil, H.A. Cry baby: Using spectrographic analysis. In Advances in Speech Recognition; Neustein, A., Ed.; Springer: New York, NY,
USA, 2010; pp. 323–348.
3. Yong, B.F.; Ting, H.; Ng, K. Baby cry recognition using deep neural networks. In World Congress on Medical; Springer: Prague,
Czech, 2019; pp. 809–816.
4. Reyes-Galaviz, O.F.; Arch-Tirado, E. Classification of infant crying to identify pathologies in recently born babies with ANFIS. In
International Conference on Computers Helping People with Special Needs; Research Gate: Paris, France, 2004; pp. 408–415.
5. Garcia, J.; García, C. Clasification of infant cry ising a scaled conjugate gradient neural. In European Symposium on Artificial Neural
Networks; d-side publi: Bruges, Belgium, 2003; pp. 349–354.
6. Guo, L.; Yu, H.Z.; Li, Y.H.; Ma, N. Pitch analysis of infant crying. Int. J. Digit. Content Technol. Its Appl. 2013, 7, 1072–1079.
7. Narang, S.; Gupta, D. Speech feature extraction techniques: A review. Int. J. Comput. Sci. Mob. Comput. 2015, 4, 107–114.
8. Yu, H.; Zhang, X.; Zhen, Y.; Jiang, G. A universal data cleaning framework based on user model. In Proceedings of the IEEE
International Colloquium on Computing, Communication, Control and Management, Sanya, China, 8–9 August 2009; pp.
200–202.
9. Prajapati, P.; Patel, M. Feature extraction of isolated gujarati digits with Mel Frequency Cepstral Coefficients (MFCCs). Int. J.
Comput. Appl. 2017, 163, 29–33. [CrossRef]
10. Miranda, I.; Diacon, A.; Nielser, T. A comparative study of features for acoustic cough detection using deep architectures. In
Proceedings of the 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Berlin,
Germany, 23–27 July 2019; pp. 2601–2605.
11. Moller, H.; Pedersen, C. Hearing at low and infrasonic frequencies. Noise Healthy 2004, 6, 37–57.
12. Girirajan, S.; Sangeetha, R.; Preethi, T.; Chinnappa, A. Automatic speech recognition with stuttering. Int. J. Recent Technol. Eng.
2020, 8, 1677–1681.
13. Lavner, Y. Baby cry detection in domestic environment using deep learning. In Proceedings of the ICSEE International Conference
on the Science of Electrical Engineering, Eilat, Israel, 16–18 November 2016.
14. Zan, T.; Wang, H.; Wang, M.; Liu, Z.; Gao, X. Application of Multi-Dimension Input Convolutional Neural Network in Fault
Diagnosis of Rolling Bearings. Appl. Sci. 2019, 9, 2690. [CrossRef]
15. Swedia, E.; Mutiara, A.; Subali, M. Deep learning Long-Short Term Memory (LSTM) for indonesian speech digit recognition
using LPC and MFCC Feature. In Proceedings of the 2018 Third International Conference on Informatics and Computing (ICIC),
Palembang, Indonesia, 17–18 October 2018; pp. 1–5.

You might also like