0% found this document useful (0 votes)
115 views11 pages

Infant Cry Language Analysis and Recogni

This paper proposes a method to recognize and classify infant cry signals based on their acoustic features. Cry signals are analyzed in the time and frequency domains to extract features like LPC, LPCC, BFCC, and MFCC. A compressed sensing technique is then used to classify the cry signals based on these extracted features. Experiments show the approach can accurately distinguish between different meanings of both normal and abnormal cry signals, even in noisy environments, and is independent of the individual infant. The goal is to help parents and caregivers better understand infants' needs through their cries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views11 pages

Infant Cry Language Analysis and Recogni

This paper proposes a method to recognize and classify infant cry signals based on their acoustic features. Cry signals are analyzed in the time and frequency domains to extract features like LPC, LPCC, BFCC, and MFCC. A compressed sensing technique is then used to classify the cry signals based on these extracted features. Experiments show the approach can accurately distinguish between different meanings of both normal and abnormal cry signals, even in noisy environments, and is independent of the individual infant. The goal is to help parents and caregivers better understand infants' needs through their cries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

778 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO.

3, MAY 2019

Infant Cry Language Analysis and Recognition:


An Experimental Approach
Lichuan Liu, Senior Member, IEEE, Wei Li, Senior Member, IEEE,
Xianwen Wu, Member, IEEE and Benjamin X. Zhou

Abstract—Recently, lots of research has been directed towards head restraints [3]. Previously, in [4], [5], we proposed a
natural language processing. However, the baby’s cry, which preliminary approach which can recognize cry signals of a
serves as the primary means of communication for infants, has specific infant. However, only limited normal cry signals such
not yet been extensively explored, because it is not a language
that can be easily understood. Since cry signals carry information as hunger, a wet diaper and attention have been studied, and
about a babies’ wellbeing and can be understood by experienced the algorithms work only for specific infants in the study
parents and experts to an extent, recognition and analysis of an in a controlled lab environment. Nevertheless, an abnormal
infant’s cry is not only possible, but also has profound medical cry can be associated with severe or chronic illness, so the
and societal applications. In this paper, we obtain and analyze au- detection and recognition of abnormal cry signals are of great
dio features of infant cry signals in time and frequency domains.
Based on the related features, we can classify given cry signals importance. Compared with normal cry signals, abnormal cry
to specific cry meanings for cry language recognition. Features signals are more intense, requiring further evaluation [6]. An
extracted from audio feature space include linear predictive abnormal cry is often related to medical problems, such as:
coding (LPC), linear predictive cepstral coefficients (LPCC), infection, abnormal central nervous system, pneumonia, sepsis,
Bark frequency cepstral coefficients (BFCC), and Mel frequency laryngitis, pain, hypothyroidism, trauma to the hypopharynx,
cepstral coefficients (MFCC). Compressed sensing technique was
used for classification and practical data were used to design vocal cord paralysis, etc. Therefore, approaches which can
and verify the proposed approaches. Experiments show that the identify and recognize both normal and abnormal cry signals
proposed infant cry recognition approaches offer accurate and in practical scenarios is of extreme importance. In this paper,
promising results. we propose a novel cry language recognition algorithm which
Index Terms—Compressed sensing, feature extraction, infant can distinguish the meanings of both normal and abnormal cry
cry signal, language recognition. signals in a noisy environment. Additionally, the proposed al-
gorithm is individual crier independent. Hence, this algorithm
can be widely used in practical scenarios to recognize and
I. I NTRODUCTION
classify various cry features.

C RYING is the primary means of communication for


infants. Experts, including experienced parents, pedia-
tricians and child care specialists, can often distinguish infant
The proposed algorithm can be used to interpret a babies’
needs, providing parents with an appropriate way to sooth
infants. Furthermore, it can help parents or infant caregivers
cries though training and experience [1]. However, it is difficult avoid misunderstanding of their babies’ cries thereby reducing
for new parents and inexperienced pediatricians and caregivers their own stress. It also helps prevent child abuse and neglect.
to interpret infant cries. Hence, differentiating cries with Moreover, analyzing infant cries provides a non-invasive di-
various meanings based on related cry audio features is of agnostic of the condition of the infant without using invasive
great importance [1], [2]. tests [7]. Using an infant’s cry as a diagnostic tool plays an
Prior works on infant cry analysis have either investigated important role in various situations: tackling medical problems
the difference between normal and pathological (deaf or in which there is currently no diagnostic tool available (e.g.
hearing disabled infants) cries, or they have attempted to sudden infant death syndrome (SIDS), problems in devel-
differentiate conditional cries such as pain from immunization opmental outcome and colic), tackling medical problems in
shots, fear from jack-in-the box toys, or frustration from which early detection is possible only by invasive procedures
(e.g. chromosomal abnormalities), and finally tackling medical
Manuscript received October 18, 2018; revised December 12, 2018; ac-
cepted December 17, 2018. This work was supported by the Gerber Founda- problems which may be readily identified but would benefit
tion and the Northern Illinois University Research Foundation. Recommended from an improved ability to define prognosis, (e.g. prognosis
by Associate Editor Mengchu Zhou. (Corresponding author: Lichuan Liu.) of long term developmental outcome in cases of prematurity
Citation: L. C. Liu, W. Li, X. W. Wu, and B. X. Zhou, “Infant cry language
analysis and recognition: an experimental approach,” IEEE/CAA J. Autom. and drug exposure [8]).
Sinica, vol. 6, no. 3, pp. 778−788, May 2019. In our model, cry signals are output signals from the vocal
L. Liu, W. Li, and X. W. Wu are with the Department of Electrical tract system, which is also called the linear system. The stimuli
Engineering, Northern Illinois University, DeKalb, IL 60115 USA (e-mail:
{liu; weili; z1648342}@niu.edu). signal, which excites the linear system, is the airflow from an
B. X. Zhou is with the Department of Biology, The College of New Jersey, infants’ lungs [9]. Similar to digital speech signal process-
Ewing, NJ 08618 USA (e-mail: [email protected]). ing, we use a time-varying Fourier transform to study the
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. spectral properties of cry signals. Therefore, we can identify
Digital Object Identifier 10.1109/JAS.2019.1911435 the difference between vocal tract systems and input signals,
LIU et al.: INFANT CRY LANGUAGE ANALYSIS AND RECOGNITION: AN EXPERIMENTAL APPROACH 779

which are related with different cry reasons. In this paper, In the first few weeks after birth, crying has a reflexive-like
short-time Fourier transform (STFT) is used to analyze the quality and is most likely tied to the regulation of physiological
cry signals. Recently, speech recognition and acoustic signal homeostasis as the neonate is balancing internal demands with
classification techniques have been widely used in many areas external demands [15].
such as manufacturing, communication, consumer electronic As physiological processes stabilize, periods of alertness
products and medical care [10]−[12]. Speech recognition is and attention increase, which place additional demands on
a signal processing procedure that transfers speech signal regulatory functions. Crying can occur when the system be-
waveforms in a spatial domain into a series of coefficients, comes overloaded due to external stimulation. Crying is also
called a feature, which can be recognized by the computer considered as a mechanism for discharging energy or tension.
[10]−[13]. Since infant cry signals are time-varying non- The need for tension reduction is especially acute at times
stationary random signals which are similar to speech signals. of major developmental upheavals and shifts. Unexplained
The stimuli for infant cry signal is the same as the stimuli fussiness and sudden increases in crying occur between 3 and
for voiced speech signal. In this paper, we use techniques 12 weeks of age due to maturational changes in brain structure
originally designed and used in automatic speech recognition and shifts in the organization of the central nervous system.
to detect and recognize the features for infant cry signals, and Physiological and anatomical changes that occur around 1
use compressed sensing to analyze and classify those signals. to 2 months result in more control over vocalization, thus
Figure 1 shows the procedures of cry signal recognition which crying becomes more differentiated. At the age of 7−9 months
consists of the following steps: there is a second bio-behavioral shift characterized by major
Step 1: Cry unit detection cognitive and affective changes that are also thought to reflect
Step 2: Feature extraction central nervous system reorganization. Crying now occurs for
Step 3: Analysis and classification additional reasons, such as fear and frustration [15].

B. Catalog of Cry Signal


The cry production mechanism in infants resembles the
Fig. 1. Block diagram for infant cry recognition. speech production process in adults [9]. First, external or
internal stimuli will stimulate the infant’s brain. Then, the
This paper is organized as follows. Section II introduces nervous system will transmit the brain’s commands to speech
anatomy of infant-related cries and the physiology of cry and respiratory muscles which control the ejection of air from
signals. In Section III, short time Fourier analysis is pro- the lungs to the vocal tract, changing the vocal tract status [14].
posed and cry detection techniques are presented. Section IV As a result, a different acoustic sound is uttered. The vibration
presents feature pattern extraction algorithms, and proposes of vocal cords and muscle movements results in a change in air
a compressed sensing model to recognize and classify infant pressure. The cord vibration fundamental frequency is called
cry signals. Experimental results are presented in Section V. the pitch.
Finally, in Section VI, we conclude the paper. Similar to speech signals, infant sounds can also be defined
as voiced or unvoiced excitations based on different utterance
II. I NFANT C RY M ODELLING AND C ATEGORIZATION
mechanisms. Voiced excitations occur in the larynx and in-
A. Physiology of Infant Cry volve vocal cord vibration while unvoiced excitations involve
From a physiological point of view, increasing alertness and air turbulences of occlusion caused by the soft palate, tongue,
decreasing crying, as part of the sleep/wakefulness cycle, sug- teeth, or lips.
gests that there may be a balanced exchange between crying Crying serves several useful purposes for infants. Crying
and attention. The change from sleep/cry to sleep/alert/cry ne- is a way for infants to communicate when they are hungry
cessitates the development of control mechanisms to modulate or uncomfortable [15]. Crying helps them shut out intensive
arousal. An infant has to increase arousal gradually to maintain stimuli, such as: sights, sounds, and other sensations. Addi-
states of attention for longer periods. tionally, it helps infants release tension. Sometimes, crying
The infant cry is the result of complex interactions between even helps babies get rid of excess energy. Normal cries
anatomic structures and physiologic mechanisms. These in- can be due to hunger, a need for a diaper change or a
teractions involve the central nervous system, the respiratory need to be held. However, there are also cries associated
system, the peripheral nervous system, and a variety of mus- with something more severe (abnormal cries), such as a hair
cles [14]. tourniquet (a piece of hair wrapped very tightly around a finger
Newborns differ from one another in their response to or toe), an obstruction in the intestine, or pain and sickness.
different stimuli. There are two main physiological states Understanding and identifying the different reasons for various
which infants can switch between: a sleep state and an awake infant cries, especially the abnormal cries can help parents or
state. Within the sleep state, infants fall under two categories- caregivers choose the proper healthcare service and reduce the
either the quiet sleep or the active sleep category. On the other risk of health impairment for infants. The following are some
hand, the awake state is characterized by four main behaviors: common reasons for infant crying [16], [17], hunger, stomach
drowsy, quiet alert, active alert, and crying. Physiological problems, needing to sleep, a dirty diaper, wanting to be hold
changes can easily affect an infant’s cry behavior directly. and etc.
780 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO. 3, MAY 2019

III. C RY S IGNAL T IME F REQUENCY A NALYSIS AND performs well as a cry detector because there is a noticeable
D ETECTION difference of average energy between voiced and unvoiced
After obtaining cry signals, we analyze the recorded signals cry signals, and between crying and silence. This technique
by using waveform and time frequency analysis. Then we is usually paired with short-time zero crossing for a robust
conduct signal detection and segmentation for later pattern ex- detection scheme.
traction. Signal detection processes instances of voiced activity
instead of spending computational time during silent periods. C. Short-Time Zero Crossing
To accurately detect potential periods of voiced activity, two Short-time zero crossing (STZC) is defined as the rate of
short term signal detection techniques are used. signal sign change [11]:
N −1
A. Short-Time Fourier Analysis 1 X
Z(n) = |sign(x(n − m)) − sign(x(n − m − 1))|
N m=0
In this section, we use time frequency analysis to analyze
the infant cry signals. It is well known that discrete fourier (4)
transform (DFT) of a long sequence is an estimate of the (
1, x(m) ≥ 0
power spectrum density (PSD), called a periodogram [11]. where sign(x(m)) =
Different cry signals from different infants would produce −1, x(m) < 0.
similar gross PSD. Therefore, we use STFT to obtain the time- STZC estimation works well with the crying detector be-
varying properties of cry signals. STFT is defined as: cause there are noticeably fewer zero crossings in voiced

crying as compared with unvoiced crying. It is obvious that
X short-time zero crossing can predict the start and endpoints of
Xn (ejω ) = x(m)w(n − m)e−jωn (1)
m=−∞
cry signals, as shown in Fig. 8. STZC approach can effectively
obtain the envelope of a non-silent signal, and combined with
where w(n − m) is a real window sequence to determine the short-time energy, STZC can effectively track instances of
portion of the signal x(n) that receives emphasis at a particular potentially voiced signals that are the signals of interest for
time index, n. STFT is a time dependent complex function of analysis.
time index n and frequency ω. Not all signals bounded by the STZC boundary contain
We can observe STFT as the discrete time Fourier transform cries. Large STZC envelopes with low energy tended to
(DTFT) of the sequence x(m)w(n − m). An alternative contain cry precursors such as whimpers and breathing events.
interpretation of STFT is to consider Xn (ejω ) as a function Not all signals with non-negligible STE contained cries as
of n with a given frequency. Then it becomes a discrete-time well. Infant coughing could also have similar STZC envelopes
convolution and can be considered as linear filtering. and contain noticeable STE values. In this research, crying is
The shape of the window sequence has an important effect defined as a high energy segment of sufficiently long duration.
on this time-dependent FT. The STFT of a given signal is In this research, we use both STE and STZC to detect cry
Z π
1 units. As the normal infant cry duration is around 1.6sec, the
Xn (ejω ) = W (e−jω )e−jωn X(ej(ω−eω) )de ω . (2) two quantifiable threshold conditions to constitute a desired
2π −π
voiced cry are [14]:
Fourier transform (FT) of the sequence of input signal is 1) Normalized energy > 0.05 (to eliminate non-voiced
convolved with the FT of the shifted window. To represent artifacts such as breathing/whimpering and to supersede cry
X(ejω ) by using STFT Xn (ejω ), we choose a window func- precursors);
tion with a spectral highly concentrated around the origin. In 2) Signal envelope period > 0.1 sec (to eliminate impulsive
this paper, Hamming window is used to conduct STFT. voiced artifacts such as coughing).

B. Short-Time Energy IV. F EATURE E XTRACTION AND R ECOGNITION


Short-time energy (STE) is defined as the average of the A. Audio Features
square of the sample values in a suitable time window [10]:
Through the sense of hearing, people can distinguish similar
N
X −1 sounds of different types. This is done through the human
1
E(n) = [w(m)x(n − m)]2 (3) perception of qualitative audio features. There are four pri-
N m=0 mary auditory qualities associated with sound: loudness, pitch,
where w(m) are the coefficients of the window function of timbre, and the source of the sound [11]. Loudness is a
length N , m stands for window index, and n stands for index quantitative measure of the amplitude of the sound compared
of sample. The Hamming window was defined as a time to a reference level and can be qualitatively described from
window which minimizes the maximum side lobe in frequency being quiet to loud. Pitch is a quantitative measure of the actual
domain. fundamental frequency of a signal and can be qualitatively
Short-time processing of crying should take place during described from low to high. Timbre is a qualitative measure
segments between 10−30 ms in length [9]. For signals of of a sound that can be used to help differentiate between two
8 kHz sampling frequency, a time window of 128 samples sounds of equal loudness and pitch through the tonal quality of
(∼16 ms) was used. As shown in Fig. 7, STE estimation the sound. Essentially whenever a sound is heard, the human
LIU et al.: INFANT CRY LANGUAGE ANALYSIS AND RECOGNITION: AN EXPERIMENTAL APPROACH 781

brain will actively process those analog auditory qualities and reflects the difference of the biological structure of the human
make decisions regarding the sound. vocal track [9]. LPCC derives from LPC recursively as [11]
Audio feature extractions hinge upon digital signal pro- 
cessing of audio signals to quantize acoustic information in LP CC1 = LP C1
a manner that makes classification practical and tractable. P k
i−1
LP CCi = LP Ci + i LP CCi−k LP Ck , 1 < i ≤ M
The comparison of time domain waveforms can be used as k=1
a measure for signal classification. On the other hand, time- (6)
domain signals can also be segmented and processed in smaller
where M is LPCC coefficients order, i = 2, . . . , M .
time windows to generate frequency domain snapshots of the
segments through Fourier transform. The frequency domain
analysis of signals yields information closely tied with timbre C. Mel Frequency Cepstral Coefficients
and pitch. In this paper, we leverage both time and frequency Mel frequency cepstral coefficients (MFCC) are coefficients
domain analysis to cover all four primary auditory qualities. that describe the mel frequency cepstrum [13], [18]. In sound
In this section, linear predictive coding (LPC), linear predic- processing, mel frequency cepstrum is a representation of
tive code cepstral (LPCC), mel-frequency cepstral coefficients the short-time power spectrum of a sound based on a linear
(MFCC), and bark-frequency cepstral coefficients (BFCC) are cosine transform of a log spectrum on a non-linear mel scale
extracted from cry signals as features. Additionally, com- of frequency. The mel frequency cepstrum is obtained with
pressed sensing (CS) is used for cry feature recognition in the following steps. The short-time Fourier transform of the
this paper. signal is taken to obtain the quasi-stationary short-time power
spectrum F (f ) = F {f (t)}. The frequency portion is then
B. Linear Predictive Coding mapped to the mel scale perceptual filter bank with 18 triangle
The waveforms of two similar sounds will also be similar. band pass filters equally spaced on the mel range of frequency
If two infant cries have very similar waveforms, it indicates F (m). These triangle band pass filters smooth the magnitude
that they should possess the same impetus. However, it is spectrum such that the harmonics are flattened to obtain the
impractical to conduct a full sample by sample comparison envelope of the spectrum The log of the filtered spectrum is
between cry signals due to the complexity of the sampled obtained and then the Fourier transform of the log spectrum
audio signals. For better performance of the time domain squared results in the power cepstrum of the signals.
comparison of infant cry signals, linear predictive coding
f
(LPC) is applied. M el(f ) = 2595log10 (1 + ). (7)
There are two acoustic sources associated with voiced and 700
unvoiced speech. Voiced crying is produced by the vibration of At this point, the discrete cosine transform (DCT) of the
the vocal cords caused by the airflow from the lungs and this power cepstrum is taken to obtain the MFCC, a tool commonly
vibration is periodic in nature; unvoiced crying is produced used to measure audio signal similarity. The DCT coefficients
by constrictions in the air tract resulting in random airflow are retained as they represent the power amplitudes of the mel
[12]. The basis of the source-filter model of speech is that frequency cepstrum.
crying can be synthesized by generating an acoustic source
and passing it through an all-pole filter.
D. Bark Frequency Cepstral Coefficients
LPC produces a vector of coefficients that represent a
spectral shaping filter [11]. The input signal to this filter Similar to MFCC, BFCC warps power cepstrum in such a
is either a pitch train for voiced sounds, or white noise way that it matches human perception of loudness. The method
for unvoiced sounds. This shaping filter is an all-pole filter of obtaining BFCC is similar to that of MFCC [12]. In BFCC,
represented as [11]: frequencies are converted to the bark scale as following:
1 f 2
H(z) = PM (5) Bark(f ) = 13arctan(0.00076f ) + 3.5arctan(( ) ) (8)
1 − i=1 ai z −i 7500
where ai are the linear prediction coefficients and M is the where Bark denotes bark frequency and f is the frequency
number of poles. The present sample of the cry signal could in Hertz. The mapped bark frequency is passed through 18
then be described as a linear combination of the past M triangle band pass filters. The center frequencies of these
samples of the cry signals. triangular band pass filters correspond to the first 18 of the
The coefficients {ai } can then be estimated by either 24 critical frequency bands of hearing.
autocorrelation or covariance methods [10]. Effectively, the BFCC is obtained by applying DCT to the bark frequency
purpose of LPC is to take a large size waveform and then cepstrum and the 10 DCT coefficients describe the amplitudes
compress it into coefficients, a more manageable form. Be- of the cepstrum. The power cepstrum also possesses the same
cause similar waveforms will also result in similar acoustic sampling rate as the signal, so the BFCC is obtained by
output, LPC serves as a time domain measure of how close performing LPC algorithm on the power cepstrum in 128
two different waveforms are. sample frames. BFCC encodes the cepstrum waveform in
Linear predictive cepstral coefficients (LPCC) represents a compact fashion that makes it suitable for classification
LPC coefficients in the cepstral domain [12]. This feature schemes.
782 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO. 3, MAY 2019

E. Classification with Compressed Sensing V. E XPERIMENTS AND R ESULTS


In this section, we propose a solution to cry recognition All the baby cry audio data studied in this paper were
by obtaining the sparse representation of the training cry recorded in the neonatal intensive care unit (NICU) of a
signals through compressed sensing [19]. Cry signals could hospital. The probable reason for each cry signal file is given
be represented by a vertical vector and we organize all the by experienced neonatal nurses as known facts. A Shure om-
training cry data into matrix A. Test cry files can be contained nidirectional SM94 microphone was used to collect infant cry
as vector y. The test data sets can be linearly combined by the signals. When the baby was crying, we placed the microphone
training data set. The solution vector of this problem will be around 6−10 inches away from the infant’s mouth to pick
sparse. up the cry audio signal. A Sound Digital 722 digital audio
To obtain the sparse solution of the linear system recorder was used to record infant cry signals. The sampling
frequency was 44.1 kHz with a resolution of 16 bits, and then
Ax = y (9) down sampled to 7350 Hz.
The probable reason for each cry signal file was given by ex-
where A is a m × n matrix constructed by the training data
perienced neonatal nurses, experienced nurses and caregivers
vectors, y is the test cry signal vector, and x is the solution of
who were able to identify the reason for a baby’s cries after
the linear system. The total number of classes will be less than
a bit of listening. For example, there are some observed types
n. If m À n, the system is called an overdetermined system.
of newborn cries associated with different audio cues:
Typically, one can use the Randomized Kaczmarz algorithm
for feature recognition [19] under noise free situation. The The “neh” sound is generally related to being “hungry”.
algorithm process is as follows: Typically, when a baby has the sucking reflex, and his/her
tongue is pushed to the roof of the mouth, a “neh” sound is
Input: Standardized matrix A, and standardized vector y.
generated.
Output: Estimation vector x of the linear system Ax = y.
The “owh” sound is made in the reflex of a yawn which
Set x0 with random number, for k = 0.
means “sleepy”.
1. Randomly select r ∈ {1, 2, . . . , m}, update the vector
The “heh” sound means “I need something”, such as: being
xk+1 = xk + (yr − hAr , xk i)Ar . (10) too cold, being itchy, needing a new diaper, or needing a new
body position, etc..
Repeat the iteration until the solution is obtained. The “eair” is a deeper sound which comes from the ab-
Since the ambient noise level at a hospital is high, both domen, so it means lower gas pain. It is usually accompanied
training data and test data we obtained from the collaborating by a newborn pulling his/her knees up or pushing down his/her
hospital were polluted by noise. Therefore, we used a modified legs.
Kaczmarz algorithm to improve the performance with the The “eh” sound means that a baby needs to burp. Generally
noisy background [20]. In the modified Kaczmarz algorithm, speaking, it happens after feeding.
training cry signal matrix A was constructed with normalized Besides listening to cry signals, experienced personnel can
columns. Matrix A is a m×n matrix. The input test cry feature confirm the reasons for different cries by considering other
vectors were also normalized. Training data with various cues, such as gesture, facial expressions, and motion. For
iteration cycles were chosen to verify the performance of example, some hunger signs for newborns include fussing, lip
the modified Kaczmarz algorithm [20]. The procedure is as smacking, rooting (a newborn reflex that makes babies turn
follows: their head toward your hand when you stroke their cheek), and
Input: Standardized matrix A, k is the index number and putting their fingers to their mouth [21]. Crying caused by wet
standardized vector y. diapers can be distinguished by just checking infant’s diaper.
Output: Estimation vector x of the linear system Ax = y. The signs for being “sleepy” are yawning, rubbing eyes and
Set x0 with random number, for k = 0. nodding. Attention crying can be easily soothed by holding
1) Randomly select r ∈ {1, 2, . . . , m}, infants or interacting with them. Discomfort crying, such as
2) Set µk = hAr , As i, an injection or blood test, could be associated with a certain
medical procedure.
hk = xk−1 + (yr − hxk−1 , As i)As (11) All the babies have their own nursing logs containing
information including: age, sex, temperature, blood pressure,
feeding time, diaper change time, sleep time and so on. Nurses
Ar − µk As
vk = p (12) can use the data provided as well as deductive logic to then
1 − |µk |2 interpret an infant’s cry. For instance, if a baby was fed a few
minutes ago, then their crying is most likely not due to hunger
yr − µk ys and an infant who just woke up usually does not cry because
βk = p (13) they are sleepy.
1 − |µk |2
There were 48 cries obtained from 26 infants, of which 11
3) update the vector xk+1 = hk + (βk − hhk , vk i)vk were female and 15 were male. Among the samples, there
Repeat procedures 1−3 until the solution is obtained. This were 25 Asian babies and 1 Caucasian baby. The age of
algorithm convergence with expected exponential rate. these infants ranged from 3 days to 6 months. None of the
LIU et al.: INFANT CRY LANGUAGE ANALYSIS AND RECOGNITION: AN EXPERIMENTAL APPROACH 783

infant has hearing impairment. Cry signals were filed under the table, d stands for day, w stands for week, and m stands
five different causes: needing a diaper (6 observations), being for month.
hungry (16 observations), needing attention (8 observations), We analyzed the different cry signals by using time-
needing sleep (8 observations), and being in discomfort (10 frequency analysis. A Hamming window with length 256
observations), which included injection, sputum induction and was used, the overlap was 128 and a 512 point fast Fourier
blood tests. In the 20−40 second recording time, we assumed transform (FFT) was used for calculating the STFT.
that an infant would not change his/her mood or desire within Figs. 2−6 show the waveforms and STFTs (spectrograms)
the recording period. of different cry signals. It is obvious that different catalogs of
cry signals have different waveform and spectrum character-
Based on the data obtained from the babies and known istics.
facts from the experts, we listed “discomfort” cry in the Diaper-related crying is considered a normal cry and has a
abnormal category of crying and the other cries in the normal pattern of crying and silence as shown in Fig. 2. This kind of
category. We used “hungry”, “diaper”, “attention”, “sleepy” crying starts with a cry coupled with a briefer silence, which
and “discomfort” as pilot features, but more features can be is followed by a short high-pitched inspiratory whistle. Then,
easily added such as tired, cold or hot, need a burp etc.. there is a brief silence followed by another cry. Fig. 3 shows
attention-related crying, which is also a normal cry. This type
Signal acquisition and numbering of cry audio files with the
of cry is characterized by a similar temporal sequence but can
associated infant is shown in Table I. In the Age column of
be distinguished by differences in the length of the various
TABLE I frequency components.
C RY S IGNAL I NFORMATION Hunger-related crying, which is also a normal cry, is the
Cause Sex Age Race File most general cry. The duration of crying is not only longer
1 Diaper F 2w Asian T07 but it is also followed by a longer silence as shown in Fig. 4.
2 Attention F Asian T10A Typically, this cry is louder and more abrupt compared with
3 Attention T34
4 Attention T105 attention or diaper-related crying.
5 Hungry M 1w Asian T11
6 Attention T33
7 Hungry T35
8 Hungry M 1w Asian T19
9 Sleepy M 3m Asian T20
10 Disturbed T32
11 Sleepy F Asian T21
12 Sleepy T23
13 Diaper F 3d Asian T22
14 Inject M 1w Asian T24
15 Sputum induction M 2w T110
16 Sleepy M 1w Asian T25
17 Hungry M 2w T113
18 Hungry F 3d Asian T26
19 Attention F 1w T104
20 Hungry F 1w T122
21 Attention F 8d Asian T27
22 Uncomfortable M 2w Asian T28
23 Blood test M 3w Asian T109
24 Diaper F 2w Asian T29
25 Attention M 9d Asian T30 Fig. 2. Diaper signal wave form (upper) and spectrogram (lower).
26 Attention T31
27 Diaper M 2d Asian T36
28 Diaper M 6d T116
29 Diaper F 9d Asian T37
30 Hungry F 2w T117
31 Other M 1w Asian T106
32 Hungry M 2w T121
33 Hungry M 2w Asian T107
34 Blood test F 2w Asian T108
35 Uncomfortable F 1m Asian T111
36 Hungry F T124
37 Hungry M 5d Asian T112
38 Hungry M 11d Asian T114
39 Hungry M 2w T115
40 Hungry M 2w T120
41 Sleepy F 8d Asian T118
42 Sleepy F T119
43 Hungry M 1w Caucasian T123
44 Uncomfortable M 14w Asian T125
45 Sleepy M T126
46 Hungry M T127
47 Sleepy M T128
48 Uncomfortable M T129 Fig. 3. Attention: signal waveform (upper) and spectrogram (lower).
784 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO. 3, MAY 2019

cry signals. Its duration is longer and starting from a low


amplitude, the cry gradually increases in loudness then drops
slowly. The silent period is also a little longer compared with
other cry signals.
Uncomfortable-related crying is shown in Fig. 6. Since it is
a pain-related cry, unlike the previous normal cry signals, this
cry has no preliminary moaning. The pain cry is a loud cry,
followed by a period of breath holding [22].
In this study, 48 recording files were segmented by short
time processing. Each wave file was processed to obtain “cry
units”. As an example, five “hungry baby cry units” were
detected from file T19. wav. Fig. 7 shows the original cry
signal, the short time energy analysis and the cry units detected
based on STE. By using short time zero crossing, we see that
higher zero crossing is associated with crying (Fig. 8). The
detected cry units are the segments after the low power signal
Fig. 4. Hungry: waveform (upper) and spectrogram (lower).

Fig. 5. Sleepy: waveform (upper) and spectrogram (lower). Fig. 7. Baby cry signal, short time energy and detected cry unit for cry file
T19.wav.

Fig. 6. Uncomfortable: waveform (upper) and spectrogram (lower).

Fig. 5 shows sleep-related crying, which is also a normal Fig. 8. Baby cry signal, short time zero-crossing and detected cry for T19.
cry. However, it is quite different from the previous normal wav.
LIU et al.: INFANT CRY LANGUAGE ANALYSIS AND RECOGNITION: AN EXPERIMENTAL APPROACH 785

Fig. 9. BFCC features for attention, diaper, hungry and discomfort cry signals.

segments were removed from the cry signals. For all those (48) energy cry signals which matches the experts’ experience.
recording files, we got 151 “attention cry units”, 137 “diaper Cry units from each class (100 “Draw attention cry units”,
change needed cry units”, 422 “hungry cry units”, 79 “sleepy” 50 “Diaper change needed cry units”, 120 discomfort, and 200
cry units and 182 “discomfort cry units”. “Hungry cry units”) were used as training signals, and the rest
Fig. 9 shows the BFCC features for different catalogs from of the data (51 attention, 87 diaper and 222 hungry) were used
different infants. BFCC features for attention from 4 different for testing purposes. Fig. 10 shows the cry units for 3 different
babies are shown in (a). Features from one infant are similar cry signals by using LPC, LPCC, MFCC an BFCC features.
to other infants when they had a similar reason to cry. It is obvious that different cry signals have different features.
Subplot (b) shows “Diaper change needed cry units” BFCC Compressed sensing technique was used to conduct recog-
features of 4 different cry files. Again, the results show similar nition and classification and classification rate was used to
features for needing a diaper change across different infants. evaluate the performance and was defined as
Since attention-related crying and diaper-related crying both
Nright
are characterized as normal crying, their intensity levels are Pc = × 100% (14)
similar but less than hunger-related crying. Ntotal
Fig. 9 (c) shows “Hungry cry units” BFCC features of 4 where Nright was the number of right classifications, and Ntotal
different babies. Hunger-related crying had the highest inten- was the total number of test cry units. We used sleep-related
sity level in the normal cry catalog. BFCC features obtained crying and hunger-related crying to test the performance of
from “hungry” is quite different from those of “attention” CS, 79 sleepy cry units and 200 hungry cry units have been
crying and “diaper” crying. It is shown that the BFCC patterns used for training data and 100 sleepy cry units and 222 hungry
changed from the low stress level cries to high stress level cry units have been used as testing data.
cries. There is an abrupt jump from coefficient 1 to coefficient Comparing Fig. 5 and Fig. 6, the differences between the
2 which is close to the trend of abnormal cry signals. Fig. 9 (d) waveforms of the two types of cries are more pronounced
shows discomfort-related crying from 4 files associated with in the frequency domain than in the time domain. Since
4 different babies. The BFCC features show a similar trend LPC features were obtained only in the time domain, CS
among those infants. They are quite different from normal cry cannot distinguish hungry and sleepy accurately based on LPC
signals, especially the low intensity level cries, such as diaper- features, as shown in Table II. However, LPCC represents LPC
related and attention-related crying. And even compared with coefficients in the cepstral domain to reflect the differences
hunger-related cry signals, the values of the coefficients were of the biological structure of the vocal track and the LPCC
higher which means discomfort-related crying produced higher algorithm produces different features for hungry and sleepy
786 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO. 3, MAY 2019

Fig. 10. Different features for hungry and sleepy cry.

TABLE II
cry. Nonetheless, since BFCC and MFCC algorithms capture
I NFANT C RY R ECOGNITION C ORRECT R ATE WITH C OMPRESSED
both time and frequency domain information of cry signals,
S ENSING T ECHNIQUE AND D IFFERENT F EATURES FOR C RY
it is obvious that BFCC and MFCC can produce different
S IGNALS ( SLEEPY AND HUNGRY )
features for hunger-related crying and sleepy crying, as shown
in Fig. 10. As a result, BFCC and MFCC features outperform The data ratio of constructing the matrix
LPC and LPCC features with a classification rate around 70% Features
0.4 0.5 0.6 0.7 0.8 0.9
(Table II).
BFCC 0.6991 0.6915 0.7067 0.6842 0.7105 0.6842
Experimental results on hungry, discomfort, attention and
diaper cry signals are shown in Fig. 11 and Table III. For LPC 0.5133 0.4681 0.4933 0.4737 0.4211 0.5789
the same reason mentioned above, BFCC and MFCC features LPCC 0.6018 0.6064 0.6267 0.5965 0.5789 0.4737
outperform LPC and LPCC features. MFCC 0.6814 0.6596 0.6767 0.7018 0.7105 0.6842
We also investigate the performance in terms of the recogni-
tion rate for different features and different popular classifica-
TABLE III
tion methods, as shown in Table IV. We found that MFCC and
I NFANT C RY R ECOGNITION C ORRECT R ATE WITH C OMPRESSED
BFCC outperform other features, and ANN and CS technique
S ENSING T ECHNIQUE AND D IFFERENT F EATURES
can provide higher recognition rate. Different combination can
achieve different performance. For example, LPC can work The data ratio of constructing the matrix
with NN well, MFCC and LPCC combined with CS can Features
0.4 0.5 0.6 0.7 0.8 0.9
achieve higher recognition rate than CS or NN. The highest
BFCC 0.5701 0.5393 0.5742 0.5754 0.5111 0.6842
recognition correct rate for infants cry application is achieved
at 76.47% by using BFCC feature and ANN. LPC 0.6131 0.6225 0.5854 0.5688 0.5419 0.4667
It is obvious that there are universal individual independent LPCC 0.5009 0.4989 0.4986 0.4944 0.5028 0.4889
patterns for infant cry signals. Based on the time and frequency MFCC 0.5907 0.5910 0.5938 0.5502 0.5140 0.5333
features, it is feasible to discern between different cry units.
BFCC features and CS algorithms can provide reasonable and VI. C ONCLUSION
accurate recognition capabilities. Experimental results of the This paper presents a novel detection and recognition
proposed approach match experts’ knowledge and judgments method for individual independent infant cries in a noisy en-
very well. vironment. Audio features of infant cry signals were obtained
LIU et al.: INFANT CRY LANGUAGE ANALYSIS AND RECOGNITION: AN EXPERIMENTAL APPROACH 787

Fig. 11. Features for attention, diaper, hungry and discomfort cry.

TABLE IV
[3] Y. Kheddache and C. Tadj, “Acoustic measures of the cry characteris-
I NFANT C RY R ECOGNITION C ORRECT R ATE BY U SING tics of healthy newborns and newborns with pathologies,” Journal of
D IFFERENT F EATURES AND R ECOGNITION T ECHNIQUES Biomedical Science and Engineering, vol. 6, no. 8, 9 pages, 2013.

Features LPC LPCC MFCC BFCC [4] L. Liu, K. Kuo, and Sen M. Kuo, “Infant cry classification integrated
Nearest neighborhood (NN) 0.6384 0.4795 0.6389 0.6522 ANC system for infant incubators,” in Proc. IEEE International Conf.
on Networking, Sensing and Control, Paris, France, 2013, pp. 383−387.
Artificial neural network (ANN) 0.5455 0.5188 0.6045 0.7647
Compressed sensing (CS) 0.5789 0.6267 0.7105 0.7064 [5] L. Liu and K. Kuo, “Active noise control systems integrated with infant
cry detection and classification for infant incubators,” in Proc. Acoustic,
pp. 1−6. 2012.
in time and frequency domains, and were used to perform
infant cry language recognition. Practical data from hospitals
[6] L. LaGasse, A. Neal, and M. Lester, “Assessment of infant cry: acoustic
were used to design and verify the proposed approaches. cry analysis and parental perception,” Ment Retard Dev Disabil Res Rev.,
Experiments proved that the proposed infant cry unit recog- vol. 11, no. 1, pp. 83−93, 2005.
nition models offer accurate and promising results with far-
reaching applications medically and societally. Our future [7] Várallyay Jr. György, “Future prospects of the application of the infant
research includes: takes multiple features into consideration cry in the medicine,” Periodica Polytechnica Ser. El. Eng, vol. 50, no.
1−2, pp. 47−62, 2006.
and reinforcement learning to improve the performance. We
plan to collect more data and include more cry reasons as well.
[8] G. Buonocore and C.V. Bellieni, Neonatal Pain, Suffering, Pain and Risk
of Brain Damage in the Fetus and Newborn, Berlin, Germany, Springer,
2008.

R EFERENCES [9] L. L. LaGasse, R. Neal, and B. M. Lester. “Assessment of infant cry:


acoustic cry analysis and parental perception,” Mental Retardation and
Developmental Disabilities Research Reviews, vol. 11, no. 1. pp. 83−93,
[1] H. Karp, The Happiest Baby on the Block; Fully Revised and Updated
2005.
Second Edition: The New Way to Calm Crying, New York City, NY,
USA, 2015.
[10] L. Tan and J. Jiang, Digital Signal Processing: Fundamentals and
Applications (3rd edition). Cambridge, MA, USA, Academic Press,
[2] J. A. Green, P. G. Whitney, and M. Potegalb, “Screaming, yelling, 2017.
whining and crying: categorical and intensity differences in vocal
expressions of anger and sadness in children’s tantrums,” Emotion, vol.
5, no. 11, pp. 1124−1133, Oct. 2011. [11] Z. Ren, K. Qian, Z. X. Zhang, V. Pandit, A. Baird, and B. Schuller,
788 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 6, NO. 3, MAY 2019

“Deep scalogram representations for acoustic scene classification,” Lichuan Liu (M’06–SM’11) received the B.S. and
IEEE/CAA J. Autom. Sinica, vol. 5, no. 3, pp. 662−669, May 2018. M.S. degree in electrical engineering in 1995 and
1998 respectively from University of Electronic Sci-
ence and Technology of China, and Ph.D. degree
[12] Dong Yu and Jinyu Li. “Recent progresses in deep learning based in electrical engineering from New Jersey Institute
acoustic models,” IEEE/CAA J. Autom. Sinica, vol. 4, no. 3, pp. of Technology, Newark, NJ in 2006. She joined
396−409, April 2017 Northern Illinois University in 2007 and is currently
an Associate Professor of Electrical Engineering and
[13] B. Goldand N. Morgan, Speech and Audio Signal Processing. New York, the Director of Digital Signal Processing Laboratory.
NY, USA, John Wiley & Sons, 2011. Her current research includes digital signal process-
ing, real-time signal processing, wireless communi-
cation and networking. She has over 70 publications including 30 journal
[14] V. R. Fisichelli, S. Karelitz, C. F. Z. Boukydis, and B. M. Lester, “The papers and one book chapter. She has three patents awarded. She has led and
cry attencies of normal infants and those with brain damage,” Infant participated in many research grants, such as: NSF, NASA and NIH.
Crying, Plenum Press, 1985.

[15] C. F. Z. Boukydis and B. M. Lester, Infant Crying: Theoretical and


Research Perspectives, Berlin, Germany, Springer Science and Bussiness
Media, 2012. Wei Li (M’99) received the Ph.D. degree in electri-
cal and computer engineering from the University
of Victoria, Canada in 2004. He is currently an
[16] S. Ludington-Hoe, X. Cong, and F. Hashemi, “Infant crying: nature, Assistant Professor at the Northern Illinois Univer-
physiologic consequences, and select interventions,” Neonatal Netw. vol. sity, USA. His research interests include computer
21, no. 2, pp. 29−36. Mar. 2002. networks, smart grid, internet of things, applications
of machine learning and artificial intelligence in e-
health, computer vision and natural language pro-
[17] P. Dunstan, Calm the Crying: The Secret Baby Language That Reveals cessing.
the Hidden Meaning Behind an Infant’s Cry, New York City, NY, USA,
Avery, 2012.

[18] M. Sahidullah, and G. K. Saha, “Design analysis and experimental eval-


uation of block based transformation in MFCC computation for speaker
recognition,” Speech Communication, vol. 54, no. 4, pp. 543−565, May Xianwen Wu (M’06) received the B.S. degree
2012. in electrical engineering from North University of
China in July 2005, the M.S. degree in biomedical
engineering from Southeast University in June 2010,
[19] F. Katzberg, R. Mazur, M. Maass, P. Koch, and A. Mertins, “A and the M.S. degree in electrical engineering from
compressed sensing framework for dynamic sound-field measurements,” Northern Illinois University in August 2013. He
IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 26, no. received the Ph.D. degree in electrical engineering
11, pp. 1962−1975, Jun. 2018. from the University of Arkansas in December 2016
and then joined Qualcomm Inc. as a System Engi-
[20] D. Needell and R. Ward, “Two-subspace projection method for coherent neer. His research focuses on communication theory,
overdetermined systems,” Journal of Fourier Analysis and Applications, wireless sensor networks, and signal processing.
vol. 19, no. 2, pp. 256−269, April, 2013.

[21] C. Lau, “Development of suck and swallow mechanisms in infants,”


Ann. Nutr. Metab., vol. 7, no. 5, pp. 7−14, July 2015. Benjamin X. Zhou is currently pursuing a B.S. in
biology at the College of New Jersey as part of the
[22] P. Runefors and E. Arnbjönsson, “A sound spectrogram analysis of 7-year B.S/M.D. program with NJMS. He currently
children’s crying after painful stimuli during the first year of life,” Folia researches at Perelman School of Medicine at the
honiatr. Logop., vol. 2, no. 57, pp. 90−95, Mar–Apr. 2005. University of Pennsylvania. His current research
interests include sepsis and its effects on the immune
system using animal models.

You might also like