Infant Mood Prediction and Emotion Classification With Different Intelligent Models
Infant Mood Prediction and Emotion Classification With Different Intelligent Models
Abstract—In this paper, we have analysed the cries processing of an infant cry is more challenging
of infants aged 0 to 6 months and have tried predicting compared to an adult voice. This is due to a
emotions which might be a tool of communication. high fundamental frequency with abrupt changes
The present work carried out is mainly for analysing
infant cries to predict emotions of hunger, discomfort, of very short duration. Since every infant cry seems
and belly pain. The system described here involves the indistinguishable to an adult, it is useful to introduce
Mel Frequency Cepstral Coefficients (MFCC) feature a classification system using machine learning to
extraction technique and consecutive processing of deduce the reason as to why an infant is crying, in
various classification models such as Decision Tree, a reliable manner.
Random Forest, Support Vector Machine (SVM), and
Logistic Regression. After comparing the results from Pal, Pritam, Ananth N. Iyer, and Robert E. Yan-
all the mentioned classifiers, we have concluded that torno [1] used Image Processing and K-Means clus-
for an infant cry analysis, SVM and Random Forest
Classification gives the most accurate output of 91%.
tering to recognise features of the facial expression
of an infant, and analyse the audio signal of its
Index Terms—Emotion Classification, Infant Cry, cry, to identify the reason behind its cry. The image
Feature Extraction, Machine Learning, Support Vec- processing accuracy was 64%, and the audio signal
tor Machine. accuracy was 74.2%. Yamamoto, Shota, et al. [2]
used Principal Component Analysis (PCA) to detect
I. I NTRODUCTION the mood of an infant. This could prove useful in
Understanding what an infant really wants when robots that could be built to take care of babies
it is crying can often prove to be a difficult task and be their caregiver with an accuracy of 62.10%.
[13], especially for parents. However, with advance- Honda, Kazuki, et al. [3] used Maximum Likeli-
ments in different artificial intelligence algorithms, hood approach using both frame-wise and global
it has become easier to understand them [5],[6]. features. It is the classification of different infant
Interpreting an infant’s cry correctly can help us moods in clusters and the accuracy was 75.50%.
take better care of the infant, and even prevent Noroozi, Fatemeh, Neda Akrami, and Gholamreza
possible health risks and wrong diagnosis due to the Anbarjafari [4] used an approach of non-linear
wrong conclusion of an infant’s cry. This inspired auto-regression time-series neural network, assum-
us to conduct some research into the possibility ing the value of the variables are defined as data-
of predicting the mood of an infant by analysing feedback time-series. The accuracy of recognition
the cry. Acoustic analysis of a new-born infant was 86.25% when Random Forest Classification
cry can aid preliminary clinical diagnosis in a cost was used, and an accuracy of 60.30% when neural
effective and non-invasive manner. However, signal networks were used. Tarunika, K., R. B. Pradeeba,
and P. Aruna [5] used Deep Neural Network (DNN)
and K-Nearest Neighbour (KNN) to recognize emo-
978-1-6654-4175-9/21/$31.00 ©2021 IEEE tion from speech, especially that of a scary state of
Authorized licensed use limited to: PES University Bengaluru. Downloaded on January 19,2024 at 09:49:20 UTC from IEEE Xplore. Restrictions apply.
mind. being 10%. The method can be seen in Figure 1.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on January 19,2024 at 09:49:20 UTC from IEEE Xplore. Restrictions apply.
the frequency domain, the lower and upper cut-off 2) Windowing of the signal using Hamming
frequencies were determined by the Bandpass filter window:
[15]. This Bandpass filter was used for noise signal An infant cry signal is a slowly time-varying
obliteration. The lower and upper cut-off analyzed or quasi-stationary signal. For stable acoustic
during this test was found out to be 142.9107 Hz characteristics, this signal needs to be exam-
and 6620.128 Hz respectively. This process was ined over a sufficiently short period of time.
achieved through coding on MATLAB R2021a. Therefore, the signal analysis must always be
After this initial step of noise removal, the signals carried out on short segments across which
are processed by the machine learning algorithm in the infant cry signal is assumed to be sta-
Python programming language. tionary. For this, we have used the Hamming
window which is computed as W(n).
B. Feature Extraction
2πn
The accuracy of any machine learning model W (n) = 0.54−0.46·cos ; 0 ≤ n ≤ N −1
N −1
solely depends on the feature selection and their (2)
extraction [14]. Since the proposed model deals
with acoustical analysis, Mel-Frequency Cepstral y(n) = W (n) · x(n) (3)
Coefficients (MFCCs) are used as primary features
in this study. MFCCs are coefficients that collec- Where, N is the sample number in each
tively make up an MFC (Mel-Frequency Ceptrum). frame, x(n) is the input cry signal and y(n) is
They are derived from a type of cepstral repre- the signal after applying convolution. This is
sentation of an input audio clip. MFCC takes into done to enhance the harmonics, smooth the
account human perception for sensitivity at appro- edges and to reduce the edge effect while
priate frequencies by converting the conventional taking the DFT on the signal.
frequency to Mel Scale and hence, are suitable
for speech recognition tasks. Other features, such 3) Discrete Fourier Transform (DFT):
as energy, spectral centroid, and spectral skewness Each of the above windowed frame is then
could have also been used. However, based on the converted into magnitude spectrum by per-
results obtained, the overall accuracy obtained from forming DFT. It is computed as X(k).
the extracted data reduced when more features were N −1
X −j2πnk
extracted for analysis. Hence, solely the MFCC X(k) = x(n) · e N ;0 ≤ k ≤ N − 1
feature was extracted as it resulted in an overall n=0
higher accuracy than adding more features for clas- (4)
sification purposes. Where, N is the number of points used to
This feature extraction technique includes pre- compute the DFT.
emphasis [7], windowing of the signal using Ham- 4) Transforming the frequencies onto the Mel
ming window, applying Discrete Fourier Trans- scale:
form (DFT), taking the logarithmic value of the The Mel filter is computed as M(f).
magnitude, transforming the frequencies onto the M (f ) = log(H · X(: n) + e−20 ) (5)
Mel scale and applying Discrete Cosine Transform
(DCT). The detailed description of the various steps Where, f is the frequency in Hertz, X is the
involved in the MFCC feature extraction technique signal and H a triangular filter. The Mel scale
are as follows: is approximately a linear frequency spacing
1) Pre-Emphasis: below 1 kHz and a logarithmic spacing above
Pre-emphasis refers to filtering out the part 1 kHz. The approximation of Mel from phys-
that emphasizes the higher frequencies. Its ical frequency is computed as fM el .
purpose is to balance the spectrum of voiced
f
sounds that have a steep roll-off in the fM el = 2595log10 1 + (6)
700
high-frequency region. Pre-emphasis removes
some of the glottal effects from the vocal
tract parameters. The most commonly used 5) Taking the logarithmic value of the magnitude
pre-emphasis filter is given by the following and performing Discrete Cosine:
transfer function H(n). Humans are less sensitive to change in audio
H(n) = 1 − bn− 1 (1) signal energy at higher energy compared to
lower energy. Log function also has a similar
Where, the value of b controls the slope of property, at a low value of input x gradient
the filter and is usually between 0.4 and 1.0. of log function will be higher but at high
value of input gradient value is less. So we
Authorized licensed use limited to: PES University Bengaluru. Downloaded on January 19,2024 at 09:49:20 UTC from IEEE Xplore. Restrictions apply.
apply log to the output of Mel-filter to mimic for an answer. This process goes on until the
the human hearing system. Now, the DCT record has reached a certain conclusion. This
is applied to the transformed Mel Frequency forms a tree-like pattern. The Decision Tree
coefficients which gives us a set of cepstral has three types of nodes. A root node has no
coefficients. Since most of the signal infor- edges coming to it but has more than zero
mation is represented by the first few MFCC edges coming out of it. Internal nodes have
coefficients, we have considered extracting exactly one edge coming to it and two edges
only those initial coefficients ignoring or trun- coming out of it. Leaf or terminal nodes have
cating higher order DCT components. Finally, exactly one edge coming to it but no outgoing
MFCC is computed as c(n). edge.
IV. R ESULTS
M −1
X πn(m − 0.5) Different classifier models were used in this
c(n) = log10 (s(m))cos
m=0
M paper for the given dataset. As seen from Tables
(7) I and II, Random Forest and SVM both performed
n = 0, 1, 2...C − 1 the best with a training accuracy of 0.93 and an F1
Where, c(n) are the cepstral coefficients and score of 0.96. Decision Tree gave the second highest
C is the number of MFCCs. accuracy with an accuracy of 0.90 and an F1 Score
of 0.88. When it comes to testing performance, both
C. Classification Random Forest and SVM perform the best with an
The extracted features are classified using dif- accuracy of 0.91 and an F1 Score of 0.88. Logistic
ferent machine learning algorithms and models to Regression follows closely with an accuracy of 0.88
study and see the accuracy yielded by each, as listed and an F1 Score of 0.86. We conclude that, between
below [4]. Python3 was used to train the models Random Forest and SVM, the better choice would
on the features extracted. Feature Extraction was be Random Forest as overfitting of the dataset is re-
performed using MATLAB R2021a. duced and improved precision. The SVM classifier
can be used as well, however it is not suitable when
1) Logistic Regression Model:
there are too many training examples, and hence the
This type of regression is used when there is
better choice would be Random Forest. Figures 2
a classification involved in the dataset. It is
and 3 are a graphical representation of Tables I and
frequently used when the output produces an
II.
output of binary nature, that is either 0 or 1.
This model is usually used for classification TABLE I
models having only two classes, although T RAINING PERFORMANCE : R ANDOM F OREST AND SVM
multiple classes [8] were used to classify C LASSIFIER MODELS HAVE THE HIGHEST ACCURACY
different emotions in this case. Algorithm Accuracy F1 Score
2) Random Forest: Random Forest 0.93 0.96
A supervised learning model that is based Logistic Regression 0.89 0.91
Decision Tree 0.90 0.88
on the Decision Tree Classifier model. It is SVM 0.93 0.96
made up of many Decision Trees, hence the
word ‘Forest’ in its name. The final outcome
of a Random Forest Classification model is TABLE II
based on the different outputs obtained from T ESTING PERFORMANCE : R ANDOM F OREST AND SVM
C LASSIFIER MODELS HAVE THE HIGHEST ACCURACY
the different Decision Tree predictions. The
final prediction is made by taking the mean Algorithm Accuracy F1 Score
predicted value of various Decision Trees in Random Forest 0.91 0.88
Logistic Regression 0.88 0.86
the Random Forest model. Decision Tree 0.85 0.84
3) SVM Classification Method: SVM 0.91 0.88
Known as support vector machines, these
models make hyperplanes which classify dif-
ferently labeled data in the best way possible. V. C ONCLUSION
The aim is to find the best possible hyper- From this we can conclude that for the chosen
plane which separates different classes in the dataset, the Random Forest and SVM classification
best way possible. models are the most appropriate, as it gives the
4) Decision Tree Classifier: highest possible accuracy amongst all the experi-
It is built based on the dataset by checking a mented classifier models.
series of conditions. For each answer (yes/no Quite often, a parent is not able to correctly guess
classification), another condition is checked why their infant is crying and might not be able to
Authorized licensed use limited to: PES University Bengaluru. Downloaded on January 19,2024 at 09:49:20 UTC from IEEE Xplore. Restrictions apply.
put them at ease by lulling them to sleep or by
comforting them. This model was made in a way
to detect three infant moods: Hunger, Belly Pain,
and Discomfort. We believe that this would make it
easier for parents to understand the reason behind
their infant’s cry and appropriately treat them, aid-
ing in better infant care. Using this method, parents
could correctly identify the mood of their infants
and consequently child care personnel could take
Fig. 3. Accuracy and F1 score were plotted as performance
care of the infant effectively. parameters (y-axis; unitless) of models (x-axis) on testing data
VI. D ISCUSSION
R EFERENCES
As observed, the best possible accuracy is given
[1] Pal, P., Iyer, A.N. and Yantorno, R.E., 2006, May. Emotion
by both the SVM classification model and the detection from infant facial expressions and cries. In 2006
Random Forest Classification model. Yet Random IEEE International Conference on Acoustics Speech and
Forest Classification is preferred as there is less Signal Processing Proceedings (Vol. 2, pp. II-II). IEEE.
[2] Yamamoto, S., Yoshitomi, Y., Tabuse, M., Kushida, K. and
overfitting and higher precision. Even though we Asada, T., 2013. Recognition of a baby’s emotional cry
had refined the audio signals from the available towards robotics baby caregiver. International Journal of
dataset, there were many unavoidable scenarios like Advanced Robotic Systems, 10(2), p.86.
[3] Honda, K., Kitahara, K., Matsunaga, S., Yamashita, M.
a very low volume adult speaking in background, and Shinohara, K., 2012, December. Emotion classification
inconsistent cry, television or other sounds playing of infant cries with consideration for local and global
in background along with the infant’s cry, wrongly features. In Proceedings of The 2012 Asia Pacific Signal
and Information Processing Association Annual Summit
tagged infant cry sounds in the dataset, etc. Some- and Conference (pp. 1-4). IEEE.
times, even the ground truth is difficult to determine [4] Noroozi, F., Akrami, N. and Anbarjafari, G., 2017, May.
because babies cannot communicate the reason, and Speech-based emotion recognition and next reaction pre-
diction. In 2017 25th Signal Processing and Communica-
the ground truth depends on the subjective assess- tions Applications Conference (SIU) (pp. 1-4). IEEE.
ment by the parents. These drawbacks reduced our [5] Tarunika, K., Pradeeba, R.B. and Aruna, P., 2018, July.
training data scope to 28 infant cry signals per Applying machine learning techniques for speech emo-
category for hungry, belly pain and discomfort. tion recognition. In 2018 9th International Conference on
Computing, Communication and Networking Technologies
The rest of the categories like cold or hot, lonely, (ICCCNT) (pp. 1-5). IEEE.
needs, burping, scared, tired and ‘don’t know’ had [6] Cai, L., Jiang, C., Wang, Z., Zhao, L. and Zou, C., 2003,
even fewer samples available and hence were not December. A method combining the global and time series
structure features for emotion recognition in speech. In
considered in our analysis. Since this idea and its International Conference on Neural Networks and Signal
development are at very preliminary stages, no other Processing, 2003. Proceedings of the 2003 (Vol. 2, pp. 904-
datasets were available that could be used. 907). IEEE.
[7] Rajak, R. and Mall, R., 2019, October. Emotion recognition
from audio, dimensional and discrete categorization using
CNNs. In TENCON 2019-2019 IEEE Region 10 Confer-
ence (TENCON) (pp. 301-305). IEEE.
[8] Ashrafidoost, R., Setayeshi, S. and Sharifi, A., 2016,
November. Recognizing Emotional State Changes Using
Speech Processing. In 2016 European Modelling Sympo-
sium (EMS) (pp. 41-46). IEEE.
[9] Sharma, S. and Mittal, V.K., 2017, November. Infant cry
analysis of cry signal segments towards identifying the cry-
cause factors. In TENCON 2017-2017 IEEE Region 10
Conference (pp. 3105-3110). IEEE.
[10] Mittal, V.K., 2016, October. Discriminating features of in-
fant cry acoustic signal for automated detection of cause of
crying. In 2016 10th International Symposium on Chinese
Fig. 2. Accuracy and F1 score were plotted as performance Spoken Language Processing (ISCSLP) (pp. 1-5). IEEE.
parameters (y-axis; unitless) of models (x-axis) on training data [11] Chang, C.Y., Hsiao, Y.C. and Chen, S.T., 2015, September.
Application of incremental SVM learning for infant cries
recognition. In 2015 18th International Conference on
Network-Based Information Systems (pp. 607-610). IEEE.
[12] Osmani, A., Hamidi, M. and Chibani, A., 2017, November.
ACKNOWLEDGMENT Machine learning approach for infant cry interpretation. In
2017 IEEE 29th International Conference on Tools with
The authors want to express their sincere thanks Artificial Intelligence (ICTAI) (pp. 182-186). IEEE.
and gratitude to Vellore Institute of Technology, [13] Teeravajanadet, K., Siwilai, N., Thanaselanggul, K., Pon-
siricharoenphan, N., Tungjitkusolmun, S. and Phasukkit,
Vellore, India, for the support and for the resources P., 2019, November. An Infant Cry Recognition based
provided to carry out this research. on Convolutional Neural Network Method. In 2019 12th
Authorized licensed use limited to: PES University Bengaluru. Downloaded on January 19,2024 at 09:49:20 UTC from IEEE Xplore. Restrictions apply.
Biomedical Engineering International Conference (BME-
iCON) (pp. 1-4). IEEE.
[14] Limantoro, W.S., Fatichah, C. and Yuhana, U.L., 2016,
October. Application development for recognizing type of
infant’s cry sound. In 2016 International Conference on
Information & Communication Technology and Systems
(ICTS) (pp. 157-161). IEEE.
[15] Kuo, K., 2010, May. Feature extraction and recognition
of infant cries. In 2010 IEEE International Conference on
Electro/Information Technology (pp. 1-5). IEEE.
[16] Sharma, K., Gupta, C. and Gupta, S., 2019, July. Infant
Weeping Calls Decoder using Statistical Feature Extraction
and Gaussian Mixture Models. In 2019 10th International
Conference on Computing, Communication and Network-
ing Technologies (ICCCNT) (pp. 1-6). IEEE.
Authorized licensed use limited to: PES University Bengaluru. Downloaded on January 19,2024 at 09:49:20 UTC from IEEE Xplore. Restrictions apply.