Computer Vision-Based Assessment of Autistic Children Analyzing Interactions Emotions Human Pose and
Computer Vision-Based Assessment of Autistic Children Analyzing Interactions Emotions Human Pose and
ABSTRACT In this paper, the proposed work implements and tests the computer vision applications to
perform the skill and emotion assessment of children with Autism Spectrum Disorder (ASD) by extracting
various bio-behaviors, human activities, child-therapist interactions, and joint pose estimations from the
recorded videos of interactive single- or two-person play-based intervention sessions. A comprehensive
data set of 300 videos is amassed from ASD children engaged in social interaction, and three novel deep
learning-based vision models are developed, which are explained as follows: (i) activity comprehension
to analyze child-play partner interactions (activity comprehension model); (ii) an automatic joint attention
recognition framework using head and hand pose; and (iii) emotion and facial expression recognition. The
proposed models are also tested on children’s real-world, 68 unseen videos captured from the clinic, and
public datasets. The activity comprehension model has an overall accuracy of 72.32%, the joint attention
recognition models have an accuracy of 97% for follow eye gaze and 93.4% for hand pointing, and the
facial expression recognition model has an overall accuracy of 95.1%. The proposed models could extract
behaviors of interest, events of activities, emotions, and social skills from free-play and intervention session
videos of long duration and provide temporal plots for session monitoring and assessment, thus empowering
clinicians with insightful data useful in diagnosis, assessment, treatment formulation, and monitoring ASD
children with limited supervision.
INDEX TERMS Autism spectrum disorder, activity comprehension, facial expressions, joint attention, ASD
screening, applied behavior analysis.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 47907
V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children
Behavioral methods are the gold standard for diagnosing a multimodel approach with video annotation performed by
ASD in children, which entails the physician documenting humans [12], [39], [45], however, very little work has been
the patient’s medical history, interviewing the parents, and done on the automatic extraction and classification of human
manually observing the children’s behavior. These observa- actions from untrimmed videos for ASD detection [46]. The
tions are recorded as detailed in the instruction guidelines state-of-the-art ML and DL methods have improved quality,
for diagnostic rating scale instruments such as the Autism outcomes, and access to ASD screening, diagnosis [12], and
Diagnostic Observation Schedule (ADOS) and the Childhood assessments [39]. Researchers have trained supervised learn-
Autism Rating Scale (CARS-2) [2], [3]. The rating scale ing ML models on multimodal data to develop ASD screening
usually suggests a child’s skills in social engagement, joint and diagnosis [39] solutions with moderate to high psycho-
attention, emotional expressions, instruction following, play metric outcomes in minimal time, ensuring their internal
and life skills, imitation abilities, and visual attention. The validity. These solutions have focused on detecting children
diagnosis is established if the observation scores cumula- with ASD and ODD [12] on cross-cultural datasets.
tively exceed a predetermined threshold. After diagnosis, In the past decade, computer vision-based behavior imag-
a functional assessment is conducted utilizing the instruments ing and facial analysis have shown promising results in
such as VBMAPP [4] to build a personalized intervention assisting clinicians with the diagnosis of multiple medi-
program that can improve the necessary skills of ASD chil- cal conditions including ASD [16], [17], [18]. Moreover,
dren for their school and societal inclusion. The functional computer vision-based methods can offer an accurate, low-
assessment includes detailed observations and measurements cost, and non-invasive alternative compared to traditional
of children’s skills in various domains such as independent labor-intensive manual assessments and invasive methods
play, social communication, self-stimulatory behavior, joint such as electroencephalogram (EEG) [19].
attention, imitation, and understanding of emotions through Even though computer vision has demonstrated many
facial expressions [5], and other necessary skills of ASD promising solutions, its application in assessing behavior,
children [6]. play, imitation and life skills, posture, and gait analysis to
However, there are various limitations when using conven- assess the joint attention of ASD children has not yet been
tional diagnostic and functional assessment methods. Firstly, explored [20], [21], [22]. In addition to these, there are no
the interpretative coding of a child’s behavior observed is large-scale efforts to develop facial expression recognition
manual and time-consuming. Secondly, a clinician’s obser- models or detect joint attention skills of young children
vations may not always be reliable or valid due to differences from real-time videos. Therefore, we address these issues
in professional training, experience, available resources, and by developing novel computer vision models to extract and
cultural backgrounds. Thirdly, there is a huge demand-supply classify the joint attention skills, facial expressions, and life
mismatch between the number of professionals available to skills from untrimmed videos of ASD children and assist the
treat nearly 2% of newborn children diagnosed with ASD [1]. clinician in diagnosing ASD or establishing the functional
These challenges are exaggerated in Low and Middle-Income assessment for ASD children.
Countries (LMICs) [1], [7], [8] where there is a severe short- (i) To assess children’s joint attention skills automatically,
age of clinicians and poor infrastructure to manage ASD con- we developed computer vision models by analyzing
ditions. Therefore, new technological methods for rapid and postural changes in response to instruction or stimuli
automatic data collection and analysis can enhance clinician given by the clinician.
capacity and improve quality, affordability, and accessibility (ii) To recognize nine emotional expressions, namely
in ASD detection and assessments. anger, disgust, fear, happiness, sadness, surprise, laugh-
Technology has demonstrated significant benefits by ter, crying, and neutral for children aged 1 to 5,
employing Machine Learning (ML) and Deep Learning (DL) we developed the Facial Expression Recognition (FER)
for early diagnosis and functional assessments of ASD model by gathering extensive facial images from
[9], [10]. ML has uncovered essential and minimal fea- diverse ethical and cultural backgrounds.
tures [11], [12] of ASD diagnostic instruments such as (iii) To perform an automatic functional assessment of
the Autism Diagnostic Observation Schedule (ADOS) [2], children from their intervention video sessions, their
and the Autism Diagnostic Interview-Revised (ADI-R) [13], engagement duration, and frequency with clinicians,
thereby accelerating the diagnosis procedure without com- parents, or play partners on ten life skill activities,
promising accuracy [14], [15]. ML and DL methods can namely run, sit, stand, engagement, instruction engage-
analyze an unprecedented quantity of multimodal and mul- ment, hit or fight someone, watch someone, hold an
tidimensional clinical data from videos, images, texts, voice object or oblique toys, walk, and answer the phone, are
messages, and sensors due to the rapid evolution of technol- assessed.
ogy and digitization [9]. The analysis can suggest patterns, The paper is organized as follows: Section II briefly
aid in the development of clinical decision support systems to describes state-of-the-art computer vision methods used in
diagnose ASD or developmental delays, and provide sugges- ASD management. In Section III, we provide the details
tions for treatment and personalization, enhancing the clini- of the study procedure, and Sub-section III-A provides a
cian’s capacity. Earlier studies on ASD screening developed detailed description of the problem and answers the questions
raised in developing video-driven assessments. Section IV Artificial intelligence (AI) technology especially ML and
describes the data collection procedure and the technological DL can address these limitations due to its unique facets such
methodology to realize the study aims. Section V provides as increased processing power of computer hardware and
a detailed evaluation and results of our models on real-time multimodal data availability, thereby leading to faster ASD
videos; in Section VI, we discuss the results interpretations, diagnosis [38]. Recently, the clinical study of multi-modular
practical implications, limitations, and future directions; and ML-based ASD diagnosis based on questionnaires and home
in Section VII we provide the conclusion. videos has demonstrated a sensitivity of 90% towards ASD
detection [39]. Some of the other improvements that have
been witnessed with the application of AI are: (i) Detection of
II. LITERATURE REVIEW ASD at an early age, (ii) Reduction in the number of assess-
This section discusses relevant studies, state-of-the-art ment items as a result of implementing the feature reduction
computer vision methods, implementation challenges, and method, (iii) Effective classification between different ASD,
improvement options. Sub-section II-A summarizes the cur- Typically Developing (TD), and other neurodevelopmental
rent state of ASD assessment and intervention methods and disorders, (iv) Automatic feature extraction of bio-behaviors
new studies incorporating ML and DL models into ASD from multimodal data [9], [40].
diagnosis and therapy. Sub-section II-B discusses state-of- Due to the availability of multimodal data from diverse bio-
the-art computer vision methods deployed for Human Pose behavioral sources, such as videos including ASD behavioral
Estimation. Sub-section II-C discusses the importance of features [12], [41], audio [42], facial expressions [43], and
joint attention skills and data-driven assessment techniques. Electronic Health Record (EHR) data [44], DL applications
Sub-section II-D describes the significance of facial expres- trained on unstructured data have accelerated the detection
sion recognition in ASD assessment and treatment planning. and management of ASD and can be implemented at the
Finally, Sub-section II-E summarizes the state-of-the-art point of care [9], [12], [41], [45], [46], [47]. The feasi-
human action recognition methods and their applications in bility of the therapeutic intervention and prognosis lever-
assessment and treatment formulation for ASD. aging AI has shown reasonable success [48] for ASD and
other neurodevelopmental disorders. Furthermore, individu-
alized socially assistive robotic intervention and automation
A. ASD TREATMENT based on engagement analysis has aided in the development
Most evidence-based ASD intervention methods can enhance of a low-cost, robot-based therapeutic framework for ASD
the child’s ability, especially in the first three years [23]. children [49].
However, the demand for professionally trained therapists However, most studies focus on one of the seven key data
has outpaced the supply; consequently, clinicians’ availabil- categories, such as stereotyped behaviors, eye gaze, facial
ity and cost-effectiveness are crucial for promoting treat- expressions, postural analyses, motor movements, auditory,
ment accessibility. Cognitive Behavioral Therapy (CBT) is and electronic health records [9] adopting ML and DL tech-
a behavioral intervention that can help individuals with niques with Graphical Processing Units (GPU) and high pro-
ASD to achieve their goals and change their lifestyles cessing cloud capabilities [9]. To the best of our knowledge,
[21], [24], [25]. this is the first study to employ computer vision to extract
Applied behavioral analysis (ABA) is a gold-standard data from various bio-behaviors, including play, engagement,
intervention widely used to assist ASD individuals with facial expression, and joint-attention abilities.
behavioral and communication challenges by promoting
desirable social behaviors [26] such as overcoming food
intolerance, improving intelligence quotient (IQ), social com- B. HUMAN POSE ESTIMATION
munication, and teaching play and life skills using principles Computer-vision-based Human Pose Estimation (HPE)
of reinforcement [27]. A higher quality of life for ASD methods including conventional and instance-based pose esti-
children can be foreseen through early diagnosis followed mation models to novel deep network architectures can detect
by evidence-based treatment methods [20], [28]. An accurate human body poses in 2D or 3D space by regressing skeletal
diagnostic and functional evaluation is essential to evaluate joint angles or critical points using a single view or several
the child’s area of strength for customizing an intervention view cameras with monocular or depth modalities [50], [51],
program to the child’s unique needs. ADOS [2], ADI-R [13], [52], [53], [54], [55]. In addition, developing computer vision
The Modified Checklist for Autism in Toddlers, Revised, applications for specific task measurements involves accurate
with Follow-Up (M-CHAT-R/F) [29], and The Childhood measurements of both human body joints and their parts.
Autism Rating Scale-2 (CARS-2) [30] are a few widely used Head pose estimation involves the prediction of head orienta-
gold-standard ASD diagnostic and screening instruments tion and assessing human attention and head pose. Similarly,
developed in Western countries [31], [32], [33]. Therefore, hand detection and tracking provide a fine-grained estimation
the outcomes of these assessments have limited efficacy when of hand posture for regressing skeletal finger points and ges-
employed in LMICs due to a lack of training and cultural ture recognition tasks [50]. However, most pose estimation
disparities [34], [35], [36], [37]. algorithms are designed for adults or pedestrians, and few
solutions have focussed on special needs children or pediatric distressing for some ASD individuals leading to social anx-
healthcare [56], [57]. Several significant constraints prevail iety. Subsequently, the capacity to imitate and comprehend
in the development and deployment of the HPE methods for facial emotions is crucial for the social functioning of any
various child-specific problems in managing ASD conditions individual. ASD children have difficulty understanding and
as follows: (i) data security, privacy, and ethical challenges, responding to nonverbal cues and recognizing and compre-
(ii) expensive data collection, (iii) manual data annotation hending facial expressions and emotions. Carpenter et al.
process, and (iv) camera calibration and setup, and (v) sin- [43] have extracted positive, neutral, and other facial land-
gle and targeted solutions to a specific problem [52]. The marks from a database of 3D facial expressions utilizing
current methods for human pose estimation are designed to a trained computer vision model and have discovered that
track specific movements and activities and may not be able children with ASD have more neutral facial expressions,
to capture a broad variety of child behaviors or activities. which corresponds to the fact that facial expression imitation
For instance, head pose estimation may be ineffective for is an essential indicator of social interaction skills. Zhao
monitoring social engagement or other nonverbal indicators. et al. [62] have implemented a DL model to recognize facial
Human pose estimation methods are not always accurate, par- expressions by utilizing multiple databases while training it
ticularly when monitoring the movements of children, who with the facial expressions of sixteen Chinese children. The
have smaller, more rapid movements. Additionally, children experimental results of Zhao et al. have shown that the ASD
are more likely to make sudden, unpredictable movements group’s average imitation expression is found to be less than
that can be difficult to accurately monitor. To circumvent 60%, a significant deterministic threshold for ASD.
these problems, our goal is to develop dedicated models for Alvari et al. [63] have examined facial expressions using
tracking hands and heads that work well for adults as well as the Facial Action Coding System (FACS) and extracted the
toddlers. intricate dynamics of ASD and TD children’s social smiles
from home recordings. The findings of Alvari et al. have
C. JOINT ATTENTION
suggested that ASD children exhibit less happiness than TD
children in their first years, confirming that ASD children
Joint attention (JA) is a social communication method of
have difficulty distinguishing faces and take a long time to
engaging one’s attention with another person using objects
comprehend facial expressions. Deep learning-based facial
and gestures. Limited JA skills are one of the earliest indi-
expression recognition (FER) has been explored in numer-
cators of ASD which JA necessitates capturing, sustain-
ous architectures such as convolutional neural networks,
ing, and transferring attention and fostering the growth of
deep belief networks, autoencoders, generative adversarial
essential social abilities, such as engaging with others and
networks, and ensembles of networks. These architectures
understanding their perspective [14]. Similarly, few works
performed the best on a variety of benchmark datasets as
implemented a DL classification model for evaluating joint
they focused on the two important issues of overfitting and
attention in individuals with ASD by utilizing short video
expression-unrelated variations.
clips of joint attention initiation. The system evaluates joint
attention aids in the early detection and intervention of ASD.
Similarly, a vision-based joint attention detection system E. ACTIVITY RECOGNITION
for ASD using eye-tracking technology showed good accu-
Activity recognition identifies significant events of interest in
racy in detecting joint attention among non-ASD adults.
vast video datasets [64], [65], [66], [67]. Earlier techniques
An automated tool called RJAfinder has been developed
employed human posture traits [68], feature descriptors [69],
that quantifies responding to joint attention behaviors in
and dense trajectories [70] based on the appearance from
ASD using eye-tracking data. RJAfinder can compare RJA
camera movement. However, ML and computer vision (CV)
events among ASD children, typically developing children,
have improved various aspects of human visual perception
and adults and finds fewer RJA events that ASD children
to find clinically meaningful patterns from the images and
display than the other two groups. Cazzato et al. [58] have
videos and classify activities of interest to diagnose and func-
examined how robot-assisted therapy affected the social inter-
tionally assess ASD children [12], [71], [72], [73]. However,
actions of children with ASD and used expensive depth
one of the obstacles associated with applying CV in ASD
cameras to aid in non-invasive JA evaluation. Few studies
detection and its management is the high labor cost and
have used eye-tracking technology to investigate eye gaze
downtime associated with the manual annotation of video.
accuracy, fixation, eye transition, and eye movement dur-
Furthermore, due to the computationally intensive descrip-
ing technology-aided JA assessments [15], [59], [60], [61]
tion and monitoring of motion data from the real-time feed,
methods.
activity monitoring can result in low generalizability because
of potential tracking failure in non-neural network-based
D. FACIAL EXPRESSION RECOGNITION AND EYE CONTACT systems. Hence, we propose a novel DL model to address
The intensity and frequency of eye contact and facial these limitations by training the model on a limited publicly
expressions can facilitate verbal and non-verbal communi- available dataset of action classes relevant to ASD diagnosis
cation between individuals. Maintaining eye contact can be and assessments.
FIGURE 1. An overview of study procedures for model development, real-world testing, and
performance evaluation.
(by counting the number of person bounding boxes) and to the area of the union of two bounding boxes (Equation 1).
their associated activities. Hence, there is a need to segregate
the activities of the child and play partner to make logical Area of Overlap
IoU = (1)
conclusions for the assessment by implementing the activity Area of Union
separation method which is explained in the next section. We compute the IoUs between each pair of axis-aligned
bounding boxes which are one IoU between the child detec-
tor bounding box and one of the bounding boxes from the
2) ACTIVITY SEPARATION person detection bounding boxes. We select the IoU score
To distinguish a child from a play partner, we develop a of ≥ 0.75 as a good threshold for locating the child and
child detector model. Knowing the child’s location is neces- perform the following two checks to ensure that the located
sary because we infer its actions from the features extracted bounding box is correct out of several other bounding boxes
from subsequent frames. The child detector model is trained obtained from the activity prediction model. At first, the
using 3027 annotated images of children (see Table 2) by center coordinates pixel values of both the bounding boxes,
fine-tuning weights from Faster R-CNN with Resnet-50 (v1) having a good IoU match are calculated. Then we check
[91] to obtain a [email protected] score of 0.94. As a result, whether the center coordinate lies in either of the two quad-
a child is distinguished from a play partner, and child features rants of the image (distinguished as left and right with the
are extracted. Table 3 summarizes the hyperparameters used center axis) and compare the quadrants in which the center
and the results of the detector model evaluation. The child coordinate lies and record it. Next, we calculate the Euclidean
detection model predicts the child’s location in each video distance dc between the center coordinates of a good IoU
frame and produces a bounding box for the child. We store match (Equation 2). The lesser the distance score between
the detected bounding boxes along with the time interval. two center coordinates, the higher the chance that the person’s
We perform Intersection over Union (IoU) of detected bounding box encompasses the child. From the experiments,
boxes on the person detection boxes (accumulated child and we choose a distance threshold of 20 pixels and then select
play partner bounding boxes from subsection IV-B1) and the person bounding box with a good IoU match that agrees
child detection box (from subsection IV-B2). The IoU of two with the quadrant rule and the distance measure. We now
detection boxes is defined as the ratio of the overlapped area know the child’s location and, therefore, it is easy to recover
model is used to understand a child’s interactions, activity The model also assesses the play partner and learner for
level, and attention with a play partner by analyzing ten various activities shown in Table 1. A video can be uploaded
activities, including running, sitting, standing, engagement, to analyze any of these activities, and a scatter plot of
instruction engagement, hitting/fighting someone, watching engagement and non-engagement with time intervals would
someone, holding objects/oblique toys, walking, and answer- be generated for the respective inputs. This scatter plot tells
ing the phone. Any user who wants to assess engagement crucial information with a frame-by-frame analysis of the
and understand the ABA session can upload an ABA video learner and the play partner’s activities during the session.
and gets predictions. These predictions are different activ- Each point shows the action class with a timestamp. The
ity classes predicted by a spatiotemporal action recognition discontinuities in the scatter plot indicate the time intervals
model (subsection IV-B1 and IV-B2). We get predictions of no particular interest to the ABA outcomes. Therefore,
for the entire video length with specific activities shown these scatter plot patterns can be used to trace the success
as a scatter plot at time intervals of seconds. The model of therapy and intervention delivery by the therapist easily.
can efficiently assess videos from a camera with a tripod The assessment recordings and scatter plot predictions of the
stand in clinical sessions and from a mobile camera with- model can predict a child’s skill level, leading to the line
out a stable tripod stands from home videos. The ABA of treatment. For instance, the therapist would be prepared
sessions with the Activity Comprehension model will help to deal with a violent child if the model had predictions
the learners concentrate on learning goals and the clinicians of hitting activity (see Fig. 5). The therapist can then work
or therapists to augment their decisions on ABA session closely on various attributes of the child’s behavior with prior
outcomes. predictions reported from the model. The scatter plot not only
highlights the presence of a particular class of activity but (e.g., non-face photos or images with faces wrongly cropped),
marks the absence (if any) in a learner’s or therapist’s behav- and the distribution of images among emotion categories is
ior. In a child showing less attentive behavior towards the not uniform. There are almost 6,000 photographs depicting
play partner’s commands (i.e., lower engagement), the thera- happiness, but only about 500 depict disgust. Additionally,
pist can be proactive to work on weaker behavior attributes, as none of the datasets had images of teenagers and tod-
strengthen them in the upcoming sessions, and track progress dlers crying or laughing, we, therefore, compiled images
for specific class activities. In this way, the ABA video from the popular action recognition datasets (Kinetics [67],
activity recognition model is of great clinical and therapeutic Moments in Time [93], HMDB [66]), and from our video
significance. dataset with 300 ASD children. We accumulate 9882 images
The activities such as hitting, running, walking, and repet- of toddlers and teenagers crying and 10268 images of them
itive behavior indicate a low level of attention by the learner laughing. The final enhanced dataset contains 51037 training
towards commands of the play partner. Thus, with a plot images, 3000 images for validation, and 2000 images for
indicating such results, the customization and prognosis of testing (see Table 2). We trained a Resnet-34 backbone-
the therapy can be decided (see Fig. 6). Similarly, if the plot based facial expression recognition model for nine output
of the play partner/therapist points towards using a phone, the expression classes namely anger, disgust, fear, happy, sad,
play partner could be replaced or evaluated for their actions surprise, laugh, cry, and neutral. We train a residual mask-
accordingly. Fig. 7 shows the pie charts of percentages of ing network of four primary residual masking blocks. Each
activities for a child and therapist after conducting the ABA Residual Masking Block is comprised of a Residual Layer
session. To conclude, the model aids in diagnosis, prognosis, and a Masking Block which acts on different feature sizes.
customization of the therapy, and evaluation of the learner and A 3×3 convolutional layer will first process a 224×224 input
therapist’s progress/activities and so it is vital for the ABA image with stride 2 followed by a 2 × 2 max-pooling layer,
sessions and treatment of ASD children. resulting in a spatial size reduction to 56 × 56. The cor-
responding forward layers of four residual masking blocks
C. FACIAL EXPRESSION RECOGNITION MODEL generate feature maps of four spatial sizes (56 × 56, 28 × 28,
The experiments are conducted on the well-known FER2013 14 × 14, and 7 × 7) from the feature maps produced by the
public dataset which has a collection of 35887 greyscales preceding pooling layer. The network ends with an average
(48 × 48) images in total [92] and other publicly accumulated pooling layer and a 9-way fully-connected softmax layer
datasets. In the FER2013 dataset, each image gathered by producing outputs for 9 facial expression classes (‘‘Angry,’’
the Google image search API is labeled with one of seven ‘‘Sad,’’ ‘‘Fear,’’ ‘‘Happy,’’ ‘‘Surprise,’’ ‘‘Cry,’’ ‘‘Disgust,’’
categories: anger, disgust, fear, happy, sad, surprise, and neu- ‘‘Laugh,’’ ‘‘Neutral’’). The model is trained for 250 epochs
tral. However, the dataset contains several faulty samples with a batch size of 48, with the SGD optimizer, with a
FIGURE 6. A regular output scatter plot of child and play partner to analyze per
session response.
learning decay rate of 0.9 and weight decay of 5e−4 . The of the model on the validation set images is 74.4% and the
learning rate is set to 0.001. Before the training process, accuracy of the model on the test images is 73.9%. The con-
the original training images are resized to 224 × 224 and fusion matrix is shown in Fig. 8. The classes with the highest
transformed to RGB to support ImageNet pre-trained models. scores are Happy, Sad, Surprise, and Neutral, while those
In addition, the training photos are enhanced to prevent over- with the lowest scores were Laughing, Fear, and Disgust. The
fitting. The dataset augmentation techniques included left- evaluation scores per class for the 20 ASD test videos are
right flipping, brightness variation, and rotation inside the listed in Table 4.
interval [-30, 30] degrees. The model is deployed on a Linux server with one NVIDIA
The accuracy measure is the evaluation metric for the V100 GPU to provide real-time facial expression assess-
classification tasks (see Appendix Table 13). The accuracy ment. Fig. 9 illustrates a child’s monitoring and assessment
TABLE 4. Evaluation metrics per class on the test set of the facial expression recognition model.
FIGURE 9. Session monitoring and assessment with facial expression recognition model.
FIGURE 11. Illustration of joint attention models with follow gaze and hand pointing
blocks.
your mother?’’, ‘‘Where is the toy?’’ or ‘‘Where is the animal A. TEST DATA
picture in the book?’’ and the child will either point with their The procedure for clinical and publicly available data pro-
index finger and a closed thumb finger or point with their cessing for all the models during testing is the same as
index finger and an open thumb finger. Fig. 11 illustrates described in Section IV-A.
different orientations of finger pointing in the Joint Atten- 1) Activity Comprehension model: 48 Videos (21 videos
tion Pointing to someone/something block. We collected from Clinic, 27 publicly available)
only hand/finger pointing from various hand gestures and 2) Facial Expression Recognition model: 10 Videos
hand-pointing annotated datasets [102], [103], [104], [105]. 3) Joint Attention Recognition model: 10 Videos
We trained a finger-pointing detector using 24369 annotated All the videos are collected and annotated at the frame level
images with their bounding boxes (see Table 2) by fine-tuning and these human annotations will be used as ground-truth
weights from Faster R-CNN with Resnet-50 (v1) [91] to labels for comparison of the results.
obtain a [email protected] score of 0.917 and an average recall
of 77%. Table 5 summarizes the hyperparameters used and B. FACIAL EXPRESSION RECOGNITION MODEL
the results of the hand-pointing detector model evaluation. Ten videos of children exhibiting any of the nine facial
Whenever the child points to someone or something, the expressions were gathered. The facial expression recogni-
model detects it, and we record all the time stamps of the tion (FER) model, when inferred, stores the predictions for
detection in the entire video. each video frame. For each frame, we compare the model
prediction to the ground truth. A prediction is considered
TABLE 5. Finger pointing detector model hyperparameters and results.
as correct if it is of the same class as the ground truth
and has a confidence level of at least 85%. The confusion
matrix of the FER model is shown in Fig. 12. Table 6 shows
the evaluation scores per class for all ten videos. The class
‘‘None’’ comprises the frames of the background and the
times when no face is visible. From the experimental results,
it is observed that the accuracy is above 93% for all the
expression categories.
TABLE 6. Evaluation metrics per class on unseen test videos of children using the facial expression recognition model.
TABLE 7. Evaluation metrics per class of children on the clinical test set using the activity comprehension model.
TABLE 8. Evaluation metrics per class of play partners on the clinical test set using the activity comprehension model.
TABLE 9. Evaluation metrics per class of children on the publicly available videos using the activity comprehension model.
TABLE 10. Evaluation metrics per class of play partners on the publicly available videos using the activity comprehension model.
and the FER2013 dataset. The baseline implementations are followed by the method CNNs and BOVW with SVM having
evaluated in TensorFlow, and the rest of them are imple- 75.42% followed by Ensemble 8 CNN-based method with
mented with the code provided in the studies. The method an accuracy of 75.20% and our model with an accuracy
Ensemble ResMasknet outperforms with 76.82% accuracy, of 74.15%.
TABLE 11. Evaluation metrics of joint attention models on ten publicly child-therapist interaction, a model for FER of children, and a
available videos.
real-time approach for automatic joint attention recognition.
The demand for low-cost diagnostics, universal screening
guidelines, and research funding availability have prompted
endeavors to create technology-based ASD screening [9],
[39], [48]. Technological advances and the availability of
low-cost cloud infrastructure have motivated researchers to
automate the creation and processing of video data by con-
structing data pipelines. Integrating data pipelines with ML
technology has advanced the development of cost-effective
TABLE 12. Evaluation of the performance of the FER model using ASD detection and assessment methods [31], [38], [47].
well-known methods on 2000 images in our test set and test set of However, ASD diagnostic services are not always accessible,
FER2013 dataset.
cost-effective, or data-driven. Our findings indicate that the
technology-based ASD approaches can be generalized to the
broader population with neurodevelopmental disorders along
with few technological modifications and can serve wider
population groups with enhanced quality, access, and afford-
ability. In addition, technology-enabled innovations are antic-
ipated to supplement traditional detection methods for the
following reasons: Diagnostic methods based on ML and DL
can be trained on a large volume of involuntary multimodal
data generated from various activities to detect children at
risk for ASD. Few diagnostic techniques, such as CARS-2
[30], can diagnose children older than two years. Moreover,
children do not develop social communication, language, and
other crucial milestones until the second or third year of life.
An untrained clinician may therefore receive contradictory
results when evaluating ASD risk in infants under two years
For the activity comprehension model, we only utilized of age [1], [33]. Abbas et al. [39], Gupta et al. [38], Kohli
10 of the 80 action categories, and we are only interested et al. [9], and Uddin et al. [48] have highlighted the novel ML
in two-person interactions. Since popular baseline models methods and their feasibility of analysis on ASD and other
evaluate multi-person interactions and also due to resource neurodevelopmental landmarks from behavioral, eye gaze,
constraints, we only performed the evaluation on 48 real- audio, facial expression, postural, and EHR assessment data
world videos, and these results are listed in Tables 7-10. More to identify children at risk of ASD at an early age, circum-
details on competitive baselines can be found in detail in the venting the age restrictions and limitations of the traditional
study [88]. diagnostic instruments.
Joint Attention Follow Gaze detection requires a pipeline Researchers have collected multimodal data from hospi-
of multiple models involving head detection, child detection, tal EHRs and constructed enormous multimodal data lakes
and finally head pose estimation via Euler angle prediction; which enable DL and ML algorithms to discover clinically
therefore, there is no direct comparison in the clinical liter- significant patterns for recognizing ASD, tracking patients
ature for a fair evaluation. We are only interested in real- over time, prescribing and tailoring treatments, and alleviat-
time, long-duration videos, so a comparison of individual ing ASD severity [39], [47].
blocks to popular image datasets is outside the scope of
this paper due to hardware resource constraints. However, A. PRACTICAL IMPLICATIONS
details on full-range head pose estimation and comparison The ABA treatment enhancement efforts described in this
on specific head-image datasets are described in the studies research can advance the field of neuroscience by increasing
[96], [97], [98], [101]. For Joint Attention Hand pointing early identification and consequently expanding access to
estimation, we do not compare against 2D or 3D gesture early intervention treatments which can be used by clini-
recognition datasets because our model is distinct from the cians accessible via mobile or web applications, significantly
majority of conventional approaches, where our model is enhancing their capacity and meeting the needs of children
trained using images collected and curated from four hand- with ASD and other developmental delays (speech, devel-
or finger-pointing datasets (see section IV-D2). opment, and intellectual delay). By supporting the adoption
of these technologies through controlled pilots with stake-
VI. DISCUSSION holders such as parents, doctors, and schools and digitizing
Three independent vision-based paradigms are designed downstream detection processes, evaluations, and thera-
as follows: a model for Activity Comprehension for pies, a computerized, human-supported ASD diagnosis, and
management framework can be launched and migrated to diagnosis information for ASD or other neurodevelopmental
an autonomous and personalized digital model to optimize disorders. Future work should include a large clinical trial
cost, maximize scale, and fast-track access to referrals and testing the models on grouped cohorts of ASD, neurotypical,
intervention. According to our knowledge, this is the first and other developmental disorders that can reveal the level of
attempt to create an automated, integrated ABA assessment efficacy and identify areas for further development. Lastly,
framework deployed on the cloud for real-time assessment more real-world test cases may uncover unforeseen edge
using video data. cases that hamper model performance and generalizability.
Moreover, to decrease bias and ensure the internal and Future studies can incorporate clinicians’ survey responses to
external validity of the implemented models, there is an determine the efficacy of computer vision models that aid in
urgent need to undertake large clinical trials, including the accurate, timely diagnosis and treatment monitoring. Future
participation of researchers and doctors from many nations studies can integrate various methods into a single pipeline or
with diverse backgrounds, and ethnicity. The purpose of architecture with a unified model that is trained in a multitask
the partnership is to validate the results, determine effi- fashion for analyzing human behavior, joint attention, social
cacy, address potential technology edge cases, and design communication skills, facial expression and motor imitation
approaches to incorporate children from various backgrounds recognition, and eye contact detection. Further, speech and
into research investigations. Our experimental results from auditory features can be incorporated that provide rich fea-
testing joint attention, Activity Comprehension, and facial tures in the case of social interaction and communication to
expression models with YouTube videos demonstrate robust- develop a multimodal vision-speech model that can identify
ness in chaotic and natural videos. abnormalities in speech and social behaviors.
It is appropriate for medical experts in the ABA and clini-
cal psychology fields to evaluate the findings’ validity. The VII. CONCLUSION
study’s strengths include the use of 68 real-world clinical The paper investigates the viability of a computer vision
videos and comparisons against ground-truth video annota- and deep learning-based ABA treatment and assessment that
tions provided by clinicians. However, medical professionals experts or non-experts can use to detect important behavioral
have the expertise and experience to interpret the findings activities, emotions, and JA using videos. Experiments with
and provide additional insights that may not be immediately 68 clinical and public videos from the real world reveal that
apparent to those outside the clinical context. In our case, the activity comprehension model reports an overall accu-
we cross-reviewed the annotations provided by a clinician racy of 72.32%, the joint attention models show an accuracy
to ensure a coherent interpretation of video annotations and of 97% for gaze following and 93.4% for hand pointing,
minimize annotator bias. In addition, medical professionals and the facial expression recognition model has an overall
assessed the study’s methodology, identified a few possible accuracy of 95.1%. During the development of the activity
limitations, and made suggestions for future research, such as comprehension and facial expression recognition model, the
in the case of multiple children handled by a single therapist proposed methodology incorporates diversity and fairness
for Activity Comprehension models. Involving medical pro- to low-income and middle-income populations by collecting
fessionals in the evaluation of the study’s findings increased videos of children of different ages, socioeconomic statuses,
the study’s rigor and credibility and ensured that the results and ethnicity. The models’ predictions help to make real-time
were appropriately interpreted and implemented in clinical monitoring and assessment reports that help clinicians to
practice. make decisions about ABA services.
B. LIMITATIONS AND FUTURE DIRECTIONS
APPENDIX A
While developing the Activity Comprehension model, the
A. MODEL EVALUATION METRICS
current study provides solutions to only a subset of the many
activities performed during ABA therapy; many other action The robustness of the machine learning model can be evalu-
classes specific to children can be incorporated if sufficient ated on various metrics items listed below.
resources are allocated to model development. As ML tech- a) Accuracy is the number of correct predictions divided
nology develops, it is possible to reduce false positives and by the total number of predictions.
negatives in the FER model, thereby increasing its sensitivity, b) True Positive (TP) signifies how many positive class
precision, and specificity. Even though we have collected samples the model predicted correctly.
as many images of children’s faces as possible to train an c) True Negative (TN) signifies how many negative class
online FER model, there is still room for additional data and samples the model predicted correctly.
microexpressions. In addition, each model assumes a partic- d) False Positive (FP) signifies how many negative class
ular video data distribution: (i) the Activity Comprehension samples the model predicted incorrectly.
model assumes person-person or person-object interactions, e) False Negative (FN) signifies how many positive class
(ii) the FER model assumes frontal face visibility, and joint samples the model predicted incorrectly.
attention models perform sub-optimally in crowded scenar- f) Precision is the ratio of true positives and total positives
ios. Most of the children’s videos we collected lack clinical predicted.
g) Recall or Sensitivity is the ratio of true positives to all Swati Kohli, A. P. Prathosh, Tanu Wadhera, Diptanshu Das,
the positives in the ground truth. Debasis Panigrahi; Software: Varun Ganjigunte Prakash,
h) Specificity is defined as the proportion of actual neg- Manu Kohli; Supervision: Manu Kohli, Swati Kohli,
ative class samples, which got predicted as the true A. P. Prathosh, Tanu Wadhera, Diptanshu Das, Debasis
negatives. Panigrahi; Writing—review and editing: Varun Ganjigunte
i) F1 Score is the harmonic mean of precision and recall. Prakash, Manu Kohli, Tanu Wadhera; Visualization: Varun
j) Negative Predictive Value is the ratio of the number of Ganjigunte Prakash, Manu Kohli.
true negatives to the total number of class samples that
test negative. ETHICAL APPROVAL
Table 13 lists evaluation metrics with mathematical notations. All procedures performed in studies involving human par-
ticipants were in accordance with the ethical standards of
TABLE 13. Machine learning model performance metrics. the institutional and/or national research committee and with
the 1964 Helsinki declaration and its later amendments or
comparable ethical standards. This article does not contain
any studies with animals performed by any of the authors.
INFORMED CONSENT
Informed consent was obtained from all individual partici-
pants included in the study.
REFERENCES
[1] CDC. (2021). Data and Statistics on Autism Spectrum Disorder. [Online].
Available: https://www.cdc.gov/ncbddd/autism/data.html
[2] C. Lord, S. Risi, L. Lambrecht, E. H. Cook, B. L. Leventhal, P. C. DiLavore,
A. Pickles, and M. Rutter, ‘‘The autism diagnostic observation schedule-
generic: A standard measure of social and communication deficits associ-
ated with the spectrum of autism,’’ J. Developmental Disorders, vol. 30,
ACKNOWLEDGMENT no. 3, pp. 205–223, Jun. 2000.
The authors would like to acknowledge the clinical experts [3] L. Jurek, M. Baltazar, S. Gulati, N. Novakovic, M. Núñez, J. Oakley, and
A. O’Hagan, ‘‘Response (minimum clinically relevant change) in ASD
and engineers at SM Learning Skills Academy for Special symptoms after an intervention according to CARS-2: Consensus from an
Needs Pvt. Ltd. for their support with video data collection expert elicitation procedure,’’ Eur. Child Adolescent Psychiatry, vol. 31,
and annotation. They also thank Microsoft Azure and IBM no. 8, pp. 1–10, Aug. 2022.
[4] M. L. Sundberg, VB-MAPP Verbal Behavior Milestones Assessment and
Cloud for giving them cloud credits that allowed them to store
Placement Program: A Language and Social Skills Assessment Program
huge amounts of video data and run GPU virtual machines. for Children with Autism or Other Developmental Disabilities. Concord,
CA, USA: AVB Press, 2008.
DISCLOSURE STATEMENT [5] H. S. Roane, W. W. Fisher, and J. E. Carr, ‘‘Applied behavior analysis as
treatment for autism spectrum disorder,’’ J. Pediatrics, vol. 175, pp. 27–32,
The authors Manu Kohli and Dr. A. P. Prathosh report a Sep. 2016.
relationship with the company SM Learning Skills Academy [6] R. K. Dogan, M. L. King, A. T. Fischetti, C. M. Lake, T. L. Math-
for Special Needs Pvt. Ltd. that includes: equity or stocks. ews, and W. J. Warzak, ‘‘Parent-implemented behavioral skills training
of social skills,’’ J. Appl. Behav. Anal., vol. 50, no. 4, pp. 805–818,
The remaining authors declare they have no financial or Oct. 2017.
non-financial conflicts of interest. [7] D. M. Bhatia. (Aug. 2020). How to Help Low-Income Children With
Autism. [Online]. Available: https://www.doctorbhatia.com/autism-
research-treatment/autism-diagnosis-how-is-autism-diagnosed-clinical-
CREDIT AUTHORSHIP CONTRIBUTION STATEMENT screening-cars-blood-gentic-tests/
Conceptualization: Manu Kohli, Swati Kohli, A. P. Prathosh; [8] A. Opar. (Jan. 2019). How to Help Low-Income Children With Autism.
Data curation: Varun Ganjigunte Prakash, Manu Kohli, Swati [Online]. Available: https://www.spectrumnews.org/features/deep-
dive/help-low-income-children-autism/
Kohli, Tanu Wadhera, Diptanshu Das, Debasis Panigrahi;
[9] M. Kohli, A. K. Kar, and S. Sinha, ‘‘The role of intelligent technologies
Formal analysis: A. P. Prathosh, Tanu Wadhera, Diptanshu in early detection of autism spectrum disorder (ASD): A scoping review,’’
Das, Debasis Panigrahi; Funding acquisition: Manu Kohli, IEEE Access, vol. 10, pp. 104887–104913, 2022.
A. P. Prathosh; Methodology: Varun Ganjigunte Prakash, [10] M. Kohli and S. Kohli, ‘‘Electronic assessment and training curriculum
based on applied behavior analysis procedures to train family members of
Manu Kohli, Swati Kohli, A. P. Prathosh; Project adminis- children diagnosed with autism,’’ in Proc. IEEE Region 10 Humanitarian
tration: Manu Kohli, Swati Kohli; Resources: Manu Kohli, Technol. Conf. (R10-HTC), Dec. 2016, pp. 1–6.
[11] A. Ali, F. F. Negin, F. F. Bremond, and S. Thümmler, ‘‘Video-based behav- [29] D. L. Robins, K. Casagrande, M. Barton, C.-M.-A. Chen,
ior understanding of children for objective diagnosis of autism,’’ in Proc. T. Dumont-Mathieu, and D. Fein, ‘‘Validation of the modified checklist
17th Int. Joint Conf. Comput. Vis., Imag. Comput. Graph. Theory Appl., for autism in toddlers, revised with follow-up (M-CHAT-R/F),’’ Pediatrics,
Feb. 2022, pp. 1–11. [Online]. Available: https://hal.inria.fr/hal-03447060 vol. 133, no. 1, pp. 37–45, Jan. 2014.
[12] Q. Tariq, S. L. Fleming, J. N. Schwartz, K. Dunlap, C. Corbin, P. Washing- [30] C. A. Vaughan, ‘‘Test review: E. Schopler, ME Van Bourgondien,
ton, H. Kalantarian, N. Z. Khan, G. L. Darmstadt, and D. P. Wall, ‘‘Detect- GJ Wellman, & SR Love childhood autism rating scale. Los Angeles, CA:
ing developmental delay and autism through machine learning models Western psychological services, 2010,’’ J. Psychoeducational Assessment,
using home videos of Bangladeshi children: Development and validation vol. 29, no. 5, pp. 489–493, Oct. 2011.
study,’’ J. Med. Internet Res., vol. 21, no. 4, Apr. 2019, Art. no. e13822. [31] M. Marlow, C. Servili, and M. Tomlinson, ‘‘A review of screening tools for
[13] C. Lord, M. Rutter, and A. Le Couteur, ‘‘Autism diagnostic interview- the identification of autism spectrum disorders and developmental delay in
revised: A revised version of a diagnostic interview for caregivers of infants and young children: Recommendations for use in low- and middle-
individuals with possible pervasive developmental disorders,’’ J. Autism income countries,’’ Autism Res., vol. 12, no. 2, pp. 176–199, Feb. 2019,
Develop. Disorders, vol. 24, no. 5, pp. 659–685, Oct. 1994. doi: 10.1002/aur.2033.
[14] P. U. Putra, K. Shima, S. A. Alvarez, and K. Shimatani, ‘‘Identifying [32] K. Bauer, K. L. Morin, T. E. Renz, and S. Zungu, ‘‘Autism assessment
autism spectrum disorder symptoms using response and gaze behavior in low- and middle-income countries: Feasibility and usability of west-
during the Go/NoGo game CatChicken,’’ Sci. Rep., vol. 11, no. 1, p. 22012, ern tools,’’ Focus Autism Other Develop. Disabilities, vol. 37, no. 3,
Nov. 2021, doi: 10.1038/s41598-021-01050-7. pp. 179–188, Sep. 2022, doi: 10.1177/10883576211073691.
[15] L. Billeci et al., ‘‘Disentangling the initiation from the response in [33] R. Choueiri, W. T. Garrison, and V. Tokatli, ‘‘Early identification of
joint attention: An eye-tracking study in toddlers with autism spectrum autism spectrum disorder (ASD): Strategies for use in local commu-
disorders,’’ Transl. Psychiatry, vol. 6, no. 5, p. e808, May 2016, doi: nities,’’ Indian J. Pediatrics, vol. 90, no. 4, pp. 377–386, May 2022,
10.1038/tp.2016.75. doi: 10.1007/s12098-022-04172-6.
[16] C. Su, Z. Xu, J. Pathak, and F. Wang, ‘‘Deep learning in mental health [34] N. J. Hidalgo, L. L. McIntyre, and E. H. McWhirter, ‘‘Sociodemographic
outcome research: A scoping review,’’ Transl. Psychiatry, vol. 10, no. 1, differences in parental satisfaction with an autism spectrum disorder diag-
p. 116, Apr. 2020, doi: 10.1038/s41398-020-0780-3. nosis,’’ J. Intellectual Develop. Disability, vol. 40, no. 2, pp. 147–155,
[17] A. Esteva, K. Chou, S. Yeung, N. Naik, A. Madani, A. Mottaghi, Y. Apr. 2015.
Liu, E. Topol, J. Dean, and R. Socher, ‘‘Deep learning-enabled medical [35] A. J. Kumm, M. Viljoen, and P. J. de Vries, ‘‘The digital divide in tech-
computer vision,’’ npj Digit. Med., vol. 4, no. 1, p. 5, Jan. 2021, doi: nologies for autism: Feasibility considerations for low- and middle-income
10.1038/s41746-020-00376-2. countries,’’ J. Autism Develop. Disorders, vol. 52, no. 5, pp. 2300–2313,
[18] Y. Kumar, A. Koul, R. Singla, and M. F. Ijaz, ‘‘Artificial intelligence in May 2022.
disease diagnosis: A systematic literature review, synthesizing framework
[36] N. Malik-Soni, A. Shaker, H. Luck, A. E. Mullin, R. E. Wiley,
and future research agenda,’’ J. Ambient Intell. Humanized Comput., vol.
M. E. S. Lewis, J. Fuentes, and T. W. Frazier, ‘‘Tackling healthcare access
2022, pp. 1–28, Jan. 2022, doi: 10.1007/s12652-021-03612-z.
barriers for individuals with autism from diagnosis to adulthood,’’ Pedi-
[19] G. Brihadiswaran, D. Haputhanthri, S. Gunathilaka, D. Meedeniya, and atric Res., vol. 91, no. 5, pp. 1028–1035, Apr. 2022.
S. Jayarathna, ‘‘EEG-based processing and classification methodologies
[37] M. P. Kelly and P. Reed, ‘‘Examination of stimulus over-selectivity in
for autism spectrum disorder: A review,’’ J. Comput. Sci., vol. 15, no. 8,
children with autism spectrum disorder and its relationship to stereotyped
pp. 1161–1183, Aug. 2019.
behaviors and cognitive flexibility,’’ Focus Autism Other Develop. Disabil-
[20] L. Klintwall and S. Eikeseth, Early and Intensive Behavioral Intervention
ities, vol. 36, no. 1, pp. 47–56, Mar. 2021.
(EIBI) in Autism. New York, NY, USA: Springer, 2014, pp. 117–137, doi:
10.1007/978-1-4614-4788-7_129. [38] C. Gupta, P. Chandrashekar, T. Jin, C. He, S. Khullar, Q. Chang, and
D. Wang, ‘‘Bringing machine learning to research on intellectual and
[21] J. J. Wood, A. Drahota, K. Sze, K. Har, A. Chiu, and D. A. Langer, ‘‘Cogni-
developmental disabilities: Taking inspiration from neurological diseases,’’
tive behavioral therapy for anxiety in children with autism spectrum disor-
J. Neurodevelopmental Disorders, vol. 14, no. 1, p. 28, May 2022, doi:
ders: A randomized, controlled trial,’’ J. Child Psychol. Psychiatry, vol. 50,
10.1186/s11689-022-09438-w.
no. 3, pp. 224–234, Mar. 2009, doi: 10.1111/j.1469-7610.2008.01948.x.
[22] M. E. Król and M. Król, ‘‘A novel machine learning analysis [39] H. Abbas, F. Garberson, S. Liu-Mayo, E. Glover, and D. P. Wall,
of eye-tracking data reveals suboptimal visual information ‘‘Multi-modular AI approach to streamline autism diagnosis in
extraction from facial stimuli in individuals with autism,’’ young children,’’ Sci. Rep., vol. 10, no. 1, p. 5014, Mar. 2020,
Neuropsychologia, vol. 129, pp. 397–406, Jun. 2019. [Online]. Available: doi: 10.1038/s41598-020-61213-w.
https://www.sciencedirect.com/science/article/pii/S002839321930106X [40] D.-Y. Song, S. Y. Kim, G. Bong, J. M. Kim, and H. J. Yoo, ‘‘The use of
[23] R. Aishworiya, T. Valica, R. Hagerman, and B. Restrepo, ‘‘An update artificial intelligence in screening and diagnosis of autism spectrum dis-
on psychopharmacological treatment of autism spectrum disorder,’’ order: A literature review,’’ J. Korean Acad. Child Adolescent Psychiatry,
Neurotherapeutics, vol. 19, no. 1, pp. 248–262, Jan. 2022, doi: vol. 30, no. 4, pp. 145–152, Oct. 2019, doi: 10.5765/jkacap.190027.
10.1007/s13311-022-01183-1. [41] G. S. Young, J. N. Constantino, S. Dvorak, A. Belding, D. Gangi,
[24] X. Wang, J. Zhao, S. Huang, S. Chen, T. Zhou, Q. Li, X. Luo, A. Hill, M. Hill, M. Miller, C. Parikh, A. J. Schwichtenberg, E. Solis, and
and Y. Hao, ‘‘Cognitive behavioral therapy for autism spectrum dis- S. Ozonoff, ‘‘A video-based measure to identify autism risk in infancy,’’
orders: A systematic review,’’ Pediatrics, vol. 147, no. 5, May 2021, J. Child Psychol. Psychiatry, vol. 61, no. 1, pp. 88–94, Jan. 2020.
doi: 10.1542/peds.2020-049880. [42] E. Patten, K. Belardi, G. T. Baranek, L. R. Watson, J. D. Labban, and
[25] D. Ung, R. Selles, B. J. Small, and E. A. Storch, ‘‘A systematic D. K. Oller, ‘‘Vocal patterns in infants with autism spectrum disorder:
review and meta-analysis of cognitive-behavioral therapy for anxi- Canonical babbling status and vocalization frequency,’’ J. Autism Develop.
ety in youth with high-functioning autism spectrum disorders,’’ Child Disorders, vol. 44, no. 10, pp. 2413–2428, Oct. 2014.
Psychiatry Hum. Develop., vol. 46, no. 4, pp. 533–547, Aug. 2015, [43] K. L. H. Carpenter, J. Hahemi, K. Campbell, S. J. Lippmann, J. P. Baker,
doi: 10.1007/s10578-014-0494-y. H. L. Egger, S. Espinosa, S. Vermeer, G. Sapiro, and G. Dawson, ‘‘Digital
[26] F. Mohammadzaheri, L. K. Koegel, M. Rezaee, and S. M. Rafiee, ‘‘A ran- behavioral phenotyping detects atypical pattern of facial expression in
domized clinical trial comparison between pivotal response treatment toddlers with autism,’’ Autism Res., vol. 14, no. 3, pp. 488–499, Mar. 2021.
(PRT) and structured applied behavior analysis (ABA) intervention for [44] R. Rahman, A. Kodesh, S. Z. Levine, S. Sandin, A. Reichenberg, and
children with autism,’’ J. Autism Develop. Disorders, vol. 44, no. 11, A. Schlessinger, ‘‘Identification of newborns at risk for autism using
pp. 2769–2777, Nov. 2014, doi: 10.1007/s10803-014-2137-3. electronic medical records and machine learning,’’ Eur. Psychiatry, vol. 63,
[27] Q. Yu, E. Li, L. Li, and W. Liang, ‘‘Efficacy of interventions based on no. 1, p. e22, 2020.
applied behavior analysis for autism spectrum disorder: A meta-analysis,’’ [45] L. Ouss, G. Palestra, C. Saint-Georges, M. L. Gille, M. Afshar, H. Pellerin,
Psychiatry Invest., vol. 17, no. 5, pp. 432–443, May 2020. K. Bailly, M. Chetouani, L. Robel, B. Golse, R. Nabbout, I. Desguerre,
[28] H. Manohar, P. Kandasamy, V. Chandrasekaran, and R. P. Rajkumar, M. Guergova-Kuras, and D. Cohen, ‘‘Behavior and interaction imaging
‘‘Early diagnosis and intervention for autism spectrum disorder: Need for at 9 months of age predict autism/intellectual disability in high-risk
pediatrician-child psychiatrist liaison,’’ Indian J. Psychol. Med., vol. 41, infants with west syndrome,’’ Transl. Psychiatry, vol. 10, no. 1, pp. 1–7,
no. 1, pp. 87–90, 2019, doi: 10.4103/IJPSYM.IJPSYM_154_18. Feb. 2020.
[46] J. Hashemi, G. Dawson, K. L. H. Carpenter, K. Campbell, Q. Qiu, [66] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, ‘‘HMDB: A
S. Espinosa, S. Marsan, J. P. Baker, H. L. Egger, and G. Sapiro, ‘‘Computer large video database for human motion recognition,’’ in Proc. IEEE Int.
vision analysis for quantification of autism risk behaviors,’’ IEEE Trans. Conf. Comput. Vis., Nov. 2011, pp. 2556–2563.
Affect. Comput., vol. 12, no. 1, pp. 215–226, Jan. 2021. [67] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier,
[47] M. Kohli, A. K. Kar, A. Bangalore, and P. Ap, ‘‘Machine learning-based S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman,
ABA treatment recommendation and personalization for autism spectrum and A. Zisserman, ‘‘The kinetics human action video dataset,’’ 2017,
disorder: An exploratory study,’’ Brain Informat., vol. 9, no. 1, p. 16, arXiv:1705.06950.
Jul. 2022, doi: 10.1186/s40708-022-00164-6. [68] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, ‘‘PoTion:
[48] M. Uddin, Y. Wang, and M. Woodbury-Smith, ‘‘Artificial intelligence for Pose motion representation for action recognition,’’ in Proc.
precision medicine in neurodevelopmental disorders,’’ npj Digit. Med., IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,
vol. 2, no. 1, p. 112, Nov. 2019, doi: 10.1038/s41746-019-0191-0. pp. 7024–7033.
[49] S. Jain, B. Thiagarajan, Z. Shi, C. Clabaugh, and M. J. Matarić, ‘‘Modeling [69] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human detec-
engagement in long-term, in-home socially assistive robot interventions tion,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
for children with autism spectrum disorders,’’ Sci. Robot., vol. 5, no. 39, (CVPR), 2005, pp. 886–893.
Feb. 2020, doi: 10.1126/scirobotics.aaz3791. [70] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, ‘‘Dense trajectories and
[50] M. Toshpulatov, W. Lee, S. Lee, and A. H. Roudsari, ‘‘Human motion boundary descriptors for action recognition,’’ Int. J. Comput. Vis.,
pose, hand and mesh estimation using deep learning: A survey,’’ vol. 103, no. 1, pp. 60–79, May 2013.
J. Supercomput., vol. 78, no. 6, pp. 7616–7654, Apr. 2022, [71] R. A. J. de Belen, T. Bednarz, A. Sowmya, and D. Del Favero, ‘‘Computer
doi: 10.1007/s11227-021-04184-7. vision in autism spectrum disorder research: A systematic review of pub-
[51] T. L. Munea, Y. Z. Jembre, H. T. Weldegebriel, L. Chen, C. Huang, and lished studies from 2009 to 2019,’’ Translational Psychiatry, vol. 10, no. 1,
C. Yang, ‘‘The progress of human pose estimation: A survey and taxonomy pp. 1–20, Sep. 2020.
of models applied in 2D human pose estimation,’’ IEEE Access, vol. 8, [72] L. Zhang, M. Wang, M. Liu, and D. Zhang, ‘‘A survey on deep learning for
pp. 133330–133348, 2020. neuroimaging-based brain disorder analysis,’’ Frontiers Neurosci., vol. 14,
[52] C. Zheng, W. Wu, T. Yang, S. Zhu, C. Chen, R. Liu, J. Shen, p. 779, Oct. 2020.
N. Kehtarnavaz, and M. Shah, ‘‘Deep learning-based human pose estima- [73] P. Pandey, P. Ap, M. Kohli, and J. Pritchard, ‘‘Guided weak
tion: A survey,’’ 2020, arXiv:2012.13392. supervision for action recognition with scarce data to assess
[53] W. Zhang, J. Fang, X. Wang, and W. Liu, ‘‘EfficientPose: Efficient human skills of children with autism,’’ Proc. AAAI Conf. Artif. Intell.,
pose estimation with neural architecture search,’’ Comput. Vis. Media, Apr. 2020, vol. 34, no. 1, pp. 463–470. [Online]. Available:
vol. 7, no. 3, pp. 335–347, Sep. 2021, doi: 10.1007/s41095-021-0214-z. https://ojs.aaai.org/index.php/AAAI/article/view/5383
[54] S. Dubey and M. Dixit, ‘‘A comprehensive survey on human pose estima- [74] J. Carreira and A. Zisserman, ‘‘Quo vadis, action recognition? A new
tion approaches,’’ Multimedia Syst., vol. 29, no. 1, pp. 167–195, Aug. 2022, model and the kinetics dataset,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
doi: 10.1007/s00530-022-00980-0. Recognit. (CVPR), Aug. 2017, pp. 4724–4733.
[55] H. Chen, R. Feng, S. Wu, H. Xu, F. Zhou, and Z. Liu, ‘‘2D human pose [75] S. F. Dos Santos, N. Sebe, and J. Almeida, CV-C3D: Action
estimation: A survey,’’ 2022, arXiv:2204.07370. Recognition on Compressed Videos with Convolutional 3D
[56] G. Sciortino, G. M. Farinella, S. Battiato, M. Leo, and C. Distante, ‘‘On Networks. Porto Alegre, RS, Brasil: SBC, 2019. [Online]. Available:
the estimation of children’s poses,’’ in Image Analysis and Processing— https://sol.sbc.org.br/index.php/sibgrapi/article/view/9782
ICIAP, S. Battiato, G. Gallo, R. Schettini, and F. Stanco, Eds. Cham, [76] H. Xu, A. Das, and K. Saenko, ‘‘R-C3D: Region convolutional 3D network
Switzerland: Springer, 2017, pp. 410–421. for temporal activity detection,’’ in Proc. IEEE Int. Conf. Comput. Vis.
[57] J. Stenum, K. M. Cherry-Allen, C. O. Pyles, R. D. Reetzke, M. F. Vignos, (ICCV), Oct. 2017, pp. 5794–5803.
and R. T. Roemmich, ‘‘Applications of pose estimation in human [77] A. Richard and J. Gall, ‘‘A bag-of-words equivalent recurrent
health and performance across the lifespan,’’ Sensors, vol. 21, no. 21, neural network for action recognition,’’ Comput. Vis. Image
p. 7315, Nov. 2021. [Online]. Available: https://www.mdpi.com/1424- Understand., vol. 156, pp. 79–91, Mar. 2017. [Online]. Available:
8220/21/21/7315 https://www.sciencedirect.com/science/article/pii/S1077314216301680
[58] D. Cazzato, P. L. Mazzeo, P. Spagnolo, and C. Distante, ‘‘Automatic joint [78] J. Tang, J. Xia, X. Mu, B. Pang, and C. Lu, Asynchronous Interaction
attention detection during interaction with a humanoid robot,’’ in Social Aggregation for Action Detection. New York, NY, USA: Springer-Verlag,
Robotics, A. Tapus, E. André, J.-C. Martin, F. Ferland, and M. Ammi, Eds. 2020, pp. 71–87, doi: 10.1007/978-3-030-58555-6_5.
Cham, Switzerland: Springer, 2015, pp. 124–134. [79] Y. Liu, L. Ma, Y. Zhang, W. Liu, and S. Chang, ‘‘Multi-granularity gener-
[59] K. Kim and P. Mundy, ‘‘Joint attention, social-cognition, and recognition ator for temporal action proposal,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
memory in adults,’’ Frontiers Hum. Neurosci., vol. 6, p. 172, Jun. 2012. Pattern Recognit. (CVPR), Jun. 2019, pp. 3599–3608.
[60] P. Nyström, E. Thorup, S. Bölte, and T. Falck-Ytter, ‘‘Joint attention in [80] V. Escorcia, F. Caba Heilbron, J. C. Niebles, and B. Ghanem, ‘‘DAPs: Deep
infancy and the emergence of autism,’’ Biol. Psychiatry, vol. 86, no. 8, action proposals for action understanding,’’ in Computer Vision—ECCV,
pp. 631–638, Oct. 2019, doi: 10.1016/j.biopsych.2019.05.006. B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham, Switzerland:
[61] G. Wan, X. Kong, B. Sun, S. Yu, Y. Tu, J. Park, C. Lang, M. Koh, Z. Wei, Springer, 2016, pp. 768–784.
Z. Feng, Y. Lin, and J. Kong, ‘‘Applying eye tracking to identify autism [81] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, ‘‘Action tubelet
spectrum disorder in children,’’ J. Autism Develop. Disorders, vol. 49, detector for spatio-temporal action localization,’’ in Proc. IEEE Int. Conf.
no. 1, pp. 209–215, Jan. 2019. Comput. Vis. (ICCV), Oct. 2017, pp. 4415–4423.
[62] W. Zhao and L. Lu, ‘‘Research and development of autism diagnosis [82] M. Tomei, L. Baraldi, S. Calderara, S. Bronzin, and R. Cucchiara,
information system based on deep convolution neural network and facial ‘‘Video action detection by learning graph-based spatio-temporal
expression data,’’ Library Hi Tech, vol. 38, no. 4, pp. 799–817, Mar. 2020. interactions,’’ Comput. Vis. Image Understand., vol. 206, May 2021,
[63] G. Alvari, C. Furlanello, and P. Venuti, ‘‘Is smiling the key? Machine Art. no. 103187. [Online]. Available: https://www.sciencedirect.com/
learning analytics detect subtle patterns in micro-expressions of infants science/article/pii/S107731422100031X
with ASD,’’ J. Clin. Med., vol. 10, no. 8, p. 1776, Apr. 2021. [Online]. [83] C. Plizzari, M. Cannici, and M. Matteucci, ‘‘Skeleton-based
Available: https://www.mdpi.com/2077-0383/10/8/1776 action recognition via spatial and temporal transformer networks,’’
[64] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, Comput. Vis. Image Understand., vols. 208–209, Jul. 2021,
S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, C. Schmid, Art. no. 103219. [Online]. Available: https://www.sciencedirect.com/
and J. Malik, ‘‘AVA: A video dataset of spatio-temporally localized atomic science/article/pii/S1077314221000631
visual actions,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [84] X. Wang and A. Gupta, ‘‘Videos as space-time region graphs,’’ in Com-
May 2018, pp. 6047–6056. puter Vision—ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss,
[65] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, Eds. Cham, Switzerland: Springer, 2018, pp. 413–431.
M. Shah, and R. Sukthankar. (2014). THUMOS Challenge: Action [85] C. Feichtenhofer, H. Fan, J. Malik, and K. He, ‘‘Slowfast networks for
Recognition With a Large Number of Classes. [Online]. Available: video recognition,’’ in Proc. IEEE Int. Conf. Comput. Vis., Jan. 2019,
http://crcv.ucf.edu/THUMOS14/ pp. 6202–6211.
[86] H. Xu, L. Yang, S. Sclaroff, K. Saenko, and T. Darrell, ‘‘Spatio- VARUN GANJIGUNTE PRAKASH received the
temporal action detection with multi-object interaction,’’ 2020, B.E. degree in electronics and communication
arXiv:2004.00180. from Sri Jayachamarajendra College of Engineer-
[87] M. Tapaswi, V. Kumar, and I. Laptev, ‘‘Long term spatio- ing, India, in 2018. He has five years of experi-
temporal modeling for action detection,’’ Comput. Vis. Image ence in developing machine learning and computer
Understand., vol. 210, Sep. 2021, Art. no. 103242. [Online]. Available: vision-based technological solutions for multiple
https://www.sciencedirect.com/science/article/pii/S1077314221000862 industries and startups. He has spent time working
[88] J. Tang, J. Xia, X. Mu, B. Pang, and C. Lu, ‘‘Asynchronous interac- on many computer vision and robotics challenges.
tion aggregation for action detection,’’ in Proc. Eur. Conf. Comput. Vis. His research interests include solving problems at
(ECCV), 2020, pp. 71–87. the intersection of computer vision and robotics,
[89] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ computer vision, robotic manipulation, control systems, autonomous mobile
2018, arXiv:1804.02767. robots, deep learning, deep reinforcement learning, robotic system design,
[90] G. Jocher, A. Stoken, J. Borovec, L. Changyu, A. Hogan, and machine learning.
L. Diaconu, F. Ingham, J. Poznanski, J. Fang, L. Yu, M. Wang,
N. Gupta, O. Akhtar, and P. Rai, ‘‘ultralytics/yolov5: V3.1—Bug fixes MANU KOHLI is currently pursuing the Ph.D.
and performance improvements,’’ Ultralytics, Los Angeles, CA, USA,
degree with the Indian Institute of Technology,
Tech. Rep. 7.0, Oct. 2020, doi: 10.5281/zenodo.4154370.
Delhi (IIT Delhi). He has 18 years of experi-
[91] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-
ence in executing large-scale business and digi-
time object detection with region proposal networks,’’ IEEE Trans. Pattern
tal transformation projects in multiple countries.
Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2015.
He has undertaken leadership positions in Fortune
[92] I. J. Goodfellow et al., ‘‘Challenges in representation learning: A report
500 organizations globally and has worked on mul-
on three machine learning contests,’’ Neural Netw., vol. 64, pp. 59–63,
Apr. 2015. tiple technologies, such as SAP, Saas, and machine
learning. He is the Chief Technology Officer of
[93] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal,
T. Yan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick, and A. Oliva, CogniAble, where he has developed innovative
‘‘Moments in time dataset: One million videos for event understand- artificial intelligence solutions for detecting and managing developmen-
ing,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 1–8, tal challenges, including autism, with outstanding psychometric proper-
Feb. 2019. ties. He has led the formation of new technology-enabled businesses and
[94] J. Bidwell, A. Rozga, I. Essa, and G. Abowd, ‘‘Measuring child visual ensured their commercialization. He has authored multiple publications in
attention using markerless head tracking from color and depth sensing peer-reviewed journals and books by SAP PRESS. His research interests
cameras,’’ in Proc. 16th Int. Conf. Multimodal Interact., Nov. 2014, include developing cutting-edge machine learning, deep learning, and com-
pp. 447–454. puter vision methods to solve complex business and healthcare problems.
[95] E. Murphy-Chutorian and M. M. Trivedi, ‘‘Head pose estimation in com- He has received numerous honors, including the UNICEF Blue Ribbon,
puter vision: A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, AI Gamechangers, and cash prizes from Lockheed Martin, Tata-Trusts,
no. 4, pp. 607–626, Apr. 2009. Western Digital, NASSCOM, NTT-Data, and the Ministry of Electronics for
[96] M. Raza, Z. Chen, S.-U. Rehman, P. Wang, and P. Bao, ‘‘Appearance his innovations.
based pedestrians’ head pose and body orientation estimation using
deep learning,’’ Neurocomputing, vol. 272, pp. 647–659, Jan. 2018, doi: SWATI KOHLI received the Diploma degree,
10.1016/j.neucom.2017.07.029. in 1998, the B.Ed. degree in special education,
[97] E. Rehder, H. Kloeden, and C. Stiller, ‘‘Head detection and orientation in 2002, the Postgraduate Diploma degree in early
estimation for pedestrian safety,’’ in Proc. 17th Int. IEEE Conf. Intell. intervention, and the M.A. degree in psychology.
Transp. Syst. (ITSC), Oct. 2014, pp. 2292–2297. She completed the ABA coursework from the
[98] D. Heo, J. Nam, and B. Ko, ‘‘Estimation of pedestrian pose ori- Florida Institute of Technology. She is currently
entation using soft target training based on teacher–student frame- the Clinical Director of CogniAble. She has more
work,’’ Sensors, vol. 19, no. 5, p. 1147, Mar. 2019. [Online]. Available: than 18 years of experience in working for special
https://www.mdpi.com/1424-8220/19/5/1147 needs children with neuro-developmental delays in
[99] S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, school, center, and clinic-based settings.
‘‘CrowdHuman: A benchmark for detecting human in a crowd,’’ 2018,
arXiv:1805.00123. A. P. PRATHOSH received the B.Tech. degree,
[100] T.-H. Vu, A. Osokin, and I. Laptev, ‘‘Context-aware CNNs for person in 2011, and the Ph.D. degree in temporal data
head detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, analysis from the Indian Institute of Science (IISc),
pp. 2893–2901. Bengaluru, in 2015. He submitted the Ph.D. thesis
[101] Y. Zhou and J. Gregson, ‘‘WHENet: Real-time fine-grained estimation three years after the B.Tech. degree, with many
for wide range head pose,’’ 2020, arXiv:2005.10353. top-tier journal publications. He also happens to
[102] Y. Huang, X. Liu, L. Jin, and X. Zhang, ‘‘DeepFinger: A cascade convolu- be a Student of the Sanskrit language and Indian
tional neuron network approach to finger key point detection in egocentric Philosophical Sciences. He was with corporate
vision with mobile camera,’’ in Proc. IEEE Int. Conf. Syst., Man, Cybern.,
research labs, including Xerox Research India,
Oct. 2015, pp. 2944–2949.
Philips Research, and a start-up in CA, USA.
[103] Y. Huang, X. Liu, X. Zhang, and L. Jin, ‘‘A pointing gesture based egocen-
He has co-founded CogniAble which builds learning algorithms for behav-
tric interaction system: Dataset, approach and application,’’ in Proc. IEEE
ioral healthcare using video analytics (first-place winner of the recent AI
Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2016,
pp. 16–23. startup challenge by the Government of India) and also actively engaged with
several corporate industries, start-ups, and medical centers (e.g., AIIMS) in
[104] D. Shukla, O. Erkent, and J. Piater, ‘‘Probabilistic detection of
pointing directions for human–robot interaction,’’ in Proc. Int. solving interesting technical problems. He joined the Computer Technology
Conf. Digit. Image Comput., Techn. Appl. (DICTA), Nov. 2015, Group of Electrical Engineering, IIT Delhi, in 2017, as an Assistant Profes-
pp. 1–8. sor, where he was engaged in research and teaching of machine and deep
[105] G. Benitez-Garcia, J. Olivares-Mercado, G. Sanchez-Perez, and K. Yanai, learning courses. He is currently a Faculty Member with the Department
‘‘IPN hand: A video dataset and benchmark for real-time continuous hand of ECE, IISc. His work in the industry, focusing on healthcare analytics,
gesture recognition,’’ in Proc. 25th Int. Conf. Pattern Recognit. (ICPR), led to the generation of several IPs, comprising 15 (U.S.) patents of which
Milan, Italy, Jan. 2021, pp. 4340–4347. ten are granted and six are commercialized. His current research interests
[106] H. Xia and Y. Zhan, ‘‘A survey on temporal action localization,’’ IEEE include deep-representational learning, cross-domain generalization, signal
Access, vol. 8, pp. 70477–70487, 2020. processing, and their applications in computer vision and speech analytics.
TANU WADHERA received the B.Tech. degree DEBASIS PANIGRAHI received the M.B.B.S.
in electronics and communication from the Guru degree from the Veer Surendra Sai Institute
Nanak Engineering College, Ludhiana, India, the of Medical Sciences and Research, Sambalpur,
M.Tech. degree in electronics and communica- Odisha, in 2006, and the M.D. degree in pediatrics
tion from Punjabi University, Patiala, India, and from the Sriram Chandra Bhanj Medical College,
the Ph.D. degree from the National Institute of Cuttack, in 2011. He received a fellowship in pedi-
Technology Jalandhar (NIT Jalandhar), Jalandhar, atric neurology from the Kanchi Kamakoti Child
Punjab, India. She completed her postdoctoral Trust Hospital, Chennai, in 2013. He has 16 years
research with the Indian Institute of Technology, of experience as a Pediatrician and a Pediatric
Delhi, India. She has a total of six years of research Neurologist in Bhubaneswar. He currently prac-
experience, including four years with NIT Jalandhar. She has one year of tices with the Child Neuro Clinic, Jagannath Hospital, Bhubaneswar. He is
teaching experience as an Assistant Professor at NIT Jalandhar. Based on a member of the Indian Academy of Paediatrics (IAP), the Association of
her contribution to the field of computational healthcare, especially autism Child Neurologists, India, and the IAP’s Developmental Pediatric Chapter.
spectrum disorder and other disabilities lying on the same spectrum, she
is currently a Project Engineer with the Indian Institute of Technology in
collaboration with AIIMS, Delhi. She is also an Assistant Professor with the
School of Electronics, Indian Institute of Information Technology Una, Una,
India. She has experience publishing work in reputed journals and editing
and/or authoring books for several journals. Her research interests include
artificial intelligence, assistive technology, behavioral modeling, biomedical
signal processing, cognitive neuroscience, and machine learning.