0% found this document useful (0 votes)

96 views23 pages

Computer Vision-Based Assessment of Autistic Children Analyzing Interactions Emotions Human Pose and

This paper presents a study on using computer vision to assess skills and emotions in children with Autism Spectrum Disorder (ASD) through the analysis of video recordings from play-based interventions. Three deep learning models were developed for activity comprehension, joint attention recognition, and facial expression recognition, achieving high accuracy rates in their assessments. The findings suggest that these models can provide valuable insights for clinicians in diagnosing and monitoring ASD with reduced supervision.

Uploaded by

FARAS A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views23 pages

Computer Vision-Based Assessment of Autistic Children Analyzing Interactions Emotions Human Pose and

Uploaded by

FARAS A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Received 4 January 2023, accepted 25 March 2023, date of publication 20 April 2023, date of current version 22 May 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3269027

Computer Vision-Based Assessment of Autistic

Children: Analyzing Interactions, Emotions,
Human Pose, and Life Skills
VARUN GANJIGUNTE PRAKASH 1 , MANU KOHLI1 , SWATI KOHLI1 , A. P. PRATHOSH 2,

TANU WADHERA 3 , DIPTANSHU DAS 4 , DEBASIS PANIGRAHI5 ,

AND JOHN VIJAY SAGAR KOMMU 6
1 CogniAble, Gurugram, Haryana 122022, India
2 Department of Electrical Communication Engineering, Signal Processing Building West, Indian Institute of Science, Bengaluru 560012, India
3 Department of Electronics and Communication Engineering, Indian Institute of Information Technology Una (IIITU), Una, Himachal Pradesh 177209, India
4 Institute of NeuroDevelopment, Kolkata, West Bengal 700005, India
5 Jagannath Hospital, Bhubaneswar, Odisha 751007, India
6 Department of Child and Adolescent Psychiatry, National Institute of Mental Health and Neurosciences (NIMHANS), Bengaluru, Karnataka 560029, India

Corresponding author: Varun Ganjigunte Prakash ([email protected])

This work was supported in part by the Biotechnology Industry Research Assistance Council (BIRAC), India, under Grant
BIRAC/FITT0528/BIG-13/18; and in part by Social Alpha, India.
This work involved human subjects in its research. Approval of all ethical and experimental procedures and protocols was granted by the
Maulana Azad Medical College New Delhi, the All India Institute of Medical Sciences Jodhpur, the National Institute of Mental Health
and Neuro-Sciences (NIMHANS) Bangalore, the Vardhman Mahavir Medical College New Delhi, Latur Medical College, and the
Central Institute of Psychiatry in Ranchi.

ABSTRACT In this paper, the proposed work implements and tests the computer vision applications to
perform the skill and emotion assessment of children with Autism Spectrum Disorder (ASD) by extracting
various bio-behaviors, human activities, child-therapist interactions, and joint pose estimations from the
recorded videos of interactive single- or two-person play-based intervention sessions. A comprehensive
data set of 300 videos is amassed from ASD children engaged in social interaction, and three novel deep
learning-based vision models are developed, which are explained as follows: (i) activity comprehension
to analyze child-play partner interactions (activity comprehension model); (ii) an automatic joint attention
recognition framework using head and hand pose; and (iii) emotion and facial expression recognition. The
proposed models are also tested on children’s real-world, 68 unseen videos captured from the clinic, and
public datasets. The activity comprehension model has an overall accuracy of 72.32%, the joint attention
recognition models have an accuracy of 97% for follow eye gaze and 93.4% for hand pointing, and the
facial expression recognition model has an overall accuracy of 95.1%. The proposed models could extract
behaviors of interest, events of activities, emotions, and social skills from free-play and intervention session
videos of long duration and provide temporal plots for session monitoring and assessment, thus empowering
clinicians with insightful data useful in diagnosis, assessment, treatment formulation, and monitoring ASD
children with limited supervision.

INDEX TERMS Autism spectrum disorder, activity comprehension, facial expressions, joint attention, ASD
screening, applied behavior analysis.

I. INTRODUCTION difficulty in establishing friends, poor social communication

Children with Autism Spectrum Disorder (ASD) typically abilities, and limited understanding and expression of emo-
exhibit biobehavioral patterns such as repetitive behavior, tions [1]. Traditional diagnostic methods such as blood tests,
genetic testing, and brain imaging have limited success in
The associate editor coordinating the review of this manuscript and establishing diagnoses, severity degree, and skill assessments
approving it for publication was Dian Tjondronegoro . of ASD children.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 11, 2023 47907
V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

Behavioral methods are the gold standard for diagnosing a multimodel approach with video annotation performed by
ASD in children, which entails the physician documenting humans [12], [39], [45], however, very little work has been
the patient’s medical history, interviewing the parents, and done on the automatic extraction and classification of human
manually observing the children’s behavior. These observa- actions from untrimmed videos for ASD detection [46]. The
tions are recorded as detailed in the instruction guidelines state-of-the-art ML and DL methods have improved quality,
for diagnostic rating scale instruments such as the Autism outcomes, and access to ASD screening, diagnosis [12], and
Diagnostic Observation Schedule (ADOS) and the Childhood assessments [39]. Researchers have trained supervised learn-
Autism Rating Scale (CARS-2) [2], [3]. The rating scale ing ML models on multimodal data to develop ASD screening
usually suggests a child’s skills in social engagement, joint and diagnosis [39] solutions with moderate to high psycho-
attention, emotional expressions, instruction following, play metric outcomes in minimal time, ensuring their internal
and life skills, imitation abilities, and visual attention. The validity. These solutions have focused on detecting children
diagnosis is established if the observation scores cumula- with ASD and ODD [12] on cross-cultural datasets.
tively exceed a predetermined threshold. After diagnosis, In the past decade, computer vision-based behavior imag-
a functional assessment is conducted utilizing the instruments ing and facial analysis have shown promising results in
such as VBMAPP [4] to build a personalized intervention assisting clinicians with the diagnosis of multiple medi-
program that can improve the necessary skills of ASD chil- cal conditions including ASD [16], [17], [18]. Moreover,
dren for their school and societal inclusion. The functional computer vision-based methods can offer an accurate, low-
assessment includes detailed observations and measurements cost, and non-invasive alternative compared to traditional
of children’s skills in various domains such as independent labor-intensive manual assessments and invasive methods
play, social communication, self-stimulatory behavior, joint such as electroencephalogram (EEG) [19].
attention, imitation, and understanding of emotions through Even though computer vision has demonstrated many
facial expressions [5], and other necessary skills of ASD promising solutions, its application in assessing behavior,
children [6]. play, imitation and life skills, posture, and gait analysis to
However, there are various limitations when using conven- assess the joint attention of ASD children has not yet been
tional diagnostic and functional assessment methods. Firstly, explored [20], [21], [22]. In addition to these, there are no
the interpretative coding of a child’s behavior observed is large-scale efforts to develop facial expression recognition
manual and time-consuming. Secondly, a clinician’s obser- models or detect joint attention skills of young children
vations may not always be reliable or valid due to differences from real-time videos. Therefore, we address these issues
in professional training, experience, available resources, and by developing novel computer vision models to extract and
cultural backgrounds. Thirdly, there is a huge demand-supply classify the joint attention skills, facial expressions, and life
mismatch between the number of professionals available to skills from untrimmed videos of ASD children and assist the
treat nearly 2% of newborn children diagnosed with ASD [1]. clinician in diagnosing ASD or establishing the functional
These challenges are exaggerated in Low and Middle-Income assessment for ASD children.
Countries (LMICs) [1], [7], [8] where there is a severe short- (i) To assess children’s joint attention skills automatically,
age of clinicians and poor infrastructure to manage ASD con- we developed computer vision models by analyzing
ditions. Therefore, new technological methods for rapid and postural changes in response to instruction or stimuli
automatic data collection and analysis can enhance clinician given by the clinician.
capacity and improve quality, affordability, and accessibility (ii) To recognize nine emotional expressions, namely
in ASD detection and assessments. anger, disgust, fear, happiness, sadness, surprise, laugh-
Technology has demonstrated significant benefits by ter, crying, and neutral for children aged 1 to 5,
employing Machine Learning (ML) and Deep Learning (DL) we developed the Facial Expression Recognition (FER)
for early diagnosis and functional assessments of ASD model by gathering extensive facial images from
[9], [10]. ML has uncovered essential and minimal fea- diverse ethical and cultural backgrounds.
tures [11], [12] of ASD diagnostic instruments such as (iii) To perform an automatic functional assessment of
the Autism Diagnostic Observation Schedule (ADOS) [2], children from their intervention video sessions, their
and the Autism Diagnostic Interview-Revised (ADI-R) [13], engagement duration, and frequency with clinicians,
thereby accelerating the diagnosis procedure without com- parents, or play partners on ten life skill activities,
promising accuracy [14], [15]. ML and DL methods can namely run, sit, stand, engagement, instruction engage-
analyze an unprecedented quantity of multimodal and mul- ment, hit or fight someone, watch someone, hold an
tidimensional clinical data from videos, images, texts, voice object or oblique toys, walk, and answer the phone, are
messages, and sensors due to the rapid evolution of technol- assessed.
ogy and digitization [9]. The analysis can suggest patterns, The paper is organized as follows: Section II briefly
aid in the development of clinical decision support systems to describes state-of-the-art computer vision methods used in
diagnose ASD or developmental delays, and provide sugges- ASD management. In Section III, we provide the details
tions for treatment and personalization, enhancing the clini- of the study procedure, and Sub-section III-A provides a
cian’s capacity. Earlier studies on ASD screening developed detailed description of the problem and answers the questions

47908 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

raised in developing video-driven assessments. Section IV Artificial intelligence (AI) technology especially ML and
describes the data collection procedure and the technological DL can address these limitations due to its unique facets such
methodology to realize the study aims. Section V provides as increased processing power of computer hardware and
a detailed evaluation and results of our models on real-time multimodal data availability, thereby leading to faster ASD
videos; in Section VI, we discuss the results interpretations, diagnosis [38]. Recently, the clinical study of multi-modular
practical implications, limitations, and future directions; and ML-based ASD diagnosis based on questionnaires and home
in Section VII we provide the conclusion. videos has demonstrated a sensitivity of 90% towards ASD
detection [39]. Some of the other improvements that have
been witnessed with the application of AI are: (i) Detection of
II. LITERATURE REVIEW ASD at an early age, (ii) Reduction in the number of assess-
This section discusses relevant studies, state-of-the-art ment items as a result of implementing the feature reduction
computer vision methods, implementation challenges, and method, (iii) Effective classification between different ASD,
improvement options. Sub-section II-A summarizes the cur- Typically Developing (TD), and other neurodevelopmental
rent state of ASD assessment and intervention methods and disorders, (iv) Automatic feature extraction of bio-behaviors
new studies incorporating ML and DL models into ASD from multimodal data [9], [40].
diagnosis and therapy. Sub-section II-B discusses state-of- Due to the availability of multimodal data from diverse bio-
the-art computer vision methods deployed for Human Pose behavioral sources, such as videos including ASD behavioral
Estimation. Sub-section II-C discusses the importance of features [12], [41], audio [42], facial expressions [43], and
joint attention skills and data-driven assessment techniques. Electronic Health Record (EHR) data [44], DL applications
Sub-section II-D describes the significance of facial expres- trained on unstructured data have accelerated the detection
sion recognition in ASD assessment and treatment planning. and management of ASD and can be implemented at the
Finally, Sub-section II-E summarizes the state-of-the-art point of care [9], [12], [41], [45], [46], [47]. The feasi-
human action recognition methods and their applications in bility of the therapeutic intervention and prognosis lever-
assessment and treatment formulation for ASD. aging AI has shown reasonable success [48] for ASD and
other neurodevelopmental disorders. Furthermore, individu-
alized socially assistive robotic intervention and automation
A. ASD TREATMENT based on engagement analysis has aided in the development
Most evidence-based ASD intervention methods can enhance of a low-cost, robot-based therapeutic framework for ASD
the child’s ability, especially in the first three years [23]. children [49].
However, the demand for professionally trained therapists However, most studies focus on one of the seven key data
has outpaced the supply; consequently, clinicians’ availabil- categories, such as stereotyped behaviors, eye gaze, facial
ity and cost-effectiveness are crucial for promoting treat- expressions, postural analyses, motor movements, auditory,
ment accessibility. Cognitive Behavioral Therapy (CBT) is and electronic health records [9] adopting ML and DL tech-
a behavioral intervention that can help individuals with niques with Graphical Processing Units (GPU) and high pro-
ASD to achieve their goals and change their lifestyles cessing cloud capabilities [9]. To the best of our knowledge,
[21], [24], [25]. this is the first study to employ computer vision to extract
Applied behavioral analysis (ABA) is a gold-standard data from various bio-behaviors, including play, engagement,
intervention widely used to assist ASD individuals with facial expression, and joint-attention abilities.
behavioral and communication challenges by promoting
desirable social behaviors [26] such as overcoming food
intolerance, improving intelligence quotient (IQ), social com- B. HUMAN POSE ESTIMATION
munication, and teaching play and life skills using principles Computer-vision-based Human Pose Estimation (HPE)
of reinforcement [27]. A higher quality of life for ASD methods including conventional and instance-based pose esti-
children can be foreseen through early diagnosis followed mation models to novel deep network architectures can detect
by evidence-based treatment methods [20], [28]. An accurate human body poses in 2D or 3D space by regressing skeletal
diagnostic and functional evaluation is essential to evaluate joint angles or critical points using a single view or several
the child’s area of strength for customizing an intervention view cameras with monocular or depth modalities [50], [51],
program to the child’s unique needs. ADOS [2], ADI-R [13], [52], [53], [54], [55]. In addition, developing computer vision
The Modified Checklist for Autism in Toddlers, Revised, applications for specific task measurements involves accurate
with Follow-Up (M-CHAT-R/F) [29], and The Childhood measurements of both human body joints and their parts.
Autism Rating Scale-2 (CARS-2) [30] are a few widely used Head pose estimation involves the prediction of head orienta-
gold-standard ASD diagnostic and screening instruments tion and assessing human attention and head pose. Similarly,
developed in Western countries [31], [32], [33]. Therefore, hand detection and tracking provide a fine-grained estimation
the outcomes of these assessments have limited efficacy when of hand posture for regressing skeletal finger points and ges-
employed in LMICs due to a lack of training and cultural ture recognition tasks [50]. However, most pose estimation
disparities [34], [35], [36], [37]. algorithms are designed for adults or pedestrians, and few

VOLUME 11, 2023 47909

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

solutions have focussed on special needs children or pediatric distressing for some ASD individuals leading to social anx-
healthcare [56], [57]. Several significant constraints prevail iety. Subsequently, the capacity to imitate and comprehend
in the development and deployment of the HPE methods for facial emotions is crucial for the social functioning of any
various child-specific problems in managing ASD conditions individual. ASD children have difficulty understanding and
as follows: (i) data security, privacy, and ethical challenges, responding to nonverbal cues and recognizing and compre-
(ii) expensive data collection, (iii) manual data annotation hending facial expressions and emotions. Carpenter et al.
process, and (iv) camera calibration and setup, and (v) sin- [43] have extracted positive, neutral, and other facial land-
gle and targeted solutions to a specific problem [52]. The marks from a database of 3D facial expressions utilizing
current methods for human pose estimation are designed to a trained computer vision model and have discovered that
track specific movements and activities and may not be able children with ASD have more neutral facial expressions,
to capture a broad variety of child behaviors or activities. which corresponds to the fact that facial expression imitation
For instance, head pose estimation may be ineffective for is an essential indicator of social interaction skills. Zhao
monitoring social engagement or other nonverbal indicators. et al. [62] have implemented a DL model to recognize facial
Human pose estimation methods are not always accurate, par- expressions by utilizing multiple databases while training it
ticularly when monitoring the movements of children, who with the facial expressions of sixteen Chinese children. The
have smaller, more rapid movements. Additionally, children experimental results of Zhao et al. have shown that the ASD
are more likely to make sudden, unpredictable movements group’s average imitation expression is found to be less than
that can be difficult to accurately monitor. To circumvent 60%, a significant deterministic threshold for ASD.
these problems, our goal is to develop dedicated models for Alvari et al. [63] have examined facial expressions using
tracking hands and heads that work well for adults as well as the Facial Action Coding System (FACS) and extracted the
toddlers. intricate dynamics of ASD and TD children’s social smiles
from home recordings. The findings of Alvari et al. have
C. JOINT ATTENTION
suggested that ASD children exhibit less happiness than TD
children in their first years, confirming that ASD children
Joint attention (JA) is a social communication method of
have difficulty distinguishing faces and take a long time to
engaging one’s attention with another person using objects
comprehend facial expressions. Deep learning-based facial
and gestures. Limited JA skills are one of the earliest indi-
expression recognition (FER) has been explored in numer-
cators of ASD which JA necessitates capturing, sustain-
ous architectures such as convolutional neural networks,
ing, and transferring attention and fostering the growth of
deep belief networks, autoencoders, generative adversarial
essential social abilities, such as engaging with others and
networks, and ensembles of networks. These architectures
understanding their perspective [14]. Similarly, few works
performed the best on a variety of benchmark datasets as
implemented a DL classification model for evaluating joint
they focused on the two important issues of overfitting and
attention in individuals with ASD by utilizing short video
expression-unrelated variations.
clips of joint attention initiation. The system evaluates joint
attention aids in the early detection and intervention of ASD.
Similarly, a vision-based joint attention detection system E. ACTIVITY RECOGNITION
for ASD using eye-tracking technology showed good accu-
Activity recognition identifies significant events of interest in
racy in detecting joint attention among non-ASD adults.
vast video datasets [64], [65], [66], [67]. Earlier techniques
An automated tool called RJAfinder has been developed
employed human posture traits [68], feature descriptors [69],
that quantifies responding to joint attention behaviors in
and dense trajectories [70] based on the appearance from
ASD using eye-tracking data. RJAfinder can compare RJA
camera movement. However, ML and computer vision (CV)
events among ASD children, typically developing children,
have improved various aspects of human visual perception
and adults and finds fewer RJA events that ASD children
to find clinically meaningful patterns from the images and
display than the other two groups. Cazzato et al. [58] have
videos and classify activities of interest to diagnose and func-
examined how robot-assisted therapy affected the social inter-
tionally assess ASD children [12], [71], [72], [73]. However,
actions of children with ASD and used expensive depth
one of the obstacles associated with applying CV in ASD
cameras to aid in non-invasive JA evaluation. Few studies
detection and its management is the high labor cost and
have used eye-tracking technology to investigate eye gaze
downtime associated with the manual annotation of video.
accuracy, fixation, eye transition, and eye movement dur-
Furthermore, due to the computationally intensive descrip-
ing technology-aided JA assessments [15], [59], [60], [61]
tion and monitoring of motion data from the real-time feed,
methods.
activity monitoring can result in low generalizability because
of potential tracking failure in non-neural network-based
D. FACIAL EXPRESSION RECOGNITION AND EYE CONTACT systems. Hence, we propose a novel DL model to address
The intensity and frequency of eye contact and facial these limitations by training the model on a limited publicly
expressions can facilitate verbal and non-verbal communi- available dataset of action classes relevant to ASD diagnosis
cation between individuals. Maintaining eye contact can be and assessments.

47910 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

Researchers have developed neural network architectures A. PROBLEM FORMULATION

such as two-stream Inflated 3D [74] and C3D [75], [76], This study focuses on the following areas,
which incorporate optical flow and RGB image proper- 1) Activity Comprehension: Developmental concerns
ties for capturing person features and movements. In addi- raised by parents are the first reporting point for per-
tion to the standard bag-of-visual-words method [77], these forming diagnosis or age-level skill assessment. Clin-
architectures have strengthened the activity recognition icians engage children in play-based sessions, provide
framework. In terms of capturing the temporal patterns and them with various stimuli, and ask them to undertake
simplifying parameter learning for underlying architectures, various activities to test their age-appropriate skills
3D convolutional layers are superior to 2D ones [74]. How- while diagnosing or assessing their functional skills for
ever, these methods work only for short, trimmed videos ASD. These evaluations also include monitoring the
and do not perform well with longer untrimmed videos con- children engaged in independent play in an unsuper-
taining simultaneous multi-person actions. Temporal action vised session. During post-ASD diagnosis, the children
localization-based methods have solved this limitation by are engaged in intervention sessions where they are
slicing a long-duration video into manageable time segments. taught various motor and life skills and activities of
A single-stage end-to-end network suggests action intervals daily living, including child interaction with the parent,
and also classifies potential actions [76], [78]. toys, or clinicians during the session. These diagnostic,
In contrast, two-stage methods use distinct neural networks functional assessment, and intervention sessions are
for making suggestions on the occurrence of significant manually analyzed by clinicians, where behaviors of
human movements and then classifying those activities [79], interest and actions of interest are analyzed, spending
[80]. However, these methods perform poorly with videos significant man hours to establish the diagnosis and
similar in actions or outside the training data distribution. functional assessment. In order to circumvent manual
Also, activity recognition systems must be resistant to occlu- observation and treatment monitoring, there is a need
sions and capable of handling crowded scenarios with mul- to adopt automatic treatment monitoring and analysis.
tiple people and actions. Several contemporary paradigms Therefore, we propose to use computer vision in the
simultaneously investigate spatial-temporal characteristics to video-recorded session to capture children’s activity
identify and mark the location of activities. Tublet generation performance and skill levels. With the help of computer
[81], human- and object-centric learning [78], [82], skeleton- vision, massive engagement video data can be utilized
based method [83], and graph convolution networks [84] have to train a system to identify children’s engagement.
enabled the incorporation of various human and environmen- As part of the general engagement, we measure ten
tal features for accurate activity recognition. As part of this activities as listed in Table 1.
paper, we aim to develop a general video activity classi- 2) Facial expressions: The facial expressions of chil-
fier to detect multiple actions of interest in natural videos dren with ASD can significantly differentiate from
accurately. TD children in response to various external stimuli.
Further, children’s motivation during assessments and
III. STUDY PROCEDURE the ABA intervention session system can be tracked
1) The children are recruited from SM Learning Skills by analyzing children’s emotions during and after the
Academy Pvt. Ltd., India, a special needs clinic, and intervention. The emotion and facial expressions can
the National Institute of Mental Health and Neuro- give insights into a precursor of maladaptive behav-
Sciences (NIMHANS), Bangalore, India. All partic- iors or unusual responses to visual or auditory stimuli
ipants’ consent is recorded. The children who have and investigate the potential causes of these responses.
already been diagnosed with ASD participate in play- In addition, there are limited datasets of ASD chil-
based interactions. dren; therefore, we augment existing datasets by col-
2) The study objectives and details of data capture lecting publicly available facial expressions of children
are explained to potential participants’ parents or and adolescents of diverse ethnicity, culture, and geo-
caregivers, and any doubts regarding the same are graphic location, thereby conforming to fairness and
clarified. Finally, their consent for data usage is decreasing bias. Also, we intend to identify nine facial
recorded. expressions in children using a deep learning model.
3) Video recordings of interactive ABA therapy sessions 3) Joint attention: The JA assessment involves a human
between a child and the therapist are recorded. observer who records the child’s responses and the
The principles of ABA have demonstrated its utility for observation data leading to longer wait times, requir-
promoting learning and behavior change in children with ing additional therapists, and burdening clinicians’
ASD. In the following sections, we describe the general work. Now, it is possible to automate this process
concerns with conventional ABA sessions and outline our and provide an assessment report by developing algo-
primary goals for enhancing three areas of ABA intervention rithms to process RGB videos recorded on any cam-
namely general interaction and life skills, emotion recogni- era without requiring expensive sensors attached to
tion, and joint attention analysis. children or human observers in treatment scenes.

VOLUME 11, 2023 47911

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

The JA assessment report comprises marked times- B. ACTIVITY COMPREHENSION MODEL

tamps at which the child’s responses to stimuli are In this section, we provide an overview of our Activity Com-
observed, enabling them to conduct their intervention prehension model, and technological methods of implemen-
plan effectively. tation are explained in the subsequent subsections. We define
The study procedure of model development, real- the ABA enrollment and assessment pipeline as shown in
world testing, and performance evaluation is illustrated in Fig. 2. The flow diagram also shows the steps involved in
Fig. 1. the real-world deployment of the Activity Comprehension
model. These video sessions capture ABA intervention ses-
TABLE 1. Ten activities of interest predicted by the model.
sions in which the child is engaged with the therapist in
a play-based scenario (see Section III). The video is then
inferred into our Activity Comprehension model to provide
predictions. Fig. 3 shows the plots of temporal events of
interest (activities) and an automatic pie chart generation
where each point on the child/play partner interaction plot
belongs to one of the activity classes (see Table 1) at a given
time step.
Spatio-temporal action recognition methods [82], [83],
[85], [86], [87] have been developed to train actions simulta-
neously with spatial and temporal annotations, i.e., where the
action is and what the action is about in a given image. How-
IV. METHODS ever, these methods require dense annotations at the frame
A. DATA COLLECTION AND PROCESSING level, which is difficult and time-consuming in a clinical
The video data of 300 children with ASD diagnosis of varied setting. Therefore, we utilize open-sourced spatiotemporal
age groups (1-5 years) are collected. The average duration transformer approach [88] to analyze person-person inter-
of collected videos is 20 minutes. The model outcomes are action. We also aim to understand child-play partner inter-
blind-tested on 68 videos (see Section V-A). actions efficiently and thereby provide actionable insights
The following prerequisites are completed before data col- from the ABA video assessments. We propose our Activity
lection at each site. Comprehension model approach in Fig. 4 which comprises
1) Researchers have finalized a set of behavioral markers child-play partner activity prediction and activity separation,
exhibited prominently by ASD children such as poor described in the following sections.
eye contact, motor, and joint attention, imitation skills,
absence of play, and self-stimulatory behavior to per-
form diagnosis or functional assessments by referring 1) ACTIVITY PREDICTION
to literature and expert discussions. The Activity Comprehension model is trained on the AVA
2) We developed a web application used by expert annota- dataset [64] which contains 235 training movie videos, 64 val-
tors to traverse temporally across a video and perform idation videos, and 131 test movie videos (see Table 2) from
frame-by-frame annotations on preselected multi-level which 10 out of 80 action categories are utilized for our
video clinical landmarks. The annotators performed study. The Activity Comprehension model localizes peo-
the annotations by capturing eye contact, joint atten- ple’s actions in both space and time and recognizes actions
tion, imitation, and activities of interest, and their with a novel asynchronous interaction aggregation method
respective metadata as features. For instance, depend- [88]. The action recognition system uses the object detec-
ing on the therapist-child interaction, the annotator tion model [89] for localizing people. However, the object
would add metadata such as ‘‘yes’’ or ‘‘no’’ for detection model misses detecting several people/children in
the ‘Joint Attention’ and ‘Social Behavior’ categories the dense crowd and occlusions cases. Hence, we utilized
for the follow gaze, finger pointing, and eye con- Yolo-v5 [90] for real-time person detection in videos with
tact classes, and additional information if responses significant improvements over Yolo-v3 [89]. The input to the
are spontaneous or elicited. Similarly, the annotators Activity Comprehension model is an ABA video (approxi-
mark all the activities of interest (see Table 1) for mately 10 minutes in length) containing scenes of engage-
both the child and the play partner in the videos. ment/activity between a child and their play partner and
Each video’s annotations are saved in an XML file the model predicts the activities (see Table 1) of both the
containing the child’s age, gender, ground-truth diag- child and the play partner temporally. We record and store
nosis, audio or video annotations, metadata, and the detected bounding boxes, model predictions, and time
timestamps. intervals of the child and the play partner. However, it should
3) Annotators examine publicly available videos collected be noted that the model does not explicitly make distinc-
from YouTube and Vimeo for good resolution of at least tions concerning the people in the scene as children or play
240p before annotation. partners. We only know the number of people in the scene

47912 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

FIGURE 1. An overview of study procedures for model development, real-world testing, and
performance evaluation.

(by counting the number of person bounding boxes) and to the area of the union of two bounding boxes (Equation 1).
their associated activities. Hence, there is a need to segregate
the activities of the child and play partner to make logical Area of Overlap
IoU = (1)
conclusions for the assessment by implementing the activity Area of Union
separation method which is explained in the next section. We compute the IoUs between each pair of axis-aligned
bounding boxes which are one IoU between the child detec-
tor bounding box and one of the bounding boxes from the
2) ACTIVITY SEPARATION person detection bounding boxes. We select the IoU score
To distinguish a child from a play partner, we develop a of ≥ 0.75 as a good threshold for locating the child and
child detector model. Knowing the child’s location is neces- perform the following two checks to ensure that the located
sary because we infer its actions from the features extracted bounding box is correct out of several other bounding boxes
from subsequent frames. The child detector model is trained obtained from the activity prediction model. At first, the
using 3027 annotated images of children (see Table 2) by center coordinates pixel values of both the bounding boxes,
fine-tuning weights from Faster R-CNN with Resnet-50 (v1) having a good IoU match are calculated. Then we check
[91] to obtain a [email protected] score of 0.94. As a result, whether the center coordinate lies in either of the two quad-
a child is distinguished from a play partner, and child features rants of the image (distinguished as left and right with the
are extracted. Table 3 summarizes the hyperparameters used center axis) and compare the quadrants in which the center
and the results of the detector model evaluation. The child coordinate lies and record it. Next, we calculate the Euclidean
detection model predicts the child’s location in each video distance dc between the center coordinates of a good IoU
frame and produces a bounding box for the child. We store match (Equation 2). The lesser the distance score between
the detected bounding boxes along with the time interval. two center coordinates, the higher the chance that the person’s
We perform Intersection over Union (IoU) of detected bounding box encompasses the child. From the experiments,
boxes on the person detection boxes (accumulated child and we choose a distance threshold of 20 pixels and then select
play partner bounding boxes from subsection IV-B1) and the person bounding box with a good IoU match that agrees
child detection box (from subsection IV-B2). The IoU of two with the quadrant rule and the distance measure. We now
detection boxes is defined as the ratio of the overlapped area know the child’s location and, therefore, it is easy to recover

VOLUME 11, 2023 47913

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

TABLE 2. Data details of models for computer vision-based assessment.

TABLE 3. Child detector model hyperparameters and results.

non-child detection boxes belonging to the play partner. With

these results, we can explain an automatic ABA activity
assessment in detail in the next section.
The distance between the two center coordinates
q
(cx1 , cy1 ), (cx2 , cy2 ) = dc = (cx1 − cx2 )2 + (cy1 − cy2 )2
(2)
where (cx1 , cy1 ) and (cx2 , cy2 ) are the center coordinates of
two bounding boxes.

3) ABA VIDEO ASSESSMENT

We collected 21 videos from Clinic (No. 3.05/30th Insti-
tutional Ethics Meeting of Behavioral Sciences Division,
FIGURE 2. Flow diagram of ABA enrollment and assessment using activity National Institute of Mental Health and Neuro-Sciences
comprehension model.
(NIMHANS), Bangalore, India - Approved on 26/06/2021),
and 27 publicly available ABA videos for testing our ABA
the predictions and time intervals from the person detection Activity Comprehension model on one NVIDIA V100 GPU,
list obtained from the action prediction model. Similarly, and predictions of videos are in the form of output plots
we recover activity predictions and time intervals from all as illustrated in Fig. 4. The ABA Activity Comprehension

47914 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

FIGURE 3. Session monitoring and assessment with activity comprehension model.

FIGURE 4. Implementation of spatio-temporal activity comprehension model for ABA assessment.

model is used to understand a child’s interactions, activity The model also assesses the play partner and learner for
level, and attention with a play partner by analyzing ten various activities shown in Table 1. A video can be uploaded
activities, including running, sitting, standing, engagement, to analyze any of these activities, and a scatter plot of
instruction engagement, hitting/fighting someone, watching engagement and non-engagement with time intervals would
someone, holding objects/oblique toys, walking, and answer- be generated for the respective inputs. This scatter plot tells
ing the phone. Any user who wants to assess engagement crucial information with a frame-by-frame analysis of the
and understand the ABA session can upload an ABA video learner and the play partner’s activities during the session.
and gets predictions. These predictions are different activ- Each point shows the action class with a timestamp. The
ity classes predicted by a spatiotemporal action recognition discontinuities in the scatter plot indicate the time intervals
model (subsection IV-B1 and IV-B2). We get predictions of no particular interest to the ABA outcomes. Therefore,
for the entire video length with specific activities shown these scatter plot patterns can be used to trace the success
as a scatter plot at time intervals of seconds. The model of therapy and intervention delivery by the therapist easily.
can efficiently assess videos from a camera with a tripod The assessment recordings and scatter plot predictions of the
stand in clinical sessions and from a mobile camera with- model can predict a child’s skill level, leading to the line
out a stable tripod stands from home videos. The ABA of treatment. For instance, the therapist would be prepared
sessions with the Activity Comprehension model will help to deal with a violent child if the model had predictions
the learners concentrate on learning goals and the clinicians of hitting activity (see Fig. 5). The therapist can then work
or therapists to augment their decisions on ABA session closely on various attributes of the child’s behavior with prior
outcomes. predictions reported from the model. The scatter plot not only

VOLUME 11, 2023 47915

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

FIGURE 5. Scatter plot of a child indicating hitting and running behavior.

highlights the presence of a particular class of activity but (e.g., non-face photos or images with faces wrongly cropped),
marks the absence (if any) in a learner’s or therapist’s behav- and the distribution of images among emotion categories is
ior. In a child showing less attentive behavior towards the not uniform. There are almost 6,000 photographs depicting
play partner’s commands (i.e., lower engagement), the thera- happiness, but only about 500 depict disgust. Additionally,
pist can be proactive to work on weaker behavior attributes, as none of the datasets had images of teenagers and tod-
strengthen them in the upcoming sessions, and track progress dlers crying or laughing, we, therefore, compiled images
for specific class activities. In this way, the ABA video from the popular action recognition datasets (Kinetics [67],
activity recognition model is of great clinical and therapeutic Moments in Time [93], HMDB [66]), and from our video
significance. dataset with 300 ASD children. We accumulate 9882 images
The activities such as hitting, running, walking, and repet- of toddlers and teenagers crying and 10268 images of them
itive behavior indicate a low level of attention by the learner laughing. The final enhanced dataset contains 51037 training
towards commands of the play partner. Thus, with a plot images, 3000 images for validation, and 2000 images for
indicating such results, the customization and prognosis of testing (see Table 2). We trained a Resnet-34 backbone-
the therapy can be decided (see Fig. 6). Similarly, if the plot based facial expression recognition model for nine output
of the play partner/therapist points towards using a phone, the expression classes namely anger, disgust, fear, happy, sad,
play partner could be replaced or evaluated for their actions surprise, laugh, cry, and neutral. We train a residual mask-
accordingly. Fig. 7 shows the pie charts of percentages of ing network of four primary residual masking blocks. Each
activities for a child and therapist after conducting the ABA Residual Masking Block is comprised of a Residual Layer
session. To conclude, the model aids in diagnosis, prognosis, and a Masking Block which acts on different feature sizes.
customization of the therapy, and evaluation of the learner and A 3×3 convolutional layer will first process a 224×224 input
therapist’s progress/activities and so it is vital for the ABA image with stride 2 followed by a 2 × 2 max-pooling layer,
sessions and treatment of ASD children. resulting in a spatial size reduction to 56 × 56. The cor-
responding forward layers of four residual masking blocks
C. FACIAL EXPRESSION RECOGNITION MODEL generate feature maps of four spatial sizes (56 × 56, 28 × 28,
The experiments are conducted on the well-known FER2013 14 × 14, and 7 × 7) from the feature maps produced by the
public dataset which has a collection of 35887 greyscales preceding pooling layer. The network ends with an average
(48 × 48) images in total [92] and other publicly accumulated pooling layer and a 9-way fully-connected softmax layer
datasets. In the FER2013 dataset, each image gathered by producing outputs for 9 facial expression classes (‘‘Angry,’’
the Google image search API is labeled with one of seven ‘‘Sad,’’ ‘‘Fear,’’ ‘‘Happy,’’ ‘‘Surprise,’’ ‘‘Cry,’’ ‘‘Disgust,’’
categories: anger, disgust, fear, happy, sad, surprise, and neu- ‘‘Laugh,’’ ‘‘Neutral’’). The model is trained for 250 epochs
tral. However, the dataset contains several faulty samples with a batch size of 48, with the SGD optimizer, with a

47916 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

FIGURE 6. A regular output scatter plot of child and play partner to analyze per
session response.

FIGURE 7. Assessment output of child and play partner interactions.

learning decay rate of 0.9 and weight decay of 5e−4 . The of the model on the validation set images is 74.4% and the
learning rate is set to 0.001. Before the training process, accuracy of the model on the test images is 73.9%. The con-
the original training images are resized to 224 × 224 and fusion matrix is shown in Fig. 8. The classes with the highest
transformed to RGB to support ImageNet pre-trained models. scores are Happy, Sad, Surprise, and Neutral, while those
In addition, the training photos are enhanced to prevent over- with the lowest scores were Laughing, Fear, and Disgust. The
fitting. The dataset augmentation techniques included left- evaluation scores per class for the 20 ASD test videos are
right flipping, brightness variation, and rotation inside the listed in Table 4.
interval [-30, 30] degrees. The model is deployed on a Linux server with one NVIDIA
The accuracy measure is the evaluation metric for the V100 GPU to provide real-time facial expression assess-
classification tasks (see Appendix Table 13). The accuracy ment. Fig. 9 illustrates a child’s monitoring and assessment

VOLUME 11, 2023 47917

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

TABLE 4. Evaluation metrics per class on the test set of the facial expression recognition model.

gaze by turning their head and trying to look in the same

direction. Another critical skill assessment in JA is hand
pointing.

1) JOINT ATTENTION — FOLLOW GAZE

The task of determining the orientation of people’s heads
from images or videos is known as Head Pose Estimation
(HPE). HPE has garnered a lot of attention [94], [95] with the
development of a variety of methods such as appearance tem-
plates, detector arrays, geometric, regression models, track-
ing, and hybrid methods. In low-resource clinical settings,
the use of costly, complex sensor systems such as magnetic
sensors, inertial sensors, lasers, and optical motion capture
systems makes it difficult not only to collect ground truth
FIGURE 8. Confusion matrix scores on the test set of the facial expression
recognition model. data for developing effective HPE methods but also makes it
impossible in the deployed applications without these sensor
systems. However, RGB video cameras are cheap, easy to
approach. The FER model processes the video data stream use, and can be installed in clinics. The data collected from
and outputs predictions of emotions along with time stamps the cameras provide novel paths to explore HPE methods
of their occurrence in a scatter plot. The therapist can view that can work on real-world videos. Existing methods are
the emotional assessment of the child using a pie chart, effective for frontal views but not always for head poses
as shown in Fig. 10. During our experiments, it is observed from all angles. Additionally, we need to identify the full
that the detection of laughing or crying is more accurate for range of yaw angles for follow gaze detection. Full-range
numerous consecutive frames over a short duration than for a yaw estimation is significantly less common than narrow-
single frame. The confidence of the network predictions for range estimation, as most known HPE datasets concentrate
sad/cry and happy/laugh is visually similar if a single frame is mainly on frontal to profile views. To determine yaw, recent
processed. Hence, we grouped 20 consecutive frames of the approaches classify poses into coarse-grained bins/classes
images and determined whether the confidence of the crying [96], [97], [98]. The limitations of using these existing meth-
or laughing class increased during subsequent frame predic- ods are mentioned as follows: they do not predict pitch and
tions, indicating greater and longer persistence of sadness or roll; full-range yaw estimation is not robust to occlusions;
happiness. Given the volatile nature of emotions in children unreliable pose estimation in heavy noise and jitter environ-
with ASD, understanding of the child’s emotions enables bet- ments; low-resolution video; and utilization of multi-camera
ter planning of activities suited to the child’s needs and remote over monocular images. Therefore, we aim to implement
management and monitoring of the children with ASD. a deep network to predict Euler angles over a single RGB
image’s entire range of head yaws.
D. JOINT ATTENTION DETECTION MODELS In our clinical setting, a play partner interacts with a child.
Fig. 11 illustrates the methodology of our joint attention One of the goals of ABA therapy is to improve a child’s joint
detection model. We implemented two deep neural network attention (JA) skills. A child is considered to have good JA
models for two crucial types of joint attention that therapists follow gaze skills if they respond and actively interact with
frequently use to differentiate ASD from typically developing their play partner’s stimuli. For the JA follow gaze stimulus,
children. First, joint attention follow-gaze is an activity in the play partner performs a subtle gesture with their eyes or
which the therapist indicates to a child to look at somewhere pointing hand, asking the child to look towards a direction,
or something with his or her eyes, and the child must then typically behind or beyond the sight of the child, so that the
follow the therapist’s gaze. Successful joint attention effectiveness of the JA follow gaze can be observed. As a
indicates that the child actively responded to the therapist’s response to the stimulus, an attentive child would turn their

47918 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

FIGURE 9. Session monitoring and assessment with facial expression recognition model.

track a human rotating through a complete revolution of yaw

but do not distinguish between a child, a play partner, or other
people in the video.
Since we are primarily interested in the full-range yaw
rotation of the child, we develop a child detector model
using 3274 annotated images of children (see Table 2) by
fine-tuning weights from Faster R-CNN with Resnet-50 (v1)
[91]. Table 3 summarizes detector model training parameters
and evaluation. The child detection model predicts the child’s
location in each video frame and generates a bounding box for
the child. We store the detected bounding boxes alongside the
time interval and compare whether the x and y coordinates of
the Origin (O) of the yaw, pitch, and roll axes are within the
FIGURE 10. Assessment output of child’s emotions. child detector’s bounding box. We collect all the yaw values
with the child’s timestamps for the entire video duration.
We develop an algorithm for detecting the rotation based on
yaw values. The sign change in the yaw angles is a clear
head and look in that direction, following their play partner’s indication of the rotation of the head from left to right or
gaze. We automatically identify and mark these key JA scenes vice versa when a child follows the gaze of a play partner.
in the video. To solve the problem of detecting the follow Upon analyzing each frame of the video, a sign change from
gaze, initially, we need to identify people’s heads, track the positive to negative or vice versa indicating that the child
Euler angles predicted by the pose estimation model, and turned from one side to the other. The magnitude of the
then detect a full-range yaw rotation. Fig. 11 illustrates our sign change indicates the degree or magnitude of rotation
Joint Attention Follow Gaze block. The input video consists (a smaller magnitude of change indicates a slight head turn
of a play-based interaction of a child with a play partner. and a larger magnitude indicates a complete head turn).
First, head detection datasets [99], [100] are used to train a We record the time interval during which a child responds to
YOLO-v3 object detector [89]. With the location of the spa- a follow-gaze stimulus, and detecting a change in the child’s
tially identified heads, we track the Euler angles with a model head position indicates a successful follow-gaze response.
trained on the network that predicts pitch, yaw, and roll using
a multi-loss metric [101]. In computer vision, Euler angles
are commonly used to define any three-dimensional object’s 2) JOINT ATTENTION — POINTING TO SOMEONE OR
rotation by combining three sequential elemental rotations SOMETHING
along three distinct axes (X, Y, and Z) which represent the One of the goals of the JA assessments is to improve a
rotation by three parameters: yaw, pitch, and roll. In human child’s JA hand-pointing skills to understand how well they
head posture, roll computes the amount of X-axis rotation. recognize and interact in their environment. For the JA hand-
In head movement, it is the same as moving your head to the pointing stimulus, the play partner verbally asks the child a
left or right. The amount of rotation about the Y-axis is com- simple question to elicit an expected response of pointing
puted via pitch which is comparable to glancing up or down fingers toward a particular direction, typically in front or
at a person’s head. Yaw determines the amount of rotation behind the sight of the child, so that the effectiveness of JA
about the Z-axis. For the human head, it might be interpreted finger-pointing can be observed. As a response to the stim-
as either looking left or right. Fig. 11 illustrates head pose ulus, the child points a finger in that direction to answer the
predictions of successive images. The pose predictions help question. For example, the play partner might ask, ‘‘Where is

VOLUME 11, 2023 47919

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

FIGURE 11. Illustration of joint attention models with follow gaze and hand pointing
blocks.

your mother?’’, ‘‘Where is the toy?’’ or ‘‘Where is the animal A. TEST DATA
picture in the book?’’ and the child will either point with their The procedure for clinical and publicly available data pro-
index finger and a closed thumb finger or point with their cessing for all the models during testing is the same as
index finger and an open thumb finger. Fig. 11 illustrates described in Section IV-A.
different orientations of finger pointing in the Joint Atten- 1) Activity Comprehension model: 48 Videos (21 videos
tion Pointing to someone/something block. We collected from Clinic, 27 publicly available)
only hand/finger pointing from various hand gestures and 2) Facial Expression Recognition model: 10 Videos
hand-pointing annotated datasets [102], [103], [104], [105]. 3) Joint Attention Recognition model: 10 Videos
We trained a finger-pointing detector using 24369 annotated All the videos are collected and annotated at the frame level
images with their bounding boxes (see Table 2) by fine-tuning and these human annotations will be used as ground-truth
weights from Faster R-CNN with Resnet-50 (v1) [91] to labels for comparison of the results.
obtain a [email protected] score of 0.917 and an average recall
of 77%. Table 5 summarizes the hyperparameters used and B. FACIAL EXPRESSION RECOGNITION MODEL
the results of the hand-pointing detector model evaluation. Ten videos of children exhibiting any of the nine facial
Whenever the child points to someone or something, the expressions were gathered. The facial expression recogni-
model detects it, and we record all the time stamps of the tion (FER) model, when inferred, stores the predictions for
detection in the entire video. each video frame. For each frame, we compare the model
prediction to the ground truth. A prediction is considered
TABLE 5. Finger pointing detector model hyperparameters and results.
as correct if it is of the same class as the ground truth
and has a confidence level of at least 85%. The confusion
matrix of the FER model is shown in Fig. 12. Table 6 shows
the evaluation scores per class for all ten videos. The class
‘‘None’’ comprises the frames of the background and the
times when no face is visible. From the experimental results,
it is observed that the accuracy is above 93% for all the
expression categories.

C. ACTIVITY COMPREHENSION MODEL

1) CLINICAL AND PUBLICLY AVAILABLE ABA VIDEOS
Twenty-one recordings of children participating in ABA
sessions are collected and evaluated using computer vision
V. RESULTS (Activity Comprehension model) for the activities mentioned
In this section, we describe our experiments conducted on in Table 1. The videos are inferred utilizing the Activity
previously unseen test data, along with evaluation metrics and Comprehension model, and the predictions for each video
results. frame are compared to the ground truth. The prediction is

47920 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

TABLE 6. Evaluation metrics per class on unseen test videos of children using the facial expression recognition model.

YouTube and Vimeo video search engines. Unlike the previ-

ous videos, these videos do not conform to play-based activ-
ities and include multiperson interactions other than child
and play partner interactions, and significantly vary from
the training video samples of the AVA dataset [64]. Hence,
testing our Activity Comprehension model on these videos
will also be helpful to evaluate the robustness of adversarial
attacks and provide insights to address out-of-distribution
detection. Tables 9 and 10 show the evaluation scores for
each class of the children and play partners across all the
videos. The model identifies potentially dangerous behaviors,
such as hitting or fighting, and performs best when rec-
ognizing children’s actions, such as holding objects, sitting
down, and following instructions and is also capable of deter-
FIGURE 12. Confusion matrix scores on unseen test videos of children
mining whether a play partner uses a phone, how engaged
with the facial expression recognition model. they are with children, and how they interact with toys and
objects.

considered correct if the class prediction is similar to the

D. JOINT ATTENTION
ground truth and has a confidence level of ≥ 50 percent. The
evaluation scores per class for all 21 videos of the children are 1) PUBLICLY AVAILABLE VIDEOS
listed in Table 7. The model can detect most of the children’s We gather ten videos of children participating in ABA ses-
activities with at least 70% accuracy. Precision, recall, and sions from publicly accessible videos of children on JA
accuracy are highest for actions such as sitting and holding gaze following and finger-pointing skills. Table 11 lists the
objects or oblique toys for children. The per-class metrics evaluation scores of joint attention (follow gaze and joint
corresponding to the play partners in the videos are shown attention - pointing to someone or something) models for all
in Table 8. Since the child or the play partner may have ten videos. The experimental results reveal that both models
more than one label class (sitting and instruction engagement, of joint attention are accurate to millisecond levels in terms
sitting and holding objects) at each instant, it is necessary to of detecting the instant at which joint attention is successful.
evaluate the temporal alignment of different actions against The majority of errors are false negatives caused by different
the ground truth. We define the temporal Intersection over configurations of the finger pointing away from the camera’s
Union (t-IoU) metric [106] for evaluating the action predic- line of sight.
tion metrics considering each class’s start and end of time
instants. t-IoU is the sum of the intersections between the E. EVALUATION ON BASELINE METHODS
predicted and observed time intervals for each predicted class, We evaluate the performance of our FER model on the most
divided by the sum of their unions. The true-positive t-IoU competitive well-known baselines on our enhanced image
threshold is set at 0.30. [email protected] is the mean t-IOU data with 2000 images used in our test set and on the public
across all test videos with a threshold of 0.3 or above. The dataset FER2013 test images. Since the FER2013 does not
[email protected] metric severely penalizes any temporal mis- have the cry and laugh classes, both classes are mapped to sad
alignment when examining the model’s true performance. and happy, respectively, for a fair comparison. Our model out-
Many false positives and false negatives are observed, and performs all the models in the baselines on our test set with an
some actions, such as hitting and answering the phone, are accuracy of 73.9% followed by the Ensemble 8-CNN frame-
not captured on video, and their evaluations are represented work with 73.50% and Ensemble ResMasknet with 73.20%.
as null (-). Additionally, our model is competitive with ensemble-based
Similarly, we test the performance of twenty-seven pub- methods for the FER2013 dataset. Table 12 lists the baseline
licly available ABA videos of children collected from methods and compares the accuracy between our test set

VOLUME 11, 2023 47921

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

TABLE 7. Evaluation metrics per class of children on the clinical test set using the activity comprehension model.

TABLE 8. Evaluation metrics per class of play partners on the clinical test set using the activity comprehension model.

TABLE 9. Evaluation metrics per class of children on the publicly available videos using the activity comprehension model.

TABLE 10. Evaluation metrics per class of play partners on the publicly available videos using the activity comprehension model.

and the FER2013 dataset. The baseline implementations are followed by the method CNNs and BOVW with SVM having
evaluated in TensorFlow, and the rest of them are imple- 75.42% followed by Ensemble 8 CNN-based method with
mented with the code provided in the studies. The method an accuracy of 75.20% and our model with an accuracy
Ensemble ResMasknet outperforms with 76.82% accuracy, of 74.15%.

47922 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

TABLE 11. Evaluation metrics of joint attention models on ten publicly child-therapist interaction, a model for FER of children, and a
available videos.
real-time approach for automatic joint attention recognition.
The demand for low-cost diagnostics, universal screening
guidelines, and research funding availability have prompted
endeavors to create technology-based ASD screening [9],
[39], [48]. Technological advances and the availability of
low-cost cloud infrastructure have motivated researchers to
automate the creation and processing of video data by con-
structing data pipelines. Integrating data pipelines with ML
technology has advanced the development of cost-effective
TABLE 12. Evaluation of the performance of the FER model using ASD detection and assessment methods [31], [38], [47].
well-known methods on 2000 images in our test set and test set of However, ASD diagnostic services are not always accessible,
FER2013 dataset.
cost-effective, or data-driven. Our findings indicate that the
technology-based ASD approaches can be generalized to the
broader population with neurodevelopmental disorders along
with few technological modifications and can serve wider
population groups with enhanced quality, access, and afford-
ability. In addition, technology-enabled innovations are antic-
ipated to supplement traditional detection methods for the
following reasons: Diagnostic methods based on ML and DL
can be trained on a large volume of involuntary multimodal
data generated from various activities to detect children at
risk for ASD. Few diagnostic techniques, such as CARS-2
[30], can diagnose children older than two years. Moreover,
children do not develop social communication, language, and
other crucial milestones until the second or third year of life.
An untrained clinician may therefore receive contradictory
results when evaluating ASD risk in infants under two years
For the activity comprehension model, we only utilized of age [1], [33]. Abbas et al. [39], Gupta et al. [38], Kohli
10 of the 80 action categories, and we are only interested et al. [9], and Uddin et al. [48] have highlighted the novel ML
in two-person interactions. Since popular baseline models methods and their feasibility of analysis on ASD and other
evaluate multi-person interactions and also due to resource neurodevelopmental landmarks from behavioral, eye gaze,
constraints, we only performed the evaluation on 48 real- audio, facial expression, postural, and EHR assessment data
world videos, and these results are listed in Tables 7-10. More to identify children at risk of ASD at an early age, circum-
details on competitive baselines can be found in detail in the venting the age restrictions and limitations of the traditional
study [88]. diagnostic instruments.
Joint Attention Follow Gaze detection requires a pipeline Researchers have collected multimodal data from hospi-
of multiple models involving head detection, child detection, tal EHRs and constructed enormous multimodal data lakes
and finally head pose estimation via Euler angle prediction; which enable DL and ML algorithms to discover clinically
therefore, there is no direct comparison in the clinical liter- significant patterns for recognizing ASD, tracking patients
ature for a fair evaluation. We are only interested in real- over time, prescribing and tailoring treatments, and alleviat-
time, long-duration videos, so a comparison of individual ing ASD severity [39], [47].
blocks to popular image datasets is outside the scope of
this paper due to hardware resource constraints. However, A. PRACTICAL IMPLICATIONS
details on full-range head pose estimation and comparison The ABA treatment enhancement efforts described in this
on specific head-image datasets are described in the studies research can advance the field of neuroscience by increasing
[96], [97], [98], [101]. For Joint Attention Hand pointing early identification and consequently expanding access to
estimation, we do not compare against 2D or 3D gesture early intervention treatments which can be used by clini-
recognition datasets because our model is distinct from the cians accessible via mobile or web applications, significantly
majority of conventional approaches, where our model is enhancing their capacity and meeting the needs of children
trained using images collected and curated from four hand- with ASD and other developmental delays (speech, devel-
or finger-pointing datasets (see section IV-D2). opment, and intellectual delay). By supporting the adoption
of these technologies through controlled pilots with stake-
VI. DISCUSSION holders such as parents, doctors, and schools and digitizing
Three independent vision-based paradigms are designed downstream detection processes, evaluations, and thera-
as follows: a model for Activity Comprehension for pies, a computerized, human-supported ASD diagnosis, and

VOLUME 11, 2023 47923

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

management framework can be launched and migrated to diagnosis information for ASD or other neurodevelopmental
an autonomous and personalized digital model to optimize disorders. Future work should include a large clinical trial
cost, maximize scale, and fast-track access to referrals and testing the models on grouped cohorts of ASD, neurotypical,
intervention. According to our knowledge, this is the first and other developmental disorders that can reveal the level of
attempt to create an automated, integrated ABA assessment efficacy and identify areas for further development. Lastly,
framework deployed on the cloud for real-time assessment more real-world test cases may uncover unforeseen edge
using video data. cases that hamper model performance and generalizability.
Moreover, to decrease bias and ensure the internal and Future studies can incorporate clinicians’ survey responses to
external validity of the implemented models, there is an determine the efficacy of computer vision models that aid in
urgent need to undertake large clinical trials, including the accurate, timely diagnosis and treatment monitoring. Future
participation of researchers and doctors from many nations studies can integrate various methods into a single pipeline or
with diverse backgrounds, and ethnicity. The purpose of architecture with a unified model that is trained in a multitask
the partnership is to validate the results, determine effi- fashion for analyzing human behavior, joint attention, social
cacy, address potential technology edge cases, and design communication skills, facial expression and motor imitation
approaches to incorporate children from various backgrounds recognition, and eye contact detection. Further, speech and
into research investigations. Our experimental results from auditory features can be incorporated that provide rich fea-
testing joint attention, Activity Comprehension, and facial tures in the case of social interaction and communication to
expression models with YouTube videos demonstrate robust- develop a multimodal vision-speech model that can identify
ness in chaotic and natural videos. abnormalities in speech and social behaviors.
It is appropriate for medical experts in the ABA and clini-
cal psychology fields to evaluate the findings’ validity. The VII. CONCLUSION
study’s strengths include the use of 68 real-world clinical The paper investigates the viability of a computer vision
videos and comparisons against ground-truth video annota- and deep learning-based ABA treatment and assessment that
tions provided by clinicians. However, medical professionals experts or non-experts can use to detect important behavioral
have the expertise and experience to interpret the findings activities, emotions, and JA using videos. Experiments with
and provide additional insights that may not be immediately 68 clinical and public videos from the real world reveal that
apparent to those outside the clinical context. In our case, the activity comprehension model reports an overall accu-
we cross-reviewed the annotations provided by a clinician racy of 72.32%, the joint attention models show an accuracy
to ensure a coherent interpretation of video annotations and of 97% for gaze following and 93.4% for hand pointing,
minimize annotator bias. In addition, medical professionals and the facial expression recognition model has an overall
assessed the study’s methodology, identified a few possible accuracy of 95.1%. During the development of the activity
limitations, and made suggestions for future research, such as comprehension and facial expression recognition model, the
in the case of multiple children handled by a single therapist proposed methodology incorporates diversity and fairness
for Activity Comprehension models. Involving medical pro- to low-income and middle-income populations by collecting
fessionals in the evaluation of the study’s findings increased videos of children of different ages, socioeconomic statuses,
the study’s rigor and credibility and ensured that the results and ethnicity. The models’ predictions help to make real-time
were appropriately interpreted and implemented in clinical monitoring and assessment reports that help clinicians to
practice. make decisions about ABA services.
B. LIMITATIONS AND FUTURE DIRECTIONS
APPENDIX A
While developing the Activity Comprehension model, the
A. MODEL EVALUATION METRICS
current study provides solutions to only a subset of the many
activities performed during ABA therapy; many other action The robustness of the machine learning model can be evalu-
classes specific to children can be incorporated if sufficient ated on various metrics items listed below.
resources are allocated to model development. As ML tech- a) Accuracy is the number of correct predictions divided
nology develops, it is possible to reduce false positives and by the total number of predictions.
negatives in the FER model, thereby increasing its sensitivity, b) True Positive (TP) signifies how many positive class
precision, and specificity. Even though we have collected samples the model predicted correctly.
as many images of children’s faces as possible to train an c) True Negative (TN) signifies how many negative class
online FER model, there is still room for additional data and samples the model predicted correctly.
microexpressions. In addition, each model assumes a partic- d) False Positive (FP) signifies how many negative class
ular video data distribution: (i) the Activity Comprehension samples the model predicted incorrectly.
model assumes person-person or person-object interactions, e) False Negative (FN) signifies how many positive class
(ii) the FER model assumes frontal face visibility, and joint samples the model predicted incorrectly.
attention models perform sub-optimally in crowded scenar- f) Precision is the ratio of true positives and total positives
ios. Most of the children’s videos we collected lack clinical predicted.

47924 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

g) Recall or Sensitivity is the ratio of true positives to all Swati Kohli, A. P. Prathosh, Tanu Wadhera, Diptanshu Das,
the positives in the ground truth. Debasis Panigrahi; Software: Varun Ganjigunte Prakash,
h) Specificity is defined as the proportion of actual neg- Manu Kohli; Supervision: Manu Kohli, Swati Kohli,
ative class samples, which got predicted as the true A. P. Prathosh, Tanu Wadhera, Diptanshu Das, Debasis
negatives. Panigrahi; Writing—review and editing: Varun Ganjigunte
i) F1 Score is the harmonic mean of precision and recall. Prakash, Manu Kohli, Tanu Wadhera; Visualization: Varun
j) Negative Predictive Value is the ratio of the number of Ganjigunte Prakash, Manu Kohli.
true negatives to the total number of class samples that
test negative. ETHICAL APPROVAL
Table 13 lists evaluation metrics with mathematical notations. All procedures performed in studies involving human par-
ticipants were in accordance with the ethical standards of
TABLE 13. Machine learning model performance metrics. the institutional and/or national research committee and with
the 1964 Helsinki declaration and its later amendments or
comparable ethical standards. This article does not contain
any studies with animals performed by any of the authors.

INFORMED CONSENT
Informed consent was obtained from all individual partici-
pants included in the study.

DATA AVAILABILITY STATEMENT

Due to the nature of the research and participants’ consent
agreements, data is not available. However, case-by-case
analysis can be made as per the request and data access
requirements.

REFERENCES
[1] CDC. (2021). Data and Statistics on Autism Spectrum Disorder. [Online].
Available: https://www.cdc.gov/ncbddd/autism/data.html
[2] C. Lord, S. Risi, L. Lambrecht, E. H. Cook, B. L. Leventhal, P. C. DiLavore,
A. Pickles, and M. Rutter, ‘‘The autism diagnostic observation schedule-
generic: A standard measure of social and communication deficits associ-
ated with the spectrum of autism,’’ J. Developmental Disorders, vol. 30,
ACKNOWLEDGMENT no. 3, pp. 205–223, Jun. 2000.
The authors would like to acknowledge the clinical experts [3] L. Jurek, M. Baltazar, S. Gulati, N. Novakovic, M. Núñez, J. Oakley, and
A. O’Hagan, ‘‘Response (minimum clinically relevant change) in ASD
and engineers at SM Learning Skills Academy for Special symptoms after an intervention according to CARS-2: Consensus from an
Needs Pvt. Ltd. for their support with video data collection expert elicitation procedure,’’ Eur. Child Adolescent Psychiatry, vol. 31,
and annotation. They also thank Microsoft Azure and IBM no. 8, pp. 1–10, Aug. 2022.
[4] M. L. Sundberg, VB-MAPP Verbal Behavior Milestones Assessment and
Cloud for giving them cloud credits that allowed them to store
Placement Program: A Language and Social Skills Assessment Program
huge amounts of video data and run GPU virtual machines. for Children with Autism or Other Developmental Disabilities. Concord,
CA, USA: AVB Press, 2008.
DISCLOSURE STATEMENT [5] H. S. Roane, W. W. Fisher, and J. E. Carr, ‘‘Applied behavior analysis as
treatment for autism spectrum disorder,’’ J. Pediatrics, vol. 175, pp. 27–32,
The authors Manu Kohli and Dr. A. P. Prathosh report a Sep. 2016.
relationship with the company SM Learning Skills Academy [6] R. K. Dogan, M. L. King, A. T. Fischetti, C. M. Lake, T. L. Math-
for Special Needs Pvt. Ltd. that includes: equity or stocks. ews, and W. J. Warzak, ‘‘Parent-implemented behavioral skills training
of social skills,’’ J. Appl. Behav. Anal., vol. 50, no. 4, pp. 805–818,
The remaining authors declare they have no financial or Oct. 2017.
non-financial conflicts of interest. [7] D. M. Bhatia. (Aug. 2020). How to Help Low-Income Children With
Autism. [Online]. Available: https://www.doctorbhatia.com/autism-
research-treatment/autism-diagnosis-how-is-autism-diagnosed-clinical-
CREDIT AUTHORSHIP CONTRIBUTION STATEMENT screening-cars-blood-gentic-tests/
Conceptualization: Manu Kohli, Swati Kohli, A. P. Prathosh; [8] A. Opar. (Jan. 2019). How to Help Low-Income Children With Autism.
Data curation: Varun Ganjigunte Prakash, Manu Kohli, Swati [Online]. Available: https://www.spectrumnews.org/features/deep-
dive/help-low-income-children-autism/
Kohli, Tanu Wadhera, Diptanshu Das, Debasis Panigrahi;
[9] M. Kohli, A. K. Kar, and S. Sinha, ‘‘The role of intelligent technologies
Formal analysis: A. P. Prathosh, Tanu Wadhera, Diptanshu in early detection of autism spectrum disorder (ASD): A scoping review,’’
Das, Debasis Panigrahi; Funding acquisition: Manu Kohli, IEEE Access, vol. 10, pp. 104887–104913, 2022.
A. P. Prathosh; Methodology: Varun Ganjigunte Prakash, [10] M. Kohli and S. Kohli, ‘‘Electronic assessment and training curriculum
based on applied behavior analysis procedures to train family members of
Manu Kohli, Swati Kohli, A. P. Prathosh; Project adminis- children diagnosed with autism,’’ in Proc. IEEE Region 10 Humanitarian
tration: Manu Kohli, Swati Kohli; Resources: Manu Kohli, Technol. Conf. (R10-HTC), Dec. 2016, pp. 1–6.

VOLUME 11, 2023 47925

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

[11] A. Ali, F. F. Negin, F. F. Bremond, and S. Thümmler, ‘‘Video-based behav- [29] D. L. Robins, K. Casagrande, M. Barton, C.-M.-A. Chen,
ior understanding of children for objective diagnosis of autism,’’ in Proc. T. Dumont-Mathieu, and D. Fein, ‘‘Validation of the modified checklist
17th Int. Joint Conf. Comput. Vis., Imag. Comput. Graph. Theory Appl., for autism in toddlers, revised with follow-up (M-CHAT-R/F),’’ Pediatrics,
Feb. 2022, pp. 1–11. [Online]. Available: https://hal.inria.fr/hal-03447060 vol. 133, no. 1, pp. 37–45, Jan. 2014.
[12] Q. Tariq, S. L. Fleming, J. N. Schwartz, K. Dunlap, C. Corbin, P. Washing- [30] C. A. Vaughan, ‘‘Test review: E. Schopler, ME Van Bourgondien,
ton, H. Kalantarian, N. Z. Khan, G. L. Darmstadt, and D. P. Wall, ‘‘Detect- GJ Wellman, & SR Love childhood autism rating scale. Los Angeles, CA:
ing developmental delay and autism through machine learning models Western psychological services, 2010,’’ J. Psychoeducational Assessment,
using home videos of Bangladeshi children: Development and validation vol. 29, no. 5, pp. 489–493, Oct. 2011.
study,’’ J. Med. Internet Res., vol. 21, no. 4, Apr. 2019, Art. no. e13822. [31] M. Marlow, C. Servili, and M. Tomlinson, ‘‘A review of screening tools for
[13] C. Lord, M. Rutter, and A. Le Couteur, ‘‘Autism diagnostic interview- the identification of autism spectrum disorders and developmental delay in
revised: A revised version of a diagnostic interview for caregivers of infants and young children: Recommendations for use in low- and middle-
individuals with possible pervasive developmental disorders,’’ J. Autism income countries,’’ Autism Res., vol. 12, no. 2, pp. 176–199, Feb. 2019,
Develop. Disorders, vol. 24, no. 5, pp. 659–685, Oct. 1994. doi: 10.1002/aur.2033.
[14] P. U. Putra, K. Shima, S. A. Alvarez, and K. Shimatani, ‘‘Identifying [32] K. Bauer, K. L. Morin, T. E. Renz, and S. Zungu, ‘‘Autism assessment
autism spectrum disorder symptoms using response and gaze behavior in low- and middle-income countries: Feasibility and usability of west-
during the Go/NoGo game CatChicken,’’ Sci. Rep., vol. 11, no. 1, p. 22012, ern tools,’’ Focus Autism Other Develop. Disabilities, vol. 37, no. 3,
Nov. 2021, doi: 10.1038/s41598-021-01050-7. pp. 179–188, Sep. 2022, doi: 10.1177/10883576211073691.
[15] L. Billeci et al., ‘‘Disentangling the initiation from the response in [33] R. Choueiri, W. T. Garrison, and V. Tokatli, ‘‘Early identification of
joint attention: An eye-tracking study in toddlers with autism spectrum autism spectrum disorder (ASD): Strategies for use in local commu-
disorders,’’ Transl. Psychiatry, vol. 6, no. 5, p. e808, May 2016, doi: nities,’’ Indian J. Pediatrics, vol. 90, no. 4, pp. 377–386, May 2022,
10.1038/tp.2016.75. doi: 10.1007/s12098-022-04172-6.
[16] C. Su, Z. Xu, J. Pathak, and F. Wang, ‘‘Deep learning in mental health [34] N. J. Hidalgo, L. L. McIntyre, and E. H. McWhirter, ‘‘Sociodemographic
outcome research: A scoping review,’’ Transl. Psychiatry, vol. 10, no. 1, differences in parental satisfaction with an autism spectrum disorder diag-
p. 116, Apr. 2020, doi: 10.1038/s41398-020-0780-3. nosis,’’ J. Intellectual Develop. Disability, vol. 40, no. 2, pp. 147–155,
[17] A. Esteva, K. Chou, S. Yeung, N. Naik, A. Madani, A. Mottaghi, Y. Apr. 2015.
Liu, E. Topol, J. Dean, and R. Socher, ‘‘Deep learning-enabled medical [35] A. J. Kumm, M. Viljoen, and P. J. de Vries, ‘‘The digital divide in tech-
computer vision,’’ npj Digit. Med., vol. 4, no. 1, p. 5, Jan. 2021, doi: nologies for autism: Feasibility considerations for low- and middle-income
10.1038/s41746-020-00376-2. countries,’’ J. Autism Develop. Disorders, vol. 52, no. 5, pp. 2300–2313,
[18] Y. Kumar, A. Koul, R. Singla, and M. F. Ijaz, ‘‘Artificial intelligence in May 2022.
disease diagnosis: A systematic literature review, synthesizing framework
[36] N. Malik-Soni, A. Shaker, H. Luck, A. E. Mullin, R. E. Wiley,
and future research agenda,’’ J. Ambient Intell. Humanized Comput., vol.
M. E. S. Lewis, J. Fuentes, and T. W. Frazier, ‘‘Tackling healthcare access
2022, pp. 1–28, Jan. 2022, doi: 10.1007/s12652-021-03612-z.
barriers for individuals with autism from diagnosis to adulthood,’’ Pedi-
[19] G. Brihadiswaran, D. Haputhanthri, S. Gunathilaka, D. Meedeniya, and atric Res., vol. 91, no. 5, pp. 1028–1035, Apr. 2022.
S. Jayarathna, ‘‘EEG-based processing and classification methodologies
[37] M. P. Kelly and P. Reed, ‘‘Examination of stimulus over-selectivity in
for autism spectrum disorder: A review,’’ J. Comput. Sci., vol. 15, no. 8,
children with autism spectrum disorder and its relationship to stereotyped
pp. 1161–1183, Aug. 2019.
behaviors and cognitive flexibility,’’ Focus Autism Other Develop. Disabil-
[20] L. Klintwall and S. Eikeseth, Early and Intensive Behavioral Intervention
ities, vol. 36, no. 1, pp. 47–56, Mar. 2021.
(EIBI) in Autism. New York, NY, USA: Springer, 2014, pp. 117–137, doi:
10.1007/978-1-4614-4788-7_129. [38] C. Gupta, P. Chandrashekar, T. Jin, C. He, S. Khullar, Q. Chang, and
D. Wang, ‘‘Bringing machine learning to research on intellectual and
[21] J. J. Wood, A. Drahota, K. Sze, K. Har, A. Chiu, and D. A. Langer, ‘‘Cogni-
developmental disabilities: Taking inspiration from neurological diseases,’’
tive behavioral therapy for anxiety in children with autism spectrum disor-
J. Neurodevelopmental Disorders, vol. 14, no. 1, p. 28, May 2022, doi:
ders: A randomized, controlled trial,’’ J. Child Psychol. Psychiatry, vol. 50,
10.1186/s11689-022-09438-w.
no. 3, pp. 224–234, Mar. 2009, doi: 10.1111/j.1469-7610.2008.01948.x.
[22] M. E. Król and M. Król, ‘‘A novel machine learning analysis [39] H. Abbas, F. Garberson, S. Liu-Mayo, E. Glover, and D. P. Wall,
of eye-tracking data reveals suboptimal visual information ‘‘Multi-modular AI approach to streamline autism diagnosis in
extraction from facial stimuli in individuals with autism,’’ young children,’’ Sci. Rep., vol. 10, no. 1, p. 5014, Mar. 2020,
Neuropsychologia, vol. 129, pp. 397–406, Jun. 2019. [Online]. Available: doi: 10.1038/s41598-020-61213-w.
https://www.sciencedirect.com/science/article/pii/S002839321930106X [40] D.-Y. Song, S. Y. Kim, G. Bong, J. M. Kim, and H. J. Yoo, ‘‘The use of
[23] R. Aishworiya, T. Valica, R. Hagerman, and B. Restrepo, ‘‘An update artificial intelligence in screening and diagnosis of autism spectrum dis-
on psychopharmacological treatment of autism spectrum disorder,’’ order: A literature review,’’ J. Korean Acad. Child Adolescent Psychiatry,
Neurotherapeutics, vol. 19, no. 1, pp. 248–262, Jan. 2022, doi: vol. 30, no. 4, pp. 145–152, Oct. 2019, doi: 10.5765/jkacap.190027.
10.1007/s13311-022-01183-1. [41] G. S. Young, J. N. Constantino, S. Dvorak, A. Belding, D. Gangi,
[24] X. Wang, J. Zhao, S. Huang, S. Chen, T. Zhou, Q. Li, X. Luo, A. Hill, M. Hill, M. Miller, C. Parikh, A. J. Schwichtenberg, E. Solis, and
and Y. Hao, ‘‘Cognitive behavioral therapy for autism spectrum dis- S. Ozonoff, ‘‘A video-based measure to identify autism risk in infancy,’’
orders: A systematic review,’’ Pediatrics, vol. 147, no. 5, May 2021, J. Child Psychol. Psychiatry, vol. 61, no. 1, pp. 88–94, Jan. 2020.
doi: 10.1542/peds.2020-049880. [42] E. Patten, K. Belardi, G. T. Baranek, L. R. Watson, J. D. Labban, and
[25] D. Ung, R. Selles, B. J. Small, and E. A. Storch, ‘‘A systematic D. K. Oller, ‘‘Vocal patterns in infants with autism spectrum disorder:
review and meta-analysis of cognitive-behavioral therapy for anxi- Canonical babbling status and vocalization frequency,’’ J. Autism Develop.
ety in youth with high-functioning autism spectrum disorders,’’ Child Disorders, vol. 44, no. 10, pp. 2413–2428, Oct. 2014.
Psychiatry Hum. Develop., vol. 46, no. 4, pp. 533–547, Aug. 2015, [43] K. L. H. Carpenter, J. Hahemi, K. Campbell, S. J. Lippmann, J. P. Baker,
doi: 10.1007/s10578-014-0494-y. H. L. Egger, S. Espinosa, S. Vermeer, G. Sapiro, and G. Dawson, ‘‘Digital
[26] F. Mohammadzaheri, L. K. Koegel, M. Rezaee, and S. M. Rafiee, ‘‘A ran- behavioral phenotyping detects atypical pattern of facial expression in
domized clinical trial comparison between pivotal response treatment toddlers with autism,’’ Autism Res., vol. 14, no. 3, pp. 488–499, Mar. 2021.
(PRT) and structured applied behavior analysis (ABA) intervention for [44] R. Rahman, A. Kodesh, S. Z. Levine, S. Sandin, A. Reichenberg, and
children with autism,’’ J. Autism Develop. Disorders, vol. 44, no. 11, A. Schlessinger, ‘‘Identification of newborns at risk for autism using
pp. 2769–2777, Nov. 2014, doi: 10.1007/s10803-014-2137-3. electronic medical records and machine learning,’’ Eur. Psychiatry, vol. 63,
[27] Q. Yu, E. Li, L. Li, and W. Liang, ‘‘Efficacy of interventions based on no. 1, p. e22, 2020.
applied behavior analysis for autism spectrum disorder: A meta-analysis,’’ [45] L. Ouss, G. Palestra, C. Saint-Georges, M. L. Gille, M. Afshar, H. Pellerin,
Psychiatry Invest., vol. 17, no. 5, pp. 432–443, May 2020. K. Bailly, M. Chetouani, L. Robel, B. Golse, R. Nabbout, I. Desguerre,
[28] H. Manohar, P. Kandasamy, V. Chandrasekaran, and R. P. Rajkumar, M. Guergova-Kuras, and D. Cohen, ‘‘Behavior and interaction imaging
‘‘Early diagnosis and intervention for autism spectrum disorder: Need for at 9 months of age predict autism/intellectual disability in high-risk
pediatrician-child psychiatrist liaison,’’ Indian J. Psychol. Med., vol. 41, infants with west syndrome,’’ Transl. Psychiatry, vol. 10, no. 1, pp. 1–7,
no. 1, pp. 87–90, 2019, doi: 10.4103/IJPSYM.IJPSYM_154_18. Feb. 2020.

47926 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

[46] J. Hashemi, G. Dawson, K. L. H. Carpenter, K. Campbell, Q. Qiu, [66] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, ‘‘HMDB: A
S. Espinosa, S. Marsan, J. P. Baker, H. L. Egger, and G. Sapiro, ‘‘Computer large video database for human motion recognition,’’ in Proc. IEEE Int.
vision analysis for quantification of autism risk behaviors,’’ IEEE Trans. Conf. Comput. Vis., Nov. 2011, pp. 2556–2563.
Affect. Comput., vol. 12, no. 1, pp. 215–226, Jan. 2021. [67] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier,
[47] M. Kohli, A. K. Kar, A. Bangalore, and P. Ap, ‘‘Machine learning-based S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman,
ABA treatment recommendation and personalization for autism spectrum and A. Zisserman, ‘‘The kinetics human action video dataset,’’ 2017,
disorder: An exploratory study,’’ Brain Informat., vol. 9, no. 1, p. 16, arXiv:1705.06950.
Jul. 2022, doi: 10.1186/s40708-022-00164-6. [68] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, ‘‘PoTion:
[48] M. Uddin, Y. Wang, and M. Woodbury-Smith, ‘‘Artificial intelligence for Pose motion representation for action recognition,’’ in Proc.
precision medicine in neurodevelopmental disorders,’’ npj Digit. Med., IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,
vol. 2, no. 1, p. 112, Nov. 2019, doi: 10.1038/s41746-019-0191-0. pp. 7024–7033.
[49] S. Jain, B. Thiagarajan, Z. Shi, C. Clabaugh, and M. J. Matarić, ‘‘Modeling [69] N. Dalal and B. Triggs, ‘‘Histograms of oriented gradients for human detec-
engagement in long-term, in-home socially assistive robot interventions tion,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
for children with autism spectrum disorders,’’ Sci. Robot., vol. 5, no. 39, (CVPR), 2005, pp. 886–893.
Feb. 2020, doi: 10.1126/scirobotics.aaz3791. [70] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, ‘‘Dense trajectories and
[50] M. Toshpulatov, W. Lee, S. Lee, and A. H. Roudsari, ‘‘Human motion boundary descriptors for action recognition,’’ Int. J. Comput. Vis.,
pose, hand and mesh estimation using deep learning: A survey,’’ vol. 103, no. 1, pp. 60–79, May 2013.
J. Supercomput., vol. 78, no. 6, pp. 7616–7654, Apr. 2022, [71] R. A. J. de Belen, T. Bednarz, A. Sowmya, and D. Del Favero, ‘‘Computer
doi: 10.1007/s11227-021-04184-7. vision in autism spectrum disorder research: A systematic review of pub-
[51] T. L. Munea, Y. Z. Jembre, H. T. Weldegebriel, L. Chen, C. Huang, and lished studies from 2009 to 2019,’’ Translational Psychiatry, vol. 10, no. 1,
C. Yang, ‘‘The progress of human pose estimation: A survey and taxonomy pp. 1–20, Sep. 2020.
of models applied in 2D human pose estimation,’’ IEEE Access, vol. 8, [72] L. Zhang, M. Wang, M. Liu, and D. Zhang, ‘‘A survey on deep learning for
pp. 133330–133348, 2020. neuroimaging-based brain disorder analysis,’’ Frontiers Neurosci., vol. 14,
[52] C. Zheng, W. Wu, T. Yang, S. Zhu, C. Chen, R. Liu, J. Shen, p. 779, Oct. 2020.
N. Kehtarnavaz, and M. Shah, ‘‘Deep learning-based human pose estima- [73] P. Pandey, P. Ap, M. Kohli, and J. Pritchard, ‘‘Guided weak
tion: A survey,’’ 2020, arXiv:2012.13392. supervision for action recognition with scarce data to assess
[53] W. Zhang, J. Fang, X. Wang, and W. Liu, ‘‘EfficientPose: Efficient human skills of children with autism,’’ Proc. AAAI Conf. Artif. Intell.,
pose estimation with neural architecture search,’’ Comput. Vis. Media, Apr. 2020, vol. 34, no. 1, pp. 463–470. [Online]. Available:
vol. 7, no. 3, pp. 335–347, Sep. 2021, doi: 10.1007/s41095-021-0214-z. https://ojs.aaai.org/index.php/AAAI/article/view/5383
[54] S. Dubey and M. Dixit, ‘‘A comprehensive survey on human pose estima- [74] J. Carreira and A. Zisserman, ‘‘Quo vadis, action recognition? A new
tion approaches,’’ Multimedia Syst., vol. 29, no. 1, pp. 167–195, Aug. 2022, model and the kinetics dataset,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
doi: 10.1007/s00530-022-00980-0. Recognit. (CVPR), Aug. 2017, pp. 4724–4733.
[55] H. Chen, R. Feng, S. Wu, H. Xu, F. Zhou, and Z. Liu, ‘‘2D human pose [75] S. F. Dos Santos, N. Sebe, and J. Almeida, CV-C3D: Action
estimation: A survey,’’ 2022, arXiv:2204.07370. Recognition on Compressed Videos with Convolutional 3D
[56] G. Sciortino, G. M. Farinella, S. Battiato, M. Leo, and C. Distante, ‘‘On Networks. Porto Alegre, RS, Brasil: SBC, 2019. [Online]. Available:
the estimation of children’s poses,’’ in Image Analysis and Processing— https://sol.sbc.org.br/index.php/sibgrapi/article/view/9782
ICIAP, S. Battiato, G. Gallo, R. Schettini, and F. Stanco, Eds. Cham, [76] H. Xu, A. Das, and K. Saenko, ‘‘R-C3D: Region convolutional 3D network
Switzerland: Springer, 2017, pp. 410–421. for temporal activity detection,’’ in Proc. IEEE Int. Conf. Comput. Vis.
[57] J. Stenum, K. M. Cherry-Allen, C. O. Pyles, R. D. Reetzke, M. F. Vignos, (ICCV), Oct. 2017, pp. 5794–5803.
and R. T. Roemmich, ‘‘Applications of pose estimation in human [77] A. Richard and J. Gall, ‘‘A bag-of-words equivalent recurrent
health and performance across the lifespan,’’ Sensors, vol. 21, no. 21, neural network for action recognition,’’ Comput. Vis. Image
p. 7315, Nov. 2021. [Online]. Available: https://www.mdpi.com/1424- Understand., vol. 156, pp. 79–91, Mar. 2017. [Online]. Available:
8220/21/21/7315 https://www.sciencedirect.com/science/article/pii/S1077314216301680
[58] D. Cazzato, P. L. Mazzeo, P. Spagnolo, and C. Distante, ‘‘Automatic joint [78] J. Tang, J. Xia, X. Mu, B. Pang, and C. Lu, Asynchronous Interaction
attention detection during interaction with a humanoid robot,’’ in Social Aggregation for Action Detection. New York, NY, USA: Springer-Verlag,
Robotics, A. Tapus, E. André, J.-C. Martin, F. Ferland, and M. Ammi, Eds. 2020, pp. 71–87, doi: 10.1007/978-3-030-58555-6_5.
Cham, Switzerland: Springer, 2015, pp. 124–134. [79] Y. Liu, L. Ma, Y. Zhang, W. Liu, and S. Chang, ‘‘Multi-granularity gener-
[59] K. Kim and P. Mundy, ‘‘Joint attention, social-cognition, and recognition ator for temporal action proposal,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
memory in adults,’’ Frontiers Hum. Neurosci., vol. 6, p. 172, Jun. 2012. Pattern Recognit. (CVPR), Jun. 2019, pp. 3599–3608.
[60] P. Nyström, E. Thorup, S. Bölte, and T. Falck-Ytter, ‘‘Joint attention in [80] V. Escorcia, F. Caba Heilbron, J. C. Niebles, and B. Ghanem, ‘‘DAPs: Deep
infancy and the emergence of autism,’’ Biol. Psychiatry, vol. 86, no. 8, action proposals for action understanding,’’ in Computer Vision—ECCV,
pp. 631–638, Oct. 2019, doi: 10.1016/j.biopsych.2019.05.006. B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham, Switzerland:
[61] G. Wan, X. Kong, B. Sun, S. Yu, Y. Tu, J. Park, C. Lang, M. Koh, Z. Wei, Springer, 2016, pp. 768–784.
Z. Feng, Y. Lin, and J. Kong, ‘‘Applying eye tracking to identify autism [81] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, ‘‘Action tubelet
spectrum disorder in children,’’ J. Autism Develop. Disorders, vol. 49, detector for spatio-temporal action localization,’’ in Proc. IEEE Int. Conf.
no. 1, pp. 209–215, Jan. 2019. Comput. Vis. (ICCV), Oct. 2017, pp. 4415–4423.
[62] W. Zhao and L. Lu, ‘‘Research and development of autism diagnosis [82] M. Tomei, L. Baraldi, S. Calderara, S. Bronzin, and R. Cucchiara,
information system based on deep convolution neural network and facial ‘‘Video action detection by learning graph-based spatio-temporal
expression data,’’ Library Hi Tech, vol. 38, no. 4, pp. 799–817, Mar. 2020. interactions,’’ Comput. Vis. Image Understand., vol. 206, May 2021,
[63] G. Alvari, C. Furlanello, and P. Venuti, ‘‘Is smiling the key? Machine Art. no. 103187. [Online]. Available: https://www.sciencedirect.com/
learning analytics detect subtle patterns in micro-expressions of infants science/article/pii/S107731422100031X
with ASD,’’ J. Clin. Med., vol. 10, no. 8, p. 1776, Apr. 2021. [Online]. [83] C. Plizzari, M. Cannici, and M. Matteucci, ‘‘Skeleton-based
Available: https://www.mdpi.com/2077-0383/10/8/1776 action recognition via spatial and temporal transformer networks,’’
[64] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, Comput. Vis. Image Understand., vols. 208–209, Jul. 2021,
S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, C. Schmid, Art. no. 103219. [Online]. Available: https://www.sciencedirect.com/
and J. Malik, ‘‘AVA: A video dataset of spatio-temporally localized atomic science/article/pii/S1077314221000631
visual actions,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., [84] X. Wang and A. Gupta, ‘‘Videos as space-time region graphs,’’ in Com-
May 2018, pp. 6047–6056. puter Vision—ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss,
[65] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, Eds. Cham, Switzerland: Springer, 2018, pp. 413–431.
M. Shah, and R. Sukthankar. (2014). THUMOS Challenge: Action [85] C. Feichtenhofer, H. Fan, J. Malik, and K. He, ‘‘Slowfast networks for
Recognition With a Large Number of Classes. [Online]. Available: video recognition,’’ in Proc. IEEE Int. Conf. Comput. Vis., Jan. 2019,
http://crcv.ucf.edu/THUMOS14/ pp. 6202–6211.

VOLUME 11, 2023 47927

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

[86] H. Xu, L. Yang, S. Sclaroff, K. Saenko, and T. Darrell, ‘‘Spatio- VARUN GANJIGUNTE PRAKASH received the
temporal action detection with multi-object interaction,’’ 2020, B.E. degree in electronics and communication
arXiv:2004.00180. from Sri Jayachamarajendra College of Engineer-
[87] M. Tapaswi, V. Kumar, and I. Laptev, ‘‘Long term spatio- ing, India, in 2018. He has five years of experi-
temporal modeling for action detection,’’ Comput. Vis. Image ence in developing machine learning and computer
Understand., vol. 210, Sep. 2021, Art. no. 103242. [Online]. Available: vision-based technological solutions for multiple
https://www.sciencedirect.com/science/article/pii/S1077314221000862 industries and startups. He has spent time working
[88] J. Tang, J. Xia, X. Mu, B. Pang, and C. Lu, ‘‘Asynchronous interac- on many computer vision and robotics challenges.
tion aggregation for action detection,’’ in Proc. Eur. Conf. Comput. Vis. His research interests include solving problems at
(ECCV), 2020, pp. 71–87. the intersection of computer vision and robotics,
[89] J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ computer vision, robotic manipulation, control systems, autonomous mobile
2018, arXiv:1804.02767. robots, deep learning, deep reinforcement learning, robotic system design,
[90] G. Jocher, A. Stoken, J. Borovec, L. Changyu, A. Hogan, and machine learning.
L. Diaconu, F. Ingham, J. Poznanski, J. Fang, L. Yu, M. Wang,
N. Gupta, O. Akhtar, and P. Rai, ‘‘ultralytics/yolov5: V3.1—Bug fixes MANU KOHLI is currently pursuing the Ph.D.
and performance improvements,’’ Ultralytics, Los Angeles, CA, USA,
degree with the Indian Institute of Technology,
Tech. Rep. 7.0, Oct. 2020, doi: 10.5281/zenodo.4154370.
Delhi (IIT Delhi). He has 18 years of experi-
[91] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-
ence in executing large-scale business and digi-
time object detection with region proposal networks,’’ IEEE Trans. Pattern
tal transformation projects in multiple countries.
Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2015.
He has undertaken leadership positions in Fortune
[92] I. J. Goodfellow et al., ‘‘Challenges in representation learning: A report
500 organizations globally and has worked on mul-
on three machine learning contests,’’ Neural Netw., vol. 64, pp. 59–63,
Apr. 2015. tiple technologies, such as SAP, Saas, and machine
learning. He is the Chief Technology Officer of
[93] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal,
T. Yan, L. Brown, Q. Fan, D. Gutfruend, C. Vondrick, and A. Oliva, CogniAble, where he has developed innovative
‘‘Moments in time dataset: One million videos for event understand- artificial intelligence solutions for detecting and managing developmen-
ing,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 1–8, tal challenges, including autism, with outstanding psychometric proper-
Feb. 2019. ties. He has led the formation of new technology-enabled businesses and
[94] J. Bidwell, A. Rozga, I. Essa, and G. Abowd, ‘‘Measuring child visual ensured their commercialization. He has authored multiple publications in
attention using markerless head tracking from color and depth sensing peer-reviewed journals and books by SAP PRESS. His research interests
cameras,’’ in Proc. 16th Int. Conf. Multimodal Interact., Nov. 2014, include developing cutting-edge machine learning, deep learning, and com-
pp. 447–454. puter vision methods to solve complex business and healthcare problems.
[95] E. Murphy-Chutorian and M. M. Trivedi, ‘‘Head pose estimation in com- He has received numerous honors, including the UNICEF Blue Ribbon,
puter vision: A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, AI Gamechangers, and cash prizes from Lockheed Martin, Tata-Trusts,
no. 4, pp. 607–626, Apr. 2009. Western Digital, NASSCOM, NTT-Data, and the Ministry of Electronics for
[96] M. Raza, Z. Chen, S.-U. Rehman, P. Wang, and P. Bao, ‘‘Appearance his innovations.
based pedestrians’ head pose and body orientation estimation using
deep learning,’’ Neurocomputing, vol. 272, pp. 647–659, Jan. 2018, doi: SWATI KOHLI received the Diploma degree,
10.1016/j.neucom.2017.07.029. in 1998, the B.Ed. degree in special education,
[97] E. Rehder, H. Kloeden, and C. Stiller, ‘‘Head detection and orientation in 2002, the Postgraduate Diploma degree in early
estimation for pedestrian safety,’’ in Proc. 17th Int. IEEE Conf. Intell. intervention, and the M.A. degree in psychology.
Transp. Syst. (ITSC), Oct. 2014, pp. 2292–2297. She completed the ABA coursework from the
[98] D. Heo, J. Nam, and B. Ko, ‘‘Estimation of pedestrian pose ori- Florida Institute of Technology. She is currently
entation using soft target training based on teacher–student frame- the Clinical Director of CogniAble. She has more
work,’’ Sensors, vol. 19, no. 5, p. 1147, Mar. 2019. [Online]. Available: than 18 years of experience in working for special
https://www.mdpi.com/1424-8220/19/5/1147 needs children with neuro-developmental delays in
[99] S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun, school, center, and clinic-based settings.
‘‘CrowdHuman: A benchmark for detecting human in a crowd,’’ 2018,
arXiv:1805.00123. A. P. PRATHOSH received the B.Tech. degree,
[100] T.-H. Vu, A. Osokin, and I. Laptev, ‘‘Context-aware CNNs for person in 2011, and the Ph.D. degree in temporal data
head detection,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, analysis from the Indian Institute of Science (IISc),
pp. 2893–2901. Bengaluru, in 2015. He submitted the Ph.D. thesis
[101] Y. Zhou and J. Gregson, ‘‘WHENet: Real-time fine-grained estimation three years after the B.Tech. degree, with many
for wide range head pose,’’ 2020, arXiv:2005.10353. top-tier journal publications. He also happens to
[102] Y. Huang, X. Liu, L. Jin, and X. Zhang, ‘‘DeepFinger: A cascade convolu- be a Student of the Sanskrit language and Indian
tional neuron network approach to finger key point detection in egocentric Philosophical Sciences. He was with corporate
vision with mobile camera,’’ in Proc. IEEE Int. Conf. Syst., Man, Cybern.,
research labs, including Xerox Research India,
Oct. 2015, pp. 2944–2949.
Philips Research, and a start-up in CA, USA.
[103] Y. Huang, X. Liu, X. Zhang, and L. Jin, ‘‘A pointing gesture based egocen-
He has co-founded CogniAble which builds learning algorithms for behav-
tric interaction system: Dataset, approach and application,’’ in Proc. IEEE
ioral healthcare using video analytics (first-place winner of the recent AI
Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2016,
pp. 16–23. startup challenge by the Government of India) and also actively engaged with
several corporate industries, start-ups, and medical centers (e.g., AIIMS) in
[104] D. Shukla, O. Erkent, and J. Piater, ‘‘Probabilistic detection of
pointing directions for human–robot interaction,’’ in Proc. Int. solving interesting technical problems. He joined the Computer Technology
Conf. Digit. Image Comput., Techn. Appl. (DICTA), Nov. 2015, Group of Electrical Engineering, IIT Delhi, in 2017, as an Assistant Profes-
pp. 1–8. sor, where he was engaged in research and teaching of machine and deep
[105] G. Benitez-Garcia, J. Olivares-Mercado, G. Sanchez-Perez, and K. Yanai, learning courses. He is currently a Faculty Member with the Department
‘‘IPN hand: A video dataset and benchmark for real-time continuous hand of ECE, IISc. His work in the industry, focusing on healthcare analytics,
gesture recognition,’’ in Proc. 25th Int. Conf. Pattern Recognit. (ICPR), led to the generation of several IPs, comprising 15 (U.S.) patents of which
Milan, Italy, Jan. 2021, pp. 4340–4347. ten are granted and six are commercialized. His current research interests
[106] H. Xia and Y. Zhan, ‘‘A survey on temporal action localization,’’ IEEE include deep-representational learning, cross-domain generalization, signal
Access, vol. 8, pp. 70477–70487, 2020. processing, and their applications in computer vision and speech analytics.

47928 VOLUME 11, 2023

V. G. Prakash et al.: Computer Vision-Based Assessment of Autistic Children

TANU WADHERA received the B.Tech. degree DEBASIS PANIGRAHI received the M.B.B.S.
in electronics and communication from the Guru degree from the Veer Surendra Sai Institute
Nanak Engineering College, Ludhiana, India, the of Medical Sciences and Research, Sambalpur,
M.Tech. degree in electronics and communica- Odisha, in 2006, and the M.D. degree in pediatrics
tion from Punjabi University, Patiala, India, and from the Sriram Chandra Bhanj Medical College,
the Ph.D. degree from the National Institute of Cuttack, in 2011. He received a fellowship in pedi-
Technology Jalandhar (NIT Jalandhar), Jalandhar, atric neurology from the Kanchi Kamakoti Child
Punjab, India. She completed her postdoctoral Trust Hospital, Chennai, in 2013. He has 16 years
research with the Indian Institute of Technology, of experience as a Pediatrician and a Pediatric
Delhi, India. She has a total of six years of research Neurologist in Bhubaneswar. He currently prac-
experience, including four years with NIT Jalandhar. She has one year of tices with the Child Neuro Clinic, Jagannath Hospital, Bhubaneswar. He is
teaching experience as an Assistant Professor at NIT Jalandhar. Based on a member of the Indian Academy of Paediatrics (IAP), the Association of
her contribution to the field of computational healthcare, especially autism Child Neurologists, India, and the IAP’s Developmental Pediatric Chapter.
spectrum disorder and other disabilities lying on the same spectrum, she
is currently a Project Engineer with the Indian Institute of Technology in
collaboration with AIIMS, Delhi. She is also an Assistant Professor with the
School of Electronics, Indian Institute of Information Technology Una, Una,
India. She has experience publishing work in reputed journals and editing
and/or authoring books for several journals. Her research interests include
artificial intelligence, assistive technology, behavioral modeling, biomedical
signal processing, cognitive neuroscience, and machine learning.

JOHN VIJAY SAGAR KOMMU received the

bachelor’s degree in medical training from the
DIPTANSHU DAS received the M.B.B.S. degree S. V. Medical College, Tirupati, and the Doctor
from the Medical College, Kolkata, the M.H.Sc. of Medicine (M.D.) degree in psychiatry from the
degree in clinical child development from the National Institute of Mental Health and Neuro-
University of Kerala, and the M.D. degree in pedi- sciences (NIMHANS), Bengaluru. He has exten-
atric neurology (level 2 master’s) from the Uni- sive academic, clinical, and research experience of
versity of Rome, Tor Vergata. He is a Pediatric over 20 years, having worked in reputed institu-
Neurologist by education and profession. He was tions, such as Christian Medical College, Vellore,
formerly a Consultant with Medica Superspecialty and JIPMER, Pondicherry. He was trained with the
Hospital (MSH), Kolkata, and the Institute of McKnight Brain Institute, University of Florida. He was recently appointed
Neurosciences Kolkata (I-NK). He is currently to Affiliate Faculty with the School of Biological and Population Health Sci-
a Consultant with the Institute of Child Health, Kolkata. He is the ences, Oregon State University. He has been a Consultant Child Psychiatrist
Founder/Co-Founder of the Institute of NeuroDevelopment, Kolkata, India. with the Department of Child and Adolescent Psychiatry, NIMHANS, for
He is also a long-term Wikipedian and an Advocate for open access and open over ten years. He is currently a Professor and the Head of the Department of
education. He is a General Practitioner and an eminent Pediatric Neurologist. Child and Adolescent Psychiatry, NIMHANS. He has authored 60 research
He is proficient in treating patients with epilepsy or neurodevelopmental articles and seven book chapters. His current research interests include neu-
disorders, with 16 years of experience. He has several academic publications. rodevelopmental disorders especially ASD, pediatric psychopharmacology,
He is a member of the Education Council and the Publication Council of the adolescent mental health, child abuse, and neurobiology of child psychiatric
International League Against Epilepsy (ILAE). He is an Executive Board disorders. He was awarded the prestigious NIH (USA) funded Indo–U.S.
Member of the Association of Child Neurology (AOCN), India, and the Fogarty Fellowship, in 2014.
Editor-in-Chief of the ILAE Project.

VOLUME 11, 2023 47929

Review 3
No ratings yet
Review 3
83 pages
Facial Image-Based Autism Detection A Comparative Study of Deep Neural
100% (1)
Facial Image-Based Autism Detection A Comparative Study of Deep Neural
22 pages
Advanced Hybrid Transfer Learning Approaches For Autism Spectrum Disorder Detection Using Facial Features
No ratings yet
Advanced Hybrid Transfer Learning Approaches For Autism Spectrum Disorder Detection Using Facial Features
5 pages
A Decision Support System To Identify Disorder Symptoms in Children - Final
No ratings yet
A Decision Support System To Identify Disorder Symptoms in Children - Final
85 pages
Team-14 Major Project Presentation
No ratings yet
Team-14 Major Project Presentation
42 pages
2.ASD Transformer
No ratings yet
2.ASD Transformer
24 pages
Leveraging Artificial Intelligence For Diagnosis of Children Autism Through Facial Expressions
No ratings yet
Leveraging Artificial Intelligence For Diagnosis of Children Autism Through Facial Expressions
20 pages
Labeling Images With Facial Emotion and The Potential For Pediatric Healthcare - PMC
No ratings yet
Labeling Images With Facial Emotion and The Potential For Pediatric Healthcare - PMC
23 pages
ASD DLrevZ24
No ratings yet
ASD DLrevZ24
33 pages
Beechjet 400A - BE400A Training Supplement - Revision 2 03-17 PDF
100% (1)
Beechjet 400A - BE400A Training Supplement - Revision 2 03-17 PDF
138 pages
Autyolo-Att: An Attention-Based Yolov8 Algorithm For Early Autism Diagnosis Through Facial Expression Recognition
No ratings yet
Autyolo-Att: An Attention-Based Yolov8 Algorithm For Early Autism Diagnosis Through Facial Expression Recognition
21 pages
Applying Eye Tracking With Deep Learning Techniques For Early-Stage Detection of Autism Spectrum Disorders - 2023
No ratings yet
Applying Eye Tracking With Deep Learning Techniques For Early-Stage Detection of Autism Spectrum Disorders - 2023
27 pages
Ambiguous Facial Expression Detection For Autism S
No ratings yet
Ambiguous Facial Expression Detection For Autism S
15 pages
Project Phase I
No ratings yet
Project Phase I
24 pages
Deep Learning Algorithms To Identify Autism Spectrum
No ratings yet
Deep Learning Algorithms To Identify Autism Spectrum
21 pages
Springer Book Chapter
No ratings yet
Springer Book Chapter
26 pages
Zeroth Review
No ratings yet
Zeroth Review
15 pages
PIIS2405844023039701
No ratings yet
PIIS2405844023039701
14 pages
Development and Validation of A Joint Attention-Based Deep Learning System For Detection and Symptom Severity Assessment of Autism Spectrum Disorder
No ratings yet
Development and Validation of A Joint Attention-Based Deep Learning System For Detection and Symptom Severity Assessment of Autism Spectrum Disorder
13 pages
Pone.0302236 A Hybrid CNN-SVM Model For Enhanced Autism Diagnosis
No ratings yet
Pone.0302236 A Hybrid CNN-SVM Model For Enhanced Autism Diagnosis
20 pages
Bbe, 13 (1), 557-572
No ratings yet
Bbe, 13 (1), 557-572
16 pages
Wa0110
No ratings yet
Wa0110
15 pages
Mohanty 2021 J. Phys. Conf. Ser. 1921 012006
No ratings yet
Mohanty 2021 J. Phys. Conf. Ser. 1921 012006
18 pages
Asd 18
No ratings yet
Asd 18
19 pages
Real-Time Facial Emotion Recognition Model Based On Kernel Autoencoder and Convolutional Neural Network For Autism Children
No ratings yet
Real-Time Facial Emotion Recognition Model Based On Kernel Autoencoder and Convolutional Neural Network For Autism Children
14 pages
Kim 2023 Oi 231394 1702050499.36439
No ratings yet
Kim 2023 Oi 231394 1702050499.36439
12 pages
3 - 2024 Paper
No ratings yet
3 - 2024 Paper
12 pages
Autism Spectrum Disorder Diagnosis
No ratings yet
Autism Spectrum Disorder Diagnosis
12 pages
Autism Spectrum Disorder Identification With Multi-Site Functional Magnetic Resonance Imaging
No ratings yet
Autism Spectrum Disorder Identification With Multi-Site Functional Magnetic Resonance Imaging
12 pages
Asd 14
No ratings yet
Asd 14
12 pages
Recognition of Atypical Behavior in Autism
No ratings yet
Recognition of Atypical Behavior in Autism
6 pages
Definitions of CEC2017 Benchmark Suite Final Version Updated
100% (1)
Definitions of CEC2017 Benchmark Suite Final Version Updated
34 pages
Deep Learning For Neuroimaging-Based Diagnosis and Rehabilitation of Autism Spectrum Disorder: A Review
No ratings yet
Deep Learning For Neuroimaging-Based Diagnosis and Rehabilitation of Autism Spectrum Disorder: A Review
17 pages
Made Asd
No ratings yet
Made Asd
10 pages
ASD Classification For Children Using Deep Nueral Network
No ratings yet
ASD Classification For Children Using Deep Nueral Network
11 pages
Early Screening of Autism in Toddlers Via Response-To-Instructions Protocol
No ratings yet
Early Screening of Autism in Toddlers Via Response-To-Instructions Protocol
11 pages
Appendix D 2
No ratings yet
Appendix D 2
7 pages
2 - 2024 Paper
No ratings yet
2 - 2024 Paper
10 pages
Proposal
No ratings yet
Proposal
11 pages
Journal
No ratings yet
Journal
8 pages
Autism Detection of MRI Brain Images Using Hybrid Deep CNN With DM-Resnet Classifier
No ratings yet
Autism Detection of MRI Brain Images Using Hybrid Deep CNN With DM-Resnet Classifier
11 pages
Early Onset Autism Detection Using Deep Learning Technique
No ratings yet
Early Onset Autism Detection Using Deep Learning Technique
5 pages
Autism Spectrum Disorder
No ratings yet
Autism Spectrum Disorder
10 pages
On Automatically Assessing Children's Facial Expressions Quality: A Study, Database, and Protocol
No ratings yet
On Automatically Assessing Children's Facial Expressions Quality: A Study, Database, and Protocol
11 pages
Sailaja Paper
No ratings yet
Sailaja Paper
6 pages
Model For Autism Disorder Detection Using Deep Learning
No ratings yet
Model For Autism Disorder Detection Using Deep Learning
8 pages
Multi Label Classification For Emotion Analysis of Autism Spectrum Disorder Children Using Deep Neural Networks
No ratings yet
Multi Label Classification For Emotion Analysis of Autism Spectrum Disorder Children Using Deep Neural Networks
5 pages
ECCE Presentation
No ratings yet
ECCE Presentation
4 pages
System For Supporting Children With Autism Spectrum Disorder Based On Deep Learning Techniques
No ratings yet
System For Supporting Children With Autism Spectrum Disorder Based On Deep Learning Techniques
6 pages
Research Article: Classification and Detection of Autism Spectrum Disorder Based On Deep Learning Algorithms
No ratings yet
Research Article: Classification and Detection of Autism Spectrum Disorder Based On Deep Learning Algorithms
10 pages
Autism Detection in High-Functioning Adults With The Application of Eye-Tracking Technology and Machine Learning
No ratings yet
Autism Detection in High-Functioning Adults With The Application of Eye-Tracking Technology and Machine Learning
4 pages
Facial Emotion Recognition System To Predict Autism Inhumans
No ratings yet
Facial Emotion Recognition System To Predict Autism Inhumans
6 pages
Paper For Conf1
No ratings yet
Paper For Conf1
6 pages
A Deep Convolutional Neural Network Based Detection System For Autism Spectrum Disorder in Facial Images
No ratings yet
A Deep Convolutional Neural Network Based Detection System For Autism Spectrum Disorder in Facial Images
5 pages
Detecting High-Functioning Autism in Adults Using Eye Tracking and Machine Learning
No ratings yet
Detecting High-Functioning Autism in Adults Using Eye Tracking and Machine Learning
8 pages
Predicting Autism Diagnosis Using Image With Fixations and Synthetic Saccade Patterns
No ratings yet
Predicting Autism Diagnosis Using Image With Fixations and Synthetic Saccade Patterns
4 pages
Abstract
No ratings yet
Abstract
2 pages
IEEE Conference Template 2
No ratings yet
IEEE Conference Template 2
5 pages
Machine Learning Classifiers For Autism Spectrum Disorder A Review
No ratings yet
Machine Learning Classifiers For Autism Spectrum Disorder A Review
6 pages
IJCRT2209468
No ratings yet
IJCRT2209468
8 pages
BCM84891L Broadcom
No ratings yet
BCM84891L Broadcom
181 pages
Flexible Manufacturing System (FMS) and Automated Guided Vehicle System (Agvs)
No ratings yet
Flexible Manufacturing System (FMS) and Automated Guided Vehicle System (Agvs)
97 pages
Video Gesture Analysis For Autism Spectrum Disorder Detection 2018
No ratings yet
Video Gesture Analysis For Autism Spectrum Disorder Detection 2018
6 pages
Prediction and Comparison Using Adaboost and ML Algorithms With Autistic Children Dataset IJERTV9IS070091
No ratings yet
Prediction and Comparison Using Adaboost and ML Algorithms With Autistic Children Dataset IJERTV9IS070091
4 pages
Handbook of Procedures 2023
No ratings yet
Handbook of Procedures 2023
212 pages
Lecture # 6: Computer Organization and Assembly Language
100% (1)
Lecture # 6: Computer Organization and Assembly Language
31 pages
HIPOT Testing Dielectric Strength Test
No ratings yet
HIPOT Testing Dielectric Strength Test
4 pages
Assignment On Grameenphone 5c, PESTLE & SWOT Analysis: Submitted To: Jeta Majumder Assistant Professor
No ratings yet
Assignment On Grameenphone 5c, PESTLE & SWOT Analysis: Submitted To: Jeta Majumder Assistant Professor
6 pages
Komatsu fb20m-3
No ratings yet
Komatsu fb20m-3
6 pages
Customer Marketing Analysis 1738244935
No ratings yet
Customer Marketing Analysis 1738244935
42 pages
Multiplexer and Demultiplexer
No ratings yet
Multiplexer and Demultiplexer
11 pages
Screenshot 2024-08-11 at 10.14.30 AM
No ratings yet
Screenshot 2024-08-11 at 10.14.30 AM
28 pages
Old Gcse Coursework Tasks
100% (2)
Old Gcse Coursework Tasks
8 pages
Aedt 04 02 2023
No ratings yet
Aedt 04 02 2023
15 pages
Improving Bug Detection Via Context-Based Code Rep
No ratings yet
Improving Bug Detection Via Context-Based Code Rep
30 pages
Answer Key: A. B. C. A
No ratings yet
Answer Key: A. B. C. A
7 pages
E-Commerce CH 3 (Part 3)
No ratings yet
E-Commerce CH 3 (Part 3)
33 pages
Viplav Awasthi-DataScientist
No ratings yet
Viplav Awasthi-DataScientist
6 pages
Asd PPT
No ratings yet
Asd PPT
36 pages
Mastering Pandas With 103 Practical Questions and Solution 1731584558
No ratings yet
Mastering Pandas With 103 Practical Questions and Solution 1731584558
48 pages
1 PR3 Chap 1 2 3
No ratings yet
1 PR3 Chap 1 2 3
22 pages
Collaborative Desktop Publishing
No ratings yet
Collaborative Desktop Publishing
18 pages
Poly Studio p21 Data Sheet
No ratings yet
Poly Studio p21 Data Sheet
3 pages
DocScanner 15-Nov-2024 07-36 PM
No ratings yet
DocScanner 15-Nov-2024 07-36 PM
7 pages
Atterburg Limits Tests: Liquid Limit L.L T - 89
No ratings yet
Atterburg Limits Tests: Liquid Limit L.L T - 89
7 pages
Package Aware R
No ratings yet
Package Aware R
98 pages
DWT-PCA (EVD) Based Copy-Move Image Forgery Detection
No ratings yet
DWT-PCA (EVD) Based Copy-Move Image Forgery Detection
8 pages
The Ultimate Endo Farming Guide
No ratings yet
The Ultimate Endo Farming Guide
39 pages
Catelog Arista Vault Latest
No ratings yet
Catelog Arista Vault Latest
39 pages
SSRN Id4032020
No ratings yet
SSRN Id4032020
27 pages
Intelligent Algorithmic Trading Strategy Using Reinforcement Learning and Directional Change
No ratings yet
Intelligent Algorithmic Trading Strategy Using Reinforcement Learning and Directional Change
13 pages
Lecture 13: Locks: Mythili Vutukuru IIT Bombay
No ratings yet
Lecture 13: Locks: Mythili Vutukuru IIT Bombay
12 pages
Fortnite Review Academic English - Winchell Cijntje
No ratings yet
Fortnite Review Academic English - Winchell Cijntje
2 pages
Logistic Binary Classification
No ratings yet
Logistic Binary Classification
3 pages
Solution of Homework#2
No ratings yet
Solution of Homework#2
8 pages

Computer Vision-Based Assessment of Autistic Children Analyzing Interactions Emotions Human Pose and

Uploaded by

Computer Vision-Based Assessment of Autistic Children Analyzing Interactions Emotions Human Pose and

Uploaded by

Received 4 January 2023, accepted 25 March 2023, date of publication 20 April 2023, date of current version 22 May 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3269027

Computer Vision-Based Assessment of Autistic

TANU WADHERA 3 , DIPTANSHU DAS 4 , DEBASIS PANIGRAHI5 ,

Corresponding author: Varun Ganjigunte Prakash ([email protected])

I. INTRODUCTION difficulty in establishing friends, poor social communication

47908 VOLUME 11, 2023

VOLUME 11, 2023 47909

47910 VOLUME 11, 2023

Researchers have developed neural network architectures A. PROBLEM FORMULATION

VOLUME 11, 2023 47911

The JA assessment report comprises marked times- B. ACTIVITY COMPREHENSION MODEL

47912 VOLUME 11, 2023

VOLUME 11, 2023 47913

TABLE 2. Data details of models for computer vision-based assessment.

TABLE 3. Child detector model hyperparameters and results.

non-child detection boxes belonging to the play partner. With

3) ABA VIDEO ASSESSMENT

47914 VOLUME 11, 2023

FIGURE 3. Session monitoring and assessment with activity comprehension model.

FIGURE 4. Implementation of spatio-temporal activity comprehension model for ABA assessment.

VOLUME 11, 2023 47915

FIGURE 5. Scatter plot of a child indicating hitting and running behavior.

47916 VOLUME 11, 2023

FIGURE 7. Assessment output of child and play partner interactions.

VOLUME 11, 2023 47917

gaze by turning their head and trying to look in the same

1) JOINT ATTENTION — FOLLOW GAZE

47918 VOLUME 11, 2023

track a human rotating through a complete revolution of yaw

VOLUME 11, 2023 47919

C. ACTIVITY COMPREHENSION MODEL

47920 VOLUME 11, 2023

YouTube and Vimeo video search engines. Unlike the previ-

considered correct if the class prediction is similar to the

VOLUME 11, 2023 47921

47922 VOLUME 11, 2023

VOLUME 11, 2023 47923

47924 VOLUME 11, 2023

DATA AVAILABILITY STATEMENT

VOLUME 11, 2023 47925

47926 VOLUME 11, 2023

VOLUME 11, 2023 47927

47928 VOLUME 11, 2023

JOHN VIJAY SAGAR KOMMU received the

VOLUME 11, 2023 47929

You might also like