Ultimate Final Report Phase 2 Sem 7
Ultimate Final Report Phase 2 Sem 7
A Report on
“Analysis and Detection of Autism Spectrum Disorder Using Machine
Learning”
Submitted to RVITM Affiliated to Visvesvaraya Technological University (VTU Belagavi) in partial
fulfillment of the requirements for the award of degree of
BACHELOR OF ENGINEERING
in
ELECTRONICS AND COMMUNICATION ENGINEERING
By
Project Team No. PT15
RV Educational Institutions
RV Institute ofTechnology and Management, Bengaluru
Department of Electronicsand Communication Engineering
2023-24
RV INSTITUTE OF TECHNOLOGY AND MANAGEMENT®
(Affiliated to Visvesvaraya Technological University, Belagavi & Approved by AICTE, NewDelhi)
Bengaluru-560076
DEPARTMENT OF
ELECTRONICS AND COMMUNICATION ENGINEERING
CERTIFICATE
Certified that the project work titled “ANALYSIS AND DETECTION OF AUTISM
SPECTRUM DISORDER USING MACHINE LEARNING” is carried out by
AKSHAT GUPTA(1RF20EC003), ARCHITH P(1RF20EC011), MOHAMMED
NADEEM(1RF20EC028), and VIKRANT RANA(1RF20EC053), who are
bonafide students of RV Institute of Technology and Management, Bangalore, in partial
fulfillment for the award of degree of Bachelor of Engineering in Electronics and
Communication Engineering of the Visvesvaraya Technological University, Belagavi during
the year 2023-2024. It is certified that all corrections/suggestions indicated for the internal
Assessment have been incorporated in the report deposited in the departmental library. The
project report has been approved as it satisfies the academic requirements in respect of project
work prescribed by the institution for the said degree.
External Viva
2
RV INSTITUTE OF TECHNOLOGY AND MANAGEMENT®
(Affiliated to Visvesvaraya Technological University, Belagavi & Approved by AICTE, NewDelhi)
Bengaluru-560076
DEPARTMENT OF
ELECTRONICS AND COMMUNICATION ENGINEERING
DECLARATION
3. MOHAMMED NADEEM
4. VIKRANT RANA
ACKNOWLEDGEMENT
We would like to thank our Project Guide, Dr. Vikash kumar, Assistant Professor,
Department of Electronics and Communication Engineering, RV Institute of
Technology and Management, Bengaluru, for his constant guidance and inputs.
We would like to thank all the Teaching and Non-Teaching Staff for their
cooperation.
Finally, we extend our heartfelt gratitude to our family for their encouragement and
support without which we wouldn’t have come so far. Moreover, we thank all our
friends for their invaluable support and cooperation.
AKSHAT GUPTA-1RF20EC003
ARCHITH P-1RF20EC011
MOHAMMED NADEEM-1RF20EC028
VIKRANT RANA-1RF20EC053
ABSTRACT
The proposed workflow involves the pre-processing of data, training, and testing with various ML
models such as Decision Tree (DT), K-Nearest Neighbour (KNN), Naïve Bayes (NB),Random Forest
(RF) Logistic Regression (LR) and Support Vector Classifier (SVC). and comparision of results and
prediction of ASD. The proposed method is evaluated on a publicly available dataset. The dataset is
collected based on the evaluation of 31 attributes that are found to be common in patients suffering
from ASD. Data pre-processing is a technique which transforms the raw data into a meaningful and
understandable format. Then the preprocessed data is used to train the various ML models and the
models are evaluated on different metrics such as sensitivity, specificity and accuracy.
The proposed project aims to provide a significant step toward advancing the early detection and
assessment of ASD . The manual process of ASD diagnosis is unreliable due to unavailability of
resources and expert opinion. Therefore, computerized diagnostic systems which use Machine learning
architectures, are proposed to learn the patterns in the provided data and to identify the severity of the
disease. The proposed ML model can achieve high performance on ASD detection compared with the
conventional approach.
1
MOTIVATION
The motivation for using machine learning algorithms to analyze and detect ASD is driven by several
important factors:
1.Early Intervention: Early detection of ASD is crucial for providing effective interventions and
support to people suffering with ASD. The earlier the patient is diagnosed, the more effective
interventions can be, which can significantly improve their long-term outcomes.
2.Diagnostic Challenges: ASD is a complex neurodevelopmental disorder with a wide range of
symptoms and varying degrees of severity. Diagnosing ASD based solely on clinical observation can be
challenging and time-consuming, and there is a need for more objective and accurate diagnostic tools.
3.Reduction of Healthcare Costs: Early and accurate diagnosis of ASD can lead to cost savings in
healthcare. It can help avoid misdiagnoses, unnecessary tests, and delays in accessing appropriate
interventions, thereby reducing the overall burden on healthcare systems.
4.Research Advancements: Machine learning can assist researchers in uncovering the underlying
mechanisms and causes of ASD by analyzing diverse datasets. This can lead to a deeper understanding
of the disorder and potentially the development of more targeted treatments.
5.Remote Screening: Machine learning models can be applied to remote screening, allowing for the
assessment of ASD risk factors and symptoms in individuals who may not have easy access to
specialized clinical facilities.
The motivation for using machine learning in the analysis and detection of Autism Spectrum Disorder
is driven by the potential to improve early diagnosis, reduce healthcare costs, advance research, and
ultimately enhance the quality of life for individuals with ASD and their families. Machine learning
offers the promise of more objective and data-driven approaches to understanding and addressing this
complex neurodevelopmental disorder.
2
LITERATURE REVIEW
TITLE FEATURE LIMITATION
Muhammad Shuaib Qureshi Support Vector Machine (SVM) The hyperplane dimension must
et al. “Prediction and SVM is a supervised be altered from one to
Analysis of Autism Spectrum classification technique that uses the Nth dimension in this scenario
Disorder Using Machine a line to distinguish between two called as Kernel.
Learning Techniques”(2023). separate groups. SVM algorithm is not suitable for
• SVM works relatively well large data sets.
when there is a clear margin of SVM does not perform very well
separation between classes. when the data set has more noise
• SVM is more effective in high i.e. target classes are overlapping.
dimensional spaces.
• SVM is effective in cases
where the number of
dimensions is greater than the
number of samples.
• SVM is relatively memory
efficient
• The disadvantage of DT
is model overfitting can
be overcome with
Random Forest.
• Using voting, the best
scored tree will be
selected from the forest
randomly on subtrees.
3
TITLE FEATURE LIMITATION
Shirajul Islam et al. Logistic Regression (LR) Logistic If the number of
“Autism Spectrum Regression’s primary aim is in finding the observations is lesser than
Disorder Detection in model with the best fit that describes the the number of features,
Toddlers for Early relationship between the binomial Logistic Regression should
Diagnosis Using Machine character of interest and a set of not be used, otherwise, it
Learning”(2021). independent variables. It makes use of a may lead to overfitting.
logistic function to find an optimal curve
to fit the data points. It can only be used to predict
• It makes no assumptions about discrete functions.
distributions of classes in feature space.
• It can easily extend to multiple
classes(multinomial regression) and a
natural probabilistic view of class
predictions.
Sushama Rani Dutta et Naive Bayes (NB) Based around NB is that it only works well
al.“A Machine Learning- conditional probability (Bayes theorem) with limited number of
based Method for Autism and counting, the name “naïve” comes features. Moreover, there is
Diagnosis Assistance from its assumption of conditional a high bias when there is a
in Children”(2021) independence of all input features. If this small amount of data.
assumption is considered true.
4
TITLE FEATURES LIMITATION
Haibin Cai, Yinfeng Fang, Decision trees can provide Overfitting: Decision trees are
Zhaojie Ju et al.“Sensing- information about the prone to overfitting.
enhanced Therapy System for importance of different
Assessing Children with Autism features (questions, variables, Instability: Small changes in the
Spectrum Disorders: A or symptoms) in the data can lead to different tree
Feasibility Study”(2020). classification process. This can structures, making the model
be valuable for understanding unstable
which factors contribute most
to autism detection.
5
PROBLEM STATEMENT
Autism spectrum disorder (ASD) is a disorder where patients are unable to express and interact.
Recently it is an issue to be concerned that one in 59 children has identified as an autism spectrum
disorder patient. According to recent reports, about 20 million people in India are diagnosed with
autism. ASDs start from childhood but symptoms can be detected in adulthood. That is why these
children are not being able to have proper treatment at an early age and that causes more complexity in
their health. Research shows that a diagnosis of autism at an earlier age can be more reliable and stable.
Therefore, our proposed project aims to estimate ASD at a sooner possible time and increase more
accuracy than the previous research and reduce medical costs.
Early detection and treatment are the most important steps to be taken to decrease the symptoms of
ASD problem and to improve the quality of life of ASD suffering people. However, there is no
procedure of medical test for the detection of autism. ASD Symptoms are usually recognized by
observation. By assuming that human genes are responsible for it, the exact causes of ASD have not
been recognized by the scientist yet. The human genes affect the development by influencing the
environment.
1. Accurate Detection: Develop machine learning models that can accurately detect ASD from the
collected data, with a focus on achieving high sensitivity and specificity.
2. Early Detection: If applicable, design models that can identify signs of ASD in early childhood to
facilitate early intervention and support.
3. Interpretability: Ensure that the machine learning models provide interpretable results, enabling
healthcare professionals to understand the basis for ASD diagnosis.
4. Reduced Misdiagnosis: Minimize the risk of misdiagnosis and improve the reliability of ASD
diagnosis compared to traditional assessment methods.
6
METHODOLOGY
The steps in the proposed workflow, as shown in Fig 1, which involves the pre-processing of data,
training, and testing with specified models, evaluation of results and prediction of ASD.
PREPROCESSING
Data pre-processing is a technique in which transform the raw data into a meaningful and
understandable format. The data in a dataset can contain a large number of irrelevant and missing
components. A good pre-processed data always yields to a good result. Various Data pre-processing
methods are used to handle incomplete and inconsistent data like as handling missing values, outlier
detection, data discretization, data reduction (dimension and numerosity reduction), etc.
The dataset is saved in .csv format. CSV files, are a file type that allows us to save tabular data, such as
spreadsheets. The model gets unduly complex due to the fact that the dataset contains several attributes
in text format. Therefore, all of the attributes are converted to numeric values in order to decrease the
training period and enhance the model's performance.
CLASSIFICATION
Data will be classified using a variety of algorithms, including the Logistic Regression Algorithm, the
State Vector Classifier Algorithm, the Naive Bayes Algorithm, the Decision Tree Algorithm, the K-
Nearest Neighbor Algorithm and the Random Forest Algorithm, after it has been split. We get different
metrics when we use these algorithms.
7
ALGORITHM
DATASET
The dataset considers 31 attributes based on which the models are trained. The attributes are:
Born with jaundice Boolean (yes or no) Whether the case was born with jaundice
Who is completing
String Parent, relative, self
the test (User)
8
In a social group, s/he can easily keep track of several
different peoples conversations, (child, Adolescent) I
Question 3 (A3) Binary (0, 1) find it easy to do more than one thing at once, (Adult)
Does your child point to indicate that s/he wants
something? (e.g. a toy that is out of reach) (Toddler)
S/he finds it easy to go back and forth between different
activities, (child, Adolescent) If there is an interruption,
Question 4 (A4) Binary (0, 1) s/he can switch back to what s/he was doing very quick,
(Adult) Does your child point to share interest with you?
(e.g. pointing at an interesting sight) (Toddler)
S/he does not know how to keep a conversation going
with his/her peers, (child, Adolescent) I find it easy to
Question 5 (A5) Binary (0, 1) read between the lines when someone is talking to me,
(Adult) Does your child pretend? (e.g. care for dolls, talk
on a toy phone) (Toddler)
S/he is good at social chit-chat, (child, Adolescent) I
know how to tell if someone listening to me is getting
Question 6 (A6) Binary (0, 1)
bored, (Adult) Does your child follow where you are
looking? (Toddler)
When s/he is read a story, s/he finds it difficult to work
out the characters intentions or feelings, (Child) When
s/he was younger, s/he used to enjoy playing games
involving pretending with other children, (Adolescent)
Question 7 (A7) Binary (0, 1) When I am reading a story, I find it difficult to work out
the characters intentions, (Adult) If you or someone else
in the family is visibly upset, does your child show signs
of wanting to comfort them? (e.g. stroking hair, hugging
them (Toddler)
When s/he was in preschool, s/he used to enjoy playing
games involving pretending with other children, (Child)
S/he finds it difficult to imagine what it would be like to
be someone else, (Adolescent) I like to collect
Question 8 (A8) Binary (0, 1)
information about categories of things (e.g. types of car,
types of bird, types of train, types of plant, etc.), (Adult)
Would you describe your childs’ first words as:
(Toddler)
S/he finds it easy to work out what someone is thinking
or feeling just by looking at their face, (Child) S/he finds
social situations easy, (Adolescent) I find it easy to work
Question 9 (A9) Binary (0, 1)
out what someone is thinking or feeling just by looking
at their face, (Adult) Does your child use simple
gestures? (e.g. wave goodbye) (Toddler)
S/he finds it hard to make new friends, (Child,
Adolescent) I find it difficult to work out peoples
Question 10 (A10) Binary (0, 1)
intentions, (Adult) Does your child stare at nothing with
no apparent purpose? (Toddler)
9
FLOWCHART
10
.
PLATFORM/SOFTWARE USED
JUPYTER NOTEBOOK
Jupyter Notebook stands as a versatile, interactive platform that seamlessly integrates live code,
visualizations, and explanatory text within a single document. This open-source web-based tool
facilitates data analysis, scientific exploration, and machine learning by enabling users to combine
executable code cells with markdown-based text, allowing for an intuitive blend of computation and
storytelling. Its adaptable environment supports various programming languages, fostering
collaborative work and the creation of comprehensive reports that showcase code, results, and
descriptive insights all in one accessible space. Jupyter notebook lets anyone create and execute
arbitrary Python code in the browser. It's ideal for machine learning, data analysis, and education. A
notebook is saved with an .jpynb extension.
PYTHON 3.10.0
Python libraries and frameworks offer a reliable environment which reduces software development
time significantly. Python is consistent, simple, flexible, platform independent and has wide
community which makes it most appropriate for machine learning Python includes a modular machine
learning library PyBrain, Tensorflow, Keras, NumPy etc which offers many algorithms for machine
learning task. Version
MATLAB
MATLAB is a programming and numeric computing environment used by millions of engineers and
scientists to analyze data, develop algorithms, and create models. MATLAB provides professionally
developed toolboxes for signal and image processing, control systems, wireless communications,
computational finance, robotics, deep learning and AI and more. MATLAB combines a desktop
environment tuned for iterative analysis and design processes with a high-level programming
language. It includes the Live Editor for creating scripts that combine code, output, and formatted text
in an executable notebook. Prebuilt apps allow you to interactively perform iterative tasks.
11
DESIGN AND DEVELOPMENT
Decision Tree Algorithm:
fitctree(X,Y) returns a fitted binary classification decision tree based on the input variables contained in
matrix X and output Y. The returned binary tree splits branching nodes based on the values of a column
of X.
KNN Algorithm:
for k = 1:numNeighbors
model = fitcknn(dataTrain(:, 1:end1),dataTrain.category_encoded,'NumNeighbors',
k);
fitcknn returns a k-nearest neighbor classification model based on the predictor data X and response Y.
NumNeighbors is the variable which holds the value of total number of neighbors and k holds the
value to be considered for each iteration.
fitcnb(X,Y) returns a multiclass naive Bayes model (Mdl), trained by predictors X and class
labels Y0. additional options can be specified by one or more Name,Value pair arguments, using any of
the previous syntaxes. For example, you can specify a distribution to model the data, prior probabilities
for the classes, or the kernel smoothing window bandwidth.
12
fitensemble(Tbl,ResponseVarName,Method,NLearn,Learners) returns a trained ensemble model object
that contains the results of fitting an ensemble of ”Nlearn” classification or regression learners
(Learners) to all variables in the table Tbl. ResponseVarName is the name of the response variable
in Tbl. Method is the ensemble-aggregation method.
print(f'{model} : ‘)
print('Training Accuracy : ', metrics.roc_auc_score(Y, model.predict(X)))
print('Validation Accuracy:metrics.roc_auc_score(Y_val,model.predict(X_val
)))
print()
The confusion matrices of all the machine learning algorithms are as shown
13
CONFUSION MATRICES
14
TRAINING ACCURACY AND TESTING ACCURACY PLOTS
15
By displaying the true and false predictions for each class, the confusion matrix goes beyond
classification accuracy. A confusion matrix in the context of a binary classification job is a 2x2 matrix.
True Positive (TP): It is the total counts having both predicted and actual values are true.
True Negative (TN): It is the total counts having both predicted and actual values are false.
False Positive (FP): It is the total counts having prediction as true while actually, it is false.
False Negative (FN): It is the total counts having prediction as false while actually, it is true.
The figures Fig.2, Fig.3, Fig.4, Fig.5, Fig.6 and Fig.7 represent the confusion matrices of KNN, Naïve
Bayes, Decision Tree, Random Forest, State Vector Classifier and Logistic Regression algorithms
respectively.
The figures Fig.8, Fig.9 and Fig.11 illustrate the model accuracy on training and testing data.
These figures serve as crucial diagnostic tools, encapsulating the model's learning progress and
generalization capabilities. These visualizations offer insights into overfitting, hyperparameter
optimization, and the generalizing capabilities of the models by showcasing the evolution of
performance metrics across iterations or parameter variations. They enable us to assess model behavior,
detect issues like bias or variance and to compare different models, and communicate succinct
summaries of model performance, guiding the refinement and selection of optimal machine learning
models.
These graphs have 2 main features, curve 1 which is for the training set represented in blue and curve 2
which is the test set represented in orange. As it can be seen that both the curves increase exponentially
which denotes that the accuracy of our models increases periodically. These values are measured by
increasing the number of neighbors in KNN algorithm, the number of features in Naïve bayes algorithm
and the number of trees in Random Forest algorithm.
By comparing the model predictions with the actual values in terms of a percentage, it determines how
well our model predicts. Fig.10 and Fig.12 represent the model accuracy on training and testing data for
Decision tree, logistic regression and State Vector Classifier algorithms respectively.
16
Fig.13 Importance of each feature in the constructed Decision Tree
Fig.13 illustrates the importance of each feature. The x-axis corresponds to the features, and the y-axis
shows their importance estimates. Feature importance is calculated based on how often a feature is used
for splitting nodes across the tree in the ensemble. The more frequently a feature is chosen for splitting
nodes, the higher its importance is considered. It is often used after training an ensemble or tree-based
(in this case Decision tree) model to understand which features contribute the most to the model's
predictive performance.
Fig.14 : Tree structure obtained using Decision Tree Algorithm based on the above features
17
Fig.15 Importance of each feature in the Random Forest Algorithm
Random Forests explore more features compared to individual decision trees due to their ensemble
design. By building numerous trees on bootstrapped subsets of the data while considering a different
random subset of features for each tree, Random Forests encourage diversity among the trees. This
diversity promotes a broader exploration of the feature space, allowing the ensemble to capture a richer
representation of the relationships within the data. This approach often leads to improved
generalization, robustness against overfitting, and better overall performance compared to a single
decision tree.
Some of the Decision Tress are shown below:
18
Fig.16 Few Tree Structures obtained using Random Forest Algorithm
19
ACCURACY SCORES
knn
Fig.17 Accuracy of K Nearest Neighbour
20
Model Sensitivity specificity Accuracy
1. K-Nearest Neighbor 0.9696 0.9000 0.9505
2. Naïve Bayes 0.9805 0.9130 0.9597
3. Decision Tree 0.8266 0.8490 0.8325
4. Logistic Regression 0.9345 0.6415 0.8375
5. State Vector 0.9224 0.7272 0.8687
Classifier
6. Random Forest 0.9568 1.0000 0.9670
21
CONCLUSION AND FUTURE SCOPE
22
Bibliography/Reference
[1] Benjamin Gesundheit* and Joshua P. Rosenzweig, “Editorial: Autism Spectrum Disorders (ASD)-
Searching for the Biological Basis for Behavioral Symptoms and New Therapeutic Targets, Published
online 2023 Jan.
[2] Arodami Chorianopoulou, Efthymios Tzinis, Elias Iosif Asimenia Papoulidi, Christina Papailiou,
Alexandros Potamianos, “Engagement detection for children with autism spectrum disorder”, 2023.
[3] Siriwan Sunsirikul and Tiranee Achalakul, “Associative Classification Mining in the Behavior
Study of Autism Spectrum Disorder”, vol.3, 2022.
[4] Beibin Li ; Sachin Mehta ; Deepali Aneja ; ClaireFoster ; PamelaVentola ; Frederick Shic ; Linda
Shapiro, “A Facial Affect Analysis System for Autism Spectrum Disorder”, 2022.
[5] Pratibha Vellanki, Thi Duong, Svetha Venkatesh, Dinh Phung, “Nonparametric Discovery of
Learning Patterns and Autism Subgroups from Therapeutic Data”, 2022.
[6] Paul Fergus, Basma Abdulaimma, Chris Carter, Sheena Round, “Interactive Mobile Technology for
Children with Autism Spectrum Condition (ASC)”, 2021.
[8] Daiki Mitsumoto, Takeshi Hori, Shigeki Sagayama Hidenori Yamasue, Keiho Owada, Masaki
Kojima, Keiko Ochi, Nobutaka Ono, “Autism Spectrum Disorder Discrimination Based on Voice
Activities Related to Fillers and Laughter”, 2021.
[9] Tarannum Zaki, Muhammad Nazrul Islam, Md. Sami Uddin, Sanjida Nasreen Tumpa, Md. Jubair
Hossain, Maksuda Rahman Anti, Md. Mahedi Hasan, “Towards Developing a Learning Tool for
Children with Autism”, 2021.
23
[10] Ardiana Sula, Evjola Spaho, Keita Matsuo, Leonard Barolli, Rozeta Miho and Fatos Xhafa, “An
IoT-based System for Supporting Children with Autism Spectrum Disorder”, 2020.
[11] Haibin Cai, Yinfeng Fang, Zhaojie Ju, Cristina Costescu, Daniel David, Erik Billing, Tom Ziemke,
Serge Thill, Tony Belpaeme, Bram Vanderborght, David Vernon, Kathleen Richardson and Honghai
Liu, “Sensing-enhanced Therapy System for Assessing Children with Autism Spectrum Disorders: A
Feasibility Study”, 2020.
[12] Akshay Vijayan ; S Janmasree ; C Keerthana ; L Baby Syla, “A Framework for Intelligent Learning
Assistant Platform Based on Cognitive Computing for Children with Autism Spectrum Disorder”, July
2019.
[13] Sushama Rani Dutta ; Sujoy Datta ; Monideepa Roy, “Using Cogency and Machine Learning for
Autism Detection from a Preliminary Symptom”, July 2019.
[14] Che Zawiyah Che Hasan, Rozita Jailani and Nooritawati Md Tahir, “ANN and SVM Classifiers in
Identifying Autism Spectrum Disorder Gait Based on Three-Dimensional Ground Reaction Forces”,
October 2019.
[15] D. P. Wall, R. Dally, R. Luyster, J.-Y. Jung, and T. F. DeLuca, “Use of artificial intelligence to
shorten the behavioral diagnosis of autism,” PloS one, vol. 7, no. 8, p. e43855, 2019.
[17] J. Kosmicki, V. Sochat, M. Duda, and D. Wall, “Searching for a minimal set of behaviors for
autism detection through feature selection-based machine learning,” Translational psychiatry, vol. 5, no.
2, p. e514, 2019.
24
[18] W. Liu, M. Li, and L. Yi, “Identifying children with autism spectrum disorder based on their face
processing abnormality: A machine learning framework,” Autism Research, vol. 9, no. 8, pp. 888–898,
2019.
[19] Kazi Shahrukh Omar, Prodipta Mondal, Nabila Shahnaz Khan, “A Machine Learning Approach to
Predict Autism Spectrum Disorder”, 7-9 February, 2019.
[20] Akter, Tania, Md Shahriare Satu, Md Imran Khan, Mohammad Hanif Ali, Shahadat Uddin, Pietro
Lio, Julian MW Quinn, and Mohammad Ali Moni. "Machine learning-based models for early stage
detection of autism spectrum disorders." 2019.
25