0% found this document useful (0 votes)
36 views78 pages

Minor Project

Uploaded by

singhashpreet230
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views78 pages

Minor Project

Uploaded by

singhashpreet230
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Using Supervised Machine Learning Algorithms for Autism

Spectrum Disorder Recognition

DISSERTATION

Submitted in partial fulfillment of the


Requirements for the award of the degree

of

Bachelor of Technology
in
Computer Science & Engineering

By:
Ananya Singh(08/CSE1/2020)
Arshleen Bhandari(12/CSE1/2020)
Harleen Kaur Aarora(43/CSE1/2020)
Himika Prabhat(52/CSE1/2020)

Under the guidance of: Dr. Aashish Bhardwaj

Department of Computer Science & Engineering


Guru Tegh Bahadur Institute of Technology

Guru Gobind Singh Indraprastha University


Dwarka, New Delhi
Year 2020-24
DECLARATION

We hereby declare that all the work presented in the dissertation entitled “Autism Spectrum
Disorder screening in adults using Supervised Machine Learning Algorithms” in the partial
fulfillment of the requirements for the award of the degree of Bachelor of Technology in
Computer Science & Engineering, Guru Tegh Bahadur Institute of Technology, affiliated to
Guru Gobind Singh Indraprastha University Delhi is an authentic record of our own work carried
out under the guidance of Dr. Aashish Bhardwaj.

Date: 08-12-2023
Ananya Singh(08/CSE1/2020)
Arshleen Bhandari(12/CSE1/2020)
Harleen Kaur Aarora(43/CSE1/2020)
Himika Prabhat(52/CSE1/2020)

ii
CERTIFICATE

This is to certify that dissertation entitled “Autism Spectrum Disorder screening in adults
using Machine Learning Algorithms”, which is submitted by Ms. Ananya Singh, Ms.
Arshleen Bhandari, Ms. Harleen Kaur and Ms. Himika Prabhat in partial fulfillment of the
requirements for the award of the degree of Bachelor of Technology in Computer Science &
Engineering, Guru Tegh Bahadur Institute of Technology, New Delhi is an authentic record of
the candidate’s own work carried out by them under our guidance. The matter embodied in this
thesis is original and has not been submitted for the award of any other degree.

Date: 08-12-2023

Aashish Bhardwaj
(Head of Department)
Computer Science & Engineering

iii
ACKNOWLEDGEMENT

We express our sincere appreciation to our mentor, Dr. Ashish Bhardwaj, who serves as the
Head of the Department of Computer Science and Engineering (CSE). His guidance and
support have been invaluable throughout our dissertation process, and we are grateful for his
leadership that has played a crucial role in enhancing the quality of our work to its current
standard.

Without his expertise and encouragement, this achievement would not have been possible and he
has acted as guiding light for our path throughout the project. We are extremely grateful.

Date: 08-12-2023 Ananya Singh


(08/CSE1/2020)
[email protected]

Arshleen Bhandari
(12/CSE1/2020)
[email protected]

Harleen Kaur Aarora


(43/CSE1/2020)
[email protected]

Himika Prabhat
(52/CSE1/2020)
[email protected]

vi
ABSTRACT

This project delves into the intricate challenge of identifying Autism Spectrum Disorder (ASD)
in adults through the strategic application of Supervised Machine Learning Algorithms. ASD,
marked by difficulties in social interaction, communication, and repetitive behaviors,
underscores the imperative for early detection to facilitate timely intervention. Employing a
diverse ensemble of algorithms, including Decision Trees, Random Forest, Support Vector
Machines, K Nearest Neighbors, Logistic Regression, Linear Discriminant Analysis, and
Quadratic Discriminant Analysis, the project aims to craft an advanced ASD screening tool with
transformative potential for diagnostic processes.

The primary goal is the construction of a screening mechanism adept at leveraging the unique
strengths of each algorithm, ultimately yielding a more accurate and robust identification model.
To achieve this, we meticulously curated a dataset that incorporates information from both ASD
and non-ASD adult subjects. This dataset serves as the training and evaluation ground for the
machine learning algorithms, ensuring a diverse and representative sample that authentically
mirrors real-world scenarios.

The project places significant emphasis on algorithmic diversity, undertaking the exploration of
various modeling approaches to attain a nuanced understanding of the intricate features
associated with ASD. Each algorithm contributes a distinct perspective, collectively enriching
the screening process. Interpretability becomes a focal point, with the overarching objective of
unraveling complex relationships between features and ASD identification.

Integral to the project are ethical considerations that underscore privacy, fairness, and the
responsible deployment of machine learning technologies within healthcare contexts. Ensuring
the ethical use of ASD screening tools is paramount not only for building public trust but also for
safeguarding the well-being of individuals undergoing assessments.

In the synthesis of the unique capabilities inherent in diverse machine learning algorithms, this
project aspires to redefine ASD screening. The anticipated outcomes hold promise for
significantly enhancing the quality of life for individuals on the autism spectrum by enabling
more accurate and efficient early detection. This aligns seamlessly with the broader objectives of
precision medicine and exemplifies the transformative potential of advanced technologies in
positively impacting the identification of neurodevelopmental disorders. In essence, this project
represents a concerted effort to seamlessly blend cutting-edge machine learning methodologies
with a compassionate approach, thereby enhancing ASD diagnostic processes and nurturing a
future where early intervention becomes not only more accessible but also more effective in
improving outcomes for those affected by ASD.
LIST OF FIGURES

S.No Figure Name Page No.


1.1 Supervised Machine Learning Algorithm 8
1.2 Random Forest Algorithm 8
1.3 Schematic diagram of SVM algorithm 9
1.4 K-Nearest Neighbors Algorithm 10
1.5 Logistic Curve 10
1.6 Multi-Layer Perceptron Learning 11
1.7 Model Complexity Graphs 19
1.8 Learning Curve 20
3.1 ER Diagram 32
3.2 Use Case Diagram 34
4.1 Support Vector Regression score 38
4.2 k Nearest Neighbours Regression score 38
4.3 Logistic Regression score 38
4.4 XGB Classifier Regression score 38
4.5 Random Forest Regression score 38
4.6 MLP Regression score 39
A-1 First five rows of the dataset 54
A-2 Value_count of each unique values in the column ethnicity 54
A-3 Value_count of each unique values in the column relation 55
A-4 Pie chart for the number of data for each target 55
A-5 Plot where scores indicate no. of yes for set of 10 ques asked 55
A-6 Plot where count indicate no. of males and females 56
A-7 Plot where count indicate no. of people from diff ethnicities 56
A-8 Plot with 0 for people with autism and 1 for not with autism 57
A-9 Plots for diff countries given in the dataset 57
A-10 Pair plot for A1-A10 58
A-11 Heat map for correlation matrix of numeric columns 59
A-12 Count plot no. of cases for each age group 59
A-13 Comparison btw scores and no. of +ve and -ve cases 60
A-14 Normal Distribution of age values after log transformation 60
LIST OF TABLES

S.No Table Name Page No.


1.1 Attribute features and their description 13
5.1 Set of Questions 41
6.1 Results 44
CONTENTS

Chapter Page No.

Title page i
Declaration ii
Certificate iii
Acknowledgement iv
Abstract v
List of Tables and Figures vii
1. Introduction 1
1.1 Introduction to Autism Spectrum Disorder (ASD) 1
1.2 Significance of Early Detection in ASD 2
1.3 Role of Supervised Machine Learning in Healthcare 5
1.4 Libraries used 7
1.4.1 Numpy 7
1.4.2 Pandas 7
1.4.3 Matplotlib 7
1.4.4 Seaborn 7
1.4.5 SciKit Learn 7
1.5 Overview of Machine Learning Algorithms 8
1.5.1 Random Forests 8
1.5.2 Support Vector Machines (SVM) 9
1.5.3 k-Nearest Neighbors (kNN) 9
1.5.4 Logistic Regression 10
1.5.5 Multi-Layer Perceptron(MLP) 11
1.6 Crafting an Advanced ASD Screening Tool 12
1.7 Dataset Compilation and Characteristics 13
1.8 Emphasis on Algorithmic Diversity 17
1.9 Interpretability in ASD Identification 18
1.10 Ethical Considerations in Machine Learning for Healthcare 21
1.11 Anticipated Outcomes and Future Implications 23
2. Requirements Analysis with SRS 26
2.1 System Overview 26
2.1.1 System Description 26
2.1.2 System Features 26
2.2 Functional Requirements 26
2.2.1 Data Preprocessing 27
2.2.2 Algorithm Implementation 27
2.2.3 Model Evaluation 28
2.3 Non-Functional Requirements 28
2.3.1 Performance 28
2.3.2 Usability 28
2.3.3 Security 29
2.4 Constraints 29
2.4.1 Dataset Limitations 29
2.4.2 Availability of Algorithm 29
2.5 Appendix 29
2.5.1 Some References 29
2.5.2 Glossary 29
3. System Design 38
4. Test Plan
5. Body of Thesis
6. Results and observations 61
7. Summary and Conclusions
8. future scope 66
References 68
Appendix A 55
Appendix B 63
Chapter One

INTRODUCTION

1
INTRODUCTION

Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition that significantly


impacts an individual's social interaction, communication skills, and behavioral patterns. The
spectrum nature of ASD encompasses a wide range of symptoms and severity, making each
person's experience unique. Although ASD has historically been linked primarily with
childhood, it is increasingly acknowledged that it often extends into adulthood, bringing forth a
distinctive set of challenges for identification, diagnosis, and intervention.

1.1 Introduction to Autism Spectrum Disorder (ASD)

Autistic Spectrum Disorder (ASD) is the name for a group of developmental disorders impacting
the nervous system. ASD symptoms range from mild to severe: mainly language impairment,
challenges in social interaction, and repetitive behaviors. Many other possible symptoms include
anxiety, mood disorders and Attention-Deficit/Hyperactivity Disorder (ADHD).
The core features of ASD include difficulties in social interaction, marked by challenges in
understanding and responding to social cues. Individuals with ASD may struggle with
establishing and maintaining relationships, interpreting nonverbal communication, and grasping
the nuances of social reciprocity. Additionally, communication difficulties are a hallmark of
ASD, with variations ranging from delayed language development to a complete absence of
spoken language. The manifestation of repetitive behaviors, such as stereotypical movements or
adherence to strict routines, further contributes to the intricate tapestry of ASD symptoms.
Recognizing the prevalence of ASD in adults has become increasingly crucial in recent years.
The traditional perception of autism as a childhood disorder has given way to a broader
understanding that acknowledges the persistence of symptoms into adulthood. The transition
from childhood to adulthood introduces new challenges in identifying and diagnosing ASD in a
population that may have developed coping mechanisms or masked their symptoms over time.
This shift in perspective underscores the critical need for effective screening tools and diagnostic
criteria tailored to the unique characteristics of adults with ASD.
Despite advancements in our comprehension of ASD, the diagnostic journey for adults
remains a complex and nuanced process. Unlike childhood diagnosis, which often involves
observations of early developmental markers, adult diagnosis relies on retrospective assessments

2
and an analysis of lifelong patterns of behavior. The subtleties of adult presentation necessitate a
comprehensive evaluation that considers not only the core symptoms of ASD but also the
individual's adaptive functioning and quality of life. Clinicians face the challenge of
distinguishing ASD from other mental health conditions that may share overlapping features,
further emphasizing the need for a nuanced approach to diagnosis.
The importance of timely intervention for adults with ASD cannot be overstated. Early
identification and intervention have been shown to improve outcomes and enhance the
individual's overall quality of life. However, the challenges of diagnosing ASD in adults may
lead to delayed access to appropriate support and services. This delay can impact various aspects
of an individual's life, including education, employment, and social relationships.

1.2 Significance of Early Detection in ASD

The significance of early detection in Autism Spectrum Disorder (ASD) cannot be overstated,
particularly when considering its profound implications for individuals, society, and healthcare
systems. Early identification of ASD in adults serves as a pivotal gateway to timely and targeted
interventions, laying the foundation for improved outcomes across various domains of life.
At an individual level, early detection acts as a catalyst for enhancing the quality of life for those
with ASD. The developmental trajectory of individuals with ASD is markedly influenced by
early intervention, impacting key areas such as social integration, communication skills, and
adaptive behaviors. The malleability of the human brain during early developmental stages
makes this period particularly receptive to therapeutic interventions. Therefore, identifying and
addressing ASD in its early stages can lead to more effective interventions, potentially mitigating
the impact of core symptoms and promoting the development of essential life skills.
Social integration stands out as a critical domain affected by early detection and intervention.
Individuals with ASD often encounter challenges in forming and maintaining social
relationships. Early identification allows for the implementation of targeted social skills training
and support, fostering improved social interactions and relationships. As a result, the individual's
ability to navigate social situations, understand social cues, and engage in reciprocal
communication can be significantly enhanced, positively influencing their overall well-being.
Communication skills, another core aspect of ASD, benefit immensely from early intervention.
Speech and language difficulties, ranging from delayed language acquisition to challenges in

3
pragmatic communication, are common among individuals with ASD. Early identification
enables the initiation of speech and language therapy tailored to the individual's specific needs,
promoting the development of effective communication strategies. This, in turn, has cascading
effects on various aspects of life, including academic achievement, vocational success, and
independent living.
Adaptive behaviors, encompassing daily living skills and functional independence, also
experience positive outcomes with early detection. Interventions targeting adaptive behaviors,
such as self-care, organization, and time management, contribute to the individual's ability to
lead a more independent and fulfilling life. Early identification allows for the implementation of
personalized strategies and support systems, empowering individuals with ASD to navigate the
challenges of daily living more effectively.
Beyond the individual realm, the societal and economic significance of early ASD detection is
substantial. A society that actively promotes early identification and intervention contributes to a
more inclusive environment. By recognizing and accommodating the diverse needs of
individuals with ASD, societal structures become more accessible, allowing for greater
participation and contribution from this population. This inclusivity has ripple effects, fostering a
society that values neurodiversity and promotes the well-being of all its members.
From an economic perspective, early detection holds the potential to reduce the long-term
burden on healthcare systems. Timely interventions that address the core symptoms of ASD may
decrease the need for extensive and costly support services in later stages of life. Additionally,
individuals who receive early intervention are more likely to develop skills that enhance their
independence, potentially reducing the demand for long-term care and support services.

1.3 Role of Supervised Machine Learning in Healthcare

The role of supervised machine learning in healthcare represents a groundbreaking frontier that
has the potential to revolutionize various facets of medical practice. The integration of machine
learning technologies into healthcare systems signifies a paradigm shift, ushering in
unprecedented opportunities for enhanced diagnosis, personalized treatment strategies, and
improved patient care. In the realm of neurodevelopmental disorders, such as Autism Spectrum
Disorder (ASD), supervised machine learning stands out as a particularly promising avenue,
offering the potential to significantly advance diagnostic accuracy and efficiency.

4
In the context of ASD, a multifaceted neurodevelopmental condition characterized by
challenges in social interaction, communication difficulties, and repetitive behaviors, the
traditional diagnostic process has often been intricate and time-consuming. The reliance on
clinical observations, behavioral assessments, and subjective evaluations has led to variations in
diagnostic outcomes and, at times, delayed identification. Here, supervised machine learning
brings forth a transformative approach by harnessing the power of computational algorithms to
analyze and interpret complex patterns within vast datasets.
One of the primary advantages of supervised machine learning in the context of ASD lies in
its ability to learn from existing datasets. These datasets may include a diverse range of
information, such as clinical assessments, neuroimaging data, genetic profiles, and behavioral
observations. Through a process of supervised training, machine learning algorithms can discern
intricate patterns and relationships within these datasets, ultimately creating models that can
generalize and apply their learning to new, unseen data. This capacity to recognize subtle
patterns is particularly relevant in the case of ASD, where the disorder's spectrum nature and
variability in symptom presentation pose challenges for conventional diagnostic approaches.
The application of supervised machine learning in ASD diagnosis involves training
algorithms on labeled datasets, where each instance is associated with a known outcome or
diagnosis. These algorithms learn to identify patterns indicative of ASD based on the features
present in the training data. Once the training phase is complete, the machine learning model can
be applied to new, unseen data to predict whether an individual may have ASD, providing a
valuable tool for clinicians in the diagnostic process.
The potential benefits of incorporating supervised machine learning into ASD diagnosis are
manifold. Firstly, the accuracy of diagnoses may be significantly improved, as machine learning
models can analyze a wide array of data points simultaneously and identify subtle patterns that
may not be immediately apparent to human observers. This enhanced accuracy has the potential
to facilitate earlier and more precise identification of ASD, leading to timely interventions and
improved outcomes for individuals.
Moreover, the efficiency of the diagnostic process stands to benefit from the role of
supervised machine learning. The rapid analysis of diverse datasets allows for a streamlined and
objective assessment, reducing the time and resources traditionally required for a comprehensive

5
diagnosis. This efficiency is particularly critical in the context of neurodevelopmental disorders,
where early intervention has been shown to have a substantial impact on long-term outcomes.
However, it is essential to acknowledge the challenges and considerations associated with the
integration of machine learning in healthcare, including ethical concerns, data privacy, and the
interpretability of complex algorithms. Ensuring that machine learning models are transparent,
interpretable, and ethically sound is crucial for building trust among healthcare professionals and
the broader public.

Fig 1.1 Supervised Machine Learning Algorithm

6
1.4 Libraries used

1.4.1 Numpy
NumPy (Numerical Python) is an open source Python library that’s used in almost every field of
science and engineering. It’s the universal standard for working with numerical data in Python,
and it’s at the core of the scientific Python and PyData ecosystems. NumPy users include
everyone from beginning coders to experienced researchers doing state-of-the-art scientific and
industrial research and development. The NumPy API is used extensively in Pandas, SciPy,
Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.

1.4.2 Pandas
Pandas is mainly used for data analysis and associated manipulation of tabular data in
DataFrames. Pandas allows importing data from various file formats such as comma-separated
values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.

1.4.3 Matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.

1.4.4 Seaborn
Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and
integrates closely with pandas data structures. Seaborn helps you explore and understand your
data.

1.4.5 SciKit Learn


Scikit-learn, a widely-used open-source machine learning library in Python, serves as a powerful
and versatile tool for developers, researchers, and data scientists. Designed to be user-friendly
and accessible, Scikit-learn provides a comprehensive set of functionalities for various machine
learning tasks, including classification, regression, clustering, and dimensionality reduction.
Developed on the principles of simplicity and efficiency, it seamlessly integrates with popular
data science libraries such as NumPy, SciPy, and Matplotlib, fostering a cohesive ecosystem for
machine learning and data analysis. With a vast array of algorithms and tools, Scikit-learn

7
empowers users to efficiently implement and experiment with machine learning models, making
it an indispensable resource for both beginners and seasoned practitioners in the field of data
science and artificial intelligence.

1.5 Overview of Machine Learning Algorithms

1.5.1 Random Forests Algorithm


Random Forest grows multiple decision trees which are merged together for a more accurate
prediction.
The logic behind the Random Forest model is that multiple uncorrelated models (the
individual decision trees) perform much better as a group than they do alone. When using
Random Forest for classification, each tree gives a classification or a “vote.” The forest chooses
the classification with the majority of the “votes.” When using Random Forest for regression, the
forest picks the average of the outputs of all trees.
The key here lies in the fact that there is low (or no) correlation between the individual
models—that is, between the decision trees that make up the larger Random Forest model. While
individual decision trees may produce errors, the majority of the group will be correct, thus
moving the overall outcome in the right direction.

Fig 1.2 Random Forest Algorithm

8
1.5.2 Support Vector Machines (SVM)
The best way to understand the SVM algorithm is by focusing on its primary type, the SVM
classifier. The idea behind the SVM classifier is to come up with a hyper-lane in an
N-dimensional space that divides the data points belonging to different classes. However, this
hyper-pane is chosen based on margin as the hyperplane providing the maximum margin
between the two classes is considered. These margins are calculated using data points known as
Support Vectors. Support Vectors are those data points that are near to the hyper-plane and help
in orienting it.

Fig 1.3 : Schematic diagram of SVM algorithm


1.5.3 k-Nearest Neighbors (kNN)
K-nearest neighbors (KNN) is a type of supervised learning algorithm used for both regression
and classification. KNN tries to predict the correct class for the test data by calculating the
distance between the test data and all the training points. Then select the K number of points
which is closest to the test data. The KNN algorithm calculates the probability of the test data
belonging to the classes of ‘K’ training data and which class holds the highest probability will be
selected. In the case of regression, the value is the mean of the ‘K’ selected training points.

9
Fig 1.4 k-Nearest Neighbors algorithm
1.5.4 Logistic Regression
Logistic regression is a process of modeling the probability of a discrete outcome given an input
variable. The most common logistic regression models a binary outcome; something that can
take two values such as true/false, yes/no, and so on. Multinomial logistic regression can model
scenarios where there are more than two possible discrete outcomes. Logistic regression is a
useful analysis method for classification problems, where you are trying to determine if a new
sample fits best into a category. As aspects of cyber security are classification problems, such as
attack detection, logistic regression is a useful analytic technique.

Fig 1.5 Logistic curve

10
1.5.5 Multi-Layer Perceptron(MLP)
A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP
consists of at least three layers of nodes. Except for the input nodes, each node is a neuron that
uses a nonlinear activation function. MLP utilizes a supervised learning technique called
backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from
a linear perceptron. It can distinguish data that is not linearly separable. Multilayer perceptrons
are sometimes colloquially referred to as ‘vanilla’ neural networks, especially when they have a
single hidden layer.

Fig 1.6 Multi-layer perceptron learning

1.5.6 XGB Classifier Algorithm

XGBoost, short for eXtreme Gradient Boosting, is a powerful and versatile machine learning
algorithm renowned for its efficiency and performance in predictive modeling. As a supervised
learning method, XGBoost belongs to the ensemble learning category, combining the strengths
of multiple weak learners to create a robust and accurate classifier. Widely utilized in various
domains, from finance to healthcare, XGBoost excels in handling diverse data types, providing
exceptional predictive accuracy, and effectively managing complex relationships within datasets.
With its innovative gradient boosting framework and regularization techniques, XGBoost stands
as a go-to choice for tackling classification challenges and pushing the boundaries of predictive
analytics.

11
1.6 Crafting an Advanced ASD Screening Tool

At the core of this ambitious project lies the fundamental goal of crafting an advanced Autism
Spectrum Disorder (ASD) screening tool that transcends the capabilities of individual
algorithms. The primary objective is to orchestrate a symphony of diverse machine learning
algorithms, strategically weaving together their distinctive strengths to form a screening
mechanism that surpasses the limitations inherent in any single approach. This endeavor seeks to
redefine the landscape of ASD diagnosis, with a focus on the unique challenges presented in the
context of adult ASD.
The significance of this undertaking is underscored by the recognition that the complexity of
ASD demands a multifaceted and adaptive approach. Individual algorithms, while powerful in
their own right, may exhibit limitations in capturing the intricate nuances and variability within
ASD data. Thus, the strategic amalgamation of a variety of algorithms becomes imperative, as it
holds the promise of synergistically enhancing the accuracy, sensitivity, and specificity of the
screening tool.
In the pursuit of constructing this groundbreaking screening mechanism, each algorithm is
chosen for its specific attributes and strengths. Decision Trees, known for their ability to unravel
complex decision-making processes, contribute by capturing patterns within ASD data that might
be challenging for human observers to discern. The ensemble then incorporates Random Forest,
which aggregates multiple Decision Trees, mitigating overfitting and enhancing the model's
generalization capacity. Support Vector Machines (SVM) add a layer of sophistication, excelling
in classifying data with intricate boundaries, a characteristic particularly relevant in the diverse
symptomatology of ASD.
The inclusion of K Nearest Neighbors (KNN) introduces a localized perspective, recognizing
the potential clustering of ASD symptoms within specific groups. Gaussian Naive Bayes, with its
probabilistic nature, accommodates the high-dimensional nature of ASD data, providing a robust
framework for classification. Logistic Regression, a classic yet powerful algorithm, contributes
its simplicity and interpretability to the ensemble.
The statistical rigor brought by Linear Discriminant Analysis (LDA) and Quadratic
Discriminant Analysis (QDA) enhances the tool's understanding of the underlying data
distributions associated with ASD. This diverse ensemble, collectively guided by each

12
algorithm's unique strengths, forms a screening mechanism that transcends the limitations of any
single methodology.
The project's commitment to overcoming the challenges associated with adult ASD diagnosis
is evident in its emphasis on innovation and adaptability. Adult ASD presents a distinct set of
complexities compared to childhood ASD, with individuals potentially developing coping
mechanisms or masking symptoms over time. The screening tool aims to address these nuances
by integrating algorithms that can discern patterns within retrospective data, offering a
comprehensive evaluation of an individual's lifelong behavioral patterns.
As this screening tool takes shape, rigorous training and validation processes are integral
components of its development. The iterative refinement of algorithmic parameters ensures that
the ensemble performs optimally, achieving a delicate balance between sensitivity and
specificity. The overarching goal is not merely the creation of a diagnostic tool but the
establishment of a pioneering solution that revolutionizes the diagnostic landscape for adult
ASD.
The potential impact of this advanced ASD screening tool extends beyond the individual level,
resonating in healthcare systems, research, and societal understanding. By providing a more
accurate and efficient means of identifying adult ASD, the tool has the potential to reduce the
diagnostic journey's intricacies, leading to earlier interventions and improved outcomes.
Additionally, the insights gained from the screening process contribute to the growing body of
knowledge surrounding adult ASD, fostering a more nuanced understanding of this complex
neurodevelopmental condition.

1.7 Dataset Compilation and Characteristics

Table 1.1 Attribute features and their description


Attribute Type Description

Age Number Age in years

Gender String Male or female

Ethnicity String List of common ethnicities in


text format.

Born with jaundice Boolean (yes or no) Whether the case was born

13
Attribute Type Description

with jaundice.

Family member with PDD Boolean (yes or no) Whether any immediate
family member has a PDD.

Who is completing the test String Parent, self, caregiver,


medical staff, clinician ,etc.

Country of residence String List of countries in text


format

Used the screening app before Boolean (yes or no) Whether the user has used a
screening app

Screening method type Integer(0,1,2,3) The type of screening


methods chosen based on age
category (0=toddler, 1=child,
2= adolescent, 3= adult)

Question 1 answer Binary(0, 1) The answer code of the


question based on the
screening method used

Question 2 answer Binary(0, 1) The answer code of the


question based on the
screening method used

Question 3 answer Binary(0, 1) The answer code of the


question based on the
screening method used

Question 4 answer Binary(0, 1) The answer code of the


question based on the
screening method used

Question 5 answer Binary(0, 1) The answer code of the


question based on the
screening method used

Question 6 answer Binary(0, 1) The answer code of the


question based on the
screening method used

Question 7 answer Binary(0, 1) The answer code of the


question based on the
screening method used

14
Attribute Type Description

Question 8 answer Binary(0, 1) The answer code of the


question based on the
screening method used

Question 9 answer Binary(0, 1) The answer code of the


question based on the
screening method used

Question 10 answer Binary(0, 1) The answer code of the


question based on the
screening method used

Screening score Integer The final score obtained


based on the scoring
algorithm of the screening
method used. This was
computed in an automated
manner

At the heart of every machine learning endeavor, the foundation upon which the entire system
rests is the quality and representativeness of the dataset. In the context of this ambitious project,
the dataset compilation is a meticulous process, carefully crafted to be the linchpin that shapes
the efficacy and reliability of the developed ASD screening tool. The dataset is not merely a
collection of data points but a comprehensive assembly, drawing upon a diverse pool of
information from both individuals with Autism Spectrum Disorder (ASD) and those without,
specifically focusing on adult subjects. This strategic compilation ensures that the training and
evaluation processes of the machine learning algorithms unfold on a rich canvas, reflecting the
complexity and diversity inherent in real-world scenarios.
The assembly of this dataset is characterized by a rigorous curation process, guided by a
commitment to inclusivity and authenticity. Information is collated from a spectrum of sources,
capturing the nuances of adult ASD from various perspectives. The inclusion of both ASD and
non-ASD subjects establishes a balanced and representative foundation, mimicking the
intricacies encountered in clinical settings. This deliberate approach is crucial for the machine
learning algorithms to develop a nuanced understanding of the patterns and features associated
with ASD, enhancing their ability to discriminate between individuals with and without the
disorder.

15
A key consideration in dataset compilation is the incorporation of comprehensive
demographic details. These details encompass a broad spectrum, including age, gender,
socioeconomic status, and educational background. Such demographic factors play a pivotal role
in understanding the multifaceted nature of ASD, considering its potential influence on the
manifestation and presentation of symptoms. The inclusion of this demographic diversity ensures
that the developed screening tool is sensitive to variations across different subgroups,
contributing to its robustness in real-world applications.
Behavioral traits, another integral component of the dataset, offer a granular view of the
individuals under consideration. The diverse and complex nature of ASD symptoms necessitates
a detailed examination of behavioral patterns, encompassing social interactions, communication
skills, and repetitive behaviors. By incorporating this multifaceted information, the dataset
captures the heterogeneity within the ASD population, allowing the machine learning algorithms
to discern subtle variations and tailor their predictions accordingly.

Relevant clinical indicators further enrich the dataset, providing a holistic perspective on the
individuals involved. These indicators may include diagnostic assessments, medical history, and
other clinical observations. By integrating such clinically significant information, the dataset
aligns closely with the diagnostic reality faced by healthcare professionals. This alignment is
crucial for the screening tool to be not only accurate but also clinically relevant, enhancing its
utility in real-world healthcare settings.
The meticulous nature of dataset compilation is underscored by a commitment to ethical
considerations and privacy protection. Anonymization and de-identification protocols are
rigorously applied to safeguard the privacy and confidentiality of the individuals contributing to
the dataset. This ethical approach is paramount in maintaining the integrity of the project and
ensuring that the dataset is handled responsibly in accordance with privacy regulations and
standards.
As the dataset takes shape, it becomes more than a collection of variables and values; it
becomes a dynamic representation of the diverse landscape of adult ASD. The richness of the
dataset is not just in its size but in its ability to encapsulate the complexity, variability, and
individuality of each participant. This depth ensures that the machine learning algorithms, when

16
exposed to this diverse array of information, are equipped to generalize and adapt to new, unseen
data, a crucial aspect for the screening tool's success in real-world applications.

1.8 Emphasis on Algorithmic Diversity

In the pursuit of constructing an advanced Autism Spectrum Disorder (ASD) screening tool, this
project places a deliberate and strategic emphasis on algorithmic diversity, recognizing the
nuanced and intricate nature of ASD. The research team, comprising multidisciplinary experts,
acknowledges that ASD manifests as a complex and heterogeneous spectrum of
neurodevelopmental conditions. In response to this complexity, the project adopts a
forward-thinking approach by exploring a diverse array of modeling approaches, seeking a
comprehensive and nuanced understanding of the myriad features associated with ASD.
The decision to employ a diverse ensemble of machine learning algorithms is rooted in the
fundamental acknowledgment that no single algorithm can encapsulate the multifaceted
dimensions of ASD. Each algorithm, whether Decision Trees, Random Forest, Support Vector
Machines, K Nearest Neighbors, or Logistic Regression, brings a unique set of strengths and
perspectives to the table. This diversity of approaches is not arbitrary but a strategic choice to
enhance the screening process, ensuring a well-rounded and adaptable tool that can effectively
navigate the intricacies of ASD diagnosis.
The utilization of Decision Trees is emblematic of the project's commitment to capturing
complex decision-making processes inherent in ASD. Decision Trees dissect intricate patterns
within the dataset, offering insights into the relationships between various features and their
influence on ASD identification. This methodological choice is particularly valuable in handling
the diversity of symptoms and presentations within the ASD population.
The inclusion of Random Forest as part of the ensemble underscores the commitment to
mitigating the risk of overfitting and enhancing generalization. By aggregating the outputs of
multiple Decision Trees, Random Forest brings a robustness to the screening tool, making it
adaptable to the heterogeneity observed in ASD data. This ensemble approach acknowledges the
variability in symptom presentation and ensures that the screening tool remains accurate across a
broad spectrum of ASD manifestations.
Support Vector Machines (SVM) contribute their unique perspective by excelling in
classifying data with intricate boundaries. Given the complex and variable nature of ASD

17
symptoms, SVM enhances the screening tool's capacity to discern subtle patterns and boundaries
within the dataset. This methodological choice reflects a proactive approach to addressing the
challenges posed by the diversity of ASD symptoms and manifestations.
K Nearest Neighbors (KNN), operating on the principle of similarity, offers a localized
perspective to the ensemble. This algorithm excels in identifying patterns that may exist in
localized clusters within the dataset. In the context of ASD, where symptom clusters may emerge
in specific subgroups, KNN ensures that the screening tool remains sensitive to localized
variations, fostering a more nuanced understanding of ASD in diverse populations.
Logistic Regression, a classic yet powerful algorithm, adds interpretability and simplicity to
the ensemble. Its suitability for binary classification tasks aligns well with the nature of ASD
identification, providing a foundational element in the amalgamation of diverse algorithms. The
transparency and interpretability of Logistic Regression complement the complexity introduced
by other algorithms, ensuring a cohesive and understandable ensemble.
The deliberate emphasis on algorithmic diversity in this project goes beyond a mere technical
choice; it is a strategic response to the inherent challenges posed by the intricate nature of ASD.
By embracing a variety of modeling approaches, the research team aims to cast a wide net,
capturing the diverse manifestations and presentations of ASD within the dataset. This holistic
and inclusive approach enhances the screening tool's adaptability to the rich and varied landscape
of ASD, ensuring that it remains effective across a spectrum of scenarios.

1.9 Interpretability in ASD Identification

In the intricate landscape of Autism Spectrum Disorder (ASD) identification, the spotlight in this
project falls distinctly on interpretability. Recognizing the inherent complexity of ASD, the
research team places a significant emphasis on unraveling the intricate relationships between
features and the identification of ASD. This endeavor is not merely a scientific pursuit; it is a
pragmatic approach that acknowledges the practical implications for clinicians, stakeholders, and
individuals navigating the ASD diagnostic process. By prioritizing interpretability, the project
strives not only to contribute to the scientific understanding of ASD but also to enhance the
transparency of the screening tool, making it more accessible and comprehensible for healthcare
professionals and individuals undergoing assessments.

18
Interpretability, in the context of machine learning algorithms applied to ASD identification,
refers to the ability to understand and explain the decisions made by the model. The complex
interplay of features contributing to ASD can be challenging to decipher, making interpretability
a crucial aspect of ensuring that the screening tool is not only accurate but also understandable
for those relying on its outcomes.
In the pursuit of interpretability, the project acknowledges the multifaceted nature of ASD
symptoms and the need for transparency in the decision-making process of the screening tool.
Unraveling the complex relationships between various features and ASD identification becomes
a scientific imperative, providing valuable insights into the patterns and markers that contribute
to the algorithm's predictions. This scientific endeavor extends beyond the confines of
algorithmic intricacies; it aspires to enhance our broader understanding of the factors influencing
ASD identification.

Fig 1.7 Model Complexity graph

19
Fig 1.8 Learning curve
Practically, interpretability holds immense value for healthcare professionals who utilize the
screening tool in clinical settings. Transparent models empower clinicians to comprehend and
trust the decisions made by the algorithm, fostering a collaborative and informed approach to
diagnosis. Interpretability becomes a bridge between advanced computational methodologies and
the clinical expertise of healthcare professionals, ensuring that the screening tool aligns
seamlessly with the nuanced realities of ASD assessments.
Moreover, interpretability addresses a critical aspect of ethical considerations in healthcare.
Transparent models provide individuals undergoing assessments with a clearer understanding of
the factors influencing their diagnosis. This transparency fosters trust and confidence in the
diagnostic process, empowering individuals to actively engage in discussions about their
healthcare journey. The project's commitment to interpretability aligns with the broader
movement toward patient-centered care, where individuals are not passive recipients of
diagnoses but active participants in their healthcare decisions.
The emphasis on interpretability also extends to the broader stakeholder community, including
educators, policymakers, and researchers. Transparent models enable stakeholders to
comprehend the factors influencing ASD identification, facilitating informed decision-making
and policy development. The project's commitment to interpretability reflects an awareness of

20
the collaborative nature of addressing complex societal challenges such as ASD, where diverse
stakeholders play pivotal roles in shaping interventions, support systems, and public policies.
The implementation of interpretable machine learning models, such as Logistic Regression
and Decision Trees, forms a strategic component of the project's approach. These models offer
not only accurate predictions but also a clear and understandable rationale for their decisions.
Logistic Regression, with its simplicity and interpretability, provides a foundational element in
the ensemble of algorithms. Decision Trees, by nature hierarchical and structured, unravel
complex decision-making processes in a visually intuitive manner, offering insights into the
features that contribute to ASD identification.
The project recognizes that the pursuit of interpretability does not entail a compromise on
accuracy or sophistication. Instead, it is an integrative approach that harmonizes the advanced
capabilities of machine learning with the necessity for transparency in the diagnostic process.
The interpretability of the screening tool is a feature, not a limitation, enriching the overall utility
and acceptance of the tool in real-world healthcare scenarios.
1.10 Ethical Considerations in Machine Learning for Healthcare

The seamless integration of machine learning into healthcare is an epochal advancement, but it
mandates an unwavering commitment to ethical considerations. Within this transformative
landscape, the current project stands as a beacon, emphasizing the fundamental principles of
privacy, fairness, and the responsible deployment of machine learning technologies within
healthcare contexts. Ethical considerations are not merely an ancillary concern; they are the
bedrock upon which the integrity and success of this project rest. The conscientious approach to
ethics is not only a moral imperative but is also instrumental in building and sustaining public
trust, while simultaneously safeguarding the well-being and privacy of individuals undergoing
ASD assessments. Striking an equilibrium between technological innovation and ethical
responsibility emerges as an indispensable facet for the successful implementation of ASD
screening tools that genuinely contribute to healthcare advancements.
Privacy, as a cornerstone of ethical considerations, takes precedence in the integration of
machine learning for healthcare applications. The project recognizes the sensitivity of health
data, especially in the context of neurodevelopmental disorders like ASD. Rigorous
anonymization and de-identification protocols are meticulously applied to the dataset to shield
the identities of individuals contributing to the project. By adhering to robust privacy measures,

21
the research team not only upholds ethical standards but also addresses concerns related to data
security and confidentiality, fostering a sense of trust among individuals participating in the
assessment process.
Fairness, another pivotal ethical principle, is explicitly acknowledged in the project's
approach. The machine learning algorithms are meticulously trained and validated to ensure that
they do not perpetuate biases or discriminate against specific demographic groups. The project
team actively addresses issues related to algorithmic fairness, striving to mitigate the risk of
unintended consequences and disparities in ASD identification. The commitment to fairness
aligns with the broader societal goal of creating healthcare technologies that are equitable and
accessible to diverse populations.
Beyond privacy and fairness, the responsible deployment of machine learning technologies is
a guiding ethical tenet. This involves careful consideration of the potential impact on individuals'
lives and the broader healthcare ecosystem. The project team critically evaluates not only the
accuracy and efficiency of the ASD screening tool but also its real-world implications.
Responsible deployment encompasses considerations of the tool's usability in clinical settings,
the interpretability of its outcomes, and the potential consequences of its recommendations. This
holistic approach ensures that the integration of machine learning into healthcare goes beyond
technical proficiency, actively considering its ethical implications in the service of the
individuals it aims to assist.
The ethical framework established by the project is pivotal for building and maintaining public
trust. Trust is foundational in healthcare, and its erosion can have far-reaching consequences. By
prioritizing privacy, fairness, and responsible deployment, the project aims to instill confidence
in individuals undergoing ASD assessments, healthcare professionals utilizing the screening tool,
and the broader public. Transparency about the ethical considerations and measures taken further
strengthens the bond of trust between the project and its stakeholders.
Furthermore, the project team recognizes the dynamic nature of ethical considerations in the
rapidly evolving field of machine learning and healthcare. Continuous monitoring and adaptation
of ethical protocols are integral to staying abreast of emerging challenges and ensuring that the
project remains aligned with evolving ethical standards. The commitment to ongoing ethical
evaluation reflects a proactive stance, acknowledging the dynamic interplay between technology
and ethical responsibilities.

22
In essence, the ethical considerations within the project extend far beyond compliance; they
embody a proactive and comprehensive approach to safeguarding the rights, privacy, and
well-being of individuals involved. The fusion of technological innovation with ethical
responsibility reflects a commitment to the highest standards of integrity, ensuring that the
benefits of machine learning in healthcare are realized without compromising fundamental
ethical principles. As the project advances, the ethical considerations embedded within its
framework not only guide its trajectory but also contribute to shaping a responsible and
trustworthy paradigm for the integration of machine learning into neurodevelopmental disorder
assessments.

1.11 Anticipated Outcomes and Future Implications

The anticipated outcomes of this visionary project extend far beyond the realms of technological
innovation, holding the promise of significantly enhancing the quality of life for individuals on
the autism spectrum. At the heart of these anticipated outcomes is the recognition that the early
detection of Autism Spectrum Disorder (ASD) is a pivotal gateway to improving long-term
outcomes for affected individuals. The screening tool, forged through the amalgamation of
diverse machine learning algorithms, is poised to revolutionize the landscape of ASD
identification by offering more accurate and efficient early detection capabilities.
The core objective of the screening tool aligns seamlessly with the broader aspirations of
precision medicine, heralding a future where healthcare interventions are tailored to the
individual characteristics and needs of each person. By harnessing the power of advanced
technologies, the project aims to usher in a new era in the identification of neurodevelopmental
disorders, particularly ASD. Precision medicine, characterized by personalized and targeted
approaches, becomes a tangible reality as the screening tool fine-tunes its predictions based on
the unique patterns and features exhibited by individuals on the autism spectrum.
The transformative potential of advanced technologies, particularly machine learning,
becomes evident through the lens of this project. The integration of cutting-edge methodologies
in the identification of ASD transcends traditional diagnostic paradigms, offering a more
nuanced and adaptive approach. The research team envisions a future where the diagnostic
journey for ASD is not only accurate but also characterized by efficiency, accessibility, and
compassion.

23
In essence, the project aspires to redefine ASD screening by seamlessly blending the prowess
of machine learning methodologies with a compassionate and person-centered approach. The
envisioned future is one where the screening process goes beyond a mere diagnostic tool,
becoming a conduit for early intervention that is not only more accessible but also more
effective. The compassion embedded in the approach acknowledges the unique challenges faced
by individuals on the autism spectrum and strives to create a screening tool that is not only
technologically advanced but also considerate of the diverse ways in which ASD may manifest.
The ultimate vision of the project is to foster a future where early intervention becomes a
transformative force in the lives of individuals affected by ASD. Early detection, facilitated by
the screening tool, opens avenues for timely and targeted interventions that can mitigate
challenges associated with ASD. These interventions may include specialized therapies,
educational support, and tailored strategies to enhance social integration and communication
skills.
Furthermore, the impact of the anticipated outcomes extends to the societal and economic
dimensions. By streamlining the identification process and facilitating early interventions, the
project holds the potential to reduce the long-term burden on healthcare systems. The economic
implications are profound, as early interventions can lead to improved outcomes, potentially
reducing the need for long-term care and support services.
As the project progresses, the insights gained from the screening tool contribute to the
evolving body of knowledge surrounding ASD. The data generated becomes a valuable resource
for researchers and clinicians, fostering a deeper understanding of the complex nature of
neurodevelopmental disorders. This, in turn, paves the way for future advancements in the field,
shaping new diagnostic and therapeutic strategies for ASD and potentially informing research in
other related domains.

24
Chapter Two

REQUIREMENTS ANALYSIS WITH SRS

25
REQUIREMENTS ANALYSIS WITH SRS

2.1 System Overview


2.1.1 System Description

The ASD screening system will process the provided dataset, perform data preprocessing,
implement six supervised machine learning algorithms, and evaluate their performance. The
algorithms include Decision Trees, Random Forest, Support Vector Machines (SVM), K-Nearest
Neighbors (KNeighbors), Gaussian Naive Bayes (GaussianNB), and Logistic Regression.

2.1.2 System Features

2.1.2.1 Data Preprocessing

Data Preprocessing: Clean and preprocess the provided dataset, handling missing or ill-formatted
entries.

2.1.2.2 Algorithm Implementation

Algorithm Implementation: Implement the following supervised machine learning algorithms for
ASD screening:
a. Decision Trees
b. Random Forest
c. Support Vector Machines (SVM)
d. K-Nearest Neighbors (KNeighbors)
e. Gaussian Naive Bayes (GaussianNB)
f. Logistic Regression

2.1.2.3 Model Evaluation

Model Evaluation: Evaluate the performance of each algorithm using appropriate metrics (e.g.,
accuracy, precision, recall, F1 score).

2.2 Functional Requirements

26
2.2.1 Data Preprocessing
2.2.1.1 Data Cleaning
The system shall clean the dataset by removing records with missing or ill-formatted entries.

2.2.2 Algorithm Implementation


2.2.2.1 Random Forest

The system shall implement the Random Forest algorithm for ASD screening. Implemented
using scikit-learn in Python, Random Forest combines multiple decision trees to enhance
classification performance. This algorithm excels in handling diverse features such as age,
gender, ethnicity, and medical history, contributing to a comprehensive ASD screening solution.
With the flexibility to adjust hyperparameters like the number of trees and maximum depth,
Random Forest ensures adaptability to varying datasets. Through training on labeled data and
rigorous testing on a separate validation set, Random Forest enhances the predictive capabilities
of our ASD screening system.

2.2.2.2 Support Vector Machines (SVM)

The Support Vector Machines (SVM) algorithm is integral to our Autism Spectrum Disorder
(ASD) screening system for adults. Utilizing the scikit-learn library in Python, SVM employs
supervised learning to classify individuals into ASD-positive or ASD-negative categories based
on diverse features, including age, gender, ethnicity, and medical history.
The SVM implementation allows for customization, enabling the selection of different kernels
and regularization parameters. Training involves using a labeled dataset, while testing assesses
the model's predictive performance on a separate validation set. Hyperparameter tuning
optimizes SVM performance, and seamless integration within the overall system ensures a
cohesive approach to ASD screening for adults.

2.2.2.3 K-Nearest Neighbors (KNeighbors)

The system shall implement the K-Nearest Neighbors (KNeighbors) algorithm for ASD
screening. KNN determines ASD likelihood by identifying the k-nearest data points in the
feature space. This algorithm accommodates diverse input features like age, gender, ethnicity,
and medical history, ensuring flexibility in our screening process. With customizable parameters

27
such as the number of neighbors (k), KNN offers adaptability to different datasets. The training
phase involves learning the relationships between features, and testing evaluates the model's
performance on a separate validation set, enriching our ASD screening system with a
proximity-based classification strategy.

2.2.2.4 Logistic Regression


The system shall implement the Logistic Regression algorithm for ASD screening.
Logistic Regression stands as a key component in our Autism Spectrum Disorder (ASD)
screening system for adults, offering a robust method for binary classification. Implemented
using the scikit-learn library in Python, Logistic Regression models the probability of ASD
occurrence based on input features like age, gender, ethnicity, and medical history. The algorithm
employs a logistic function to make predictions, providing a clear distinction between
ASD-positive and ASD-negative outcomes. Training involves using a labeled dataset, and the
model's predictive accuracy is assessed on a separate validation set during testing. Logistic
Regression contributes to the overall accuracy and interpretability of our ASD screening system.

2.2.3 Model Evaluation


2.2.3.1 Performance Metrics
The system shall evaluate the performance of each algorithm using accuracy, precision, recall,
and F1 score.

2.3 Non-Functional Requirements


2.3.1 Performance
The system shall provide timely responses to user inputs, with a maximum processing time of 10
seconds for screening.

2.3.2 Usability

The user interface shall be intuitive and user-friendly, requiring minimal training for users to
navigate and input data.

28
2.3.3 Security

User data input and screening results shall be securely handled, and no sensitive information
shall be stored.

2.4 Constraints
2.4.1 Dataset Limitations
The system's performance is dependent on the quality and representativeness of the provided
dataset.

2.4.2 Availability of Algorithms


The availability of machine learning libraries for the chosen algorithms is a prerequisite for
implementation.

2.5 Appendix
2.5.1 Some References
- Prof. Fadi Thabtah. "Autism Spectrum Disorder Screening: Machine Learning
Adaptation and DSM-5 Fulfillment."
- UCI Machine Learning Repository. "Autistic Spectrum Disorder Screening Data for
Adult."

2.5.2 Glossary
ASD: Autism Spectrum Disorder
SRS: Software Requirements Specification

29
Chapter Three

SYSTEM DESIGN

30
SYSTEM DESIGN

1. ER Diagram

Entity-Relationship (ER) diagrams serve as a blueprint for database systems, elucidating the
connections between entities and their attributes. Entities are depicted as rectangles, representing
distinct data objects, while attributes within entities are illustrated as ovals. Diamonds signify
relationships between entities, elucidating how they interact.

Cardinality, expressed as crow's feet, denotes the numerical associations between entities. The
"one" side features a straight line, while the "many" side exhibits crow's feet, conveying the
one-to-many relationships crucial for database integrity.

ER diagrams empower database designers to conceptualize and communicate intricate data


structures, ensuring efficient data organization and retrieval. Mastering these visualizations is
paramount for effective database development and management.

Entities:
● Person:
Attributes: PersonID (Primary Key), Age, Gender, Ethnicity, BornWithJaundice,
FamilyMemberWithPDD, Completer (Who is completing the test), CountryOfResidence,
UsedAppBefore, ScreeningMethodType.

● ScreeningTest:
Attributes: TestID (Primary Key), DateConducted, Results.

● Question:
Attributes: QuestionID (Primary Key), QuestionText.

Relationships:
● Person takes Screening Test:
1. Relationship: One-to-Many (One person can take multiple tests; each test is taken
by one person).
2. Foreign Key: PersonID (in ScreeningTest).

● Screening Test includes Question:


1. Relationship: Many-to-Many (One test can include multiple questions; one
question can be part of multiple tests).

31
2. Associative Entity: TestQuestionAssociation (with attributes such as TestID,
QuestionID, and Answer).

Attributes and Data Types:

● Person:Age (Integer), Gender (String), Ethnicity (String), BornWithJaundice (Boolean),


FamilyMemberWithPDD (Boolean), Completer (String), CountryOfResidence (String),
UsedAppBefore (Boolean), ScreeningMethodType (String).
● ScreeningTest:DateConducted (Date), Results (String).
● Question:QuestionText (String).

Fig 3.1 ER diagram

2. Use Case Diagram

A Use Case Diagram is a visual representation that illustrates the functional requirements and
interactions between actors and a system. It provides a high-level view of how users (actors)
interact with a system and the system's responses.

32
Actors:
1. Actors are external entities (e.g., users, systems) that interact with the system.
2. Representation: Depicted as stick figures or labeled boxes outside the system boundary.
3. Purpose: Identify and define the roles of entities interacting with the system.

Use Cases:
1. Use cases represent the functionalities or actions that the system performs in response to
interactions from actors.
2. Representation: Depicted as ovals within the system boundary.
3. Purpose: Capture and visualize the system's behavioral aspects from a user's perspective.

Associations:
1. Associations connect actors with use cases, representing interactions or relationships.
2. Representation: Arrows connecting actors to use cases, indicating the flow of
communication.
3. Purpose: Illustrate how actors initiate and participate in specific system functionalities.

System Boundary:
1. The system boundary encapsulates all use cases and actors, defining the scope of the
system.
2. Representation: A box surrounding use cases and actors.
3. Purpose: Clearly demarcate the system's boundaries, separating internal functionalities
from external interactions.

Actors:

● User/Person: Represents individuals taking the screening test.


● ML Engineer: Analyzes screening results and ML predictions.

Use Cases:

● User/Person:
1. Provide personal data.

33
2. Take the screening test.
● ML Engineer/Developer:
1. View screening results.
2. Analyze ML predictions.

Fig 3.2 Use Case Diagram

34
Chapter Four

Test Plan

35
TEST PLAN

Autism Spectrum Disorder (ASD) has a profound impact on individuals' lives, necessitating
accurate and efficient screening methods. This test plan delves into the evaluation of various
machine learning algorithms applied to an ASD screening dataset for adults. The algorithms
under scrutiny include Logistic Regression, Support Vector Machine (SVM), k-Nearest
Neighbors (KNN), XGBoost Classifier, Random Forest, and Multi-Layer Perceptron (MLP).

Objective:
The primary objective is to assess the performance of these algorithms based on key metrics such
as training score, testing score, Mean Absolute Error (MAE), Mean Squared Error (MSE), and
Root Mean Squared Error (RMSE). Through rigorous testing, we aim to determine the efficacy
of each algorithm in accurately predicting ASD.
4.1 Data Preparation:
Input Data: Ensure the dataset is properly preprocessed, handling missing values,NaN values,
encoding categorical variables, and scaling features appropriately.

4.2 Algorithm Configuration:


Parameter Tuning: Identify optimal hyperparameters for each algorithm using techniques like
grid search or random search, tailored to the unique characteristics of ASD screening data.

4.3 Training and Testing:


Splitting Data: Divide the dataset into training and testing sets, ensuring a balanced
representation of ASD and non-ASD cases.
Model Training: Train each algorithm using the training dataset, monitoring convergence and
addressing overfitting concerns.
Model Testing: Apply trained models to the testing dataset, evaluating performance using
accuracy, precision, recall, and F1-score.

4.4 Evaluation Metrics:


4.4.1 Mean Squared Error (MSE):
Mean Squared Error is a commonly used metric to evaluate the performance of a regression
model. It provides a measure of the average squared difference between predicted and actual

36
values. MSE is calculated by taking the average of the squared differences between each
predicted and actual value.
● MSE penalizes larger errors more significantly due to the squaring of differences.
● It provides a measure of the model's precision in predicting values.
𝑛
1
MSE = 𝑛
∑ |𝑌𝑖 − 𝑌𝑖 |
𝑖=1

Where 𝑌𝑖 is the actual value and 𝑌𝑖 is the predicted value.

4.4.2 Mean Absolute Error (MAE):


Mean Absolute Error is another metric for assessing the accuracy of a regression model. It
calculates the average absolute difference between predicted and actual values. Unlike MSE,
MAE does not penalize large errors as significantly and provides a more direct measure of the
model's accuracy.
● MAE is less sensitive to outliers compared to MSE.
● It gives an average measure of how much the predictions deviate from the actual values.
𝑛
1 2
MAE = 𝑛
∑ ( 𝑌𝑖 − 𝑌𝑖 )
𝑖=1

Where 𝑌𝑖 is the actual value and 𝑌𝑖 is the predicted value.

4.4.3 Root Mean Squared Error(RMSE):


Root Mean Squared Error is a modification of MSE that takes the square root of the average
squared differences between predicted and actual values. RMSE is often preferred when the
errors are expected to be in a similar unit as the target variable, providing a more interpretable
metric.
● RMSE shares similar characteristics with MSE but is presented in the original unit of the
target variable, making it easier to comprehend.
● It penalizes larger errors while still providing a measure of the model's accuracy.
RMSE = √MSE​
Where RMSE stands for Root Mean Square Error and MSE stands for Means Squared Error.

37
4.5 Final Values:

1. Support Vector Machine Algorithm

Fig 4.1: Support Vector Regression score

2. K Nearest Neighbours Algorithm

Fig 4.2: k Nearest Neighbours Regression score


3. Logistic Regression Algorithm

Fig 4.3: Logistic Regression score


4. XGB Classifier Algorithm

Fig 4.4: XGB Classifier Regression score


5. Random Forest Algorithm

Fig 4.5: Random Forest Regression score

38
6. MLP algorithm

Fig 4.6: MLP Regression score

39
Chapter Five

BODY OF THESIS

BODY OF THESIS
The dataset used in this project is based on the Quantitative Checklist for Autism in adults
screening method devised by Baron-Cohen. Set of 10 questions has been used in the following

40
table below.. The answers to these questions are mapped to binary values as class type. These
values are assigned during the data collection process by means of answering the questionnaire.
The class value "Yes" is assigned if the questionnaire score happens to be greater than 3, that is,
there are potential ASD traits. Otherwise, class value "No" is assigned, implying no ASD traits.

Table 5.1 Set of Questions


Dataset Variable Description

A1 Person responding to you calling his/her name

A2 Ease of getting eye contact from person

A3 Person pointing to objects he/she wants

A4 Person pointing to draw your attention to


his/her interests

A5 If the person shows pretense

A6 Ease of person to follow where you point/look

A7 If the person wants to comfort someone who


is upset

A8 First words

A9 If the person uses basic gestures

A10 If the person daydreams/stares at nothing

41
Chapter Six

RESULTS

42
RESULTS AND OBSERVATIONS

The entire system rests on the quality and representativeness of the dataset. In the context of this
ambitious project, the dataset compilation is a meticulous process, carefully crafted to be the
linchpin that shapes the efficacy and reliability of the developed ASD screening tool. The dataset
is not merely a collection of data points but a comprehensive assembly, drawing upon a diverse
pool of information from both individuals with Autism Spectrum Disorder (ASD) and those
without, specifically focusing on adult subjects. This strategic compilation ensures that the
training and evaluation processes of the machine learning algorithms unfold on a rich canvas,
reflecting the complexity and diversity inherent in real-world scenarios.
The assembly of this dataset is characterized by a rigorous curation process, guided by a
commitment to inclusivity and authenticity. Information is collated from a spectrum of sources,
capturing the nuances of adult ASD from various perspectives. The inclusion of both ASD and
non-ASD subjects establishes a balanced and representative foundation, mimicking the
intricacies encountered in clinical settings. This deliberate approach is crucial for the machine
learning algorithms to develop a nuanced understanding of the patterns and features associated
with ASD, enhancing their ability to discriminate between individuals with and without the
disorder.

Accuracy = (TP + TN) / (TP + FP + TN + FN)


Where,
TP = True Positive
TN = True Negative
FP = False Positive
FN = False Negative
We have calculated the accuracy using the above table, the following table shows the accuracy-

43
Table 6.1 Results
Learning Algorithm Accuracy

Multi Layer Perceptron 73.80640 %

Random Forests 95.89874 %

SVM 88.36082 %

KNN 82.87602 %

Logistic Regression 92.45491 %

XGB Classifier 90.19123 %

44
Chapter Seven

SUMMARY AND CONCLUSION

45
SUMMARY AND CONCLUSION

The project focuses on predicting Autism Spectrum Disorder (ASD) screening results for adults
using machine learning algorithms. This analysis aims to provide valuable insights into the
potential effectiveness of various algorithms for ASD screening.

Dataset Overview:
The dataset comprises binary ASD scores (A1 to A10), demographic features (gender, ethnicity),
and other relevant attributes such as jaundice at birth, autism history, country of residence, and
family relationship. The selection of these features was driven by their potential impact on ASD
screening outcomes based on domain knowledge and preliminary exploratory data analysis
(EDA).

Data Preprocessing:
Data preprocessing steps included handling missing values, encoding categorical variables, and
ensuring the dataset is suitable for machine learning algorithms. The choice of preprocessing
techniques aimed at maintaining the integrity of the dataset and preparing it for effective model
training and evaluation.

Exploratory Data Analysis (EDA):


EDA highlighted patterns in ASD scores, distributions of demographic features, and potential
correlations. The insights gained during EDA informed the subsequent model selection and
evaluation, guiding the identification of features that could significantly influence ASD
screening outcomes.

Model Building and Evaluation

Model Selection:
The choice of algorithms—Logistic Regression, Support Vector Machine (SVM), K-Nearest
Neighbors (KNN), XGBoost Classifier, Random Forest, and Multi-Layer Perceptron (MLP) -
was driven by their suitability for binary classification tasks. Logistic Regression and SVM are
well-established algorithms for binary classification, while KNN is known for its simplicity and
effectiveness. XGBoost and Random Forest are ensemble methods known for handling complex
relationships in data, and MLP is a versatile neural network architecture suitable for non-linear
patterns.

Feature Selection:
Features such as ASD scores, demographic information, and relevant medical history were
chosen based on their potential to contribute to ASD screening outcomes. ASD scores are direct
indicators, while demographic features provide contextual information that may influence

46
screening results. The inclusion of medical history features aims to capture additional factors that
could impact ASD.

Training and Testing:


The dataset was split into training and testing sets to train and evaluate the models, ensuring a
robust assessment of their predictive capabilities.

Model Evaluation:
Each algorithm's performance was evaluated using standard metrics such as accuracy, precision,
recall, and F1 score. The choice of these metrics was motivated by the need to strike a balance
between overall accuracy and the ability to correctly identify positive cases, which is crucial in
medical screening scenarios.

Visualizations:
Visualizations, including box plots, bar graphs, etc, provided a comprehensive view of the
models' performance, facilitating informed decision-making. These visualizations were chosen to
effectively communicate the trade-offs between true positive rates and false positive rates.

Conclusion

The project demonstrated the effectiveness of machine learning models in ASD screening.
Among the algorithms tested, XGBoost Classifier and Random Forest displayed superior
performance in terms of accuracy and precision.

Feature Importance:
Feature importance analysis revealed key predictors influencing ASD screening outcomes.
Notably, specific ASD scores, age, and family relationships emerged as significant contributors.
The choice of these features was validated by their impact on the models' predictive power.

Practical Implications:
The developed models have practical implications for streamlining ASD screening processes,
potentially aiding healthcare professionals in early and accurate identification of ASD in adults.
The inclusion of specific features ensures the models capture relevant information for actionable
insights.

Limitations and Future Work:


Limitations, such as data constraints and potential biases, were acknowledged. Suggestions for
future work include collecting additional data and refining models for enhanced predictive
accuracy. The continuous exploration of additional features and algorithms may further improve
the models' performance.

47
FUTURE SCOPE

In the future, this project has the potential for improvement through the incorporation of novel
machine learning algorithms. Additionally, fine-tuning the existing algorithm by adjusting

48
various parameters crucial to its accuracy could further enhance the project. This optimization
aims to generate more precise models capable of effectively identifying Autism Spectrum
Disorder (ASD) in adults.
Furthermore, the project's advancement could involve the integration of a Deep Learning
neural network. This addition would contribute to uncovering additional concealed information
within the dataset. The advantages of refining existing algorithms or introducing new ones
extend beyond practical applications and can also be valuable for research purposes.
In addition to introducing new predictive techniques, the project will incorporate
advanced and more efficient visualization methods. These enhancements are geared towards
facilitating a clearer and more comprehensive understanding of the data, promoting better
visualization, and fostering in-depth discussions.

REFERENCES

[1] A. A. Abdullah, S. Rijal, and S. R. Dash, “Evaluation on Machine learning Algorithms for
Classification of Autism Spectrum Disorder (ASD),” Journal of Physics, vol. 1372, no. 1, p.
012052, Nov. 2019.

[2] A. N. Krishna, K. C. Srikantaiah, and C. Naveena, Integrated Intelligent Computing,


Communication and Security. Springer, 2018.

49
[3] “A Review on Predicting Autism Spectrum Disorder(ASD) meltdown using Machine
Learning Algorithms,” IEEE Conference Publication | IEEE Xplore, Nov. 18, 2021.

[4] “A supervised machine learning algorithm for arrhythmia analysis,” IEEE Conference
Publication | IEEE Xplore, 1997.

[5] B. Deepa and K. S. J. Marseline, “Exploration of Autism Spectrum Disorder using


Classification Algorithms,” Procedia Computer Science, vol. 165, pp. 143–150, Jan. 2019.

[6] C. Küpper et al., “Identifying predictive features of autism spectrum disorders in a clinical
sample of adolescents and adults using machine learning,” Scientific Reports, vol. 10, no. 1, Mar.
2020.

[7] C. M. Bishop, Pattern recognition and machine learning. Springer, 2016.

[8] D. Nguyen and J. Patrick, “Supervised machine learning and active learning in classification
of radiology reports,” Journal of the American Medical Informatics Association, vol. 21, no. 5,
pp. 893–901, Sep. 2014.

[9] D. K. S. R. et. al., “Machine Learning based novel Autism Spectrum Disorder Screening'',
TURCOMAT, vol. 12, no. 3, pp. 4866–4879, May 2021.

[10] F. Thabtah and D. Peebles, “A new machine learning model based on induction of rules for
autism detection,” Health Informatics Journal, vol. 26, no. 1, pp. 264–286, Jan. 2019.

[11] H. Abdi, L. J. Williams, and D. Valentin, “Multiple factor analysis: principal component
analysis for multitable and multiblock data sets,” WIREs Computational Statistics, vol. 5, no. 2,
pp. 149–179, Feb. 2013.

[12] H. Bhavsar and A. Ganatra, “A Comparative Study of Training Algorithms for Supervised
Machine Learning,” International Journal of Soft Computing and Engineering (IJSCE), Jan.
2012.

[13] K. Hyde et al., “Applications of Supervised Machine Learning in Autism Spectrum


Disorder Research: a Review,” Review Journal of Autism and Developmental Disorders, vol. 6,
no. 2, pp. 128–146, Feb. 2019.

[14] K. S. Oma, P. Mondal, N. S. Khan, Md. R. K. Rizvi, and M. N. Islam, “A Machine Learning
Approach to Predict Autism Spectrum Disorder,” IEEE, Feb. 2019.

[15] M. Alteneiji, “Autism Spectrum Disorder Diagnosis using Optimal Machine Learning
Methods,” 2020.

50
[16] Md. M. Rahman, O. L. Usman, R. C. Muniyandi, S. Sahran, S. Mohamed, and R. A. Razak,
“A review of Machine learning Methods of feature selection and Classification for autism
Spectrum Disorder,” Brain Sciences, vol. 10, no. 12, p. 949, Dec. 2020.

[17] M. J. Maenner, M. Yeargin‐Allsopp, K. Van Naarden Braun, D. Christensen, and L. A.


Schieve, “Development of a machine learning algorithm for the surveillance of autism spectrum
disorder,” PLOS ONE, vol. 11, no. 12, p. e0168224, Dec. 2016.

[18] M. N. Murty and R. Raghava, Support vector machines and perceptrons: Learning,
Optimization, Classification, and Application to Social Networks. Springer, 2016.

[19] Müller, A. C., & Guido, S. (2016). "Introduction to Machine Learning with Python."
O'Reilly Media.

[20] M. W. Berry, A. Mohamed, and B. W. Yap, Supervised and unsupervised learning for data
science. Springer Nature, 2019.

[21] N. Kühl, R. Hirt, L. Baier, B. Schmitz, and G. Satzger, “How to conduct rigorous supervised
machine learning in information Systems research: The Supervised Machine Learning Report
Card,” Communications of the Association for Information Systems, vol. 48, no. 1, pp. 589–615,
Jan. 2021.

[22] S. H. Lee, M. J. Maenner, and C. M. Heilig, “A comparison of machine learning algorithms


for the surveillance of autism spectrum disorder,” PLOS ONE, vol. 14, no. 9, p. e0222907, Sep.
2019.

[23] S. J. Moon, J. Hwang, R. K. Kana, J. Torous, and J. W. Kim, “Accuracy of Machine


Learning Algorithms for the diagnosis of Autism Spectrum Disorder: Systematic Review and
Meta-Analysis of Brain Magnetic Resonance Imaging Studies,” JMIR Mental Health, vol. 6, no.
12, p. e14108, Dec. 2019.

[24] S. Raj and S. Masood, “Analysis and detection of autism spectrum disorder using machine
learning techniques,” Procedia Computer Science, vol. 167, pp. 994–1004, Jan. 2020.

[25] S. R, “A machine learning way to classify autism spectrum Disorder,” Learning &
Technology Library (LearnTechLib), Mar. 30, 2021.

[26] S. Raschka, Python Machine Learning. Packt Publishing Ltd, 2015.

[27] S. Raschka, Y. Liu, V. Mirjalili, and D. Dzhulgakov, Machine Learning with PyTorch and
Scikit-Learn: Develop machine learning and deep learning models with Python. Packt
Publishing Ltd, 2022.

51
[28] S. Uddin, A. Khan, E. Hossain, and M. A. Moni, “Comparing different supervised machine
learning algorithms for disease prediction,” BMC Medical Informatics and Decision Making, vol.
19, no. 1, Dec. 2019.

[29] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: Data Mining,
Inference, and Prediction. Springer Science & Business Media, 2013.

[30] V. Nasteski, “An overview of the supervised machine learning methods,” Horizonti. Serija
B. Prirodno-matematički, Tehničko-tehnološki, Biotehnički, Medicinski Nauki I Zdravstvo, vol. 4,
pp. 51–62, Dec. 2017.

52
APPENDIX A

SCREEN SHOTS

SCREEN SHOTS

53
Fig A-1 First Five rows of the dataset

Fig A-2 value_count of each unique value in the column ethnicity

Fig A-3 value_count of each unique value in the column relation

54
Fig A-4 Pie chart for the number of data for each target

Fig A-5 Plot where the scores indicate no. of yes for the set of 10 questions asked

55
Fig A-6 Plot where the count indicates no. of males and females

Fig A-7 Plot where count indicates no. of people from different ethnicities

56
Fig A-8 Plot with 0 for people having autism and 1 for people not having autism

Fig A-9 plots for different countries given in the dataset

57
Fig A-10 pair plot for A1-A10

58
Fig A-11 heat map for correlation matrix of numeric columns

Fig A-12 Count plot where count indicates number of cases for each age group

59
Fig A-13 Comparison between scores and number of positive and negative cases

Fig A-14 Normal distribution of the age values after log transformations

60
APPENDIX B

SOURCE CODE

61
SOURCE CODE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import metrics
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler

import warnings
warnings.filterwarnings('ignore')

from google.colab import files


uploaded = files.upload()

df = pd.read_csv('new_excel.csv')
print(df.head())

df['ethnicity'].value_counts()

df['relation'].value_counts()

df = df.replace({'yes':1, 'no':0, '?':'Others', 'others':'Others'})

plt.pie(df['Class/ASD'].value_counts().values, autopct='%1.1f%%')
plt.show()

ints = []
objects = []
floats = []
for col in df.columns:
if df[col].dtype == int:
ints.append(col)
elif df[col].dtype == object:
objects.append(col)
else:
floats.append(col)

#results for No of one


scores_df = df.loc[:, 'A1_Score':'A10_Score']
scores_df.sum().plot(kind='bar', color='black')

62
plt.title('Distribution of Scores (A1 to A10)')
plt.xlabel('Question Number')
plt.ylabel('Total Score')
plt.show()

#Gender
df['gender'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()

#ethnicity graph
df['ethnicity'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Ethnicity')
plt.xlabel('Ethnicity')
plt.ylabel('Count')
plt.show()

# jaundice at birth
df['jundice'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Jundice')
plt.xlabel('Jundice')
plt.ylabel('Count')
plt.show()

# autism result
df['austim'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Autism')
plt.xlabel('Autism')
plt.ylabel('Count')
plt.show()

#country result

df['contry_of_res'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Country of Residence')
plt.xlabel('Country of Residence')
plt.ylabel('Count')
plt.show()

# pair plot for A1-A10


scores_df = df.loc[:, 'A1_Score':'A10_Score']
scores_df['Class/ASD'] = df['Class/ASD']
sns.pairplot(scores_df, hue='Class/ASD', palette='gray')
plt.title('Pair Plot of Scores')
plt.show()

63
# heat map for correlation matrix
import seaborn as sns

# Selecting numeric columns for correlation matrix


numeric_columns = df.select_dtypes(include=['int64']).columns
correlation_matrix = df[numeric_columns].corr()

sns.heatmap(correlation_matrix, cmap='Greys', annot=True, fmt='.2f')


plt.title('Correlation Matrix Heatmap')
plt.show()
#count plot for class/ASD

# pair plot for A1-A10


scores_df = df.loc[:, 'A1_Score':'A10_Score']
scores_df['Class/ASD'] = df['Class/ASD']
sns.pairplot(scores_df, hue='Class/ASD', palette='gray')
plt.title('Pair Plot of Scores')
plt.show()

df = df[df['result']>-5]
df.shape

# This functions make groups by taking


# the age as a parameter
def convertAge(age):
if age < 4:
return 'Toddler'
elif age < 12:
return 'Kid'
elif age < 18:
return 'Teenager'
elif age < 40:
return 'Young'
else:
return 'Senior'
df['ageGroup'] = df['age'].apply(convertAge)

sb.countplot(x=df['ageGroup'], hue=df['Class/ASD'])
plt.show()

def add_feature(data):
# Creating a column with all values zero
data['sum_score'] = 0
for col in data.loc[:,'A1_Score':'A10_Score'].columns:
# Updating the 'sum_score' value with scores
# from A1 to A10
data['sum_score'] += data[col]
# Creating a random data using the below three columns
data['ind'] = data['austim'] + data['used_app_before'] + data['jundice']

64
return data
df = add_feature(df)

sb.countplot(x=df['sum_score'], hue=df['Class/ASD'])
plt.show()

# Applying log transformations to remove the skewness of the data.


df['age'] = df['age'].apply(lambda x: np.log(x))

# Assuming df is your DataFrame


sns.histplot(df['age'], color='gray', kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

def encode_labels(data):
for col in data.columns:
# Here we will check if datatype
# is object then we will encode it
if data[col].dtype == 'object':
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
return data
df = encode_labels(df)
# Making a heatmap to visualize the correlation matrix
plt.figure(figsize=(10,10))
sb.heatmap(df.corr() > 0.8, annot=True, cbar=False)
plt.show()

removal = ['age_desc', 'used_app_before', 'austim']


features = df.drop(removal + ['Class/ASD'], axis=1)
target = df['Class/ASD']

# Split the data into training and validation sets


X_train, X_val, Y_train, Y_val = train_test_split(features, target, test_size=0.2, random_state=10, stratify=target)

# Apply oversampling to balance the data


oversampler = RandomOverSampler(sampling_strategy='minority', random_state=0)
X_train_resampled, Y_train_resampled = oversampler.fit_resample(X_train, Y_train)

# Display the shapes of the resampled data


print("Shape of X_train_resampled:", X_train_resampled.shape)
print("Shape of Y_train_resampled:", Y_train_resampled.shape)

# Normalizing the features for stable and fast training.


scaler = StandardScaler()
X = scaler.fit_transform(X)
X_val = scaler.transform(X_val)

65
# Check for NaN values in X
nan_columns = np.isnan(X).any(axis=0)
nan_columns = np.where(nan_columns)[0].tolist()

print("Columns with NaN values:", nan_columns)

# Assuming X is your NumPy array


column_means = np.nanmean(X, axis=0) # Compute mean of each column ignoring NaN values

# Replace NaN values with column means


X = np.where(np.isnan(X), column_means, X)

from sklearn.linear_model import LogisticRegression


from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn import metrics # Make sure to import metrics

# Assuming X, Y, X_val, Y_val are defined and represent your data

# Scaling the features for stable and fast training.


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_val_scaled = scaler.transform(X_val)

# List of models
models = [LogisticRegression(), XGBClassifier(), SVC(kernel='rbf')]

# Training and evaluation loop


for model in models:
model.fit(X_scaled, Y)

print(f'{model} : ')
print('Training AUC-ROC Score : ', metrics.roc_auc_score(Y, model.predict(X_scaled)))
print('Validation AUC-ROC Score : ', metrics.roc_auc_score(Y_val, model.predict(X_val_scaled)))
print()

from sklearn.neural_network import MLPClassifier


from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Assuming X, Y, X_val, Y_val are defined and represent your data

66
# Split the data into training and validation sets
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=10, stratify=Y)

# Scaling the features for stable and fast training.


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# List of models
models = [
LogisticRegression(),
XGBClassifier(),
SVC(kernel='rbf'),
MLPClassifier(random_state=42),
RandomForestClassifier(random_state=42),
KNeighborsClassifier()
]

# Training and evaluation loop


for model in models:
model.fit(X_train_scaled, Y_train)

print(f'{model} : ')
print('Training AUC-ROC Score : ', metrics.roc_auc_score(Y_train, model.predict(X_train_scaled)))
print('Validation AUC-ROC Score : ', metrics.roc_auc_score(Y_val, model.predict(X_val_scaled)))
print()

from sklearn.metrics import mean_absolute_error, mean_squared_error


from math import sqrt

# Assuming X_train, X_test, Y_train, Y_test are defined and represent your training and testing data

# Training the models


for model in models:
model.fit(X_train_scaled, Y_train)

# Training and testing scores


training_score = model.score(X_train_scaled, Y_train)
testing_score = model.score(X_val_scaled, Y_val)

# Mean Absolute Error (MAE)


y_pred_train = model.predict(X_train_scaled)
y_pred_test = model.predict(X_val_scaled)
mae_train = mean_absolute_error(Y_train, y_pred_train)
mae_test = mean_absolute_error(Y_val, y_pred_test)

# Mean Squared Error (MSE)


mse_train = mean_squared_error(Y_train, y_pred_train)

67
mse_test = mean_squared_error(Y_val, y_pred_test)

# Root Mean Squared Error (RMSE)


rmse_train = sqrt(mse_train)
rmse_test = sqrt(mse_test)

# Print results
print(f'{model} : ')
print('Training Score:', training_score)
print('Testing Score:', testing_score)
print('Mean Absolute Error (MAE) - Training:', mae_train, ' Testing:', mae_test)
print('Mean Squared Error (MSE) - Training:', mse_train, ' Testing:', mse_test)
print('Root Mean Squared Error (RMSE) - Training:', rmse_train, ' Testing:', rmse_test)
print()

from sklearn.linear_model import LogisticRegression


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics

# Assuming X, Y, X_val, Y_val are defined and represent your data

# Split the data into training and validation sets


X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=10, stratify=Y)

# Scaling the features for stable and fast training.


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Logistic Regression model


logreg_model = LogisticRegression()
logreg_model.fit(X_train_scaled, Y_train)

# Prediction on validation set


y_pred_val = logreg_model.predict(X_val_scaled)

# Performance metrics
accuracy = metrics.accuracy_score(Y_val, y_pred_val)
sensitivity = metrics.recall_score(Y_val, y_pred_val)
roc_auc = metrics.roc_auc_score(Y_val, logreg_model.predict_proba(X_val_scaled)[:, 1])

# Print results
print('Logistic Regression : ')
print('Accuracy:', accuracy)
print('Sensitivity:', sensitivity)
print('AUC:', roc_auc)

68

You might also like