Minor Project
Minor Project
DISSERTATION
of
Bachelor of Technology
in
Computer Science & Engineering
By:
Ananya Singh(08/CSE1/2020)
Arshleen Bhandari(12/CSE1/2020)
Harleen Kaur Aarora(43/CSE1/2020)
Himika Prabhat(52/CSE1/2020)
We hereby declare that all the work presented in the dissertation entitled “Autism Spectrum
Disorder screening in adults using Supervised Machine Learning Algorithms” in the partial
fulfillment of the requirements for the award of the degree of Bachelor of Technology in
Computer Science & Engineering, Guru Tegh Bahadur Institute of Technology, affiliated to
Guru Gobind Singh Indraprastha University Delhi is an authentic record of our own work carried
out under the guidance of Dr. Aashish Bhardwaj.
Date: 08-12-2023
Ananya Singh(08/CSE1/2020)
Arshleen Bhandari(12/CSE1/2020)
Harleen Kaur Aarora(43/CSE1/2020)
Himika Prabhat(52/CSE1/2020)
ii
CERTIFICATE
This is to certify that dissertation entitled “Autism Spectrum Disorder screening in adults
using Machine Learning Algorithms”, which is submitted by Ms. Ananya Singh, Ms.
Arshleen Bhandari, Ms. Harleen Kaur and Ms. Himika Prabhat in partial fulfillment of the
requirements for the award of the degree of Bachelor of Technology in Computer Science &
Engineering, Guru Tegh Bahadur Institute of Technology, New Delhi is an authentic record of
the candidate’s own work carried out by them under our guidance. The matter embodied in this
thesis is original and has not been submitted for the award of any other degree.
Date: 08-12-2023
Aashish Bhardwaj
(Head of Department)
Computer Science & Engineering
iii
ACKNOWLEDGEMENT
We express our sincere appreciation to our mentor, Dr. Ashish Bhardwaj, who serves as the
Head of the Department of Computer Science and Engineering (CSE). His guidance and
support have been invaluable throughout our dissertation process, and we are grateful for his
leadership that has played a crucial role in enhancing the quality of our work to its current
standard.
Without his expertise and encouragement, this achievement would not have been possible and he
has acted as guiding light for our path throughout the project. We are extremely grateful.
Arshleen Bhandari
(12/CSE1/2020)
[email protected]
Himika Prabhat
(52/CSE1/2020)
[email protected]
vi
ABSTRACT
This project delves into the intricate challenge of identifying Autism Spectrum Disorder (ASD)
in adults through the strategic application of Supervised Machine Learning Algorithms. ASD,
marked by difficulties in social interaction, communication, and repetitive behaviors,
underscores the imperative for early detection to facilitate timely intervention. Employing a
diverse ensemble of algorithms, including Decision Trees, Random Forest, Support Vector
Machines, K Nearest Neighbors, Logistic Regression, Linear Discriminant Analysis, and
Quadratic Discriminant Analysis, the project aims to craft an advanced ASD screening tool with
transformative potential for diagnostic processes.
The primary goal is the construction of a screening mechanism adept at leveraging the unique
strengths of each algorithm, ultimately yielding a more accurate and robust identification model.
To achieve this, we meticulously curated a dataset that incorporates information from both ASD
and non-ASD adult subjects. This dataset serves as the training and evaluation ground for the
machine learning algorithms, ensuring a diverse and representative sample that authentically
mirrors real-world scenarios.
The project places significant emphasis on algorithmic diversity, undertaking the exploration of
various modeling approaches to attain a nuanced understanding of the intricate features
associated with ASD. Each algorithm contributes a distinct perspective, collectively enriching
the screening process. Interpretability becomes a focal point, with the overarching objective of
unraveling complex relationships between features and ASD identification.
Integral to the project are ethical considerations that underscore privacy, fairness, and the
responsible deployment of machine learning technologies within healthcare contexts. Ensuring
the ethical use of ASD screening tools is paramount not only for building public trust but also for
safeguarding the well-being of individuals undergoing assessments.
In the synthesis of the unique capabilities inherent in diverse machine learning algorithms, this
project aspires to redefine ASD screening. The anticipated outcomes hold promise for
significantly enhancing the quality of life for individuals on the autism spectrum by enabling
more accurate and efficient early detection. This aligns seamlessly with the broader objectives of
precision medicine and exemplifies the transformative potential of advanced technologies in
positively impacting the identification of neurodevelopmental disorders. In essence, this project
represents a concerted effort to seamlessly blend cutting-edge machine learning methodologies
with a compassionate approach, thereby enhancing ASD diagnostic processes and nurturing a
future where early intervention becomes not only more accessible but also more effective in
improving outcomes for those affected by ASD.
LIST OF FIGURES
Title page i
Declaration ii
Certificate iii
Acknowledgement iv
Abstract v
List of Tables and Figures vii
1. Introduction 1
1.1 Introduction to Autism Spectrum Disorder (ASD) 1
1.2 Significance of Early Detection in ASD 2
1.3 Role of Supervised Machine Learning in Healthcare 5
1.4 Libraries used 7
1.4.1 Numpy 7
1.4.2 Pandas 7
1.4.3 Matplotlib 7
1.4.4 Seaborn 7
1.4.5 SciKit Learn 7
1.5 Overview of Machine Learning Algorithms 8
1.5.1 Random Forests 8
1.5.2 Support Vector Machines (SVM) 9
1.5.3 k-Nearest Neighbors (kNN) 9
1.5.4 Logistic Regression 10
1.5.5 Multi-Layer Perceptron(MLP) 11
1.6 Crafting an Advanced ASD Screening Tool 12
1.7 Dataset Compilation and Characteristics 13
1.8 Emphasis on Algorithmic Diversity 17
1.9 Interpretability in ASD Identification 18
1.10 Ethical Considerations in Machine Learning for Healthcare 21
1.11 Anticipated Outcomes and Future Implications 23
2. Requirements Analysis with SRS 26
2.1 System Overview 26
2.1.1 System Description 26
2.1.2 System Features 26
2.2 Functional Requirements 26
2.2.1 Data Preprocessing 27
2.2.2 Algorithm Implementation 27
2.2.3 Model Evaluation 28
2.3 Non-Functional Requirements 28
2.3.1 Performance 28
2.3.2 Usability 28
2.3.3 Security 29
2.4 Constraints 29
2.4.1 Dataset Limitations 29
2.4.2 Availability of Algorithm 29
2.5 Appendix 29
2.5.1 Some References 29
2.5.2 Glossary 29
3. System Design 38
4. Test Plan
5. Body of Thesis
6. Results and observations 61
7. Summary and Conclusions
8. future scope 66
References 68
Appendix A 55
Appendix B 63
Chapter One
INTRODUCTION
1
INTRODUCTION
Autistic Spectrum Disorder (ASD) is the name for a group of developmental disorders impacting
the nervous system. ASD symptoms range from mild to severe: mainly language impairment,
challenges in social interaction, and repetitive behaviors. Many other possible symptoms include
anxiety, mood disorders and Attention-Deficit/Hyperactivity Disorder (ADHD).
The core features of ASD include difficulties in social interaction, marked by challenges in
understanding and responding to social cues. Individuals with ASD may struggle with
establishing and maintaining relationships, interpreting nonverbal communication, and grasping
the nuances of social reciprocity. Additionally, communication difficulties are a hallmark of
ASD, with variations ranging from delayed language development to a complete absence of
spoken language. The manifestation of repetitive behaviors, such as stereotypical movements or
adherence to strict routines, further contributes to the intricate tapestry of ASD symptoms.
Recognizing the prevalence of ASD in adults has become increasingly crucial in recent years.
The traditional perception of autism as a childhood disorder has given way to a broader
understanding that acknowledges the persistence of symptoms into adulthood. The transition
from childhood to adulthood introduces new challenges in identifying and diagnosing ASD in a
population that may have developed coping mechanisms or masked their symptoms over time.
This shift in perspective underscores the critical need for effective screening tools and diagnostic
criteria tailored to the unique characteristics of adults with ASD.
Despite advancements in our comprehension of ASD, the diagnostic journey for adults
remains a complex and nuanced process. Unlike childhood diagnosis, which often involves
observations of early developmental markers, adult diagnosis relies on retrospective assessments
2
and an analysis of lifelong patterns of behavior. The subtleties of adult presentation necessitate a
comprehensive evaluation that considers not only the core symptoms of ASD but also the
individual's adaptive functioning and quality of life. Clinicians face the challenge of
distinguishing ASD from other mental health conditions that may share overlapping features,
further emphasizing the need for a nuanced approach to diagnosis.
The importance of timely intervention for adults with ASD cannot be overstated. Early
identification and intervention have been shown to improve outcomes and enhance the
individual's overall quality of life. However, the challenges of diagnosing ASD in adults may
lead to delayed access to appropriate support and services. This delay can impact various aspects
of an individual's life, including education, employment, and social relationships.
The significance of early detection in Autism Spectrum Disorder (ASD) cannot be overstated,
particularly when considering its profound implications for individuals, society, and healthcare
systems. Early identification of ASD in adults serves as a pivotal gateway to timely and targeted
interventions, laying the foundation for improved outcomes across various domains of life.
At an individual level, early detection acts as a catalyst for enhancing the quality of life for those
with ASD. The developmental trajectory of individuals with ASD is markedly influenced by
early intervention, impacting key areas such as social integration, communication skills, and
adaptive behaviors. The malleability of the human brain during early developmental stages
makes this period particularly receptive to therapeutic interventions. Therefore, identifying and
addressing ASD in its early stages can lead to more effective interventions, potentially mitigating
the impact of core symptoms and promoting the development of essential life skills.
Social integration stands out as a critical domain affected by early detection and intervention.
Individuals with ASD often encounter challenges in forming and maintaining social
relationships. Early identification allows for the implementation of targeted social skills training
and support, fostering improved social interactions and relationships. As a result, the individual's
ability to navigate social situations, understand social cues, and engage in reciprocal
communication can be significantly enhanced, positively influencing their overall well-being.
Communication skills, another core aspect of ASD, benefit immensely from early intervention.
Speech and language difficulties, ranging from delayed language acquisition to challenges in
3
pragmatic communication, are common among individuals with ASD. Early identification
enables the initiation of speech and language therapy tailored to the individual's specific needs,
promoting the development of effective communication strategies. This, in turn, has cascading
effects on various aspects of life, including academic achievement, vocational success, and
independent living.
Adaptive behaviors, encompassing daily living skills and functional independence, also
experience positive outcomes with early detection. Interventions targeting adaptive behaviors,
such as self-care, organization, and time management, contribute to the individual's ability to
lead a more independent and fulfilling life. Early identification allows for the implementation of
personalized strategies and support systems, empowering individuals with ASD to navigate the
challenges of daily living more effectively.
Beyond the individual realm, the societal and economic significance of early ASD detection is
substantial. A society that actively promotes early identification and intervention contributes to a
more inclusive environment. By recognizing and accommodating the diverse needs of
individuals with ASD, societal structures become more accessible, allowing for greater
participation and contribution from this population. This inclusivity has ripple effects, fostering a
society that values neurodiversity and promotes the well-being of all its members.
From an economic perspective, early detection holds the potential to reduce the long-term
burden on healthcare systems. Timely interventions that address the core symptoms of ASD may
decrease the need for extensive and costly support services in later stages of life. Additionally,
individuals who receive early intervention are more likely to develop skills that enhance their
independence, potentially reducing the demand for long-term care and support services.
The role of supervised machine learning in healthcare represents a groundbreaking frontier that
has the potential to revolutionize various facets of medical practice. The integration of machine
learning technologies into healthcare systems signifies a paradigm shift, ushering in
unprecedented opportunities for enhanced diagnosis, personalized treatment strategies, and
improved patient care. In the realm of neurodevelopmental disorders, such as Autism Spectrum
Disorder (ASD), supervised machine learning stands out as a particularly promising avenue,
offering the potential to significantly advance diagnostic accuracy and efficiency.
4
In the context of ASD, a multifaceted neurodevelopmental condition characterized by
challenges in social interaction, communication difficulties, and repetitive behaviors, the
traditional diagnostic process has often been intricate and time-consuming. The reliance on
clinical observations, behavioral assessments, and subjective evaluations has led to variations in
diagnostic outcomes and, at times, delayed identification. Here, supervised machine learning
brings forth a transformative approach by harnessing the power of computational algorithms to
analyze and interpret complex patterns within vast datasets.
One of the primary advantages of supervised machine learning in the context of ASD lies in
its ability to learn from existing datasets. These datasets may include a diverse range of
information, such as clinical assessments, neuroimaging data, genetic profiles, and behavioral
observations. Through a process of supervised training, machine learning algorithms can discern
intricate patterns and relationships within these datasets, ultimately creating models that can
generalize and apply their learning to new, unseen data. This capacity to recognize subtle
patterns is particularly relevant in the case of ASD, where the disorder's spectrum nature and
variability in symptom presentation pose challenges for conventional diagnostic approaches.
The application of supervised machine learning in ASD diagnosis involves training
algorithms on labeled datasets, where each instance is associated with a known outcome or
diagnosis. These algorithms learn to identify patterns indicative of ASD based on the features
present in the training data. Once the training phase is complete, the machine learning model can
be applied to new, unseen data to predict whether an individual may have ASD, providing a
valuable tool for clinicians in the diagnostic process.
The potential benefits of incorporating supervised machine learning into ASD diagnosis are
manifold. Firstly, the accuracy of diagnoses may be significantly improved, as machine learning
models can analyze a wide array of data points simultaneously and identify subtle patterns that
may not be immediately apparent to human observers. This enhanced accuracy has the potential
to facilitate earlier and more precise identification of ASD, leading to timely interventions and
improved outcomes for individuals.
Moreover, the efficiency of the diagnostic process stands to benefit from the role of
supervised machine learning. The rapid analysis of diverse datasets allows for a streamlined and
objective assessment, reducing the time and resources traditionally required for a comprehensive
5
diagnosis. This efficiency is particularly critical in the context of neurodevelopmental disorders,
where early intervention has been shown to have a substantial impact on long-term outcomes.
However, it is essential to acknowledge the challenges and considerations associated with the
integration of machine learning in healthcare, including ethical concerns, data privacy, and the
interpretability of complex algorithms. Ensuring that machine learning models are transparent,
interpretable, and ethically sound is crucial for building trust among healthcare professionals and
the broader public.
6
1.4 Libraries used
1.4.1 Numpy
NumPy (Numerical Python) is an open source Python library that’s used in almost every field of
science and engineering. It’s the universal standard for working with numerical data in Python,
and it’s at the core of the scientific Python and PyData ecosystems. NumPy users include
everyone from beginning coders to experienced researchers doing state-of-the-art scientific and
industrial research and development. The NumPy API is used extensively in Pandas, SciPy,
Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.
1.4.2 Pandas
Pandas is mainly used for data analysis and associated manipulation of tabular data in
DataFrames. Pandas allows importing data from various file formats such as comma-separated
values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.
1.4.3 Matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.
1.4.4 Seaborn
Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and
integrates closely with pandas data structures. Seaborn helps you explore and understand your
data.
7
empowers users to efficiently implement and experiment with machine learning models, making
it an indispensable resource for both beginners and seasoned practitioners in the field of data
science and artificial intelligence.
8
1.5.2 Support Vector Machines (SVM)
The best way to understand the SVM algorithm is by focusing on its primary type, the SVM
classifier. The idea behind the SVM classifier is to come up with a hyper-lane in an
N-dimensional space that divides the data points belonging to different classes. However, this
hyper-pane is chosen based on margin as the hyperplane providing the maximum margin
between the two classes is considered. These margins are calculated using data points known as
Support Vectors. Support Vectors are those data points that are near to the hyper-plane and help
in orienting it.
9
Fig 1.4 k-Nearest Neighbors algorithm
1.5.4 Logistic Regression
Logistic regression is a process of modeling the probability of a discrete outcome given an input
variable. The most common logistic regression models a binary outcome; something that can
take two values such as true/false, yes/no, and so on. Multinomial logistic regression can model
scenarios where there are more than two possible discrete outcomes. Logistic regression is a
useful analysis method for classification problems, where you are trying to determine if a new
sample fits best into a category. As aspects of cyber security are classification problems, such as
attack detection, logistic regression is a useful analytic technique.
10
1.5.5 Multi-Layer Perceptron(MLP)
A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP
consists of at least three layers of nodes. Except for the input nodes, each node is a neuron that
uses a nonlinear activation function. MLP utilizes a supervised learning technique called
backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from
a linear perceptron. It can distinguish data that is not linearly separable. Multilayer perceptrons
are sometimes colloquially referred to as ‘vanilla’ neural networks, especially when they have a
single hidden layer.
XGBoost, short for eXtreme Gradient Boosting, is a powerful and versatile machine learning
algorithm renowned for its efficiency and performance in predictive modeling. As a supervised
learning method, XGBoost belongs to the ensemble learning category, combining the strengths
of multiple weak learners to create a robust and accurate classifier. Widely utilized in various
domains, from finance to healthcare, XGBoost excels in handling diverse data types, providing
exceptional predictive accuracy, and effectively managing complex relationships within datasets.
With its innovative gradient boosting framework and regularization techniques, XGBoost stands
as a go-to choice for tackling classification challenges and pushing the boundaries of predictive
analytics.
11
1.6 Crafting an Advanced ASD Screening Tool
At the core of this ambitious project lies the fundamental goal of crafting an advanced Autism
Spectrum Disorder (ASD) screening tool that transcends the capabilities of individual
algorithms. The primary objective is to orchestrate a symphony of diverse machine learning
algorithms, strategically weaving together their distinctive strengths to form a screening
mechanism that surpasses the limitations inherent in any single approach. This endeavor seeks to
redefine the landscape of ASD diagnosis, with a focus on the unique challenges presented in the
context of adult ASD.
The significance of this undertaking is underscored by the recognition that the complexity of
ASD demands a multifaceted and adaptive approach. Individual algorithms, while powerful in
their own right, may exhibit limitations in capturing the intricate nuances and variability within
ASD data. Thus, the strategic amalgamation of a variety of algorithms becomes imperative, as it
holds the promise of synergistically enhancing the accuracy, sensitivity, and specificity of the
screening tool.
In the pursuit of constructing this groundbreaking screening mechanism, each algorithm is
chosen for its specific attributes and strengths. Decision Trees, known for their ability to unravel
complex decision-making processes, contribute by capturing patterns within ASD data that might
be challenging for human observers to discern. The ensemble then incorporates Random Forest,
which aggregates multiple Decision Trees, mitigating overfitting and enhancing the model's
generalization capacity. Support Vector Machines (SVM) add a layer of sophistication, excelling
in classifying data with intricate boundaries, a characteristic particularly relevant in the diverse
symptomatology of ASD.
The inclusion of K Nearest Neighbors (KNN) introduces a localized perspective, recognizing
the potential clustering of ASD symptoms within specific groups. Gaussian Naive Bayes, with its
probabilistic nature, accommodates the high-dimensional nature of ASD data, providing a robust
framework for classification. Logistic Regression, a classic yet powerful algorithm, contributes
its simplicity and interpretability to the ensemble.
The statistical rigor brought by Linear Discriminant Analysis (LDA) and Quadratic
Discriminant Analysis (QDA) enhances the tool's understanding of the underlying data
distributions associated with ASD. This diverse ensemble, collectively guided by each
12
algorithm's unique strengths, forms a screening mechanism that transcends the limitations of any
single methodology.
The project's commitment to overcoming the challenges associated with adult ASD diagnosis
is evident in its emphasis on innovation and adaptability. Adult ASD presents a distinct set of
complexities compared to childhood ASD, with individuals potentially developing coping
mechanisms or masking symptoms over time. The screening tool aims to address these nuances
by integrating algorithms that can discern patterns within retrospective data, offering a
comprehensive evaluation of an individual's lifelong behavioral patterns.
As this screening tool takes shape, rigorous training and validation processes are integral
components of its development. The iterative refinement of algorithmic parameters ensures that
the ensemble performs optimally, achieving a delicate balance between sensitivity and
specificity. The overarching goal is not merely the creation of a diagnostic tool but the
establishment of a pioneering solution that revolutionizes the diagnostic landscape for adult
ASD.
The potential impact of this advanced ASD screening tool extends beyond the individual level,
resonating in healthcare systems, research, and societal understanding. By providing a more
accurate and efficient means of identifying adult ASD, the tool has the potential to reduce the
diagnostic journey's intricacies, leading to earlier interventions and improved outcomes.
Additionally, the insights gained from the screening process contribute to the growing body of
knowledge surrounding adult ASD, fostering a more nuanced understanding of this complex
neurodevelopmental condition.
Born with jaundice Boolean (yes or no) Whether the case was born
13
Attribute Type Description
with jaundice.
Family member with PDD Boolean (yes or no) Whether any immediate
family member has a PDD.
Used the screening app before Boolean (yes or no) Whether the user has used a
screening app
14
Attribute Type Description
At the heart of every machine learning endeavor, the foundation upon which the entire system
rests is the quality and representativeness of the dataset. In the context of this ambitious project,
the dataset compilation is a meticulous process, carefully crafted to be the linchpin that shapes
the efficacy and reliability of the developed ASD screening tool. The dataset is not merely a
collection of data points but a comprehensive assembly, drawing upon a diverse pool of
information from both individuals with Autism Spectrum Disorder (ASD) and those without,
specifically focusing on adult subjects. This strategic compilation ensures that the training and
evaluation processes of the machine learning algorithms unfold on a rich canvas, reflecting the
complexity and diversity inherent in real-world scenarios.
The assembly of this dataset is characterized by a rigorous curation process, guided by a
commitment to inclusivity and authenticity. Information is collated from a spectrum of sources,
capturing the nuances of adult ASD from various perspectives. The inclusion of both ASD and
non-ASD subjects establishes a balanced and representative foundation, mimicking the
intricacies encountered in clinical settings. This deliberate approach is crucial for the machine
learning algorithms to develop a nuanced understanding of the patterns and features associated
with ASD, enhancing their ability to discriminate between individuals with and without the
disorder.
15
A key consideration in dataset compilation is the incorporation of comprehensive
demographic details. These details encompass a broad spectrum, including age, gender,
socioeconomic status, and educational background. Such demographic factors play a pivotal role
in understanding the multifaceted nature of ASD, considering its potential influence on the
manifestation and presentation of symptoms. The inclusion of this demographic diversity ensures
that the developed screening tool is sensitive to variations across different subgroups,
contributing to its robustness in real-world applications.
Behavioral traits, another integral component of the dataset, offer a granular view of the
individuals under consideration. The diverse and complex nature of ASD symptoms necessitates
a detailed examination of behavioral patterns, encompassing social interactions, communication
skills, and repetitive behaviors. By incorporating this multifaceted information, the dataset
captures the heterogeneity within the ASD population, allowing the machine learning algorithms
to discern subtle variations and tailor their predictions accordingly.
Relevant clinical indicators further enrich the dataset, providing a holistic perspective on the
individuals involved. These indicators may include diagnostic assessments, medical history, and
other clinical observations. By integrating such clinically significant information, the dataset
aligns closely with the diagnostic reality faced by healthcare professionals. This alignment is
crucial for the screening tool to be not only accurate but also clinically relevant, enhancing its
utility in real-world healthcare settings.
The meticulous nature of dataset compilation is underscored by a commitment to ethical
considerations and privacy protection. Anonymization and de-identification protocols are
rigorously applied to safeguard the privacy and confidentiality of the individuals contributing to
the dataset. This ethical approach is paramount in maintaining the integrity of the project and
ensuring that the dataset is handled responsibly in accordance with privacy regulations and
standards.
As the dataset takes shape, it becomes more than a collection of variables and values; it
becomes a dynamic representation of the diverse landscape of adult ASD. The richness of the
dataset is not just in its size but in its ability to encapsulate the complexity, variability, and
individuality of each participant. This depth ensures that the machine learning algorithms, when
16
exposed to this diverse array of information, are equipped to generalize and adapt to new, unseen
data, a crucial aspect for the screening tool's success in real-world applications.
In the pursuit of constructing an advanced Autism Spectrum Disorder (ASD) screening tool, this
project places a deliberate and strategic emphasis on algorithmic diversity, recognizing the
nuanced and intricate nature of ASD. The research team, comprising multidisciplinary experts,
acknowledges that ASD manifests as a complex and heterogeneous spectrum of
neurodevelopmental conditions. In response to this complexity, the project adopts a
forward-thinking approach by exploring a diverse array of modeling approaches, seeking a
comprehensive and nuanced understanding of the myriad features associated with ASD.
The decision to employ a diverse ensemble of machine learning algorithms is rooted in the
fundamental acknowledgment that no single algorithm can encapsulate the multifaceted
dimensions of ASD. Each algorithm, whether Decision Trees, Random Forest, Support Vector
Machines, K Nearest Neighbors, or Logistic Regression, brings a unique set of strengths and
perspectives to the table. This diversity of approaches is not arbitrary but a strategic choice to
enhance the screening process, ensuring a well-rounded and adaptable tool that can effectively
navigate the intricacies of ASD diagnosis.
The utilization of Decision Trees is emblematic of the project's commitment to capturing
complex decision-making processes inherent in ASD. Decision Trees dissect intricate patterns
within the dataset, offering insights into the relationships between various features and their
influence on ASD identification. This methodological choice is particularly valuable in handling
the diversity of symptoms and presentations within the ASD population.
The inclusion of Random Forest as part of the ensemble underscores the commitment to
mitigating the risk of overfitting and enhancing generalization. By aggregating the outputs of
multiple Decision Trees, Random Forest brings a robustness to the screening tool, making it
adaptable to the heterogeneity observed in ASD data. This ensemble approach acknowledges the
variability in symptom presentation and ensures that the screening tool remains accurate across a
broad spectrum of ASD manifestations.
Support Vector Machines (SVM) contribute their unique perspective by excelling in
classifying data with intricate boundaries. Given the complex and variable nature of ASD
17
symptoms, SVM enhances the screening tool's capacity to discern subtle patterns and boundaries
within the dataset. This methodological choice reflects a proactive approach to addressing the
challenges posed by the diversity of ASD symptoms and manifestations.
K Nearest Neighbors (KNN), operating on the principle of similarity, offers a localized
perspective to the ensemble. This algorithm excels in identifying patterns that may exist in
localized clusters within the dataset. In the context of ASD, where symptom clusters may emerge
in specific subgroups, KNN ensures that the screening tool remains sensitive to localized
variations, fostering a more nuanced understanding of ASD in diverse populations.
Logistic Regression, a classic yet powerful algorithm, adds interpretability and simplicity to
the ensemble. Its suitability for binary classification tasks aligns well with the nature of ASD
identification, providing a foundational element in the amalgamation of diverse algorithms. The
transparency and interpretability of Logistic Regression complement the complexity introduced
by other algorithms, ensuring a cohesive and understandable ensemble.
The deliberate emphasis on algorithmic diversity in this project goes beyond a mere technical
choice; it is a strategic response to the inherent challenges posed by the intricate nature of ASD.
By embracing a variety of modeling approaches, the research team aims to cast a wide net,
capturing the diverse manifestations and presentations of ASD within the dataset. This holistic
and inclusive approach enhances the screening tool's adaptability to the rich and varied landscape
of ASD, ensuring that it remains effective across a spectrum of scenarios.
In the intricate landscape of Autism Spectrum Disorder (ASD) identification, the spotlight in this
project falls distinctly on interpretability. Recognizing the inherent complexity of ASD, the
research team places a significant emphasis on unraveling the intricate relationships between
features and the identification of ASD. This endeavor is not merely a scientific pursuit; it is a
pragmatic approach that acknowledges the practical implications for clinicians, stakeholders, and
individuals navigating the ASD diagnostic process. By prioritizing interpretability, the project
strives not only to contribute to the scientific understanding of ASD but also to enhance the
transparency of the screening tool, making it more accessible and comprehensible for healthcare
professionals and individuals undergoing assessments.
18
Interpretability, in the context of machine learning algorithms applied to ASD identification,
refers to the ability to understand and explain the decisions made by the model. The complex
interplay of features contributing to ASD can be challenging to decipher, making interpretability
a crucial aspect of ensuring that the screening tool is not only accurate but also understandable
for those relying on its outcomes.
In the pursuit of interpretability, the project acknowledges the multifaceted nature of ASD
symptoms and the need for transparency in the decision-making process of the screening tool.
Unraveling the complex relationships between various features and ASD identification becomes
a scientific imperative, providing valuable insights into the patterns and markers that contribute
to the algorithm's predictions. This scientific endeavor extends beyond the confines of
algorithmic intricacies; it aspires to enhance our broader understanding of the factors influencing
ASD identification.
19
Fig 1.8 Learning curve
Practically, interpretability holds immense value for healthcare professionals who utilize the
screening tool in clinical settings. Transparent models empower clinicians to comprehend and
trust the decisions made by the algorithm, fostering a collaborative and informed approach to
diagnosis. Interpretability becomes a bridge between advanced computational methodologies and
the clinical expertise of healthcare professionals, ensuring that the screening tool aligns
seamlessly with the nuanced realities of ASD assessments.
Moreover, interpretability addresses a critical aspect of ethical considerations in healthcare.
Transparent models provide individuals undergoing assessments with a clearer understanding of
the factors influencing their diagnosis. This transparency fosters trust and confidence in the
diagnostic process, empowering individuals to actively engage in discussions about their
healthcare journey. The project's commitment to interpretability aligns with the broader
movement toward patient-centered care, where individuals are not passive recipients of
diagnoses but active participants in their healthcare decisions.
The emphasis on interpretability also extends to the broader stakeholder community, including
educators, policymakers, and researchers. Transparent models enable stakeholders to
comprehend the factors influencing ASD identification, facilitating informed decision-making
and policy development. The project's commitment to interpretability reflects an awareness of
20
the collaborative nature of addressing complex societal challenges such as ASD, where diverse
stakeholders play pivotal roles in shaping interventions, support systems, and public policies.
The implementation of interpretable machine learning models, such as Logistic Regression
and Decision Trees, forms a strategic component of the project's approach. These models offer
not only accurate predictions but also a clear and understandable rationale for their decisions.
Logistic Regression, with its simplicity and interpretability, provides a foundational element in
the ensemble of algorithms. Decision Trees, by nature hierarchical and structured, unravel
complex decision-making processes in a visually intuitive manner, offering insights into the
features that contribute to ASD identification.
The project recognizes that the pursuit of interpretability does not entail a compromise on
accuracy or sophistication. Instead, it is an integrative approach that harmonizes the advanced
capabilities of machine learning with the necessity for transparency in the diagnostic process.
The interpretability of the screening tool is a feature, not a limitation, enriching the overall utility
and acceptance of the tool in real-world healthcare scenarios.
1.10 Ethical Considerations in Machine Learning for Healthcare
The seamless integration of machine learning into healthcare is an epochal advancement, but it
mandates an unwavering commitment to ethical considerations. Within this transformative
landscape, the current project stands as a beacon, emphasizing the fundamental principles of
privacy, fairness, and the responsible deployment of machine learning technologies within
healthcare contexts. Ethical considerations are not merely an ancillary concern; they are the
bedrock upon which the integrity and success of this project rest. The conscientious approach to
ethics is not only a moral imperative but is also instrumental in building and sustaining public
trust, while simultaneously safeguarding the well-being and privacy of individuals undergoing
ASD assessments. Striking an equilibrium between technological innovation and ethical
responsibility emerges as an indispensable facet for the successful implementation of ASD
screening tools that genuinely contribute to healthcare advancements.
Privacy, as a cornerstone of ethical considerations, takes precedence in the integration of
machine learning for healthcare applications. The project recognizes the sensitivity of health
data, especially in the context of neurodevelopmental disorders like ASD. Rigorous
anonymization and de-identification protocols are meticulously applied to the dataset to shield
the identities of individuals contributing to the project. By adhering to robust privacy measures,
21
the research team not only upholds ethical standards but also addresses concerns related to data
security and confidentiality, fostering a sense of trust among individuals participating in the
assessment process.
Fairness, another pivotal ethical principle, is explicitly acknowledged in the project's
approach. The machine learning algorithms are meticulously trained and validated to ensure that
they do not perpetuate biases or discriminate against specific demographic groups. The project
team actively addresses issues related to algorithmic fairness, striving to mitigate the risk of
unintended consequences and disparities in ASD identification. The commitment to fairness
aligns with the broader societal goal of creating healthcare technologies that are equitable and
accessible to diverse populations.
Beyond privacy and fairness, the responsible deployment of machine learning technologies is
a guiding ethical tenet. This involves careful consideration of the potential impact on individuals'
lives and the broader healthcare ecosystem. The project team critically evaluates not only the
accuracy and efficiency of the ASD screening tool but also its real-world implications.
Responsible deployment encompasses considerations of the tool's usability in clinical settings,
the interpretability of its outcomes, and the potential consequences of its recommendations. This
holistic approach ensures that the integration of machine learning into healthcare goes beyond
technical proficiency, actively considering its ethical implications in the service of the
individuals it aims to assist.
The ethical framework established by the project is pivotal for building and maintaining public
trust. Trust is foundational in healthcare, and its erosion can have far-reaching consequences. By
prioritizing privacy, fairness, and responsible deployment, the project aims to instill confidence
in individuals undergoing ASD assessments, healthcare professionals utilizing the screening tool,
and the broader public. Transparency about the ethical considerations and measures taken further
strengthens the bond of trust between the project and its stakeholders.
Furthermore, the project team recognizes the dynamic nature of ethical considerations in the
rapidly evolving field of machine learning and healthcare. Continuous monitoring and adaptation
of ethical protocols are integral to staying abreast of emerging challenges and ensuring that the
project remains aligned with evolving ethical standards. The commitment to ongoing ethical
evaluation reflects a proactive stance, acknowledging the dynamic interplay between technology
and ethical responsibilities.
22
In essence, the ethical considerations within the project extend far beyond compliance; they
embody a proactive and comprehensive approach to safeguarding the rights, privacy, and
well-being of individuals involved. The fusion of technological innovation with ethical
responsibility reflects a commitment to the highest standards of integrity, ensuring that the
benefits of machine learning in healthcare are realized without compromising fundamental
ethical principles. As the project advances, the ethical considerations embedded within its
framework not only guide its trajectory but also contribute to shaping a responsible and
trustworthy paradigm for the integration of machine learning into neurodevelopmental disorder
assessments.
The anticipated outcomes of this visionary project extend far beyond the realms of technological
innovation, holding the promise of significantly enhancing the quality of life for individuals on
the autism spectrum. At the heart of these anticipated outcomes is the recognition that the early
detection of Autism Spectrum Disorder (ASD) is a pivotal gateway to improving long-term
outcomes for affected individuals. The screening tool, forged through the amalgamation of
diverse machine learning algorithms, is poised to revolutionize the landscape of ASD
identification by offering more accurate and efficient early detection capabilities.
The core objective of the screening tool aligns seamlessly with the broader aspirations of
precision medicine, heralding a future where healthcare interventions are tailored to the
individual characteristics and needs of each person. By harnessing the power of advanced
technologies, the project aims to usher in a new era in the identification of neurodevelopmental
disorders, particularly ASD. Precision medicine, characterized by personalized and targeted
approaches, becomes a tangible reality as the screening tool fine-tunes its predictions based on
the unique patterns and features exhibited by individuals on the autism spectrum.
The transformative potential of advanced technologies, particularly machine learning,
becomes evident through the lens of this project. The integration of cutting-edge methodologies
in the identification of ASD transcends traditional diagnostic paradigms, offering a more
nuanced and adaptive approach. The research team envisions a future where the diagnostic
journey for ASD is not only accurate but also characterized by efficiency, accessibility, and
compassion.
23
In essence, the project aspires to redefine ASD screening by seamlessly blending the prowess
of machine learning methodologies with a compassionate and person-centered approach. The
envisioned future is one where the screening process goes beyond a mere diagnostic tool,
becoming a conduit for early intervention that is not only more accessible but also more
effective. The compassion embedded in the approach acknowledges the unique challenges faced
by individuals on the autism spectrum and strives to create a screening tool that is not only
technologically advanced but also considerate of the diverse ways in which ASD may manifest.
The ultimate vision of the project is to foster a future where early intervention becomes a
transformative force in the lives of individuals affected by ASD. Early detection, facilitated by
the screening tool, opens avenues for timely and targeted interventions that can mitigate
challenges associated with ASD. These interventions may include specialized therapies,
educational support, and tailored strategies to enhance social integration and communication
skills.
Furthermore, the impact of the anticipated outcomes extends to the societal and economic
dimensions. By streamlining the identification process and facilitating early interventions, the
project holds the potential to reduce the long-term burden on healthcare systems. The economic
implications are profound, as early interventions can lead to improved outcomes, potentially
reducing the need for long-term care and support services.
As the project progresses, the insights gained from the screening tool contribute to the
evolving body of knowledge surrounding ASD. The data generated becomes a valuable resource
for researchers and clinicians, fostering a deeper understanding of the complex nature of
neurodevelopmental disorders. This, in turn, paves the way for future advancements in the field,
shaping new diagnostic and therapeutic strategies for ASD and potentially informing research in
other related domains.
24
Chapter Two
25
REQUIREMENTS ANALYSIS WITH SRS
The ASD screening system will process the provided dataset, perform data preprocessing,
implement six supervised machine learning algorithms, and evaluate their performance. The
algorithms include Decision Trees, Random Forest, Support Vector Machines (SVM), K-Nearest
Neighbors (KNeighbors), Gaussian Naive Bayes (GaussianNB), and Logistic Regression.
Data Preprocessing: Clean and preprocess the provided dataset, handling missing or ill-formatted
entries.
Algorithm Implementation: Implement the following supervised machine learning algorithms for
ASD screening:
a. Decision Trees
b. Random Forest
c. Support Vector Machines (SVM)
d. K-Nearest Neighbors (KNeighbors)
e. Gaussian Naive Bayes (GaussianNB)
f. Logistic Regression
Model Evaluation: Evaluate the performance of each algorithm using appropriate metrics (e.g.,
accuracy, precision, recall, F1 score).
26
2.2.1 Data Preprocessing
2.2.1.1 Data Cleaning
The system shall clean the dataset by removing records with missing or ill-formatted entries.
The system shall implement the Random Forest algorithm for ASD screening. Implemented
using scikit-learn in Python, Random Forest combines multiple decision trees to enhance
classification performance. This algorithm excels in handling diverse features such as age,
gender, ethnicity, and medical history, contributing to a comprehensive ASD screening solution.
With the flexibility to adjust hyperparameters like the number of trees and maximum depth,
Random Forest ensures adaptability to varying datasets. Through training on labeled data and
rigorous testing on a separate validation set, Random Forest enhances the predictive capabilities
of our ASD screening system.
The Support Vector Machines (SVM) algorithm is integral to our Autism Spectrum Disorder
(ASD) screening system for adults. Utilizing the scikit-learn library in Python, SVM employs
supervised learning to classify individuals into ASD-positive or ASD-negative categories based
on diverse features, including age, gender, ethnicity, and medical history.
The SVM implementation allows for customization, enabling the selection of different kernels
and regularization parameters. Training involves using a labeled dataset, while testing assesses
the model's predictive performance on a separate validation set. Hyperparameter tuning
optimizes SVM performance, and seamless integration within the overall system ensures a
cohesive approach to ASD screening for adults.
The system shall implement the K-Nearest Neighbors (KNeighbors) algorithm for ASD
screening. KNN determines ASD likelihood by identifying the k-nearest data points in the
feature space. This algorithm accommodates diverse input features like age, gender, ethnicity,
and medical history, ensuring flexibility in our screening process. With customizable parameters
27
such as the number of neighbors (k), KNN offers adaptability to different datasets. The training
phase involves learning the relationships between features, and testing evaluates the model's
performance on a separate validation set, enriching our ASD screening system with a
proximity-based classification strategy.
2.3.2 Usability
The user interface shall be intuitive and user-friendly, requiring minimal training for users to
navigate and input data.
28
2.3.3 Security
User data input and screening results shall be securely handled, and no sensitive information
shall be stored.
2.4 Constraints
2.4.1 Dataset Limitations
The system's performance is dependent on the quality and representativeness of the provided
dataset.
2.5 Appendix
2.5.1 Some References
- Prof. Fadi Thabtah. "Autism Spectrum Disorder Screening: Machine Learning
Adaptation and DSM-5 Fulfillment."
- UCI Machine Learning Repository. "Autistic Spectrum Disorder Screening Data for
Adult."
2.5.2 Glossary
ASD: Autism Spectrum Disorder
SRS: Software Requirements Specification
29
Chapter Three
SYSTEM DESIGN
30
SYSTEM DESIGN
1. ER Diagram
Entity-Relationship (ER) diagrams serve as a blueprint for database systems, elucidating the
connections between entities and their attributes. Entities are depicted as rectangles, representing
distinct data objects, while attributes within entities are illustrated as ovals. Diamonds signify
relationships between entities, elucidating how they interact.
Cardinality, expressed as crow's feet, denotes the numerical associations between entities. The
"one" side features a straight line, while the "many" side exhibits crow's feet, conveying the
one-to-many relationships crucial for database integrity.
Entities:
● Person:
Attributes: PersonID (Primary Key), Age, Gender, Ethnicity, BornWithJaundice,
FamilyMemberWithPDD, Completer (Who is completing the test), CountryOfResidence,
UsedAppBefore, ScreeningMethodType.
● ScreeningTest:
Attributes: TestID (Primary Key), DateConducted, Results.
● Question:
Attributes: QuestionID (Primary Key), QuestionText.
Relationships:
● Person takes Screening Test:
1. Relationship: One-to-Many (One person can take multiple tests; each test is taken
by one person).
2. Foreign Key: PersonID (in ScreeningTest).
31
2. Associative Entity: TestQuestionAssociation (with attributes such as TestID,
QuestionID, and Answer).
A Use Case Diagram is a visual representation that illustrates the functional requirements and
interactions between actors and a system. It provides a high-level view of how users (actors)
interact with a system and the system's responses.
32
Actors:
1. Actors are external entities (e.g., users, systems) that interact with the system.
2. Representation: Depicted as stick figures or labeled boxes outside the system boundary.
3. Purpose: Identify and define the roles of entities interacting with the system.
Use Cases:
1. Use cases represent the functionalities or actions that the system performs in response to
interactions from actors.
2. Representation: Depicted as ovals within the system boundary.
3. Purpose: Capture and visualize the system's behavioral aspects from a user's perspective.
Associations:
1. Associations connect actors with use cases, representing interactions or relationships.
2. Representation: Arrows connecting actors to use cases, indicating the flow of
communication.
3. Purpose: Illustrate how actors initiate and participate in specific system functionalities.
System Boundary:
1. The system boundary encapsulates all use cases and actors, defining the scope of the
system.
2. Representation: A box surrounding use cases and actors.
3. Purpose: Clearly demarcate the system's boundaries, separating internal functionalities
from external interactions.
Actors:
Use Cases:
● User/Person:
1. Provide personal data.
33
2. Take the screening test.
● ML Engineer/Developer:
1. View screening results.
2. Analyze ML predictions.
34
Chapter Four
Test Plan
35
TEST PLAN
Autism Spectrum Disorder (ASD) has a profound impact on individuals' lives, necessitating
accurate and efficient screening methods. This test plan delves into the evaluation of various
machine learning algorithms applied to an ASD screening dataset for adults. The algorithms
under scrutiny include Logistic Regression, Support Vector Machine (SVM), k-Nearest
Neighbors (KNN), XGBoost Classifier, Random Forest, and Multi-Layer Perceptron (MLP).
Objective:
The primary objective is to assess the performance of these algorithms based on key metrics such
as training score, testing score, Mean Absolute Error (MAE), Mean Squared Error (MSE), and
Root Mean Squared Error (RMSE). Through rigorous testing, we aim to determine the efficacy
of each algorithm in accurately predicting ASD.
4.1 Data Preparation:
Input Data: Ensure the dataset is properly preprocessed, handling missing values,NaN values,
encoding categorical variables, and scaling features appropriately.
36
values. MSE is calculated by taking the average of the squared differences between each
predicted and actual value.
● MSE penalizes larger errors more significantly due to the squaring of differences.
● It provides a measure of the model's precision in predicting values.
𝑛
1
MSE = 𝑛
∑ |𝑌𝑖 − 𝑌𝑖 |
𝑖=1
37
4.5 Final Values:
38
6. MLP algorithm
39
Chapter Five
BODY OF THESIS
BODY OF THESIS
The dataset used in this project is based on the Quantitative Checklist for Autism in adults
screening method devised by Baron-Cohen. Set of 10 questions has been used in the following
40
table below.. The answers to these questions are mapped to binary values as class type. These
values are assigned during the data collection process by means of answering the questionnaire.
The class value "Yes" is assigned if the questionnaire score happens to be greater than 3, that is,
there are potential ASD traits. Otherwise, class value "No" is assigned, implying no ASD traits.
A8 First words
41
Chapter Six
RESULTS
42
RESULTS AND OBSERVATIONS
The entire system rests on the quality and representativeness of the dataset. In the context of this
ambitious project, the dataset compilation is a meticulous process, carefully crafted to be the
linchpin that shapes the efficacy and reliability of the developed ASD screening tool. The dataset
is not merely a collection of data points but a comprehensive assembly, drawing upon a diverse
pool of information from both individuals with Autism Spectrum Disorder (ASD) and those
without, specifically focusing on adult subjects. This strategic compilation ensures that the
training and evaluation processes of the machine learning algorithms unfold on a rich canvas,
reflecting the complexity and diversity inherent in real-world scenarios.
The assembly of this dataset is characterized by a rigorous curation process, guided by a
commitment to inclusivity and authenticity. Information is collated from a spectrum of sources,
capturing the nuances of adult ASD from various perspectives. The inclusion of both ASD and
non-ASD subjects establishes a balanced and representative foundation, mimicking the
intricacies encountered in clinical settings. This deliberate approach is crucial for the machine
learning algorithms to develop a nuanced understanding of the patterns and features associated
with ASD, enhancing their ability to discriminate between individuals with and without the
disorder.
43
Table 6.1 Results
Learning Algorithm Accuracy
SVM 88.36082 %
KNN 82.87602 %
44
Chapter Seven
45
SUMMARY AND CONCLUSION
The project focuses on predicting Autism Spectrum Disorder (ASD) screening results for adults
using machine learning algorithms. This analysis aims to provide valuable insights into the
potential effectiveness of various algorithms for ASD screening.
Dataset Overview:
The dataset comprises binary ASD scores (A1 to A10), demographic features (gender, ethnicity),
and other relevant attributes such as jaundice at birth, autism history, country of residence, and
family relationship. The selection of these features was driven by their potential impact on ASD
screening outcomes based on domain knowledge and preliminary exploratory data analysis
(EDA).
Data Preprocessing:
Data preprocessing steps included handling missing values, encoding categorical variables, and
ensuring the dataset is suitable for machine learning algorithms. The choice of preprocessing
techniques aimed at maintaining the integrity of the dataset and preparing it for effective model
training and evaluation.
Model Selection:
The choice of algorithms—Logistic Regression, Support Vector Machine (SVM), K-Nearest
Neighbors (KNN), XGBoost Classifier, Random Forest, and Multi-Layer Perceptron (MLP) -
was driven by their suitability for binary classification tasks. Logistic Regression and SVM are
well-established algorithms for binary classification, while KNN is known for its simplicity and
effectiveness. XGBoost and Random Forest are ensemble methods known for handling complex
relationships in data, and MLP is a versatile neural network architecture suitable for non-linear
patterns.
Feature Selection:
Features such as ASD scores, demographic information, and relevant medical history were
chosen based on their potential to contribute to ASD screening outcomes. ASD scores are direct
indicators, while demographic features provide contextual information that may influence
46
screening results. The inclusion of medical history features aims to capture additional factors that
could impact ASD.
Model Evaluation:
Each algorithm's performance was evaluated using standard metrics such as accuracy, precision,
recall, and F1 score. The choice of these metrics was motivated by the need to strike a balance
between overall accuracy and the ability to correctly identify positive cases, which is crucial in
medical screening scenarios.
Visualizations:
Visualizations, including box plots, bar graphs, etc, provided a comprehensive view of the
models' performance, facilitating informed decision-making. These visualizations were chosen to
effectively communicate the trade-offs between true positive rates and false positive rates.
Conclusion
The project demonstrated the effectiveness of machine learning models in ASD screening.
Among the algorithms tested, XGBoost Classifier and Random Forest displayed superior
performance in terms of accuracy and precision.
Feature Importance:
Feature importance analysis revealed key predictors influencing ASD screening outcomes.
Notably, specific ASD scores, age, and family relationships emerged as significant contributors.
The choice of these features was validated by their impact on the models' predictive power.
Practical Implications:
The developed models have practical implications for streamlining ASD screening processes,
potentially aiding healthcare professionals in early and accurate identification of ASD in adults.
The inclusion of specific features ensures the models capture relevant information for actionable
insights.
47
FUTURE SCOPE
In the future, this project has the potential for improvement through the incorporation of novel
machine learning algorithms. Additionally, fine-tuning the existing algorithm by adjusting
48
various parameters crucial to its accuracy could further enhance the project. This optimization
aims to generate more precise models capable of effectively identifying Autism Spectrum
Disorder (ASD) in adults.
Furthermore, the project's advancement could involve the integration of a Deep Learning
neural network. This addition would contribute to uncovering additional concealed information
within the dataset. The advantages of refining existing algorithms or introducing new ones
extend beyond practical applications and can also be valuable for research purposes.
In addition to introducing new predictive techniques, the project will incorporate
advanced and more efficient visualization methods. These enhancements are geared towards
facilitating a clearer and more comprehensive understanding of the data, promoting better
visualization, and fostering in-depth discussions.
REFERENCES
[1] A. A. Abdullah, S. Rijal, and S. R. Dash, “Evaluation on Machine learning Algorithms for
Classification of Autism Spectrum Disorder (ASD),” Journal of Physics, vol. 1372, no. 1, p.
012052, Nov. 2019.
49
[3] “A Review on Predicting Autism Spectrum Disorder(ASD) meltdown using Machine
Learning Algorithms,” IEEE Conference Publication | IEEE Xplore, Nov. 18, 2021.
[4] “A supervised machine learning algorithm for arrhythmia analysis,” IEEE Conference
Publication | IEEE Xplore, 1997.
[6] C. Küpper et al., “Identifying predictive features of autism spectrum disorders in a clinical
sample of adolescents and adults using machine learning,” Scientific Reports, vol. 10, no. 1, Mar.
2020.
[8] D. Nguyen and J. Patrick, “Supervised machine learning and active learning in classification
of radiology reports,” Journal of the American Medical Informatics Association, vol. 21, no. 5,
pp. 893–901, Sep. 2014.
[9] D. K. S. R. et. al., “Machine Learning based novel Autism Spectrum Disorder Screening'',
TURCOMAT, vol. 12, no. 3, pp. 4866–4879, May 2021.
[10] F. Thabtah and D. Peebles, “A new machine learning model based on induction of rules for
autism detection,” Health Informatics Journal, vol. 26, no. 1, pp. 264–286, Jan. 2019.
[11] H. Abdi, L. J. Williams, and D. Valentin, “Multiple factor analysis: principal component
analysis for multitable and multiblock data sets,” WIREs Computational Statistics, vol. 5, no. 2,
pp. 149–179, Feb. 2013.
[12] H. Bhavsar and A. Ganatra, “A Comparative Study of Training Algorithms for Supervised
Machine Learning,” International Journal of Soft Computing and Engineering (IJSCE), Jan.
2012.
[14] K. S. Oma, P. Mondal, N. S. Khan, Md. R. K. Rizvi, and M. N. Islam, “A Machine Learning
Approach to Predict Autism Spectrum Disorder,” IEEE, Feb. 2019.
[15] M. Alteneiji, “Autism Spectrum Disorder Diagnosis using Optimal Machine Learning
Methods,” 2020.
50
[16] Md. M. Rahman, O. L. Usman, R. C. Muniyandi, S. Sahran, S. Mohamed, and R. A. Razak,
“A review of Machine learning Methods of feature selection and Classification for autism
Spectrum Disorder,” Brain Sciences, vol. 10, no. 12, p. 949, Dec. 2020.
[18] M. N. Murty and R. Raghava, Support vector machines and perceptrons: Learning,
Optimization, Classification, and Application to Social Networks. Springer, 2016.
[19] Müller, A. C., & Guido, S. (2016). "Introduction to Machine Learning with Python."
O'Reilly Media.
[20] M. W. Berry, A. Mohamed, and B. W. Yap, Supervised and unsupervised learning for data
science. Springer Nature, 2019.
[21] N. Kühl, R. Hirt, L. Baier, B. Schmitz, and G. Satzger, “How to conduct rigorous supervised
machine learning in information Systems research: The Supervised Machine Learning Report
Card,” Communications of the Association for Information Systems, vol. 48, no. 1, pp. 589–615,
Jan. 2021.
[24] S. Raj and S. Masood, “Analysis and detection of autism spectrum disorder using machine
learning techniques,” Procedia Computer Science, vol. 167, pp. 994–1004, Jan. 2020.
[25] S. R, “A machine learning way to classify autism spectrum Disorder,” Learning &
Technology Library (LearnTechLib), Mar. 30, 2021.
[27] S. Raschka, Y. Liu, V. Mirjalili, and D. Dzhulgakov, Machine Learning with PyTorch and
Scikit-Learn: Develop machine learning and deep learning models with Python. Packt
Publishing Ltd, 2022.
51
[28] S. Uddin, A. Khan, E. Hossain, and M. A. Moni, “Comparing different supervised machine
learning algorithms for disease prediction,” BMC Medical Informatics and Decision Making, vol.
19, no. 1, Dec. 2019.
[29] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: Data Mining,
Inference, and Prediction. Springer Science & Business Media, 2013.
[30] V. Nasteski, “An overview of the supervised machine learning methods,” Horizonti. Serija
B. Prirodno-matematički, Tehničko-tehnološki, Biotehnički, Medicinski Nauki I Zdravstvo, vol. 4,
pp. 51–62, Dec. 2017.
52
APPENDIX A
SCREEN SHOTS
SCREEN SHOTS
53
Fig A-1 First Five rows of the dataset
54
Fig A-4 Pie chart for the number of data for each target
Fig A-5 Plot where the scores indicate no. of yes for the set of 10 questions asked
55
Fig A-6 Plot where the count indicates no. of males and females
Fig A-7 Plot where count indicates no. of people from different ethnicities
56
Fig A-8 Plot with 0 for people having autism and 1 for people not having autism
57
Fig A-10 pair plot for A1-A10
58
Fig A-11 heat map for correlation matrix of numeric columns
Fig A-12 Count plot where count indicates number of cases for each age group
59
Fig A-13 Comparison between scores and number of positive and negative cases
Fig A-14 Normal distribution of the age values after log transformations
60
APPENDIX B
SOURCE CODE
61
SOURCE CODE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import metrics
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('new_excel.csv')
print(df.head())
df['ethnicity'].value_counts()
df['relation'].value_counts()
plt.pie(df['Class/ASD'].value_counts().values, autopct='%1.1f%%')
plt.show()
ints = []
objects = []
floats = []
for col in df.columns:
if df[col].dtype == int:
ints.append(col)
elif df[col].dtype == object:
objects.append(col)
else:
floats.append(col)
62
plt.title('Distribution of Scores (A1 to A10)')
plt.xlabel('Question Number')
plt.ylabel('Total Score')
plt.show()
#Gender
df['gender'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Gender')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()
#ethnicity graph
df['ethnicity'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Ethnicity')
plt.xlabel('Ethnicity')
plt.ylabel('Count')
plt.show()
# jaundice at birth
df['jundice'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Jundice')
plt.xlabel('Jundice')
plt.ylabel('Count')
plt.show()
# autism result
df['austim'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Autism')
plt.xlabel('Autism')
plt.ylabel('Count')
plt.show()
#country result
df['contry_of_res'].value_counts().plot(kind='bar', color='black')
plt.title('Distribution of Country of Residence')
plt.xlabel('Country of Residence')
plt.ylabel('Count')
plt.show()
63
# heat map for correlation matrix
import seaborn as sns
df = df[df['result']>-5]
df.shape
sb.countplot(x=df['ageGroup'], hue=df['Class/ASD'])
plt.show()
def add_feature(data):
# Creating a column with all values zero
data['sum_score'] = 0
for col in data.loc[:,'A1_Score':'A10_Score'].columns:
# Updating the 'sum_score' value with scores
# from A1 to A10
data['sum_score'] += data[col]
# Creating a random data using the below three columns
data['ind'] = data['austim'] + data['used_app_before'] + data['jundice']
64
return data
df = add_feature(df)
sb.countplot(x=df['sum_score'], hue=df['Class/ASD'])
plt.show()
def encode_labels(data):
for col in data.columns:
# Here we will check if datatype
# is object then we will encode it
if data[col].dtype == 'object':
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
return data
df = encode_labels(df)
# Making a heatmap to visualize the correlation matrix
plt.figure(figsize=(10,10))
sb.heatmap(df.corr() > 0.8, annot=True, cbar=False)
plt.show()
65
# Check for NaN values in X
nan_columns = np.isnan(X).any(axis=0)
nan_columns = np.where(nan_columns)[0].tolist()
# List of models
models = [LogisticRegression(), XGBClassifier(), SVC(kernel='rbf')]
print(f'{model} : ')
print('Training AUC-ROC Score : ', metrics.roc_auc_score(Y, model.predict(X_scaled)))
print('Validation AUC-ROC Score : ', metrics.roc_auc_score(Y_val, model.predict(X_val_scaled)))
print()
66
# Split the data into training and validation sets
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.2, random_state=10, stratify=Y)
# List of models
models = [
LogisticRegression(),
XGBClassifier(),
SVC(kernel='rbf'),
MLPClassifier(random_state=42),
RandomForestClassifier(random_state=42),
KNeighborsClassifier()
]
print(f'{model} : ')
print('Training AUC-ROC Score : ', metrics.roc_auc_score(Y_train, model.predict(X_train_scaled)))
print('Validation AUC-ROC Score : ', metrics.roc_auc_score(Y_val, model.predict(X_val_scaled)))
print()
# Assuming X_train, X_test, Y_train, Y_test are defined and represent your training and testing data
67
mse_test = mean_squared_error(Y_val, y_pred_test)
# Print results
print(f'{model} : ')
print('Training Score:', training_score)
print('Testing Score:', testing_score)
print('Mean Absolute Error (MAE) - Training:', mae_train, ' Testing:', mae_test)
print('Mean Squared Error (MSE) - Training:', mse_train, ' Testing:', mse_test)
print('Root Mean Squared Error (RMSE) - Training:', rmse_train, ' Testing:', rmse_test)
print()
# Performance metrics
accuracy = metrics.accuracy_score(Y_val, y_pred_val)
sensitivity = metrics.recall_score(Y_val, y_pred_val)
roc_auc = metrics.roc_auc_score(Y_val, logreg_model.predict_proba(X_val_scaled)[:, 1])
# Print results
print('Logistic Regression : ')
print('Accuracy:', accuracy)
print('Sensitivity:', sensitivity)
print('AUC:', roc_auc)
68