TARP Final Report
TARP Final Report
Team Members
Sreshta D – 20BCE2437
The project embarks on a noble mission, recognizing the profound impact of unaddressed mental
health issues in today's society. Its core objective is to unravel the intricate web of factors
contributing to mental health challenges while emphasizing public awareness and effective coping
mechanisms. Through meticulous surveys and data collection, the project aims to harness the
power of advanced machine learning algorithms, including Random Forest and Gradient Boost, to
delve deeply into the multifaceted landscape of mental health.
At the heart of this initiative lies a strong motivation to elevate mental health awareness, eliminate
stigma, and mitigate the far-reaching consequences of unattended mental health issues on various
aspects such as employment, safety, and societal well-being. The project seeks to predict levels of
mental health awareness and extend support by exploring treatment options, all while unraveling
the underlying determinants that impact an individual's mental well-being.
By employing sophisticated data analysis techniques, the project endeavors to uncover intricate
patterns and correlations between various factors—such as age, occupation, and mental health
status. These insights are essential in understanding how different variables intersect and influence
mental well-being, potentially paving the way for preventive measures and tailored interventions.
The emphasis on enhancing mental health awareness and combating associated stigmas aligns with
the broader mission of reshaping societal attitudes toward mental health. By shedding light on the
repercussions of unaddressed mental health challenges on employment opportunities, safety in
communities, and overall societal cohesion, this initiative aims to drive positive change.
Ultimately, the comprehensive approach taken by this project—integrating big data analytics,
machine learning capabilities, advocacy for public awareness, and the provision of support
strategies—holds promise in transforming societal perceptions of mental health. By striving to
create a more empathetic and supportive environment, this initiative endeavors to positively impact
the understanding and treatment of mental health issues.
2
Introduction
The conversation surrounding mental health has experienced a significant transformation, moving
past the stigmatized boundaries of secrecy and stigma to evolve into a pressing worldwide issue
requiring a swift and decisive response. This paradigm change has been driven by a seismic
increase in the availability and application of digital data, sparking a deep rebirth in the awareness
and importance of mental health concerns in communities across the globe.
The emergence of Big Data Analytics has brought about an unprecedented change in our
understanding and management of the complicated web of mental health issues. This unusual
inflow of vast and varied data sources has opened up new avenues for investigation, enabling
scholars and clinicians to take on a never-before-seen level of study into deeply interconnected
mental health conditions, complex patterns, and nuanced risk factors and multifaceted
interventions. The ability to methodically examine enormous datasets offers a unique chance to
decipher the mysterious character of mental health issues and investigate previously undiscovered
relationships and correlations.
3
Related Work and Motivation
Understanding our mental health is crucial for our overall well-being, happiness, and how well we
can contribute to our community. That's why our project is all about making more people aware
of mental health. We want to help everyone understand the signs when something might be wrong,
find the right help from qualified professionals, and most importantly, stop the unfair treatment
and judgment that often comes with mental illness. This judgment makes a lot of people suffer
quietly, and that's not okay.
When mental health issues go untreated, they can cause big problems in our communities. They
can affect local businesses, safety, how many people are without homes, and even job
opportunities. For young people, it might make it harder to do well in school. It can also increase
how much we spend on healthcare. Families and communities can feel the strain too. That's why
it's really important for us to understand mental health better and do something about it before it
gets worse.
Literature Survey
1. Overview of the role of big data in The review explores From 600 initial studies,
mental health: A scoping review recent works in utilizing 406 were excluded after
Arfan Ahmedet al.(2022) Computer Big Data for textual and screening for relevance,
Methods and Programs in 1D sensor data in the population and
Biomedicine Update - Elseiver context of mental intervention criteria,
health. It offers a and incompatible
detailed taxonomy of publication types. An
related technologies and additional 171 were
assesses their maturity removed from the
levels in this specific remaining 194 results,
4
domain resulting in 23 included
in the scoping review
2. Health: The ethics of artificial The article explores the Ethical analysis
intelligence and big data in mental ethical aspects of highlights iHealth's
healthcare Giovanni Rubeis (2022) intelligent health opportunities and
Internet Interventions - Elseive (iHealth) in mental challenges. Self-
healthcare, combining monitoring, EMA, and
AI and Big Data. It data mining could
specifically addresses prevent, predict, and
three key points personalize mental
mentioned in existing health treatment,
literature: involving patients.
selfmonitoring, Further research on
ecological momentary diverse patient groups is
assessment (EMA), and crucial to explore the
data mining. tested potential of
digital mental health
technologies.
3. Digital health data-driven approaches This paper offers a Digital health spans
to understand human behavior Lisa proof-of-concept review from measuring to
A. Marsch, (2021) of digital health data's intervening, offering
Neuropsychopharm acology – role in understanding value with digital
Nature.com health behavior, biomarkers and insights
focusing on psychiatric into psychiatric
disorders. It synthesizes diagnoses and outcome
scientific literature on measurement over time.
how empirically derived These applications can
digital data informs complement each other,
assessment, diagnosis, measuring behavior and
and clinical trajectories guiding responsive
in this context. interventions
5
4. Big Data analytics and artificial This paper talks about Mental health lacks
intelligence in mental healthcare AI and Big data in widely accepted
Ariel Rosenfeld et al. (2021) mental health holding biomarkers, relying
Applications of Big Data in potential for heavily on patient and
Healthcare - Elseiver personalized treatment, clinician questionnaires
predicting relapse, and emerging digital
preventing conditions phenotyping signals.
before they escalate, and This chapter explores
delivering treatments opportunities,
limitations, and
techniques for
enhancing mental
healthcare with AI and
big data.
6
employing natural pertaining to mental
language processing to health applications have
detect suicidal thoughts been discerned:
in unstructured detection and diagnosis,
counseling session prognosis, treatment
transcripts. and support, public
health applications, and
research and clinical
administration
7
Boosting, Random choice, despite their
Forest, and Naïve distinct advantages.
Bayes. This structured Many experiments
information provides a employ these algorithms
concise overview of the without a deep grasp of
field's key topics and the data properties.
researchers
10. Challenges and Opportunities of Big Electronic Health Examining the literature
Data in Health Care: A Systematic Records (EHR) adopted on big data in healthcare
Review Gunther Eysenbach (2016) globally offer vast, shows common
JMIR(Journal of Medical Intent complex data resources. concerns about data
Research) medical informatics In the United States, organization and
EHR usage surged since security, as well as
2009, spurred by opportunities like
incentives for improved quality and
"meaningful use." This disease diagnosis. These
growth aligns with the findings call for further
8
evolution of big data, research in these areas.
shaping the future of
health information
technology
implementation and
management.
9
Proposed Methodology
10
The initial step in the project involved data collection through survey forms distributed to
participants.
Thoroughly planning the survey's design, especially the way the questions are formulated, is
essential to collecting accurate and thorough data on mental health. We used a thorough process
to create survey questions for our study, making sure they were:
● Clear and Concise: To minimise participant confusion, all questions were written in an
easy-to-understand style without resorting to jargon or technical vocabulary.
● Objective and Unbiased: Leading or suggestive wording that can sway respondents'
responses was avoided in the framing of the questions.
● Relevant and Specific: The questions were designed to elicit information directly
connected to the study goals, without delving into unrelated or superfluous subjects.
● Culturally Sensitive: To ensure that the questions were appropriate for the target
demographic, they were phrased with consideration for potential biases and cultural
differences.
When creating the survey questions, we took psychometric factors into account in addition to these
broad guidelines. This required making certain that the inquiries were:
● Reliable: In several administrations and circumstances, the same construct was consistently
measured using the same questions.
● Valid: The desired mental health constructs were correctly assessed by the questions.
● Extensive: The questions collectively covered every facet of mental health that was
pertinent to the goals of the investigation.
● Sensitive: The questions were able to identify minute differences in the state of mental
health.
11
Pilot Testing and Improvement -
We ran a pilot test with a limited sample of respondents to further improve the caliber of the survey
questions. This let us get input on the questions' efficacy, relevance, and clarity so we could
improve them before distributing them widely.
We made sure that our survey instrument collected high-quality data on mental health by taking a
thorough approach to survey design, which gave our research project a solid foundation.
Thereby the data collected was saved as a CSV file.
12
13
14
15
16
17
Google Form Link: https://forms.gle/pzHKv2vmetLDuUju8
Upon reading the dataset, exploratory analysis was conducted to unveil patterns, detect anomalies,
test hypotheses, and validate assumptions. This process relied on summary statistics and graphical
representations. The initial phase was particularly challenging, requiring the identification of
pertinent questions and factors for analysis, such as occupation and education.
Following data collection, we proceeded to the data preprocessing phase. The initial task involved
scrutinizing the dataset for missing values (NaN) and counting the number of missing values in
18
each column. To handle missing values, a decision was made to either omit the respective data
points or manually input values, with a preference for the former. Additionally, we proposed
creating a heatmap to visualize the distribution of missing values during the initial preprocessing
stage.
Several preprocessing steps were outlined based on the collected data, including:
1. Data Cleaning - Missing Values: In the case of categorical data, missing tuples were intended
to be disregarded. We aimed to minimize missing values by making all relevant fields mandatory
in the data collection form. For numerical missing values, the plan was to use a KNN imputer. As
part of the preprocessing, a heatmap was to be generated to visualize the missing values
distribution.
2. Data Integration: Data collected in different environments, such as students and office
workers, was slated to be merged into a unified dataset.
3. Data Encoding: The dataset contained multiple labels in one or more columns. To enhance data
understandability, we adopted label encoding to convert labels into numeric forms. This
transformation made the data machine-readable, facilitating more effective operation by machine
learning algorithms. Label encoding was applied to various columns, including age group, gender,
occupation, and the importance of mental health. The LabelEncoder, specifically the
`transform(df)` function, was utilized for this purpose.
4. Data Normalization: The team considered the use of two attributes to divide the dataset into
key-value pairs.
Subsequently, the project moved to the model prediction phase, employing two algorithms: the
Random Forest Classifier and the Boost Gradient Tree. These algorithms were implemented
both in Python and PySpark. Furthermore, Lasso prediction was applied, leveraging MapReduce
techniques to identify key determining factors. The final step involved data visualization, where
various graphs were utilized to predict project outcomes.
19
Algorithms/Models Used
Random Forest:
Random Forest, an ensemble learning method, is used for tasks like classification and regression.
It operates by constructing numerous decision trees during training. For classification tasks, the
output of the Random Forest is determined by the class selected by the majority of the constituent
trees. In your dataset, the following columns are employed as independent variables: 'Which age
group do you belong to?', 'Gender', 'Occupation', 'Do you find mental health important?', 'Do you
or anyone you know face any mental health issues?', 'Do you think therapy helps during mental
illness?', 'How would you rank the awareness of mental health in our Indian society? (1 being the
least and 5 the most)', and 'Are they seeking any kind of treatment for it?' as the target variables.
The dataset is divided into training and testing sets, with a test size of 30%.
Parallel processing is employed, wherein a Spark data frame is partitioned into smaller, distributed
data sets. These smaller sets are converted into Pandas objects, functions are applied, and the
results are consolidated back into a large Spark data frame.
Gradient Boosting:
Gradient Boosting is a machine learning technique utilized for regression and classification tasks,
among others. It constructs a prediction model as an ensemble of weak prediction models, often
decision trees. The training process in Gradient Boosting is typically carried out sequentially, with
one tree trained at a time. Parallelization occurs at the single tree level.
In Decision Trees, all features are usually considered at each decision node in the tree. However,
in Random Forests, a random subset of features is selected at each node. The implementation in
MLlib capitalizes on this subsampling to reduce communication. For example, if only one-third of
the features are used at each node, communication can be reduced by a factor of one-third.
20
Feature Selection by LASSO:
LASSO (Least Absolute Shrinkage and Selection Operator) is employed for feature selection,
providing a systematic approach to reduce the number of features in a model. LASSO introduces
a penalty factor that determines how many features are retained. Cross-validation is used to choose
the penalty factor, ensuring that the model generalizes well to future data samples.
One notable aspect of this project is its novelty, as the chosen topic is relatively under-researched.
Additionally, the project involves processing data from scratch, distinct from many machine
learning projects that use pre-processed data. As a big data implementation, the project leverages
Spark's parallel data processing model to divide the data into smaller sections and then applies
machine learning models like Random Forest and Gradient Boosting, resulting in improved
efficiency and faster results.
21
Results and Discussion
Project Components:
The project is structured around three main components: data engineering, applying analytical
models to the data, and deriving results. The initial phase focuses on collecting data through an
online questionnaire, resulting in approximately 340 responses.
1. Name: Identity of the respondent (Personal information, possibly omitted for privacy).
2. Age: Provides information about the age group of the user.
3. Gender: Categorizes the respondent's gender as Male (M), Female (F), Prefer not to say, or
Other.
4. Occupation: Indicates the respondent's profession or job.
5. MH-imp: Identifies if the user considers mental health important.
6. MH-issues: Indicates whether the user faces any mental health issues.
7. Activities: If mental health issues are present, this field assesses whether they affect the
user's day-to-day activities.
8. Treatment: Identifies if the user is seeking treatment. This serves as the dependent variable.
9. MH-causes: Records factors contributing to mental health issues.
10. Pandemic: Assesses whether the pandemic has caused mental health problems.
11. Therapy: Examines whether therapy helps in curing mental illnesses.
12. MH-recover: Measures the estimated time needed for recovery from a mental illness.
13. Improve: Gathers insights on how individuals believe they can improve their mental health.
14. Stigma: Explores how a family responds when a person has mental health issues.
15. Rank: Asks respondents to rank their awareness of mental health issues.
22
Code and Implementation:
● Data Visualization: The project employs a correlation plot to depict the relationships
between dataset features.
● Handling Missing Values with KNN Imputer: K-nearest neighbors (KNN) imputer is used
to address missing values in the dataset.
23
● Label Encoding: The data is quantified using label encoding to facilitate the application of
various machine learning models.
24
● Lasso Model: The Lasso model predicts sparse coefficients to determine the best variables.
25
● Feature Selection in Pyspark Random Forest: Feature selection is conducted, and the
accuracy for Pyspark's Random Forest module is calculated at 72%.
26
● Pyspark Gradient Boosted Trees: Utilizes Gradient Boosted Trees for classification and
accuracy measurement in the Pyspark module.
27
Conclusion
The use of Pyspark modules for big data processing showcases the advantages of batch processing,
even when dealing with relatively small datasets, resulting in similar accuracy levels as
conventional models (72 % for Random Forest Model). However, as the volume of data increases,
the accuracy is expected to improve. To enhance the insights and prominence of our results, we
are actively seeking ways to expand our data set, aiming for approximately 1000 rows. With a
larger training data set, we anticipate obtaining more robust results. Additionally, we plan to
explore a broader range of machine learning models to analyze our data more efficiently. This
expanded approach will enable us to conduct a comparative study between different models.
Our ultimate goal is to predict the awareness rate among the population and provide suggestions
for various treatments and therapy options. By employing a series of algorithms, we can also
pinpoint the most influential factors affecting an individual's mental health. Presenting this
information in visual formats simplifies understanding for the general public, making it accessible
and meaningful beyond technical jargon.
The uniqueness and novelty of our project lie in the creation and utilization of a comprehensive
dataset specifically tailored to cater to the needs of students and young professionals struggling
with mental health concerns. We've meticulously gathered and curated information that not only
sheds light on mental health issues but also provides a crucial link between individuals seeking
help and qualified psychiatrists within their local regions.
What sets our dataset apart is its focus on the intersection between mental health support and
accessibility. We've compiled a diverse range of data points, including demographics, and specific
mental health needs. By amalgamating this information, we've established a platform that bridges
the gap between those seeking mental health assistance and the relevant professionals capable of
providing specialized support.
28
This initiative introduces a novel approach to addressing mental health challenges among students
and young professionals by leveraging technology to create a direct connection between those in
need and mental health experts within their localities. The dataset serves as a valuable resource,
offering personalized and localized assistance to individuals grappling with mental health issues.
This tailored approach is instrumental in breaking down barriers to access mental health care,
facilitating prompt intervention and support.
Our project's innovation lies not only in the aggregation of this unique dataset but also in its
practical application. By providing a platform that matches individuals with suitable psychiatrists
based on their specific needs and geographical location, we empower students and young
professionals to seek timely and appropriate mental health support. This tailored matching system
enhances the efficiency and effectiveness of mental health care delivery, ensuring that individuals
receive the help they need precisely when and where they need it the most.
We hope that our work contributes to helping individuals cope with mental health issues and lead
healthier lives. It's crucial for people to understand the potential consequences of mental illness
and prioritize mental well-being, just as they do for their physical health.
29
Future Scope
30
Marketing Strategies:
To introduce our revolutionary mental health awareness app, we will implement a strategic
marketing plan centered on compassion, accessibility, and innovation. Emphasizing the user-
friendly interface and the app's comprehensive resources, our strategy will focus on social media
campaigns targeting specific demographics. These campaigns will highlight features like mood
tracking, guided meditation, and expert insights.
Collaborations with mental health influencers and testimonials will showcase the positive impact
of our app, while free trial periods and community engagement initiatives will encourage
widespread adoption. Our goal is to redefine self-care and destigmatize mental health by making
comprehensive support readily available to all individuals at the touch of a button. Together, let's
champion mental health awareness and reshape how society perceives and addresses mental well-
being.
31
References
[1] Ahmed, Arfan, et al. "Overview of the role of big data in mental health: A scoping
review." Computer Methods and Programs in Biomedicine Update (2022): 100076.
[2] Rubeis, Giovanni. "iHealth: The ethics of artificial intelligence and big data in mental
healthcare." Internet Interventions 28 (2022): 100518.
[3] Marsch, Lisa A. "Digital health data-driven approaches to understand human behavior."
Neuropsychopharmacology 46.1 (2021): 191-196.
[4] Rosenfeld, Ariel, et al. "Big Data analytics and artificial intelligence in mental
healthcare." Applications of Big Data in Healthcare. Academic Press, 2021. 137-171.
[5] Tate, Ashley E., et al. "Predicting mental health problems in adolescence using machine
learning techniques." PloS one 15.4 (2020): e0230389.
[6] Liang, Yunji, Xiaolong Zheng, and Daniel D. Zeng. "A survey on big data-driven digital
phenotyping of mental health." Information Fusion 52 (2019): 290- 307.
[7] Shatte, Adrian BR, Delyse M. Hutchinson, and Samantha J. Teague. "Machine learning
in mental health: a scoping review of methods and applications." Psychological medicine 49.9
(2019): 1426-1448
[8] Cho, Gyeongcheol, et al. "Review of machine learning algorithms for diagnosing
mental illness." Psychiatry investigation 16.4 (2019): 262
[9] Sumathi, M. R., and B. Poorna. "Prediction of mental health problems among children
using machine learning techniques." International Journal of Advanced Computer Science and
Applications 7.1 (2016).
[10] Kruse, Clemens Scott, et al. "Challenges and opportunities of big data in healthcare:
a systematic review." JMIR medical informatics 4.4 (2016): e5359.
32