Dsai Report
Dsai Report
CONDITION MONITORING
FROM SOCIAL MEDIA
DATA SCIENCE IN AI FOR HEALTHCARE PROJECT
Group Members:
KARIMI UPENDRA B121023
KETHAVATH RAJA NAYAK B121024
THUTTA JAHNAVI B121067
Submitted to:
Bhubaneswar – 751003
1|Page
Contents
1. Abstract..................................................................................................................................................3
2. Project Description.................................................................................................................................4
2.1 Project Title........................................................................................................................................4
2.2 Project Objective................................................................................................................................4
2.3 Introduction.........................................................................................................................................4
3. Materials and methods...........................................................................................................................5
3.1 Dataset ...............................................................................................................................................5
3.2 Preprocessing......................................................................................................................................6
3.3 Word Embedding................................................................................................................................6
3.3.1 TFIDF(Term Frequency and Inverse Document Frequency).....................................................6
3.3.2 Bag of Words(BoW)..................................................................................................................7
3.4 Traditional ML Models.......................................................................................................................6
3.5 Deep Learning Model.........................................................................................................................6
3.6 Evaluation Metrics..............................................................................................................................6
4. Experimental Results.............................................................................................................................5
4.1 Word Cloud.........................................................................................................................................5
5. Discussion..............................................................................................................................................7
6. Conclusion.............................................................................................................................................7
References......................................................................................................................................................8
2|Page
1.ABSTRACT
Mental health detection from social media platforms has emerged as a vital area of research,
offering insights into individuals' psychological well-being and facilitating early intervention.
In this study, we explore the task of detecting mental health-related content from Reddit
posts. Leveraging a dataset curated by Turcan and McKeown, we employed both traditional
machine learning algorithms and deep learning techniques to develop accurate classifiers.
Our experiments revealed promising results across various methodologies. Among traditional
machine learning algorithms, logistic regression and support vector machines (SVM)
demonstrated competitive performances, achieving accuracies of 74.83% and 73.85%,
respectively. These findings underscore the efficacy of conventional approaches in handling
textual data for mental health detection tasks.
Our study contributes to the growing body of research aimed at leveraging computational
methods for mental health analysis. The findings suggest that both traditional and deep
learning techniques hold promise for automating the detection of mental health-related
content on social media platforms, thereby facilitating timely interventions and support for
individuals in need.
3|Page
2.Project Description
2.1.Project Title
MENTAL HEALTH CONDITION MONITORING FROM SOCIAL MEDIA
2.2.Project Objective
The primary objective of this project is to develop a predictive model to assess the
mental health status of individuals based on natural language input, such as text data. By
analysing linguistic patterns, sentiment, and other relevant features, the model aims to
provide valuable insights into an individual's mental well-being and predict their emotional
state. Predicting the emotional state of an individual can help enhance mental health support
by providing valuable insights into their current well-being. By anticipating emotions, our
goal is to support people in taking proactive measures to care for their mental well-being,
leading to enhanced quality of life and a stronger emotional foundation.
2.3.Introduction
In today's fast-paced world, stress, anxiety, and burnout are increasingly common.
Individuals who suffer from anxiety and depression frequently express their views and ideas
on social media. Thus, several studies found that people who are contemplating stress can be
identified by analysing social media posts. However, finding and comprehending patterns of
stress represent a challenging task. Therefore, it is essential to develop a machine learning
system for automated early detection of depression or any abrupt changes in a user’s
behaviour by analysing his or her posts on social media. In this report, we propose a
methodology based on experimental research for building a stress detection system using
publicly available Reddit datasets, word-embedding approaches, such as TF-IDF and
Word2Vec, for text representation, and hybrid deep learning and machine learning algorithms
for classification.
Research shows that promoting happiness and well-being can lead to numerous benefits,
including improved mental health, better relationships, and increased productivity. The
project aims to address this need by offering practical guidance, motivational content, and a
sense of community to support individuals in their journey toward greater happiness and
fulfilment.
4|Page
5|Page
3.Materials and Methods
3.1.Data
In this study, we acquired data from multiple sources to construct a comprehensive
dataset for the task of detecting mental health-related content from Reddit posts. Initially, we
examined four research papers to identify potential datasets suitable for our objectives.
Ultimately, the dataset obtained from the research paper authored by Elsbeth Turcan and
Kathleen McKeown was selected for its relevance and suitability.
The dataset provided by Turcan and McKeown consisted of 3000 rows and 116 columns,
containing a wealth of information extracted from Reddit posts. These features encompassed
various textual, temporal, and user-related attributes, offering a rich foundation for our
analysis.
To augment our dataset and enhance its diversity, we explored additional sources, including
Kaggle. While we discovered another dataset on Kaggle, it lacked labels for the posts,
presenting a challenge for supervised learning tasks. To address this limitation, we employed
K-means clustering to automatically categorize the posts based on their textual content.
Subsequently, we manually verified the clustering results for approximately 100-150 posts to
ensure accuracy.
The manual verification process involved assessing the assigned labels against the content of
the posts to ascertain their alignment with mental health-related themes. Through this
meticulous validation process, we verified the correctness of the assigned labels, thereby
enhancing the reliability of our dataset.
By integrating data from multiple sources, employing rigorous labelling approaches, and
leveraging insights from existing research, we curated a robust dataset tailored to our specific
research objectives. This comprehensive dataset served as the foundation for training and
evaluating our machine learning models, enabling us to effectively address the task of mental
health detection from Reddit posts.
6|Page
3.2.Preprocessing
We focused our analysis on a subset of the dataset, specifically, two columns out of the
116 available. Those two columns include only the text of the social media post and a binary
label for that post telling whether the post implies stressed state or not. Data Preprocessing is
done using the techniques of Natural Language Processing. Our preprocessing pipeline,
outlined in the code snippet provided, entailed the following steps:
URL Removal: Eliminating any URLs present in the text to remove irrelevant
information. HTML Tag Removal: Stripping off HTML tags, as they do not
contribute to the textual content.
Digits and Single Characters Removal: Omitting digits and single characters to
enhance readability and focus on meaningful words.
Punctuation Removal: Discarding punctuation marks to isolate words for analysis.
Tokenization and Stop word Removal: Splitting the text into individual words
(tokens) and removing common English stop words, such as "the," "is," and "and," to
eliminate noise. Word Stemming: Applying the Porter Stemmer algorithm to reduce
words to their root form, thereby consolidating similar words and reducing
redundancy.
Recomposition: Reassembling the processed words into a coherent, cleaned-up text.
3.2.Word Embedding
3.2.1. TFIDF
TF-IDF is a widely used technique in natural language processing (NLP) for converting
textual data into numerical vectors. It represents the importance of a word in a document
relative to a corpus of documents. TF-IDF is calculated as the product of two components:
Term Frequency (TF) and Inverse Document Frequency (IDF).
(1)
Set D points to a set of documents, and d denotes a single document, d∈ D. Each document is
represented as a group of sentences and words w, and nw(d) is the number of recurrent words
w in document d. Therefore, the size of document d is calculated as follows:
(2)
The frequency at which a word appears in the document is expressed in Equation (2).
IDF, the second component of TF-IDF, is used to compute the number of documents in a
textual corpus in which a specific word appears, as follows:
7|Page
(3)
The TF-IDF for word w associated with document d and corpus D can be calculated as:
(4)
The Bag of Words (BoW) model is a fundamental technique in Natural Language Processing
(NLP) for converting textual data into numerical vectors. It represents the occurrence of
words in a document without considering the order in which they appear.
Tokenization: The first step in the Bag of Words model involves tokenization, where the text
is split into individual words or tokens.
Following rigorous training and evaluation procedures, it was observed that the Logistic Regression and
SVM classifiers exhibited the highest accuracy among all models. These two algorithms demonstrated
8|Page
superior performance in discerning patterns and predicting mental health conditions based on social media
content.
Among the seven models evaluated, Logistic Regression and SVM classifiers demonstrated the highest
accuracy in predicting the mental health-related categories of Reddit posts with 74.83% and 73.85% This
indicates the effectiveness of these models in capturing the underlying patterns in the TF-IDF weighted
features.
3.4.Evaluation Metrics
To evaluate the performance of the models in classifying post content as suicidal or non-
suicidal, we used common evaluation metrics with a focus on the number of false-positive
and false-negative classifications obtained from the confusion matrix presented. The
performance metrics used were Accuracy, Precision, Recall, and Standard Deviation, which
were calculated as follows:
9|Page
4.Experimental Results
The LSTM model achieved an accuracy of 78.01% using bag of words and Logistic
regression model got an accuracy of 74.83 using TF-IDF word embedding. The followings
results were obtained:
Table 4.1: Results of traditional ML models
Classifier Accuracy Precision Recall Standard Deviation
Logistic Regression 74.83 79.95 73.57 0.030
KNN 64.48 72.36 63.72 0.20
Random Forest 68.95 82.11 66.01 0.036
Decision Tree 60.28 46.61 66.41 0.021
Naïve Bayes 54.55 68.56 54.76 0.020
Adaboost 66.99 69.11 67.64 0.022
SVM 73.85 77.51 73.33 0.006
Word clouds are widely used in NLP to visualize the most important and recurrent words in a
textual corpus. Here, we used a word cloud to visualize the most repeated words in the Reddit
dataset, shown below.
10 | P a g e
5.Discussion
The project focused on detecting mental health-related content from Reddit posts using
machine learning and deep learning techniques. By leveraging a diverse dataset and
employing various models such as logistic regression, support vector machines, and LSTM
networks, the study aimed to automate the identification of mental health discussions on
social media platforms.
The findings of the project highlighted the effectiveness of both traditional machine learning
algorithms and deep learning architectures in accurately classifying mental health-related
content. The logistic regression and SVM models demonstrated competitive performance,
while the LSTM model showcased the ability to capture nuanced linguistic patterns inherent
in mental health discussions.
6.Conslusion
In conclusion, this project signifies the pivotal role of machine learning and deep learning
techniques in automating the detection of mental health-related content on social media
platforms. By leveraging diverse datasets and employing a range of models, including logistic
regression, support vector machines, and LSTM networks, the study showcases promising
results in accurately identifying and flagging mental health discussions. The findings not only
highlight the effectiveness of these computational methods but also underscore their potential
impact in facilitating early interventions, destigmatizing mental health discourse, and
ultimately improving the well-being of individuals in online communities. This research
contributes to the ongoing efforts to harness technology for addressing societal challenges,
particularly in the domain of mental health, and underscores the importance of
interdisciplinary collaborations between computer science, psychology, and healthcare
domains.
11 | P a g e
References
Buddhitha P and Inkpen D (2023) Multi-task learning to detect suicide ideation and mental
disorders among social media users. Front. Res. Metr. Anal. 8:1152535. doi:
10.3389/frma.2023.1152535
Saylam, B.; İncel, Ö.D. Multitask Learning for Mental Health: Depression, Anxiety, Stress
(DAS) Using Wearables. Diagnostics 2024, 14, 501.
https://doi.org/10.3390/diagnostics14050501
Victor Ruiz, Lingyun Shia, Jorge Guerra, Wei Quan, Neal Ryan, Candice Biernesser,David
Brent, and Fuchiang Tsui . CLPsych2019 Shared Task: Predicting Users’ Suicide Risk Levels
from Their Reddit Posts on Multiple Forums.
12 | P a g e