Suicidal Thought Detection Using NLPNatural Language Processing On Reddit Data
Suicidal Thought Detection Using NLPNatural Language Processing On Reddit Data
Abstract—Our study harnesses the power of NLP to contribute 3 outlines the methodology, emphasizing the use of NLP
to the identification of suicidal ideation in text data. In the realm approaches. In Section 4, the paper details the implementation
of mental health, identification of suicidal ideation at an early of the TF-IDF model. Section 5 discusses results, highlighting
stage is of paramount importance for suicide prevention efforts.
This paper presents a novel approach to suicide ideation detection the significance of NLP in addressing mental health challenges
using Natural Language Processing (NLP) techniques. Faced with through social media analysis. In conclusion (Section 6), the
a dearth of publicly available datasets for this critical task, research underscores the value of this approach for suicide
we have contributed a valuable resource by curating a dataset prevention initiatives in the digital age.
from the ”SuicideWatch” and ”depression” subreddits on the
Reddit platform, collected via the Pushshift API. Specifically, we II. R ELATED W ORKS
employ the LSTM and a Random Forest classifier separately
to achieve promising results in this vital area of research. In [7], TFIDF and BoW are used in tandem to precisely
This work not only advances the field of NLP-based suicide differentiate between positive and negative tweets. They dis-
ideation detection but also contributes a valuable dataset for covered that by utilizing the TF-IDF vectorizer, the precision
future investigations, potentially saving time and resources for of sentiment analysis can be greatly improved and simulation
researchers and professionals dedicated to the prevention of
suicide and the improvement of mental health. We could achieve results demonstrate the efficiency of our suggested method.
up to 93% accuracy in suicidal thought analysis using NLP Using the NLP approach, we achieved 85.25% accuracy in
techniques. sentiment analysis. In [8], The author provided an auto-
Index Terms—NLP, Reddit, Pushshiftf, API, LSTM, Random mated conversational platform that was utilized to identify
Forest Classifier. depression-related risks as a preliminary strategy. The platform
was designed to interpret conversations through NLP and
I. I NTRODUCTION machine learning. The suggested two-phased platform’s initial
In the contemporary digital landscape, social networking phase would examine discussion and sort related emotions
sites have emerged as global agora, reshaping communication into four categories: ”happy,” ”neutral,” ”depressive,” and
dynamics and offering unprecedented insights into human ”suicidal.” In [9], The authors suggested a categorization
emotions [1]–[4]. Among these platforms, Reddit stands out strategy, deep neural networks, Bi-LSTM, CNN, and self-
as a rich source for sentiment research due to its diverse sub- attention are used in this model. , which they demonstrated on
reddits, fostering candid discussions, including those related to several datasets. Furthermore, they contrast three pre-trained
mental health [5]. Addressing the pressing issue of suicide, this word-embeddings for word encoding. The optimistic findings
research employs Natural Language Processing (NLP) tech- achieved on cutting-edge datasets allow us to test the model’s
niques, specifically leveraging the TF-IDF model, to discern validity and examine the optimum word embeddings to use for
between posts expressing suicidal ideation and non-suicidal emotion identification. They suggest their model as a starting
sentiments. With over 47,000 lives lost annually to suicide in point for further research in the issue because deep learning
the United States alone, understanding and preventing suicidal is so significant in the academic world.
thoughts are critical [6]. In [10], the authors provided the findings of a thorough map-
This study aims to contribute to mental health research by ping research to arrange the available published information.
utilizing Reddit data for sentiment analysis. Section 2 reviews We looked for studies completed between 2015 and 2020 in
related studies on suicidal thought prediction, while Section electronic research databases of the scientific literature using a
Authorized licensed use limited to: Zhejiang University. Downloaded on December 21,2024 at 10:34:22 UTC from IEEE Xplore. Restrictions apply.
8. Generate a word cloud: • Remove the 'Unnamed: 0' column if present.
• Generate a word cloud visualization to gain insights into • Check and print basic information about the dataset.
the most frequent words in the dataset. Step 4: Data Visualization
Step 6: Data Splitting and Label Encoding • Visualize the class distribution using a countplot.
9. Split the data: Step 5: Text Length Analysis
• Split the data into training and testing sets. • Calculate the length (in terms of words) of each text.
10. Encode labels: • Analyze the text length statistics, including quantiles.
• Encode the class labels (”Suicide” and ”Not Suicide”) Step 6: Filter Texts Based on Length
using LabelEncoder. • Remove texts with a length exceeding a certain threshold
Step 7: Word Embedding with Pretrained Vectors (317 words in this case).
11. Load pretrained word embeddings: • Visualize the class distribution after filtering.
• Load pretrained word embeddings (GloVe) to create an Step 7: Word Frequency Analysis
embedding matrix for words in the dataset. • Tokenize and count the frequency of words in the text
Step 8: Model Building and Training data.
12. Define the model architecture: • Filter out rare and common words based on quantiles
Authorized licensed use limited to: Zhejiang University. Downloaded on December 21,2024 at 10:34:22 UTC from IEEE Xplore. Restrictions apply.
binary features and the target ’class’. This innovative feature
extraction technique aimed to capture the distinctive patterns
associated with suicidal and non-suicidal posts, providing a
nuanced representation of the dataset. The binary features,
generated through this process, were then utilized for training
and evaluating machine learning models. This tailored feature
extraction methodology holds the potential to reveal intricate
Fig. 2. Dataset Used for Suicidal Thought Detection relationships between specific word occurrences and the clas-
sification of posts, fostering a more insightful understanding
of the dataset and facilitating improved model performance.
B. Data pre-processing
1) Data Extraction: The collected data is extracted from
the API and organized for further analysis.
2) Removal of E-mails Converting to Lower-Case: Firstly,
the text dataset underwent preprocessing where emails were
removed, and the entire dataset was converted to lowercase.
3) Tokenization: Text data is tokenized into individual
words or tokens to facilitate subsequent processing.
C. Feature Extraction
In a departure from the conventional TF-IDF method, our
Fig. 6. Separate Variables for Words Based on Their Presence
feature extraction process employed a custom approach to
enhance the interpretability and relevance of features. Specifi-
cally, we crafted binary features based on the frequency of E. Classifier Model
selected words within the text data. This method involved In this study, we investigate the performance of the Random
encoding the ’class’ column, assigning ’suicide’ a numerical Forest classifier and LSTM (Long Short-Term Memory) neural
value of 1 and ’non-suicidal’ a value of 0. Subsequently, networks as two different classifier models for detecting sui-
we calculated the Pearson correlation coefficient between the cidal thoughts. We have also applied Bi-lstm in this research.
Authorized licensed use limited to: Zhejiang University. Downloaded on December 21,2024 at 10:34:22 UTC from IEEE Xplore. Restrictions apply.
V. R ESULTS & D ISCUSSION in identifying complicated sequential relationships within text
This section contains the results of our experimentation data may be limited when compared to Bi-LSTM. This could
and goes into the analysis of the findings obtained by using account for the differences in recall and overall F1 score.
LSTM, Bi-LSTM and Random Forest Classifier models to The performance metrics obtained for the Bi-LSTM, LSTM
detect suicidal ideation in text data. We will first present a and Random Forest Classifier models are summarized in Table
brief assessment of the model’s efficiency in terms of accuracy, 1 below-
precision, recall, and F1 score, followed by a discussion of the
TABLE I
ramifications of these findings. C OMPARISON BETWEEN PARAMETERS
The Bi-LSTM model performs admirably overall, with an
Model Accuracy Precision Recall F1 Score
accuracy of 92.88%. This statistic represents the proportion
Bi-LSTM 92.88% 95.02% 90.50% 92.71%
of correctly classified cases in the test dataset. The model’s LSTM 92.62% 94.73% 90.26% 92.44%
precision score of 95.02% demonstrates its ability to correctly Random Forest
83.75% 86.15% 76.73% 81.17%
classify postings as expressing suicidal ideation while reducing Classifier
false positives. A recall score of 90.50% demonstrates the
model’s ability to recognize true suicidal ideation messages, The table 2 outlines various feature extraction, machine
reducing false negatives. The F1 score of 92.71% strikes a learning, and embedding techniques along with deep learning
good balance between precision and recall, indicating a solid algorithms applied in different studies. The first entry employs
and well-rounded model. The Bi-LSTM model’s exceptional TF-IDF for feature extraction, SVM for machine learning,
performance is due to its capacity to recognize sequential and Word2Vec for embedding, employing LSTM and CNN
dependencies within textual input. The model effectively in deep learning, achieving a notable accuracy of 90.3% as
evaluates the context and complex patterns in the text by reported in [5]. The second study utilizes TF-IDF, LIWC and
leveraging Bidirectional Long Short-Term Memory (LSTM) Sentiments for feature extraction, employing a range of ma-
units, allowing for accurate predictions. chine learning algorithms such as RF, SVM, LR, and ZeroR,
resulting in a high accuracy of 92%, as documented in [13].
Another approach combines TF-IDF, N-Gram, and LIWC for
feature extraction with various machine learning algorithms,
reaching an accuracy of 73.6%, as reported in [9]. Finally, the
current research employs binary feature correlation for feature
extraction, RF for machine learning, and Word2Vec and Glove
for embedding, with Bi-LSTM as the deep learning algorithm,
achieving an impressive accuracy of 93%, as presented in this
paper.
The findings highlight the importance of NLP approaches,
particularly LSTM-based models, in detecting suicidal ideation
in textual data. The LSTM model’s high accuracy, precision,
and recall reveal its ability to detect minor verbal clues
suggestive of suicidal ideation.
Future studies could look into combining multiple NLP
models, creating hybrid models, or including domain-specific
features to improve the efficacy of suicide ideation detection
systems. Efforts to reduce the model’s false positives and false
negatives should also be prioritized, as they have substantial
Fig. 7. Accuracy Curve ramifications in real-world applications. Finally, our research
highlights the significant prospects of NLP approaches, par-
ticularly LSTM models, in the detection of suicidal ideation.
While the Random Forest Classifier achieves a decent While the Random Forest Classifier produces decent results,
accuracy of 83.75%, there are significant discrepancies when the capacity of the LSTM model to catch intricate textual
compared to the Bi-LSTM model. The precision score of patterns is a significant leap in the field of mental health
86.15% indicates that the classifier has a moderately strong research and suicide prevention initiatives.
ability to reduce erroneous positives. The recall score of
76.73%, on the other hand, indicates a modest ability to detect VI. C ONCLUSIONS
true suicidal ideation posts, resulting in a slightly lower F1 In a world dominated by the silent battle of countless
score of 81.17%. people dealing with suicide ideation, our study shines as a
The Random Forest Classifier works on the ensemble learn- beacon of hope and creativity. Our path was distinguished
ing concept, combining predictions from numerous decision by unwavering determination, driven by a desire to apply
trees. While it is useful for a variety of tasks, its efficacy NLP to detect, comprehend and assist people in need. The
Authorized licensed use limited to: Zhejiang University. Downloaded on December 21,2024 at 10:34:22 UTC from IEEE Xplore. Restrictions apply.
TABLE II
COMPARISON WITH PREVIOUS WORKS
Feature Extraction Machine Learning Embedding Deep Learning Best Performing Metric &
Ref
Techniques Algorithms Techniques Algorithms Model Result
TF–IDF SVM Word2Vec LSTM, CNN LSTM-Attention CNN 90.3% [5]
TF-IDF, LIWC, Sentiment RF, SVM, LR, ZeroR NA NA SVM 92% [13]
TF–IDF, N-Gram, LIWC NB, SVM, KNN, RF NA NA NA 73.6% [9]
Binary Feature Correlation RF Word2Vec, Glove LSTM, Bi-LSTM Bi-LSTM 93% This paper
scope of our investigation immediately highlighted a daunting [7] E. J. Diniz, J. E. Fontenele, A. C. de Oliveira, V. H. Bastos, S. Teixeira,
challenge: the scarcity of publicly available datasets necessary R. L. Rabêlo, D. B. Calçada, R. M. Dos Santos, A. K. de Oliveira,
and A. S. Teles, “Boamente: A natural language processing-based
to our attempt. In response, we diligently curated a one- digital phenotyping tool for smart monitoring of suicidal ideation,” in
of-a-kind dataset culled from the candid and heartfelt ex- Healthcare, vol. 10, p. 698, MDPI, 2022.
pressions published on the Reddit platform’s ”SuicideWatch” [8] S. B. Hassan, S. B. Hassan, and U. Zakia, “Recognizing suicidal intent
in depressed population using nlp: a pilot study,” in 2020 11th IEEE
and ”depression” subreddits. This dataset, which spans over Annual Information Technology, Electronics and Mobile Communication
a decade of human emotions, is more than just a resource; Conference (IEMCON), pp. 0121–0128, IEEE, 2020.
it demonstrates our persistent commitment to the goals of [9] M. Polignano, P. Basile, M. de Gemmis, and G. Semeraro, “A compar-
ison of word-embeddings in emotion detection from text using bilstm,
suicide prevention and mental health promotion. Our research cnn and self-attention,” in Adjunct Publication of the 27th Conference
was built around a variety of Natural Language Processing on User Modeling, Adaptation and Personalization, pp. 63–68, 2019.
approaches. We set out on a journey to grasp the complexities [10] M. Kanakaraj and R. M. R. Guddeti, “Nlp based sentiment analysis on
twitter data using ensemble classifiers,” in 2015 3Rd international con-
of human language, the nuances of emotion, and the terrible ference on signal processing, communication and networking (ICSCN),
reality of suicidal ideation. Our method avoided the one-size- pp. 1–5, IEEE, 2015.
fits-all philosophy, instead employing a hybrid of two powerful [11] G. Coppersmith, R. Leary, P. Crutchley, and A. Fine, “Natural language
processing of social media as screening for suicide risk,” Biomedical
models: the LSTM (Long Short-Term Memory), Bi-LSTM informatics insights, vol. 10, p. 1178222618792860, 2018.
and the Random Forest Classifier. We produced exceptional [12] M. Cusick, P. Adekkanattu, T. R. Campion Jr, E. T. Sholle, A. Myers,
results through a dynamic integration of machine learning S. Banerjee, G. Alexopoulos, Y. Wang, and J. Pathak, “Using weak
supervision and deep learning to classify clinical notes for identification
algorithms. In sentiment analysis, the Bi-LSTM model, which of current suicidal ideation,” Journal of psychiatric research, vol. 136,
is known for its ability to capture sequential patterns and pp. 95–102, 2021.
contextual information, achieved an extraordinary accuracy [13] A. E. Aladağ, S. Muderrisoglu, N. B. Akbas, O. Zahmacioglu, and H. O.
Bingol, “Detecting suicidal ideation on forums: proof-of-concept study,”
rate of up to 93%. The Random Forest Classifier, on the Journal of medical Internet research, vol. 20, no. 6, p. e9840, 2018.
other hand, produced equally encouraging results, with a Test
Accuracy of 83.7%. Finally, this thesis represents more than
just a collection of studies; it represents the indomitable spirit
of human compassion and inventiveness. It demonstrates the
revolutionary power of NLP approaches, as demonstrated by
the LSTM and Random Forest Classifier, in addressing today’s
most important concerns. As we embark on an unknown
future, may our work serve as a spark for greater inquiry,
collaboration, and the creation of enhanced suicide prevention
measures.
R EFERENCES
[1] T. Nasukawa and J. Yi, “Sentiment analysis: Capturing favorability using
natural language processing,” in Proceedings of the 2nd international
conference on Knowledge capture, pp. 70–77, 2003.
[2] A. C. Fernandes, R. Dutta, S. Velupillai, J. Sanyal, R. Stewart, and
D. Chandran, “Identifying suicide ideation and suicidal attempts in a
psychiatric clinical research database using natural language processing,”
Scientific reports, vol. 8, no. 1, p. 7426, 2018.
[3] E. Yeskuatov, S.-L. Chua, and L. K. Foo, “Leveraging reddit for suicidal
ideation detection: A review of machine learning and natural language
processing techniques,” International journal of environmental research
and public health, vol. 19, no. 16, p. 10347, 2022.
[4] M. Guidère, “Nlp applied to online suicide intention detection,” in
HealTAC 2020, 2020.
[5] A. Rajput, “Natural language processing, sentiment analysis, and clinical
analytics,” in Innovation in health informatics, pp. 79–97, Elsevier, 2020.
[6] K. Brindha, S. Senthilkumar, A. K. Singh, and P. M. Sharma, “Sentiment
analysis with nlp on twitter data,” in 2022 International Conference on
Smart Generation Computing, Communication and Networking (SMART
GENCON), pp. 1–5, IEEE, 2022.
Authorized licensed use limited to: Zhejiang University. Downloaded on December 21,2024 at 10:34:22 UTC from IEEE Xplore. Restrictions apply.