Stress Detection Using Natural Language Processing and Machine Learning
Stress Detection Using Natural Language Processing and Machine Learning
ORG
ABSTRACT:
Nowadays many users posts tweets based on their mental condition about the things that happen in
their day to day lives on the social media platforms. It is very important to detect and manage stress before it
goes into a severe problem. A huge number of informal messages are posted every day in social networking
sites, blogs and discussion forums. This paper describes an approach to detect the stress using the information
from social media networking sites, like twitter. This project performs the operations involving data collection,
data cleaning, training the machine and predicting the stressed and non-stressed users. This will be using the
Natural Language Processing (NLP) and Machine Learning algorithms which include KNN ,Naïve bayes
BernoulliNB, Random Forest, Decision tree and SVM. Psychological stress is threatening people’s health. It
is non-trivial to detect stress timely for proactive care. With the popularity of social media, people are used to
sharing their daily activities and interacting with friends on social media platforms, making it feasible to
leverage online social network data for stress detection. In this paper, we find that users stress state is closely
related to that of his/her friends in social media, and we employ a large-scale dataset from real-world social
platforms to systematically study the correlation of users stress states and social interactions. We first define a
set of stress-related textual undergoes the training of the machine followed by the machine learning algorithms
for better results. Thus the proposed system takes the tweets as input and decides whether it is stressed or non-
stressed.
1. INTRODUCTION:
The social media like Facebook, Twitter, and WhatsApp are the popular social networking website
where the users create and use these social media application for conveying the message and expressing their
individual feelings, thoughts to their friends and family on different subjects. The social media applications
are highly influencing people of all age group peoples nowadays and changing the lifestyle of the people. As
the continuous use of social media by the users, it is much more possible to identify the psychological state of
the user by gathering their social media message and communication data timely and by analyzing the content.
Psychological stress which is a medical and physical illness that is a threat to the people health. Year
by year the stress level was increasing to the people and sometimes the overwhelming stress leads to suicidal
ideation. In the year 2018, about 15.7 per 100k people were suicide in India. Though the stress became more
common in our day to day life with being harmful to the human life. So there is much important to predict the
stress level of the people before it turns into a severe problem. Some of the traditional methods to detect the
stress are actually reactive and has some limitation like hysteretic, time and labor consuming and
computationally expensive. As the increased use of social media application by all age group it much more
possible to find the emotional or stress state of the user in the earlier stage by using machine learning techniques
which is much better than the traditional methods.
2. LITERATURE SURVEY:
A lot of astounding contributions have been made in the field of sentiment analysis in the past few
years. Initially, sentiment analysis was proposed for a simple binary classification that allocates evaluations to
bipolar classes. Alexander Pak and Patrick Paroubek [5] came up with a model that categorizes the tweets into
three classes. The three classes were objective, positive and negative. In their research model, they started by
generating a collection of data by accumulating tweets. They took advantage of the Twitter API and would
routinely interpret the tweets based on emoticons used. Using that twitter corpus, they were able to construct
a sentiment classifier. This classifier was built on the technique—Naive Bayes where they used N-gram and
POS-tags. They did face a drawback where the training set turned out to be less proficient since it only
contained tweets having emoticons. The papers [6–10] discuss effective data pre-processing techniques for
social media content, specifically tweets. As the data contains the words which are most often used in a
sentence but do not contribute to the analysis, such as stop words, symbols, punctuation marks. Removing
these and converting different forms of the words to the base from is an essential step.
Sentiment analysis Apoorv Agarwal et al. [11] proposed a 3-way model for categorizing sentiments in
three classes. The classes were positive, negative, and neutral. Models such as the unigram model, a feature
constructed upon the model, and a tree kernel-based were used for testing. In the case of the tree kernel-centered
model, tweets were chosen to be represented in the form of a tree. While implementing a feature-centered
model over 100 features were taken into consideration. However, in the case of the unigram model, there were
about 10,000 features. They concluded that features that end up combining previous polarization of words with
their parts-of-speech (pos) tags are the most substantial. In terms of the result, the tree kernel-based model
ended up performing better than the other two models. Certain challenges are made by a few researchers to
classify public beliefs about movies, news, etc. from Twitter posts. V.M. Kiran Peddinti et al. [12] utilized the
data from other widely accessible databases like IMDB and Blippr after appropriate alterations to benefit
Twitter sentiment analysis in the movie domain. Davidov Dmitry et al. [13] projected a method to utilize
Twitter user-defined hashtags in tweets as a classification of sentiment type using punctuation, single words,
and patterns as disparate feature types. They are then combined into a single feature vector for the task of
sentiment classification. They made use of the K-Nearest Neighbor approach to allocate sentiment labels by
constructing a feature vector for each example in the training and test set. Tagging [14], in current times
developed as a common way to sort out vast and vibrant web content. It usually refers to the act of correlating
with or allocating some keyword or unit to a piece of data. Tagging aids to depict an article and lets it be
located again by perusing. Scholars have established diverse methods and procedures for tagging corpus for
numerous uses. Xiance et al. [15] offered a flexible and practical technique for the process of the
recommendation of tags. They demonstrated documents and tags by implementing the tag-LDA model. Krestel
et al. [16] recommended a method to customize the process of recommendation by tag. She proposed a method
that amalgamates a probabilistic method of tags from the source. In this case, the tags were extracted from the
user. She examined basic language models. Additionally, she performed LDA experimentations on a real-
world dataset. The dataset was crawled from a vast tagging system which displayed that personalization
progresses the process of tag recommendation. [17-27] These researchers have made significant contributions
to stress detection and analysis through their innovations in natural language processing and sentiment analysis.
Peters and Neumann introduced deep contextualized word representations in 2018, enhancing the
understanding of stress-related language. Radford and Narasimhan's generative pre-training in the same year
improved language comprehension, enabling more accurate stress detection. Devlin, Chang, Lee, and
Toutanova's BERT, presented in 2019, has been instrumental in advancing stress analysis by pre-training deep
bidirectional transformers for language understanding. Jin, Lai, and Cao's work in 2020 applied BERT and
modified TF-IDF for multi-label sentiment analysis, aiding in nuanced stress assessment in text data
Stress/depression analysis
Arya and Mishra [28] present a review of the application of machine learning in the health sector, their
limitation, predictive analysis, and challenges in the area and need advanced research and technologies. The
authors reviewed papers on mental stress detection using ML that used social networking sites, blogs,
discussion forums, Questioner technique, clinical dataset, real-time data, Bio-signal technology (ECG, EEG),
a wireless device, and suicidal tendency. The study shows the high potential of ML algorithms in mental health
[28]. Aldarwish et al used machine learning algorithms SVM and Naïve- Bayesian for Predicting stress from
UGC- User Generated Content in Social media sites (Facebook, Twitter, Live Journal) they used social
interaction stress datasets based on mood and negativism and BDI- questionnaire having 6773 posts, 2073
depressed, 4700 non-depressed posts (textual). They achieved an accuracy of 57% from SVM and 63% from
Naïve- Bayesian. They also emphasized stress detection using big data techniques [29]. Cho G et al. presented
the analysis of ML algorithms for diagnosing mental illness. They studied properties of mental health,
techniques to identify, their limitations, and how ML algorithms are implemented. The authors considered
SVM, GBM, KNN, Naïve Bayesian, KNN, Random Forest. The authors achieved 75% from the SVM classifier
[30]. Reshma et.al proposed a Tensi Strength framework for detecting sentiment analysis on Twitter [31]. The
authors considered SVM, NB, WSD, and n-gram techniques on large social media text for sentiment analysis
and applied the Lexicon approach to detect stress and relaxation in large data set. The authors achieved 65%
precision and 67% recall. Deshpande and Rao presented an emotion artificial intelligence technique to detect
depression [32]. The authors collected 10,000 Tweet Using Twitter API.
3. METHODOLOGY:
Figure 1: SVM
3.2 RANDOM FOREST
Random Forest is a classifier that contains a number of decision trees on various subsets of the given
dataset and takes the average to improve the predictive accuracy of that dataset. Instead of relying on one
decision tree, the random forest takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output. In our project for random forest algorithm, we used 21 estimators
that is nothing but a decision tree to generate a model.
3.3 KNN
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new data points
which further means that the new data point will be assigned a value based on how closely it matches the points
in the training set. We can understand its working with the help of following steps −
Step 1 − They want data frame for enforcing some method. And we must pack its instruction and
perhaps even the relevant data within the first phase in KNN.
Step 2 − First, we will pick the K amount i.e., its closest information values. Some number could be K.
Step 3 − Does the preceding to every level well into the information −
− Using any of the methods notably: Manhattan, Euclidean or Hamming distance measure the distance
among testing data so each line of sample data. Its most frequently utilized form for range calculation is
Euclidean.
− Now, based on the distance value, sort them in ascending order.
− Next, list the top K lines from both the list you have ordered.
− Now, the category would be allocated to both the check points based on its most common classes of
those lines.
Step 4 – End
In our project for KNN algorithm we used values of n as n=1,3,5,7,9 to generated a model.
Figure 3 : KNN
3.4 DECISION TREES:
Decision Tree is a supervised learning technique that can be used for both classification and Regression
problems, but mostly it is preferred for solving classification problems. It is a tree-structured classifier, where
internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node
represents the outcome. In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output
of those decisions and do not contain any further branches. The decisions or the test are performed on the basis
of features of the given dataset.
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
ACCURACY GRAPH
80 74.06
71.2
70 9
63.5 62.5
4 4
60
50.58
50 6
40
30
Tree Tree
Figure 7:
Classifier Classifier Graph
Accuracy
5. CONCLUSION:
In today’s world, where mainly the youth and almost all of the population is suffering from surmounting
stress, it is because of peer pressure, work load or other domestic tensions; it is very crucial to have a reality
check about how stressed a person really is. Due to this reason that timely detection and prevention of stress is
a dire need. We have come up with this project which assists people in scrutinizing the problem of stress. This
project will be very beneficial for those who are not so comfortable in opening up about their problems to
others. It will help these people get a reality check and may prompt them to reach out and get medical help,
just based on their social interactions. We have utilized both human as well as machine learning and applied
the concepts of Sentiment Analysis. The main characteristic of this system is its non-invasiveness and fast-
oriented implementation in detecting stress when compared with the previous approaches.
References
1. Liang Y, Zheng X, Zeng DD. A survey on big data-driven digital phenotyping of mental health. Inform
Fusion. 2019;52(1):290–307.
2. Liu B, Zhang L. A survey of opinion mining and sentiment analysis. Boston: Springer US. 2012; p. 415–
463.
3. Munikar M, Shakya S, Shrestha A. Fine-grained sentiment classification using BERT. Artif Intell
Transform Business Society. 2019;2019:1–5. https://doi.org/10.1109/AITB48515.2019.8947435.
4. Wang B, Liu Y, Liu Z, Li M, Qi M. Topic selection in latent Dirichlet allocation, 2014 11th International
Conference on Fuzzy Systems and Knowledge Discovery (FSKD). 2014. p. 756–760.
https://doi.org/10.1109/FSKD.2014.6980931.
5. Alexander P, Patrick P. Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of
LREC. 2010.
6. Jianqiang Z, Xiaolin G. Comparison research on text pre-processing methods on Twitter sentiment
analysis. IEEE Access. 2017;5:2870–9. https://doi.org/10.1109/ACCESS.2017.2672677.
7. Pradha S, Halgamuge MN, Vinh NQT. Effective text data preprocessing technique for sentiment analysis
in social media data, 2019 11th International Conference on Knowledge and Systems Engineering (KSE).
2019. p. 1–8.https://doi.org/10.1109/KSE.2019.8919368.
8. Deepa DR, Tamilarasi A. Sentiment analysis using feature extraction and dictionary-based approaches,
2019 Third International conference on I-SMAC (IoT in SociaMobile, Analytics, and Cloud) (I-SMAC).
2019. p. 786–790. https://doi.org/10.1109/I-SMAC47947.2019.9032456.
9. Chaturvedi S, Mishra V, Mishra N. Sentiment analysis using machine learning for business intelligence,
2017 IEEE International Conference on power, control, signals, and instrumentation engineering
(ICPCSI). 2017. p. 2162–2166. https://doi.org/10.1109/ICPCSI.2017.8392100.
10. Ho J, Ondusko D, Roy B, Hsu DF. Sentiment analysis on tweets using machine learning and combinatorial
fusion,2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive
Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science
and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech). 2019. p. 1066–1071.
https://doi.org/10.1109/DASC/PiCom/CBDCom/CyberSciTech.2019.00191.
11. Apoorv A, Boyi X, Ilia V, Owen R, Rebecca P. Sentiment analysis of Twitter Data. Proceedings of the
Workshop on Languages in Social Media. 2011.
12. Peddinti MK, Chintalapoodi P. Domain adaptation in sentiment analysis of Twitter, in Analyzing
Microtext Workshop, AAAI, 2011.
13. Dmitry D, Oren T, Ari R. Enhanced sentiment learning using twitter hashtags and smileys. Coling 2010—
23rd International Conference on Computational Linguistics, Proceedings of the Conference. 2. 2010;
241–249.
14. Anupriya P, Karpagavalli S. LDA based topic modeling of journal abstracts. Int Conf Adv Comput
Commun Syst. 2015;2015:1–5. https://doi.org/10.1109/ICACCS.2015.7324058.
15. Xiance S, Maosong S. Tag-LDA for scalable real-time tag recommendation. J Comput Inform Syst.
2008;6:23.
16. Krestel R, Fankhauser P. Personalized topic-based tag recommendation. Neurocomputing. 2012;76:61–
70. https://doi.org/10.1016/j.neucom.2011.04.034.
17. Peters ME, Neumann M. Deep contextualized word representations. 2018.
18. Radford A, Narasimhan K. Improving language understanding by generative pre-training. 2018.
19. Devlin J, Chang M, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for
language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, vol 1.