Depression Analysis From Social Media Data in Bangla Language Applying Deep Recurrent Neural Network
Depression Analysis From Social Media Data in Bangla Language Applying Deep Recurrent Neural Network
Presented By
Sentiment Analysis
Classification
Motivation
According to WHO, depression was ranked as the third leading cause of the
global burden of disease in 2004 and will move into the first place by 2030. In
Bangladesh, national survey on mental health documented that depression was
found-
In 4.6% of the adult population.
1% in children.[9]
This depressed people can commit any type of crimes starting from suicide to
killing other out of depression.
Depression Statistic graph
60
55
49 50
50
43 43
40
34
30
27 27
Percentage
26
22 21 22
20
15
12
10
0
20 and over 20-39 40-54 55 and over 20-39 40-54 55 and over
Age
Men Women
Our Contribution
Related works
Related works Continue…
Name of the paper Used models Dataset Result
A Depression Detection Machine learning. 6013 Micro-blog posts from Sina Precisions 80%
Model Based on Sentiment Micro-blog. Segmented into50000
Analysis in Micro-blog sub-sentences.
Social Network [2]
Exploring human emotion Unigram model and 4232 tweets from Sentiment140, Unigram(81%) and
via Twitter[10] Unigram model with manually labelled. Unigram with
POS tagging POS(79.5%) for 4-
Multinomial Naïve way classification.
Bayes Classifier Unigram(66%) and
Unigram with
POS(64.8%) for 5-
way classification.
Sentiment Analysis on Long Short Term 9337 posts (Facebook, Twitter, Accuracy for Bangla
Bangla and Romanized Memory (LSTM) YouTube, news portals, product view) (78%) and Romanized
Bangla Text (BRBT) using Among them 6698(72%) Bangla and Bangla (55%).
Deep Recurrent models [4] 2639(28%) Romanized Bangla.
Related works continue…
Name of the paper Used models Dataset Result
Multilingual Sentiment Recurrent Neural 9,478,095 Amazon, 8539 Yelp, English- 87.06%
Analysis: An RNN-Based Network (LSTM and 68170 Competition restaurant Spanish-84.21%
Framework for Limited Data[5] GRU) reviews as training dataset. Turkish-74.36%
Dutch-81.77%
2045 Spanish, 932 Turkish, Russian-85.61%
1635 Dutch , 2529 Russian
restaurant reviews as testing
datasets.
BUSEM at SemEval-2017 Task Support Vector SemEval-2016 Task4 Subtask LSTM model-62.6%
4 Sentiment Analysis with Word Machine (SVM), A’s twitter train and test SVM model-62.8%
Embedding and Long Short Random Forest (RF), dataset.
Term Memory RNN Naïve Bayes(NB)
Approaches [6] And Long Short Term
Memory (LSTM)
Proposed method
Our Proposed Method consists of two main steps.
Creating dataset
The Creating Dataset section is divided into five sub-sections –
Collect Raw Data
(Twitter, Google sheet)
Data Preprocessing
(Letter, Number, Stop character)
Data labelling
(Manually Labeled 5000 tweets)
Data Post-processing
(Removing Redundancy,
Stratifying)
Data Vectorization
(Integer level Encoding)
Split Dataset
(Training Data 80%
Validation Data 10%
Test Data 10%)
Collecting Raw Data
Data Labelling
Our data set was labeled manually by a Sociology student into two types,
i.e.
1. Depressed and
2. Non-depressed
After labeling we got-
1. 984 depressive tweets
2. 27 negative but non-depressive tweets
3. 195 neutral tweets and
4. 2,708 positive tweets.
Data post-processing
Data Post-processing
Removing redundancies.
Down-sampling non-depressed data to balance with 588 depressed data.
Stratifying Depressed and non-depressed data.
Data vectorization
Data Vectorization
For vectorizing our dataset, we applied sentence level integer encoding on
our dataset.
Data splitting
Data Splitting
For hyper-parameter tuning steps, we split our entire dataset into three parts
–
i. Training dataset (80%)
ii. Validation dataset (10%)
iii. Testing dataset (10%).
While applying 10 folds cross-validation, our entire dataset was split into
two parts on each fold –
i. Training dataset (90%)
ii. ii. Validation dataset (10%).
Applied methods
For our thesis work, we have applied two distinct models for
analyzing depression from Bangla Social media data. The
methodologies are –
Hyper-parameter Tuning for LSTM
& GRU
(Size, Batch size, No. of epochs, No. of LSTN &
GRU layers)
Yes
Measure Accuracy
(Test Dataset 10%)
No
Hyper-parameter Tuning
complete?
Yes
Compare Test Accuracies and select
the best model
12. “National Center for Health Statistics.” Centers for Disease Control and Prevention, Centers for Disease
Control and Prevention, 16 Oct. 2014, www.cdc.gov/nchs/products/databriefs/db167.htm.
End of the Presentation
Thank you…