Stressor Classification of Filipino Political Tweets Using Lda, SVM, Xgboost, Logistic Regression
Stressor Classification of Filipino Political Tweets Using Lda, SVM, Xgboost, Logistic Regression
Abstract
With the advancement of technology, Filipinos have a means of connecting to social media mainly to share what they
are doing or what they feel now. This could lead to people venting out their stress on platforms such as Twitter. One
of the topics that cause people a lot of stress is Politics and many social media users share their opinions in a stressful
manner on Twitter. This paper will focus on detecting the reason for stress called stressors from the tweet. This will
be done by collecting tweets based on their hashtags and NLP technique called topic modeling specifically LDA to
form the topic of stress for stress detection. Then Machine learning algorithms of SVM, XGBoost, and Logistic
Regression will be used on the tweets and topics created by the Topic modeling to create and train a model that can
predict stressors based on the tweets and topics.
Keywords
Political Tweets, Stressors, LDA, Machine Learning Algorithms, Stress
1. Introduction
1.1 Stress
Stress is a form of mental illness that people nowadays suffer from (Guntuku et al. 2019). With the advancement of
technology, Filipinos have a means of connecting to social media mainly to share what they are doing or what they
feel now. This could lead to people venting out on platforms such as Twitter. Stress affects a person's mental and at
times physical health though for better or worse some people might have better mitigation to cope with stress. In other
words, performance would be greatly affected if a person is suffering from stress. Based on Seaward (2018), stress
should be managed and monitored, most people especially people from adulthood are not aware that they are under
heavy stress.
Pillai et al. (2018) uses this topic Modelling Method to form 5 stressors. He used this on two different types of group
tweets which are Political tweets and tweets about transportation. He used a tool to know which tweets are stressed
and also manually annotated the tweets themselves. The tweets which were considered stress were the only ones used
to topic modeling to create the topics.
This study would only focus on the social media platform called Twitter and would solely focus on NLP methods to
analyze tweets from Filipinos since Twitter is used by most Filipinos (Van der Schuur et al. 2018). The model would
center around the predictions of stressors. On the other hand, although similar studies included relaxation (Pillai et al.
2018) in their method this study would be excluding that factor and would solely focus on stress and the possible
stressor. The data-mined tweets will only be about politics since the majority of stress-related tweets are about politics
(Pillai et al. 2018).
1.4 Objectives
Social media networking sites are commonly used by people to share their everyday details with the world. In this
paper, we would determine the stressors on tweets of Filipinos with the challenges of the current lexicon tools not
being well suited for scenarios with grammatical problems.
Thus, with the platform as the bridge for the study, the researchers would like to aim: To identify the stressors of
Filipino Tweets from the Twitter platform using topic modeling methods, to find which model from the three machine
learning algorithms provides the highest performance in terms of fi measure and accuracy and to discover what is the
commonly used word that identifies the stressors.
2. Literature Review
2.1 Stress
Stress is one of the most common things that humans will ever experience in their lifetime. Inevitably there will come
a time wherein we as social beings would encounter social media. Based on Seaward (2018) suggests that stress should
be managed carefully because even though it is a regular feeling of emotion that we experience most people do not
know that they are experiencing heavy stress. Almost everybody is using social media and a great majority of them
are addicted to it so much that they become stressed and have a lack of sleep because of it which can be deemed
harmful to us. Mental health conditions can be monitored by their language on social media (Guntuku et al. 2019).
2018). Since Twitter is a social media globally used, most tweets shown and collected in multilingual places have the
structure of code-switching (Rijhwani et al.2017).
2.7 Framework
As shown in figure 1, the process starts by collecting a lot of tweets and having enough to get data. Then after the data
is collected, the pre-processing stage is commenced. The most common way of this process is using necessary
techniques like tokenization, stemming, removing of stop words. When it comes to Twitter, the pre-processing stage
can still be improved by removing special characters, URLs, numbers, punctuations, etc., and retweets. After the pre-
processing techniques, the advice of a domain expert will be needed to label this tweet for them to identify which of
the data can be considered a stressor. The dataset is split into different ratios for model training and testing, most
commonly 70% and 30% ratios.
The Model training is done by having a classification method. It will be using different classification algorithms using
SVM XGboost and Logistic Regression to know which classification algorithms would produce the best model with
the highest accuracy.
Model Testing will then be done on the remaining percentage of the dataset to know if the method is consistent and
to know which model classifier has the best accuracy on producing a result. The Evaluation is the last process in which
the chosen Model will be used upon different data sets.
'
2. Methodology
The use of Twitter API will be the main tool used for the collection of tweets. The researcher will primarily focus on
the Filipino Political tweets with hashtags such as #duterte, #dilawan, #BBMIsMyPresident2022, #PHVotes, #DOH,
#Presidency, #COVID19 to focus the people's POV on the politics in which must be eventful to possibly gather more
stressed tweets.
The Domain expert will annotate at least 3500 - 5000 tweets with scores of (0 -not stressed, 1 – stressed) to have
enough annotated data to be used for topic modeling for finding potential stressors.
Labeling of tweets
This process will result in a unified dataset of the domain experts for both the English and Tagalog dataset. The
researchers then dropped all the tweets that were scored 0 or "not stressed" by the Domain Experts the reason being
on to the next step, the words that are supposed to be clustered must be stressed only to have better results as to having
a non-stressed tweet will make the clustered words possibly mixing in words that are not associated with stress. The
resulting rows of the Tagalog dataset number 2,354 and as for the English dataset, 3,207. An example of pre-processed
tweet is shown in table 3.
The following pre-processing is also proposed to be done for the tweets to be more accurate data to use (Negara et al.
2017):
• Removal of URLs since links are not a source of data the researchers are looking for.
• Retweets are ignored since retweets are not considered personal tweets from the user.
• Twitter unique symbols/letters are ignored (Hashtags, “@ username”)
• All tweets are converted to lowercase
• Removal of stop words such as “are, as, a, am, etc.” for both English and Tagalog words since those do not pose
any significance on a tweet.
• Tokenization - each phrase/word is referred to as a token.
• Lemmatization - A process wherein a word/token is reduced to its word stem such as its roots.
Most Common Words per Topic Stressor Most Common Words per Topic Stressor
(Tagalog) (English)
address, say, testing, kayo, people, ask, Political respect, vote, country, dilawan, use, Election
face, think, lawyere, class, test, Stance test, run, health, help, feel, covid,
pilipina, start, supporter, doh, apologist follow, leni, ask, know, leader
Duterte, government, election, drug, Government presidency, kayo, election, face, hope, case, Government
people, corruption, administration, run, Policies file, problem, make, drug, year, come, war, Policies
president, country, support, covid, try, shield, doh, class
year, allege, kill
dilawan, neverforget, never again, bbm, President duterte, say, candidate, support, need, Political
law, country, aquino, presidency, try, think, muna, oust, opposition, Stance
forget, bbmforpresident, endbayan, philippine, supporter, thank, corruption,
hope, budget, politic, dictatorship work
After the evaluation of the Domain Experts on the cluster of words, we then asked the Domain experts to then label
the dataset with the use of the evaluated stressors using the initial dataset before the preprocessing method. A sample
labelling is shown in table 5.
when I find myself in times of trouble, mother mary comes to me. speaking words of Political Stance
wisdom, "oust Duterte"
@ABSCBNNews sinara mo track record mo nung naging tuta ka ni duterte Filipino Political News
After Receiving the dataset labeled by the Domain Experts, it is once again preprocessed just as before the LDA topic
modeling step. Table 6 shows the result of the final preprocessing activity.
Stressor Definition
Political Stance A side where an individual agrees based on the ideology, party, and policies
Government Policies The government's action or intent to solve a problem/issue within the country
Election An individual preference of candidate for the political representation of the country
Filipino Political News Reports about the latest information revolving around politics like government policies,
corruption, elections, etc.
President Information about the President's speeches, actions, and policies.
Recollection Origins of political events that are remembered due to their impact and relevance.
Figure 2. Numbers of Tweets for Tagalog (Left) and English (Right) Dataset
Ass shown in figure 2, the numbers of tweets per class are unbalanced so the researchers then took 280 per class on
the Tagalog dataset having a total of 1,400 rows of tweets on the dataset and 390 per class on the English dataset
resulting in 1,950 rows of tweets to be trained in the model. The purpose of this is to avoid biases for the model when
training.
The creation of the data model will be done under Jupiter notebook on a python3 version. The model training will
focus on the three algorithms of SVM, XGBoost, and Logistic Regression. The Model training will have ratios of
70/30. The researchers will run ten cross-validations of each classifier. The researchers will also use the standard
statistical performance metrics like accuracy, f-score, and kappa statistics to now have better comparisons between
the model but will mainly focus on the accuracy of the model.
The dataset will use the tweets as its features for the X part of the model training and the stressors for Y as it predicted
target and label. The X feature will undergo a word vectorization process. The reason for this is that machine learning
models don't accept string(tweets) as an input the vectorizing the words will make it into numerical values. The word
vectorization method that will be used is TF-IDF, what this does is that will have Term frequency in which it will
summarize how often a given word appears in a tweet and it also has an Inverse Document Frequency that downscales
words that appear a lot across the tweets. After the words go thru TF-IDF vectorization they will be put on a vocabulary
where the words will have a unique integer number assigned to them. A snippet of the TF-IDF Vocabulary is shown
in table 7.
Filipino English
{'oust': 4101, 'duterte': 1450, 'kita': 2779, 'perception': 'ph': 4179, 'utot': 5584, 'mo': 3261, 'nyo': 3749, 'kayo':
4419, 'fact': 1663, 'checker': 904, 'dilawan': 1312, 2458, 'lng': 2745, 'pwde': 4469, 'magsalita': 2897,
'legit': 2947, 'problem': 4659, 'might': 3498, 'addressed': 'magtanggol': 2904, 'pag': 3849, 'bayaran': 539, 'agad':
179, 'across': 165, 'industry': 2375} 160, 'gago': 1654, 'utak': 5581}
The Y labels are consisting of the five stressors which are also strings because of this it will go to the process of Label
Encoder. This is done to transform categorical data of string type in the dataset into numerical values which the models
will accept. Since there are 5 stressors Y labels will have 5 numerical values each representing the stressors, the
numerical values will represent as 0, 1, 2, 3, 4 which 0 will represent the first stressor then 1 will represent the second
stressor then 2 will represent the third stressor then 3 will represent the fourth stressor and the 4 will represent the fifth
stressor. In table 8, it shows the encoded stressor label.
Filipino English
SVM
The use of the SVM is chosen by the researchers because it is a good predictive analysis for data classification. In
most cases, it's a binary classifier but the SVM the researchers use is a multiclass SVM since we have five target
variables that were fitted in Y in the data modeling process for this algorithm.
XGBoost
The XGBoost is a decision tree type of machine learning algorithm, it uses a gradient boosting framework. This is a
good algorithm for the researcher's dataset because decision-based tree algorithms tend to work well on small to
medium size data.
Logistic Regression
Logistic Regression is one of the most well-known algorithms to be used for classification problems. Logistic
Regression is used to predict a data value based on their feature's prior features. For this study, we are using
multinomial Logistic Regression since we are trying to create a model that can predict five classes.
SVM and XGBoost have undergone hyperparameter tuning with the use of grid search to improve the performance of
the model.
Based on table 9, the accuracy score shows that it has acceptable results score in terms of predicting. The models show
between 70% in terms of accuracy. All three Models shows identical results in terms of their Accuracy and F-Score
the reason for this is that each Machine learning Algorithms do well when it comes to multi-class predictions, The
reason why the accuracy might not have been higher is that that tweet is used for the X feature and there are a lot of
variations of tweets within the researcher’s dataset
In table 10, the result in the English dataset is somewhat di same as the Tagalog dataset in terms of their accuracy
score which is around 70%. Both the English and Tagalog dataset seems to have the same output because both datasets
use tweets as the X feature and a lot of these tweets has different variations of it.
Confusion Matrix: Figure 3 shows the results for the Confusion Matrix of the best model.
The confusion matrix shows that for the 1st topic (Elections), the best model predicted 51 of the tweets to be true
positives while the majority of the false negatives were labeled as part of Government Policies and President. This
may be because Presidency is correlated to Elections. The 2nd topic (Filipino Political news) was predicted with 71
true positives with the majority again on the President Topic. The 3rd topic (Government Policies) is predicted with
61 true positives while it was even out on the false negatives among other topics. The 4th topic (Political Stance) was
predicted with 60 true positives with the majority of the false negatives being among topics President and Political
news. The last topic President yielded 52 true positives and the majority of the false negatives are with topic Elections
and Political stance.
Misleading Tweets: Some tweets that were collected are considered satire this can cause for word pooling of the topic
models to be inconsistent. Verbal irony such as sarcastic tweets may cause some inconsistencies in the collection of
words.
Inconsistent Topic Models: The creation of topic models saw that some models have a very similar set of words
within them this can cause the model to have a hard time predicting the stressor accurately.
Code-Switching: Both datasets containing the language of the other (both containing Taglish tweets) cause the most
influential variable in the process. Rendering the dataset is virtually the same.
Prediction from other tweets: Predicting tweets from a range of topics outside of politics would prevent the model
from predicting accurately since the stressors are only political-based.
4. CONCLUSION
This research sees that you can find stressors based on a collection of Tagalog tweets or an English tweet that has a
Filipino-based hashtag. It fulfills its role in detecting the suitable stressor for a particular tweet based on the hashtag
using the methods of topic modeling. It also saw have a method to create a data model for stressor detection and the
prediction that would have 70-71% performance among the three models, The reason why each model is like each
other is that the algorithm of the model primarily focuses on multiclass predictions. The model that performed the best
was Logistic Regression (71%) in our study because the algorithm prioritizes posterior class probability and is very
effective on a linear classification problem.
For future work, the researchers recommend mining tweets in different countries in the same generalized topic if
possible since mining the same topic on the same area (area-specific topics) to compare two different languages will
fall to code-switching. The researchers also suggest using a better cleaning method if there is one created in terms of
Tagalog text because data cleaning for Tagalog text seems to be minimal compared to other languages and does not
have a dedicated function. The researchers also suggest having other similarities-based methods to find stressors and
other NLP techniques to make a cluster for the stressor creating. Also, try different methods such as a neural network
for the data model creation.
Reference
Abastillas, G., You Are What You Tweet: A Divergence in Code-Switching Practices in Cebuano and English
Speakers in Philippines, Mehta S. (eds) Language and Literature in a Glocal World, Springer, Singapore, 2018.
Armstrong, S., Philippine Tsismis: Gossip and the Politics of Representation in Jessica Hagedorn’s Dogeaters
Postcolonial Text, Postcolonial Text, Vol 16, No 4, 2021.
Cooper, C. L. and Quick, J. C., The Handbook of Stress and Health: A Guide to research and Practice, John Wiley
Sons Inc., 2017.
De Leon, G. F. and Lintao, R., The Rise of Meme Culture: Internet Political Memes as Tools for Analysing Philippine
Propaganda, Journal of Critical Studies in Language and Literature, 2(4), 1-13, 2021.
Gollapalli, S. D. and Li, X., Using PageRank for Characterizing Topic Quality in LDA, In Proceedings of the 2018
ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '18), Association for
Computing Machinery, 2018.
Guntuku, S. G., Buffone, A., Jaidka, K., Eichstaedt, J. C., & Ungar, L. H., Understanding and Measuring Psychological
Stress Using Social Media, Proceedings of the International AAAI Conference on Web and Social Media, 13(01),
214-225, 2019.
Hussein, D., A survey on sentiment analysis challenges, Journal of King Saud University - Engineering Sciences,
Volume 30, Issue 4, Pages 330-338, 2018.
Jose, N. Chakravarthi, B. R., Suryawanshi, S., Sherly, E. and McCrae, J. P., A Survey of Current Datasets for Code-
Switching Research, Proceedings of 2020 6th International Conference on Advanced Computing and
Communication Systems (ICACCS), pp. 136-141, 2020
Negara, E.S., Triadi, D., and Andryani, R., Topic Modelling Twitter Data with Latent Dirichlet Allocation Method,
Proceeding of the 2019 International Conference on Electrical Engineering and Computer Science (ICECOS),
386-390, 2019.
Pillai, G. R., Thelwall, M. and Orasan. C., Detection of Stress and Relaxation Magnitudes for Tweets, Proceedings of
the The Web Conference 2018 (WWW '18), International World Wide Web Conferences Steering Committee,
Republic and Canton of Geneva, CHE, 1677–1684, 2018.
Rijhwani, S., Sequiera, R., Choudhury, M., Bali, K., and Maddila, C. S., Estimating Code-Switching on Twitter with
a Novel Generalized Word-Level Language Detection Technique, Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics, Association for Computational Linguistics, 1971–1982, 2017.
Sjöström, S., Labelling theory in Routledge International Handbook of Critical Mental, Taylor & Francis Group,
2017.
Seaward, B. L., Managing stress: Principles and strategies for Health and Well Being. Burlington, MA: Jones &
Bartlett Learning, 2018.
van der Van der Schuur, W.A., Baumgartner,S.E., Sumter, S.R., Social Media Use, Social Media Stress, and Sleep:
Examining Cross-Sectional and Longitudinal Relationships in Adolescents, Health Commun, 34(5):552-559,
2019.
Wang, X., Zhang, H., Cao, L., and Feng, L., Leverage Social Media for Personalized Stress Detection, Proceedings
of the 28th ACM International Conference on Multimedia, Association for Computing Machinery, New York,
NY, USA, 2710–2718, 2020.
Biographies
Mark Gabriel E.Edaño is a Computer Science Student at Mapua University who is currently in his final year of
graduation for his course. Artificial Intelligence was his specialization of the subject under the university, and he also
studied pattern recognition and technopreneur ship as his electives. He was a software/data analyst intern at the
company of NCSI Philippines. He strives to be a data scientist particularly big data that are sports-related.
Ryan Joseph S. Gonzales is an undergraduate student of Mapua University that is taking a Bachelor of Science in
Computer Science. He specializes in Artificial Intelligence and is currently in his final year of graduation. He has done
his Internship at Chimes Consulting as a Backend Developer and is planning to pursue a career as a software engineer.
His interest revolves around the automation of non-supervised functions/projects that can be applied to a larger scale
task.
Raphael Carlo B. Laguda is an undergraduate student of Mapua University, taking BS Computer Science. He
specializes in Application Development and also studied pattern recognition and technopreneurship as his electives.
He was a Software Engineer intern in Realtair Inc. His interest is in Web and Application Development, Artificial
Intelligence, and Game Development.
Joel de Goma is a student of Mapua University, taking PhD in Computer Science. He also an instructor of the said
University.