Data Miningof Public Opinion An Overview
Data Miningof Public Opinion An Overview
Abstract. The United Nations recently published the “E-government survey 2020” with the main aim of assessing the e-
government development status of all United Nations member states. The survey outlines 14 leading countries in e-
government development (out of 193 member states) some of them claiming to utilize technologies as artificial
intelligence (AI), big data and blockchain. Moreover, with the burst of the COVID-19 pandemic the topic on
development and implementation of e-government services becomes even hotter. However, along with the research on
the process of digitalization of public services, it is important to develop tools measuring how these rapid changes are
perceived by the users. Consequently, this paper examines the most recent research devoted on public opinion data
mining. On the basis of extensive literature review, we outline the latest developments and trends in the field of public
opinion data mining with special focus on sentiment analysis. Our main goal is to provide a self-contained
comprehensive summary that might be used as a basis for design and development of AI systems aimed to mine the
public opinion.
INTRODUCTION
The big data era brings a lot of research interest and attention to the scientific fields of data mining and machine
learning. The combination between available data and advanced analytics techniques leads to developing smart and
data-driven solutions for the business, economy, and many other aspects of our modern life. More and more
researchers and practitioners start adopting data mining methods in the governmental domain as well (1). This
emerging field has many new applications which facilitate the process of digitalization of government services and
the development of e-government. We should note that the United Nations recently published the “E-government
survey 2020” with the main aim of assessing the e-government development status of all United Nations member
states (2). The survey outlines 14 leading countries in e-government development (out of 193 member states) some
of them claiming to utilize technologies as artificial intelligence (AI), big data and blockchain. Moreover, with the
burst of the COVID-19 pandemic the topic on development and implementation of e-government services becomes
even hotter. However, the researchers focus should consider not only the process of digitalization of many public
services but also how these rapid changes are perceived by the users.
Consequently, this paper examines the most recent research devoted on public opinion data mining. On the basis
of extensive literature review, we outline the latest developments and trends in the field of public opinion data
mining with special focus on sentiment analysis. Our main goal is to provide a self-contained comprehensive
summary that might be used as a basis for design and development of AI systems aimed to mine the public opinion.
For this purpose, we deliver a comparative tabular summary of reviewed papers in terms of utilized text processing
techniques, machine learning (ML) algorithms and data sources as well as utilized language resources for text
preprocessing and analysis.
Our paper contributes to the body of literature devoted on data mining applications in the governmental domain
in the following ways. We outline the general methodologies and techniques utilized for analyzing public opinions
and sentiments, as well as the main sources of such data used by researchers. Furthermore, our analysis sheds light
on the general topics of public interest (digital services, political issues, healthcare etc.) analyzed by the application
of data mining and machine learning techniques. Last but not least, our review is with special focus on sentiment
analysis applications for evaluation of public opinions, hence in the surveyed papers we also pay attention to the text
data language, utilized language resources for text processing and analysis and other important aspects of studies in
the field.
The rest of the paper is organized as follows. Section 2 sheds light on the research methodology. Section 3
provides an outline of the general methodologies for sentiment analysis and opinion mining. Section 4 presents a
review of recent research devoted on data mining and sentiment analysis of public opinion in the governmental
domain. Section 5 discusses the main findings made from the current research.
RESEARCH METHODOLOGY
The research methodology is carefully chosen as to answer to the specific goals set in the study. We first shed
light on the general methodologies for sentiment analysis and opinion mining. Second, we pay special attention to
the recent approaches for analysis of public opinion in the governmental domain through computational methods.
Many studies apply qualitative approaches toward the task - (3), (4), (5). However, the application of machine
learning and natural language processing (NLP) on opinions freely expressed in social networks may lead to huge
advancements in public opinion analysis. The current research project is part of a larger project for mining citizen
opinions on the e-government in Bulgaria, hence in our review we also investigate the applications of sentiment
analysis and citizen opinion mining on text data in Bulgarian. Figure 1 displays a diagram of the research
methodology.
In a nutshell, the current overview focuses on the following strands of papers:
1. General methodologies for opinion mining and sentiment analysis - we examine general trends in the field
and approaches towards the task.
2. Sentiment analysis of public opinions in the governmental domain - we examine most recent research
published in the last 5 years.
Opinion mining and sentiment analysis have a central role in the analysis of public opinion on various topics (6).
Furthermore, as stated by Alexopoulos et al. (1), sentiment analysis is a method which improves the communication
and relationship between modern governments and citizens. Sentiment analysis is a scientific field which combines
NLP and statistical techniques with the main aim to identify, extract and analyze the opinions, emotions, and
subjectivity, expressed in text. Sentiment analysis focuses on the feelings, evaluations, attitudes, and emotions of
people expressed towards various topics (7) – goods, services, events, topics of social importance, individuals etc.
Sentiment analysis has numerous applications in various industries – business and finance (see (8) and (9))
healthcare (10), politics (11), education (12) etc. Since the beginning of the 21st century there has been an ever-
increasing research interest in the field. The studies of Turney (13) and Pang et al. (14) are fundamental for the
development of the field. In 2012 Liu (15) provides a comprehensive study that aims to outline the state-of-the-art
approaches, application areas and sub-tasks of sentiment analysis. The study lays important groundwork for future
advancements and new applications in the field. In 2012 Liu described sentiment analysis as “the most active
research area in NLP”. According to Liu the field has quickly started to expand beyond computer science to social
and management sciences due to its immense significance and numerous applications that help in the business,
political and social aspects of our lives.
In the context of sentiment analysis, most often the task to determine the sentiment polarity of a given text is
tackled. As the name suggest, polarity analysis categorizes opinions/emotions as positive, negative, or neutral (16).
Depending on the task and data under study, different levels of these three general sentiment categories might be
considered. For example, the analysis might be focused on differentiating between these three polarity levels plus a
“mixed” sentiment category (17). Sentiments might be also considered on a finer scale - for example, “very
negative, negative, neutral, positive, very positive” (see (11) and (18)). Some of the pioneering research in the field
of sentiment analysis is focused on determining the sentiment polarity – (13), (14), (19). Recent research sheds light
on the development of “emotion sentiment analysis” (20). Yadollahi et al. provide a refined taxonomy of sentiment
analysis dividing it to two main types - opinion mining and emotion mining (20). While polarity classification is
defined as a subtask of “opinion mining”, emotion classification falls into the “emotion mining” category of
sentiment analysis. The last deals with the task of determining the expressed emotions in text (from a set of defined
emotions). An example application is the study of Gupta et al. (21) which aims to analyze 8 different emotions
(anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) expressed by citizens in the context of the
COVID-19 pandemic and other socially important topics. Emotion mining receives more and more attention in the
recent years since it enables a deeper analysis of human experiences.
Three main approaches could be utilized for sentiment analysis (22). The first approach involves the application
of sentiment lexicons (the lexicon-based approach). Lexicons contain lists of sentiment words and phrases – these
are words/phrases that convey positive or negative sentiments. Such language resources are generated by utilizing a
dictionary-based or corpus-based approach (23). It should be pointed out that lexicons are domain-specific since
words have different meanings when used in different contexts. The last means that not every lexicon could be
applied to every dataset. There are also some “general-purpose” lexicons but, still, attention must be paid to the
special characteristic of data under study. The advantage of using such language resources (compared to other
approaches for sentiment analysis) is that applying them on text data is quick, interpretable, and straightforward.
Furthermore, such resources do not require the availability of training data. The last is a huge obstacle in the
application of purely statistical approaches for sentiment analysis and for this reason, in many use cases, the lexicon-
based approach is the preferred one. Among the most frequently used sentiment lexicons are - SentiWordNet (24),
AFINN (25), VADER (26), SocialSent (a collection of domain-specific lexicons) (27), TextBlob (28). Most lexicons
are suitable mainly for social media texts. The supported languages of the above-mentioned lexicons do not include
Bulgarian. There is one sentiment lexicon for Bulgarian which is available for research purposes. It is developed by
Kapukaranov and Nakov (29) and based on movie review data.
The second approach for sentiment analysis involves the utilization of machine learning methods, while the third
is a hybrid – a combination of the lexicon-based and machine learning approaches. The most frequently tackled task
in the field - sentiment polarity analysis – is usually considered as a text classification task. There are various
machine learning models that could be applied in such tasks (30). The main idea behind such predictive models is to
learn how to detect the sentiment by directly observing data. The classical methodology in machine learning
involves feature engineering prior to the application of a prediction model. Some of the most popular text
classification models are SVM (support vector machines), Naïve Bayes and Logistic Regression (31). However,
classical methods have some limitations. Feature engineering is crucial for obtaining good performance and this
phase might be quite time-consuming. Furthermore, when data size increases, methods relying on hand-crafted
features might become unreliable due to the increasing volume of new observations. If the text data language is a
“low-resource language” (like Bulgarian, for example), another problem emerging is the reliance on language
resources for some feature engineering and text pre-processing tasks such as part-of-speech (POS) tagging,
stemming and lemmatization etc. Without claiming to be exhaustive, we mention some recent research employing a
machine learning approach or a hybrid approach for sentiment analysis – (9), (32), (33).
To overcome some of the problems of classical machine learning methodologies, neural models could be utilized
for sentiment analysis. Deep learning addresses some of the described limitations and do not rely on hand-crafted
features, instead text is represented by embeddings - low-dimensional, learned continuous vector representations.
Embeddings capture the semantic relationships in texts and might remove the need to perform feature engineering.
One of the most popular word embedding techniques is Word2Vec proposed by Mikolov et al. in 2013 and useful in
many NLP tasks (34). Nowadays neural models receive more and more attention in the field of NLP since they are
applicable in many use cases. The latest breakthrough on the NLP stage introduced in 2017 are Transformer models.
The Transformer is a network architecture based on the idea of attention-mechanisms and useful in various natural
language understanding and generation tasks. It was introduced by Vaswani et al. (35) and opened a whole new path
for advancements in the field of text mining and NLP. The simple idea behind the Transformer is to avoid the usage
of convolution and time-consuming recurrent neural networks and develop a novel neural architecture that depends
entirely on attention-based mechanisms.
The Transformer architecture combined with transfer learning (36) lead to the construction of pretrained models
that enable the development of models for various NLP tasks such as machine translation, text summarization,
question answering, text classification etc. Transformer models also became a popular method for sentiment analysis
in different domains - (8), (37), (38). Among the most popular examples of pre-trained model architectures are
BERT (39), Facebook’s XLM model (40) and OpenAI GPT. Google’s BERT has several variations - XLNet (41),
RoBERTa (42), DistilBERT (43). The “HuggingFace” library is extremely popular among the deep learning
community since it is specifically designed to enable the development and deployment of such state-of-the-art
models for natural language processing (44). According to the official documentation in version 4.7.0 there are over
62 supported model architectures and over 960 datasets. The library covers over 190 different languages among
which English and Spanish are the most represented ones. Resources for Bulgarian are also available but, of course,
to a far less degree compared to other European languages. There are 38 datasets and 41 models applicable to
Bulgarian in version 4.7.0. However, among them there are not many language-specific models or datasets – most of
them are multilingual.
This section aims at providing a summary of recent research focused specifically on sentiment analysis of public
opinion in the governmental domain. We focus on computational methods, rather than qualitative approaches
towards the task. Table 1 provides a self-contained comparative tabular summary of all reviewed papers in terms of
utilized data sources, text processing techniques and machine learning algorithms as well as utilized language
resources for text preprocessing and analysis. We consider all these aspects of the studies as important in the
analyzed domain. The first subsection explicitly summarizes studies with focus on digital public services, while the
second subsection summarizes studies applying sentiment analysis in other governmental domains (for example,
healthcare, politics and other).
TABLE 1. Summary of recent research devoted on citizen opinion mining
In (50) and (51) Kowalski et al. outline the benefits of text mining and machine learning for analyzing public
opinion and the implications for the public administration. Both studies are focused on public healthcare - authors
analyze a sample of citizen reviews of primary care practices in England. Authors frame determinants of user
satisfaction in several dimensions of service quality by the application of LDA (Latent Dirichlet Allocation) (50) or
structural topic modelling (STM) (51) in combination with a Random Forest model for prediction of user
satisfaction level. The study makes important insights into patients’ key drivers of satisfaction and recommends
concrete government actions necessary in order to improve the provided services. The results suggest that patient
satisfaction levels are influenced by factors which are not captured by surveys. Authors emphasize on the important
role of comments/reviews in free textual format implying that “while surveys are reliable, they cover narrow sub-
samples of citizen experiences” (51).
Another study aimed at citizen opinion mining is that of Dandannavar et al. (52). The authors argue that
comments in social networks could be efficiently utilized to help and guide governments in the development of
successful and sustainable government initiatives and innovations. According to them, social sentiment analysis
could help in identifying citizens feelings and concerns about government programs and policies and their various
aspects. A framework for social sentiment analysis is proposed. The system has several phases - data collection,
preprocessing, feature extraction, sentiment analysis and polarity classification. For sentiment analysis, the authors
propose a combination of sentiment lexicons and machine learning methods (a hybrid approach). As in other related
work, the considered main data source of public opinion is Twitter.
Hubert et al. (53) propose a methodology for analysis of government-citizen interactions in Twitter. Under focus
are the interactions through official government accounts in the social network. The study aims to analyze the
government activity, resources shared between government and citizens as part of interactions, citizen responses and
sentiments to government announcements etc. Under focus are interactions in the field of healthcare, social
development, education, environment, and other. Authors’ methodology consists of various visualization tools used
to reveal patterns and trends in government-citizen interactions in Twitter. For sentiment analysis the authors utilize
the NRC Affect Intensity Lexicon (the Spanish version is utilized since data is in Spanish) and examine eight
primary emotions - joy, trust, fear, surprise, sadness, disgust, anger, and anticipation. The aim is to assess the
general mood of citizens in response to government tweets. The analysis of Twitter data only is considered as a
limitation and the authors plan on including other social networks.
Another study utilizing Twitter data as a main source of citizens’ opinions is that of Mendez et al. (54). The
study is focused on the public transportation system in Santiago, Chile. Authors’ aim is to overcome the limitations
of traditional surveys and inspect citizens’ opinions freely expressed in social networks. One of the interesting
research questions posed in the study is whether satisfaction surveys might be replaced by information reported on
Twitter, which has massive coverage and is free. To provide an answer to this question, the authors combine
sentiment analysis techniques with topic modeling. In sentiment analysis, the authors first experiment with the
Spanish version of SentiStrength. However, after manual annotation and comparison with the results from the
dictionary-based approach, it becomes clear that only 41% of Tweets are correctly classified by using SentiStrength.
Empirical results suggest that level of detail and variety of answers in surveys are higher than the ones obtained by
analyzing comments in free textual format. However, the last cover many topics and can be used to effectively
diagnose problems in a timely manner. Authors suggest utilizing a combination of both the proposed methodology
and surveys as an effective way of public opinion mining.
In (55) under focus is risk communication management carried out by the government and main health
organizations during the COVID-19 pandemic in Spain. By utilizing web scraping techniques, the authors analyze
citizens’ interactions in various social media - Twitter, YouTube, Instagram, official press websites, and internet
forums. The study also investigates citizens’ emotions expressed in social media during the pandemic. The authors
use the Natural Language Understanding service of IBM Watson system for mining the following citizens emotions
- anger, fear, disgust, and sadness. The IBM platform enables an analysis of syntactic characteristics and provides
information on concepts, emotions, entities, keywords, relationships, and semantic roles found in text data. The
study reveals interesting insights into public emotions regarding different aspects of the COVID-19 pandemic and
addresses main issues in government control of the crisis (for example, no dialogue between the government and
other social actors, as well as contradictory communication).
In (56) an ML framework for mining public sentiments from microblogs is proposed. The authors study public
opinions regarding the China Pakistan Economic Corridor (CPEC). Contributions of the study include the
development of a: 1. database with tweets on CPEC; 2. ML-based sentiment analysis system for classifying public
tweets regarding CPEC; 3. domain-specific sentiment lexicon. As in many other studies in the field, Twitter data is
utilized. Manual annotation is performed to categorize tweets as positive, negative, and neutral. An algorithm for
automatic generation of a domain-specific lexicon is provided. During the construction of the sentiment lexicon, the
authors make use of POS tagging in order to extract adjectives and adverbs out of raw textual data. Authors motivate
the choice of these parts of speech by claiming that they are more likely to convey public sentiments. After manual
annotation and sentiment lexicon generation, the sentiment analysis task is approached as a supervised problem and
three popular algorithms are utilized for polarity classification – k-NN, SVM, Logistic Regression. The authors plan
to include data from other social networks and implement topic models into the sentiment analysis system.
In (57) a straightforward approach for ML-based sentiment analysis is proposed by Andoh et al. Tweets
regarding the political life in Ghana are collected and manually annotated as positive, negative, or neutral. Three
popular algorithms for text classification are tested – Random Forest, SVM and Naïve Bayes. In (60) a sentiment
tracking system using data generated from verified Twitter news accounts (news agencies, newspapers,
organizations etc.) is developed. One of the possible applications of the system is to facilitate the decision-making
processes of governments. After manual annotation of a subset of the sample followed by automatic annotation of
the rest of the sample, an ML-based approach for sentiment classification is utilized. Several text classification
algorithms are tested on both the manually and automatically annotated samples - Logistic Regression, Multilayer
Perceptron, Naïve Bayes, and SVM.
DISCUSSION
The field of citizen opinion mining has received a lot of research attention in the recent years. In the current
review of studies in this field, 75% of the papers have been published in the last two years. Undoubtedly, the field
will continue to expand as a result of government digitalization and introduction of artificial intelligence
technologies in this domain. In terms of main areas of public sentiments under analysis – in the current review
mainly papers in the area of e-government were included but not only. Our review reveals that other areas in the
governmental domain which could benefit from citizen opinion mining include healthcare, policy making, political
issues and transportation. Studies in the area of e-government cover a wide range of topics – among them are mining
the opinions regarding mobile government apps and online payment public services, introduction of KPIs
measurements of services in the e-government, track of opinions regarding digitalization campaigns in the public
sector, detection of hidden social networks and propaganda against the e-government and other. Some of the studies
are devoted on a particular digital public service, while others are focused on capturing “the general picture” in e-
government development. Our review supported from findings in other research reveals a research gap in the field -
there are no studies utilizing NLP and machine learning techniques for analysis of Bulgarian citizens’ opinions on
the provision of electronic public services in Bulgaria.
From the review, we observe that sentiment analysis has been applied most frequently on the document level.
However, there are examples of studies carrying out the analysis on the sentence or even aspect levels - (46), (59). It
is not surprising that sentiment analysis of citizen opinions has been applied mainly on the document level since the
main data source of such opinions appears to be Twitter. The last is a social network in which people express
themselves by posting short messages called “tweets”. More than a half of the reviewed articles utilize Twitter data,
but still there are studies considering other social media – (46), (55). Some authors outline the utilization of only
Twitter data as a research limitation. More attention has to be paid to discussion forums as a main source of public
opinions. However, such data has a rather noisy structure and poses some challenges in text processing and analysis.
Only one article in the current review claims to use forum data – (55). In terms of text data language, along with the
most frequently used English and Spanish, there is an interest in public opinion mining applications for low-resource
languages as Arabic, Hindi and Jordanian.
In terms of general methodology for sentiment analysis, our study reveals that the usage of sentiment lexicons or
manual annotation (or both) is almost inevitable. Among the sentiment analysis tools utilized in the domain of
citizen opinion mining are NRC Affect Intensity Lexicon, SentiStrength and IBM Watson system. Some authors
even develop domain specific lexicons. Many studies apply hybrid approaches towards the task of public opinion
mining because of the lack of labeled data. Among the most frequently used machine learning algorithms are
logistic regression and SVM. Classical machine learning methodologies are preferred and only two of the reviewed
articles apply deep learning techniques for public opinion mining. The last might be due to several factors. The first
reason is the lack of labeled data and usually small samples - such problems are not well-suited for deep learning
applications. Another reason hides in the fact that the field is still emerging, and researchers prefer to test more
straight-forward, interpretable, and well-studied methodologies for sentiment analysis since all these aspects are
important in the governmental domain. However, in the future we expect more research efforts put into the
application of deep learning and transfer learning for public opinion mining. Finally, it is important to mention that
some studies suggest that the most effective approach for public opinion mining is a combination between
traditional methods as surveys and sentiment analysis of comments posted in social media - (51), (54).
ACKNOWLEDGMENTS
The presentation and dissemination of these research results is supported in part by National Science Fund
Project КП-06-Н45/3/30.11.2020 “Identifying citizens' attitudes and assessments about access, quality, and usage of
electronic public services”.
REFERENCES