0% found this document useful (0 votes)
3 views

An Overview of Lexicon-Based Approach For Sentiment Analysis

The paper discusses the lexicon-based approach for sentiment analysis, highlighting its methods, limitations, and comparisons with supervised and hybrid approaches. It emphasizes the challenges faced in accurately identifying sentiment due to context dependency and the subjective nature of language. The authors provide an overview of various research developments in this field and present their own comparative results for binary and multiclass sentiment classification.

Uploaded by

yeshengjunrea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

An Overview of Lexicon-Based Approach For Sentiment Analysis

The paper discusses the lexicon-based approach for sentiment analysis, highlighting its methods, limitations, and comparisons with supervised and hybrid approaches. It emphasizes the challenges faced in accurately identifying sentiment due to context dependency and the subjective nature of language. The authors provide an overview of various research developments in this field and present their own comparative results for binary and multiclass sentiment classification.

Uploaded by

yeshengjunrea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2018 3rd International Electrical Engineering Conference (IEEC 2018)

Feb, 2018 at IEP Centre, Karachi, Pakistan

An Overview of Lexicon-Based Approach For Sentiment Analysis


Azeema Sadia1, Fariha Khan2 and Fatima Bashir3
1,3
Department of Computer Science, Bahria University Karachi Campus, Pakistan
(1 [email protected], 3 [email protected])
2
Department of Electrical Engineering, Bahria University Karachi Campus, Pakistan
(2 [email protected])

Abstract: Sentiment Analysis is the extraction of thoughts, attitudes and subjectivity of script or text to identify
polarity i.e. positive, negative or neutral. There are three methods available for sentiment analysis, supervised,
lexicon-based and hybrid approach, where the supervised method supersedes in performance from lexicon-based
method and hybrid is a combination of both. The performance of supervised method is extremely reliant on on the
excellence and the size of exercise data while on the other hand several lexical objects seem positive in the script of a
domain while appearing negative at the same time in another domain therefore lexicon based analysis doesn’t have high
accuracy yet and optimizing it is still a very interesting research topic in the domain of Sentiment Analysis. This paper
provides a comprehensive overview of the last updates in this field of lexicon based sentiment analysis along with their
limitations and also shows our own methods’ comparison of results for binary class classification and multiclass
classification in the continuation of our future work.

Keywords: Sentiment Analysis, Lexicon, Polarity, Opinion mining

I. INTRODUCTION Sentiment Analysis


With the huge volumes of data pouring in from every
domain of every field like engineering, medical,
management sciences, social media and others, there is Machine Learning Lexicon-Based Hybrid
Based Approach Approach Approach
a constant need of automated systems to classify that
data based on different aspects. Sentiments analysis (SA)
falls into the category of computational linguistics Linear Dictionary
where the aim is to decide the outlook of the author Classifiers Based
towards a particular topic and coarsely speaking (SVM, Neural Approach
different approaches may be the result of people’s Network)
beliefs, desires and feelings etc. Sentiment investigation
is drawing in critical enthusiasm from open and Corpus
Decision Tree
corporate associations trying to mine client audits and Based
Classifier
online networking content for client assumption and Approach
supposition towards their items and administrations.
Probabilistic
Specialists are currently creating systems for different Classifier
sorts of sentiment investigation. A fundamental sort of (Naive Bayes,
assumption examination is estimation order – Bayesian Network)
classifying bits of content into positive and negative
polarity. Specialists have examined sentiment Fig.1 Sentiment Analysis Methods
classification at the report level and also sentence level
and even content section level. A. Machine Learning Based Approach
Different algorithms have been applied so far but still Machine Learning based algorithms train the
the bottleneck lies in achieving remarkable accuracy. classifier from manually labeled data. However, the
The analysis and applied processes are successful in quality and coverage of training data have a high
identifying the polarity (depending on words) of a influence to performance of the classifier i.e. it requires
sentence but not the context i.e. a sentence can include a large database to be effective which is its only let
positive words but it does not necessarily means that the down. This approach has better accuracy then
sentence is positive and that will confuse the classifier. lexicon-based.

II. METHODS IN SENTIMENT ANALYSIS B. Lexicon-Based Approach


This approach utilizes a sentiment lexicon to describe
There are three approaches for broadly categorizing the polarity (positive, negative and neutral) of a textual
sentiment analysis: (a) Machine Learning based content. This approach is more understandable and can
algorithms, (b) Lexicon based approach and (c) Hybrid be easily implemented in contrast to machine learning
Approach as shown in Fig. 1. based algorithms. But the drawback is that it requires
the involvement of human beings in the process of text Sentiment analysis is a perplexing task. A few of the
analysis. challenges are: Subjective part identification i.e. the part
The more prominent the information volume, the that contains the sentiments and deciding whether the
more noteworthy the test will be for sifting through the word is subjective or objective is a difficult task. For
noise, identifying the sentiment and distinguishing example: A. “The customer’s language was very crude”.
helpful data from various content sources. Lexicon (Crude as opinion) B. “Crude oil is being imported”.
based approach can further be divided into two (Crude is objective). Dependence on domain is also of
categories: Dictionary based approach (based on main importance as one word can be positive in one
dictionary words i.e. WordNet or other entries) and domain and negative in other. For example: The word
Corpus based approach (using corpus data, can further “unpredictable” is positive in context of a movie but not
be divided into Statistical and Semantic approaches). in terms of amount of spices in dish. Detecting sarcasm
and contextual meaning is also difficult as sarcasm
C. Hybrid Approach means expressing negative comments in a unique way
This approach is the amalgamation of both machine using positive words. For example: “The restaurant is
learning and lexicon-based methods. too good when it comes to bill”. Here the opinion is
This overview can be valuable for new comer contextually negative but the words used are positive.
scientists in this field as it includes a survey of different In the year of 2002 different text classification
work on lexicon based sentiment analysis. techniques were introduced like Naive bays, Support
vector machine etc. A sophisticated solution was
III. EMPIRICAL STUDY proposed focused on binary classification which
Before the advancement of World Wide Web, people handled the problem of model misfit apparent in some
used to ask friends or family product recommendation existing text categorization techniques [5]. Similarly in
but now the internet helps us to find out the experience 2004, Hu, M. and Liu, B. produced a paper “Mining and
of those who have used different products. In this era Summarizing Customer Reviews”, research was
more individuals are sharing their outlooks with people different from traditional work because they only mined
through the internet. This huge data of people those aspects of the product which the customer stated
opinions on internet has started the trend to know about his opinions and identified if these opinions were
other’s opinion. The year 2001 can be marked as the negative or positive by using WordNet which helped
proper beginning of the research regarding sentiment them to distinguish the semantic orientation of opinion
analysis of people opinions. The term sentiment was words [6].
first appeared in 2001 coined by Das et Al. [2] and Tong Kanayama, H. and Nasukawa proposed the idea for
[3], Dave et al. published a research in the proceedings clause level sentiment analysis in 2006. Their research
of the 2003 and used the word opinion mining the first titled as “Fully Automatic Lexicon Expansion for
time [4]. According to which the perfect opinion-mining Domain-oriented Sentiment Analysis” described the
implementation would be result generation for a methodology for clause level sentiment analysis in
particular product’s attributes through searching and which they performed phrase restriction, the input
then categorizing the results in good, bad or mixed. Fig. document was separated into sentences. After that
2 shows a typical process of Lexicon based sentiment proposition detection the final step was polarity
analysis. assignment which was assigned by comparing their
lexicon polar item with the acknowledged propositions
[7].
Input Dataset Another lexicon based method was given by Ding. X
et al. in year of 2008, the research proposed a technique
to identify the orientation of product reviews that were
context dependent. Previous researches only considered
Preprocess Data Set unambiguous opinions articulated by adjectives and
adverbs [8].
In 2011, Taboada M. et al. proposed a lexicon based
Sentiment approach for the sentiment analysis of the text. The idea
Identification Lexicon Dataset proposed was Semantic Orientation Calculator that used
thesauruses of words with their semantics. The purpose
of SO-CAL was to assign the polarities
Sentiment (positive/negative) to the text. They also described the
Classification development of dictionary and used different
dictionaries in order to know the performance of
SO-COL [9].
Polarized Results Florian Wogenstein et al. in the year 2013 presented a
paper in which sentence based opinion lexicon was used
Fig. 2 General process of Lexicon based Sentiment for the German language; they worked on the phrases
Analysis from the insurance domain and analyzed the huge
difference in accuracy amid positive and negative
statements [10]. In the same year Prabu Palanisamy et al.
proposed Serendio taxonomy consisting of positive,
negative, end words and expressions and also proposed
their own sentiment calculation technique [11].
Alexander Hogenboom et al. in 2014 presented the
idea for multi lingual sentiment analysis using lexicon
based approach. Input text was translated into reference
language. Sentiment scores were mapped to a new target
sentiment lexicon from sentiment lexicon in the
reference language, through traversing of associations
amongst language-specific semantic lexicons [12]. In
the same year Gaurangi Patil et al. described the data
preprocessing and information retrieval using support
vector machine. It was indicated that Support Vector
Machine acknowledged particular properties of script
for example High Dimensional feature space and sparse
instance vector [13].
Sara Rosenthal et al. took part in Sem-Eval 2015 task
10: Sentiment Analysis on twitter. Input data set
consisted of tweets about general topics. The collected
tweets were largely tilted towards the neutral class that’s
why the imbalance of class was reduced by removing
the tweets containing non sentiment words, for this
purpose they used SentiWordNet as database of
sentiment words. The degree of polarity was assigned Fig. 4 Basic Block Diagram for SA Processes [16]
the input tweets by using spontaneously created
sentiment lexicons i.e. Hashtag Sentiment Lexicon and A comparison of different lexicons was performed in
Sentiment140 Lexicon [14]. 2017 which introduced a lexicon named WKWSCI
In year of 2016 new meta-level features for sentiment Sentiment Lexicon and compared it to five prevailing
analysis was proposed. Three classifiers Vader, lexicons: Multi-perspective Question Answering
SentiStrength and SentiWordnet were used to guess the (MPQA), National Research Council Canada (NRC),
sentiment worth of every note. Precisely, the positive Hu & Liu Opinion Lexicon, Semantic Orientation
and negative sentiment scores of every note were Calculator (SO-CAL) lexicon, Subjectivity Lexicon,
extracted by proposed methods, along with joint and General Inquirer and Word-Sentiment Association
neutral scores specified by Vader [15]. Another research Lexicon. The efficiency of the lexicons aimed at
followed these steps to perform SA: the first step was sentiment cataloguing at the sentence and document
preprocessing that’s basically data cleansing in which level was assessed by a news headlines dataset and an
noise in the data is removed by eliminating stop words amazon product review data set. MPQA, Hu & Liu and
and punctuation marks etc. Second step is the WKWSCI, SO-CAL lexicons resulted in precision rates
probability calculation of every term in a sentence of 75%–77%. Hu & Liu obtained the highest accuracy
seperately using unigram language model. The third with a naive method of totaling positives and negatives.
step was to find sentiment of every word i.e. positive, The WKWSCI lexicon gained the precision of 69%
negative and neutral which were calculated using a [17].
standard lexicon (National Research Council Canada Aung, K. Z. et al. used lexicon based approach to
(NRC) lexicons) as shown is Fig 3, 4. foresee teaching adequateness. A database English
sentiment arguments was shaped as a lexical source to
get the polarity of words [18]. This approach relied on
bootstrapping using seed opinion words and online
thesaurus. Mainly collection of set of views manually
with recognized directions, and then to enhance this set
by finding in the WordNet for substitutes and antonyms.
The newly found words were included to seed list. The
next cycle begins. The repetitive process loop stops if
no different words were found. The semantic orientation
score of joining words in all sentences are added to
achieve the final polarity results [18].
The Table 1 below shows the summary along with
technique limitation of the some papers that contributed
Fig. 3 Dataset Refinement Flow in the area of lexicon-based sentiment analysis.
2018 3rd International Electrical Engineering Conference (IEEC 2018)
Feb, 2018 at IEP Centre, Karachi, Pakistan
Table 1 Different Approaches used along with classification type, Data scope / Dataset and limitations
Data Scope &
Year & Approach Polarity Limitation
Dataset / source
Data Scope:1.Reuters-21578
Year: 2002 Improved ways are required to find the
Positive, dataset, 2.Usenet articles2
Approach to Handling Model Misfit performance of the base classifier during
Negative Dataset / source:
in Text Categorization [5] the training phase.
Reuters newswire, Lang (1995)
Year: 2002
Dataset / source: Time limitation for queries, low level of
Semantic Orientation for Positive,
Product reviews accuracy for some application.
Unsupervised Classification of Negative
Reviews [19]
The algorithm does not cater to pronoun
Data Scope:Customer reviews
Year: 2004 Positive, resolution, defining the strength of
Dataset / source:
Using Opinion Words [6] Negative opinions, and scrutinizing opinions
Amazon.com
expressed with adverbs, verbs and nouns.
Data Scope:
Japanese Reviews data set
Year: 2006 The approach is insensitive to deal with
Positive, Dataset / source:
Automatic Lexicon Expansion the complexity of human words during
Negative, Movie review data set - Turney,
for Domain- Oriented Sentiment presenting their opinions about any
Neutral 2002, The human evaluation
Analysis [7] product.
result - digital camera domain
(Kanayama et al., 2004).
Data Scope:Customer reviews
Year: 2008
Positive, Dataset / source:
Holistic Lexicon Based Approach The work is not able to find synonyms.
Negative http://www.cs.uic.edu/~liub/FBS
[8]
/FBS.htm/
Data Scope:Review text
Dataset / source:
1. epinions.com This technique cannot analyze sarcasm.
Year: 2011 Positive,
2. Texts from the Polarity
Lexicon Based Approach [9] Negative
Dataset (Pang and Lee 2004.
3. Text used in Bloom, Garg, and
Argamon (2007).
Not suitable for word sense
Year: 2013 Positive, Data Scope:Tweets disambiguation like word good is
Simple and Practical Lexicon based Negative, Dataset / source: identified as positive word but it can also
Approach [11] Neutral Twitter.com be negative in sense when used as,
“Good mile from here”.
Data Scope:German phrases
Year: 2013 Positive, Dataset / source: Incapable of dealing with verb-based
Aspect Based Opinion Mining [10] Negative http://nlp.stanford.edu/software/l phrases.
ex-parser.shtml
Data Scope: Movie Review
Year: 2013 The only restriction is that it is domain
Positive, Dataset
Sentiment Analysis of Movie specific and it is difficult to update the
Negative Dataset / source:
Reviews [20] dictionary.
www.imdb.com
Data Scope:Micro Blogs
Dataset / source:
Year: 2014 Positive,
1.www.cs.york.ac.uk/semeval-20 ---
Lexicon Based Approach [12] Negative
13/task2/
2.https://dev.twitter.com
Data Scope:German Phrases
Year: 2014 Positive, Misinterpretation of text can cause the
Dataset / source:
Lexicon Based Approach [13] Negative failure of the algorithm.
http://www. teezir.com
Positive, Data Scope:Tweets
Year: 2015
Negative, Dataset / source: ---
Lexicon based approach [14]
Neutral dev.twitter.com
Data Scope:Short messages
Dataset / source:
Year: 2016
Positive, 1. aisopos tw , 2. debate,
Sentiment Based Meta-lexicon
Negative, 3. narr tw , 4. pappas ted, ---
Based Approach [15]
Neutral 5. pang movie, 6. sanders tw3, 7.
ss bbc, 8. ss digg,
9. ss myspace, 10. ss rw,
11. ss twitter, 12. Ss youtube, 13.
stanford tw, 14. msemeval tw4,
15. vader amzn, 16. vader movie,
17. vader nyt, 18. vader tw, 19.
yelp review
Positive, Data Scope: Amazon product
Year: 2017
Negative, review data set, news headlines ---
Comparative study [17]
Neutral data set.
Dataset / source:
Year: 2017 Positive, This technique cannot analyze sarcasm.
Department of Languages,
Lexicon-Based Approach for Negative,
the University of Computer
Students’ Comments [18] Neutral
Studies, Mandalay

VI. CONCLUSION dataset with machine learning algorithms for future


This survey paper presents an overview and recent research.
updates in lexicon based sentiment analysis. The articles
discussed explained the contributions to many sentiment REFERENCES
analysis linked areas that use lexicon based analysis. [1] Medhat, W., Hassan, A., & Korashy, H.,
After analyzing these articles, it is clear that the “Sentiment Analysis Algorithms and Applications:
advancement in lexicon based sentiment analysis is still A Survey,” Ain Shams Engineering Journal, pp.
an open field for research. Most of the research is in 1093-1113, 2014.
English language but now the interest is increasing as [2] Sanjiv Das, Mike Chen, “Extracting Market
there is a lack of resources and researches for other Sentiment From Stock Message Boards,”
languages. WordNet is the most common lexicon Proceedings of the Asia Pacific Finance
sources. In almost all applications it is of utmost Association Annual Conference (APFA), 2001.
importance to consider the context of the text than just [3] Richard M. Tong, “An Operational System for
plain polarity and for that we still need enhancements in Detecting and Tracking Opinions in On-Line
our algorithms. Discussion,” Proceedings of the Workshop on
Operational Text Classification (OTC), 2001.
IV. FUTURE WORK [4] Dave, K., Lawrence, S., & Pennock, D. M.,
In previous research work related to lexicon –based “Mining the Peanut Gallery: Opinion Extraction
sentiment analysis on restaurant reviews unigram and Semantic Classification of Product
language model was incorporated with NRC lexicon Reviews,” Proceedings of the 12th International
status in order to achieve polarity score [16]. The Conference on World Wide Web, ACM, pp.
polarity of the input was dependent on the result of 519-528, 2003.
unigram language model multiplied by the score of [5] Wu, H., Phang, T. H., Liu, B., & Li, X., “A
lexicon dictionary as shown in Fig.4 [16]. The Refinement Approach To Handling Model Misfit
Classification is generally of two types; the first is In Text Categorization, Proceedings Of The Eighth
binary class i.e. dividing the reviews into two groups ACM SIGKDD International Conference On
positive and negative, and multiclass i.e. dividing the Knowledge Discovery And Data Mining, ACM, pp.
reviews into more than two like in our case three groups 207-216, 2002
positive, negative and neutral. [6] Hu, M., & Liu, B., “Mining And Summarizing
The results showed that 85.5% accuracy was achieved Customer Reviews,” Proceedings of the Tenth
for binary class classification which decreased to 48% ACM SIGKDD International Conference on
for multiclass classification because the inclusion of one Knowledge Discovery and Data Mining, ACM, pp.
more class i.e. neutral, increases the difficulty level as 168-177, 2004.
now the reviews have to be divided in three groups and [7] Kanayama, H., & Nasukawa, T., “Fully Automatic
differentiating between positive and neural, and Lexicon Expansion For Domain-Oriented
negative and neutral becomes challenging. An example Sentiment Analysis, Proceedings of the 2006
for positive review that can be considered as negative Conference On Empirical Methods In Natural
could be “The steak was nice it had killer flavor” here Language Processing, Association for
the word killer can mislead the sentiments because the Computational Linguistics, pp. 355-363, 2006.
algorithm is unable to identify the context of the [8] Ding, X., Liu, B., & Yu, P. S., “A Holistic
reviewer. Another example of a neutral sentence that Lexicon-Based Approach to Opinion
can be mistaken as positive can be “The ambiance is Mining,” Proceedings of the 2008 International
good and the food is ok”. Conference On Web Search And Data Mining,
We intend to use bigram and trigram to analyses ACM, pp. 231-240, 2008.
whether the accuracy improves for multiclass [9] Taboada, M., Brooke, J., Tofiloski, M., Voll, K., &
classification or it affects the accuracy for binary class Stede, M., “Lexicon-Based Methods for Sentiment
classification. We also aim to check and compare the Analysis,” Computational linguistics, pp. 267-307,
accuracy of the proposed lexicon model using our own
2011. (iMac4s), 2013.
[10] Wogenstein, F., Drescher, J., Reinel, D., Rill, S., &
Scheidt, J., “Evaluation of an Algorithm for
Aspect-Based Opinion Mining using a
Lexicon-Based Approach,” Proceedings of the
Second International Workshop on Issues of
Sentiment Discovery and Opinion Mining, ACM,
pp. 5, 2013.
[11] Palanisamy, P., Yadav, V., Elchuri, H. Serendio,
“Simple and Practical lexicon Based approach to
Sentiment Analysis,” Proceedings of Second Joint
Conference on Lexical and Computational
Semantics, pp. 543-548, 2013.
[12] Hogenboom, A., Heerschop, B., Frasincar, F.,
Kaymak, U., & de Jong, F., “Multi-lingual support
for Lexicon-Based Sentiment Analysis Guided by
Semantics,” Decision support systems, pp. 43-53,
2014.
[13] Gaurangi Patil, Varsha Galande, Mr. Vedant Kekan,
Ms. Kalpana Dange, “Sentiment Analysis Using
Support Vector Machine,” International Journal of
Innovative Research in Computer and
Communication Engineering, 2014.
[14] Sara Rosenthal, PreslavNakov, Svetlana
Kiritchenko, Saif M Mohammad, Alan Ritter,
VeselinStoyanoy, “ SemEval-2015 Task 10:
Sentiment Analysis in Twitter,” Proceedings of the
9th International Workshop on Semantic
Evaluation (SemEval 2015), pp. 451–463, 2015.
[15] Canuto, S., Gonçalves, M. A., & Benevenuto, F.,
“Exploiting New Sentiment-Based Meta-Level
Features for Effective Sentiment Analysis,”
Proceedings of the ninth ACM international
conference on web search and data mining, pp.
53-62, 2016.
[16] Sadia, A., “Sentiment Analysis using Language
Model,” International Conference on Emerging
Trends in Engineering, Sciences and Technology,
pp. 50-53, 2016.
[17] Khoo, C. S., & Johnkhan, S. B., “Lexicon-Based
Sentiment Analysis: Comparative Evaluation Of
Six Sentiment Lexicons,” Journal of Information
Science, 2017.
[18] Aung, K. Z., & Myo, N. N., “Sentiment Analysis
of Students' Comment using Lexicon Based
Approach,” Computer and Information Science
(ICIS), IEEE/ACIS 16th International Conference
IEEE, pp. 149-154, 2017.
[19] P. D. Turney, “Thumbs Up or Thumbs Down?:
Semantic Orientation Applied To Unsupervised
Classification Of Reviews,” Proceedings of the
40th Annual Meeting on Association for
Computational Linguistics (ACL), pp. 417–424,
2002.
[20] V. K. Singh, R. Piryani, A. Uddin, P. Waila,
“Sentiment analysis of movie reviews: A new
feature-based heuristic for aspect-level
sentiment classification,” International
Multi-Conference on Automation, Computing,
Communication, Control and Compressed Sensing

You might also like