Unit IIISentiment Analysis (Autosaved)
Unit IIISentiment Analysis (Autosaved)
Unit III
Syllabus
We, humans, are social beings.
We are adept at utilizing a variety of means to communicate. We often
consult financial discussion forums before making an investment decision;
ask our friends for their opinions on a newly opened restaurant or a newly
released movie;
and conduct Internet searches and read consumer reviews and expert
reports before making a big purchase like a house, a car, or an appliance.
SENTIMENT We rely o n others' opinions to make better decisions, especially in area
ANALYSIS where we don’t have a knowledge.
SENTIMENT text analytics, tapping into data sources like tweets, Facebook
posts, online communities, discussion boards, Web logs,
ANALYSIS product reviews, call center logs and recording, product
rating sites, chat rooms, price comparison portals, search
APPLICATIO engine logs, and newsgroups. The following applications of
POLITICS Sentiment analysis can help understand what voters are thinking and can
clarify a candidate's position on issues.
Sentiment analysis can help political organizations, campaigns, and news
analysts to better understand which issues and positions matter the most to
voters.
The technology was successfully applied by both parties to the 2008 and
2012 American presidential election campaigns.
Government intelligence is another application that has been
used by intelligence agencies.
For example, it has been suggested that one could monitor
GOVERNME sources for increases in hostile or negative communications.
STEP 2: N-P either an overall positive or an overall negative opinion (e.g., thumbs up
or thumbs down).
POLARITY In addition to the identification of N-P polarity, one should also be
CLASSIFICAT interested in identifying the strength of the sentiment (as opposed to just
positive, it may be expressed as mildly, moderately, strongly, or very
ION strongly positive).
Most of this research was done on product or movie reviews where the
definitions of "positive" and "negative" are quite clear. Other tasks, such
as classifying news as "good" or "bad," present some difficulty. For
instance an article may contain negative news without explicitly using any
subjective words or terms.
The goal of this step is to accurately identify the target of the expressed
sentiment (e.g., a person, a product, an event, etc.).
The difficulty of this task depends largely on the domain of the analysis.
Even though it is usually easy to accurately identify the target for product
or movie reviews, because the review is directly connected to the target, it
may be quite challenging in other domains. For instance, lengthy, general-
STEP 3: purpose text such as Web pages, news articles, and blogs do not always
TARGET have a predefined topic that they are assigned to, and often mention many
objects, any of which may be deduced as the target.
IDENTIFICATI Sometimes there is more than one target in a sentiment sentence, which is
ON the case in comparative texts. A subjective comparative sentence orders
objects in order of preferences-for example, "This laptop computer is better
than my desktop PC." These sentences can be identified using comparative
adjectives and adverbs (more, less, better, longer), superlative adjectives
(most, least, best), and other words (such as same, differ, win, prefer, etc.).
Once the sentences have been retrieved, the objects can be put in an order
that is most representative of their merits, as described in text.
Once the sentiments of all text data points in the document are
identified and calculated, in this step they are aggregated and
STEP 4: converted to a single sentiment measure for the whole
COLLECTION document.
This aggregation may be as simple as summing up the
AND polarities and strengths of all texts, or as complex as using
AGGREGATION semantic aggregation techniques from natural language
processing to come up with the ultimate sentiment.
As mentioned in the previous section, polarity identification-identifying the
polarity of a text-can be made at the word, term, sentence, or document
level. The most granular level for polarity identification is at the word level.
Once the polarity identification is made at the word level, then it can be
aggregated to the next higher level, and then the next until the level of
aggregation desired from the sentiment analysis is reached.
Methods for There seem to be two dominant techniques used for identification of
Polarity polarity at the word/ term level, each having its advantages and
disadvantages:
Identification 1. Using a lexicon as a reference library (either developed manually or
automatically, by an individual for a specific task or developed by an
institution for general use)
2. Using a collection of training documents as the source of knowledge
about the polarity of terms within a specific domain (i.e., inducing
predictive models from opinionated textual documents)
A lexicon is essentially the catalog of words, their synonyms, and their meanings for a given
language. In addition to lexicons for many other languages, there are several general-purpose lexicons
created for English.
Often general-purpose lexicons are used to create a variety of special-purpose lexicons for use in
sentiment analysis projects.
Perhaps the most popular general-purpose lexicon is WordNet, created at Princeton University, which
has been extended and used by many researchers and practitioners for sentiment analysis purposes.
As described on the WordNet Web site (wordnet. princeton.edu), it is a large lexical database of
English, including nouns, verbs, adjectives, and adverbs grouped into sets of cognitive synonyms (i.e.,
synsets), each expressing a distinct concept.
Using a Synsets are interlinked by means of conceptual-semantic and lexical relations. An interesting
extension of WordNet was created by Esuli and Sebastiani (2006) where they added polarity
Lexicon (Positive-Negative) and objectivity (Subjective-Objective) labels for each term in the lexicon.
To label each term, they classify the synset (a group of synonyms) to which this term belongs using a
set of ternary classifiers (a measure that attaches to each object exactly one out of three labels), each
of them capable of deciding whether a synset is Positive, or Negative, or Objective.
The resulting scores range from 0.0 to 1.0, giving a graded evaluation of opinion-related properties of
the terms.
These can be summed up visually as in Figure 7.11. The edges of the triangle represent one of the
three classifications (positive, negative, and objective).
A term can be located in this space as a point, representing the extent to which it belongs to each of
the classifications. A similar extension methodology is used to create SentiWordNet, a publicly
available lexicon specifically developed for opinion mining (sentiment analysis) purposes.
Figure
SentiWordNet assigns to each synset of WordNet three sentiment scores:
positivity, negativity, objectivity. More about SentiWordNet can be found
at sentiworclnet.isti.cnr.it. Another extension to WordNet is WordNet-
Affect, developed by Strapparava and Valitutti (Strapparava and Valitutti,
2004).
They label WordNet synsets using affective labels representing different
affective categories like emotion , cognitive state, attitude, feeling, and so
SentiWordNet on.
WordNet has also been directly used in sentiment analysis. For example,
Kim and Hovy (Kim and Hovy, 2004) and Hu and Liu (Hu and Liu, 2005)
generate lexicons of positive and negative terms by starting with a small
list of "seed" terms of known polarities (e.g., love, like, nice, etc.) and
then using the antonymy and synonymy properties of terms to group them
into either of the polarity categories.
It is possible to perform sentiment classification using statistical analysis and
machinelearning tools that take advantage of the vast resources of labeled
(manually by annotators or using a star/ point system) documents available .
Product review Web sites like Amazon, C-NET, ebay, RottenTomatoes, and
the Internet Movie Database (IMDB) have all been extensively used as
sources of annotated data.
Using a The star (or tomato, as it were) system provides an explicit label of the
Collection of overall polarity of the review, and it is often taken as a gold standard in
algorithm evaluation.
Training A variety of manually labeled textual data is available through evaluation
Documents efforts such as the Text REtrieval Conference (TREC), NII Test Collection
for IR Systems (NTCIR), and Cross Language Evaluation Forum (CLEF).
The data sets these efforts produce often serve as a standard in the text
mining community, including for sentiment analysis researchers. Individual
researchers and research groups have also produced many interesting data
sets.
Once the semantic orientation of individual words has been
determined, it is often desirable to extend this to the phrase or
sentence the word appears in.