Text and Sentiment Analysis
Text and Sentiment Analysis
• Text Analytics
• Sentiment Analysis
• Web Mining
• Information Retrieval
Text Analytics
• The amount of information available on the Web has increased
rapidly (Information-explosion era)
– World’s data doubles every 18 months
• Users demand useful and reliable information from the Web in the
shortest time possible
• Obstacles to fulfilling this demand includes:
– Language barriers, diversified users
– Users may provide only vague specifications of the information
they want
• We must perform searching and extracting information from the
Web texts using NLP technologies
Text Analytics
• Data-mining: Extraction of interesting information (or patterns) from
structured data.
• 80-90% of all data is held in various unstructured formats
• Useful information can be derived from this unstructured data
• Intelligence in text mining is based on NLP techniques
• NLP can be used as a preprocessing technique to harvest data and
get an initial understanding of the patterns that exist in the data
• Text Mining = Statistical NLP (structured data) + Data mining
(pattern discovery)
Text Analytics
• Text Preprocessing
– Syntactic/Semantic text analysis
• Features Generation
– Bag of words
• Features Selection
– Simple counting
– Statistics
• Data Mining
– Classification (Supervised) / Clustering (Unsupervised)
• Analyzing results
Text Analytics: Text Preprocessing
Removal of punctuations
Removal of numbers
Stemming
Text Analytics
• Feature Generation
– Text document is represented by the words it contains
(and their occurrences)
- Order of words is not that important for certain applications
(Bag of words)
– Stemming: identifies a word by its root
- Reduce dimensionality
• Stop words: The common words unlikely to help text mining
Text Analytics
• Feature Selection
– Reduce dimensionality
- Learners have difficulty addressing tasks with high dimensionality
- Only interested in the information relevant to what is being analyzed
– Irrelevant features
- Not all features help
Text Analytics
• Supervised learning (classification)
– The training data is labeled indicating the class
– New data is classified based on the training set
– Correct classification: The known label of test sample is identical with the class
result from the classification model
• Unsupervised learning (clustering)
– The class labels of training data are unknown
– Establish the existence of classes or clusters in the data
– Good clustering method: high intra-cluster similarity and low inter-cluster
similarity
Text Analytics
• Descriptive: understanding underlying processes or behavior
– Web Mining (Opinion extraction, Sentiment analysis)
– Clustering (Blogs, Patterns and trends)
• Predictive: predict an unseen or unmeasured value
– Classification (Spam detection)
– Information Retrieval (Searching)
– Pattern and trend forecasting, Knowledge Acquisition from query logs
Text Analytics
• Statistical NLP
– POS Tagging
– Ambiguity
– Tokenization \ Sentence Detection \ Parsing
– Context
– Stemming
– Synonymy and polysemy
• Data Mining
– Massive amounts of data
– No training data available
Overview of Text Analytics
• In the data preparation section we discuss five steps to prepare texts for
analysis.
• The first step, importing text, covers the functions for reading texts
from various types of file formats (e.g., txt, csv, pdf) into a raw text
corpus in R.
• The steps string operations and preprocessing cover techniques for
manipulating raw texts and processing them into tokens (i.e., units of
text, such as words or word stems).
• The tokens are then used for creating the document-term matrix
(DTM), which is a common format for representing a bag-of-words type
corpus, that is used by many R text analysis packages.
• Finally, it is a common step to filter and weight the terms in the DTM.
Overview of Text Analytics
• Importing text
• In order to be able to map all known characters to a single scheme, the Unicode
standard was proposed,
• although it also requires a digital encoding format (such as the UTF-8 format,
but also UTF-16 or UTF-32).
• String operations
• Digital text is represented as a sequence of characters, called a string.
• Strings are represented as objects called “character” types, which are vectors of
strings.
• The group of string operations refers to the low-level operations for working
with textual data.
• The most common string operations are joining, splitting, and extracting parts of
strings (collectively referred to as parsing) and the use of regular expressions to
find or replace patterns.
Overview of Text Analytics
• Preprocessing
• full texts must be tokenized into smaller, more specific text features, such as
words or word combinations.
• Also, the computational performance and accuracy of many text analysis
techniques can be improved by normalizing features, or by removing
“stopwords”: words designated in advance to be of no interest, and which are
therefore discarded prior to analysis.
• Tokenization
• Tokenization is the process of splitting a text into tokens.
• Most often tokens are words, because these are the most common semantically
meaningful components of texts.
• For many languages, splitting texts by words can mostly be done with low-level
string processing due to clear indicators of word boundaries, such as white
spaces, dots and commas.
• A good tokenizer, however, must also be able to handle certain exceptions, such
as the period in the title “Dr.”, which can be confused for a sentence boundary.
Overview of Text Analytics
• Normalization: Lowercasing and stemming
• The process of normalization broadly refers to the transformation of words into a
more uniform form.
• This can be important if for a certain analysis a computer has to recognize when
two words have (roughly) the same meaning, even if they are written slightly
differently.
• Another advantage is that it reduces the size of the vocabulary (i.e., the full range
of features used in the analysis).
• A simple but important normalization techniques is to make all text lower case.
• If we do not perform this transformation, then a computer will not recognize that
two words are identical if one of them was capitalized because it occurred at the
start of a sentence.
Overview of Text Analytics
• Normalization: stemming and lemmatization
• Another argument for normalization is that a base word might have different
morphological variations, such as the suffixes from conjugating a verb, or
making a noun plural. Example: break, breaks, breaking, broken, broke
• For purposes of analysis, we might wish to consider these variations as
equivalent because of their close semantic relation, and because reducing the
feature space is generally desirable when multiple features are in fact closely
related.
• A technique for achieving this is stemming, which is essentially a rule-based
algorithm that converts inflected forms of words into their base forms (stems).
• A more advanced technique is lemmatization, which uses a dictionary to replace
words with their morphological root form.
Overview of Text Analytics
• Removing stopwords
• Common words such as “the” in the English language are rarely informative
about the content of a text.
• Filtering these words out has the benefit of reducing the size of the data,
reducing computational load, and in some cases also improving accuracy.
• Document-term matrix (DTM)
• DTM is one of the most common formats for representing a text corpus (i.e. a
collection of texts) in a bag-of-words format.
• A DTM is a matrix in which rows are documents, columns are terms, and cells
indicate how often each term occurred in each document.
• The advantage of this representation is that it allows the data to be analyzed with
vector and matrix algebra, effectively moving from text to numbers.
• Furthermore, with the use of special matrix formats for sparse matrices, text data
in a DTM format is very memory efficient and can be analyzed with highly
optimized operations.
Overview of Text Analytics
• Filtering and weighting
• Not all terms are equally informative for text analysis.
• One way to deal with this is to remove these terms from the DTM.
• We have already discussed the use of stopword lists to remove very common
terms, but there are likely still other common words and this will be different
between corpora.
• Furthermore, it can be useful to remove very rare terms for tasks
• This is especially useful for improving efficiency, because it can greatly reduce the
size of the vocabulary (i.e., the number of unique terms), but it can also improve
accuracy.
• A simple but effective method is to filter on document frequencies (the number
of documents in which a term occurs)
Overview of Text Analytics
• Filtering and weighting
• Instead of removing less informative terms, an alternative approach is assign
them variable weights.
• Many text analysis techniques perform better if terms are weighted to take an
estimated information value into account, rather than directly using their
occurrence frequency.
• Given a sufficiently large corpus, we can use information about the distribution
of terms in the corpus to estimate this information value.
• A popular weighting scheme that does so is term frequency-inverse document
frequency (tf-idf), which down-weights that occur in many documents in the
corpus
Overview of Text Analytics
• Term frequency: It tells us the occurrences of the word in a document. It can be
computed as
• Sarcasm:
• It is a popular form of mockery to ridicule or convey insult.
• Analytics fails to recognize these forms of emotions and might prove to be
ineffective in such cases
• Though the efforts are being made to cater to this problem through the extensive
use of machine learning and artificial intelligence and we might see an improved
version of sentiment analysis in near future
• Dependency:
• Sentiment analysis largely depends on the predefined words and their individual
score
• Which leads to many problems like ambiguity in the context of the sentence
• A sentence which includes ‘good’ might not have any emotions attached to it but
will be shown as positive by the analysis
Sentiments Analysis
• Despite its limitations Sentiment Analysis is extremely popular and
widely used analytical tool in business intelligence for social media
monitoring, brand health examination, effects of ad campaigns or new
product launch and various research purposes
• It is frequently applied to Twitter data and Customer reviews by
marketers and customer service teams to identify the feelings of
consumers
• Sentiment analysis has also started to gain popularity in areas like
psychology, political science and other alike fields where textual data is
obtained and explored from books, transcripts, and reports