0% found this document useful (0 votes)
120 views

Text and Sentiment Analysis

The document discusses text analytics and preprocessing techniques for text mining. It covers topics like text importing, string operations, preprocessing steps like tokenization, normalization through lowercasing and stemming, removing stopwords, creating a document-term matrix, and filtering and weighting terms. The goal of these preprocessing steps is to prepare raw text data for analysis using techniques like classification, clustering, and information extraction.

Uploaded by

ris
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

Text and Sentiment Analysis

The document discusses text analytics and preprocessing techniques for text mining. It covers topics like text importing, string operations, preprocessing steps like tokenization, normalization through lowercasing and stemming, removing stopwords, creating a document-term matrix, and filtering and weighting terms. The goal of these preprocessing steps is to prepare raw text data for analysis using techniques like classification, clustering, and information extraction.

Uploaded by

ris
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Agenda

• Text Analytics
• Sentiment Analysis
• Web Mining
• Information Retrieval
Text Analytics
• The amount of information available on the Web has increased
rapidly (Information-explosion era)
– World’s data doubles every 18 months
• Users demand useful and reliable information from the Web in the
shortest time possible
• Obstacles to fulfilling this demand includes:
– Language barriers, diversified users
– Users may provide only vague specifications of the information
they want
• We must perform searching and extracting information from the
Web texts using NLP technologies
Text Analytics
• Data-mining: Extraction of interesting information (or patterns) from
structured data.
• 80-90% of all data is held in various unstructured formats
• Useful information can be derived from this unstructured data
• Intelligence in text mining is based on NLP techniques
• NLP can be used as a preprocessing technique to harvest data and
get an initial understanding of the patterns that exist in the data
• Text Mining = Statistical NLP (structured data) + Data mining
(pattern discovery)
Text Analytics
• Text Preprocessing
– Syntactic/Semantic text analysis
• Features Generation
– Bag of words
• Features Selection
– Simple counting
– Statistics
• Data Mining
– Classification (Supervised) / Clustering (Unsupervised)
• Analyzing results
Text Analytics: Text Preprocessing
Removal of punctuations

Removal of numbers

Change to lower case

Stop words removal

Extra whitespace removal

Stemming
Text Analytics
• Feature Generation
– Text document is represented by the words it contains
(and their occurrences)
- Order of words is not that important for certain applications
(Bag of words)
– Stemming: identifies a word by its root
- Reduce dimensionality
• Stop words: The common words unlikely to help text mining
Text Analytics
• Feature Selection
– Reduce dimensionality
- Learners have difficulty addressing tasks with high dimensionality
- Only interested in the information relevant to what is being analyzed
– Irrelevant features
- Not all features help
Text Analytics
• Supervised learning (classification)
– The training data is labeled indicating the class
– New data is classified based on the training set
– Correct classification: The known label of test sample is identical with the class
result from the classification model
• Unsupervised learning (clustering)
– The class labels of training data are unknown
– Establish the existence of classes or clusters in the data
– Good clustering method: high intra-cluster similarity and low inter-cluster
similarity
Text Analytics
• Descriptive: understanding underlying processes or behavior
– Web Mining (Opinion extraction, Sentiment analysis)
– Clustering (Blogs, Patterns and trends)
• Predictive: predict an unseen or unmeasured value
– Classification (Spam detection)
– Information Retrieval (Searching)
– Pattern and trend forecasting, Knowledge Acquisition from query logs
Text Analytics
• Statistical NLP
– POS Tagging
– Ambiguity
– Tokenization \ Sentence Detection \ Parsing
– Context
– Stemming
– Synonymy and polysemy
• Data Mining
– Massive amounts of data
– No training data available
Overview of Text Analytics
• In the data preparation section we discuss five steps to prepare texts for
analysis.
• The first step, importing text, covers the functions for reading texts
from various types of file formats (e.g., txt, csv, pdf) into a raw text
corpus in R.
• The steps string operations and preprocessing cover techniques for
manipulating raw texts and processing them into tokens (i.e., units of
text, such as words or word stems).
• The tokens are then used for creating the document-term matrix
(DTM), which is a common format for representing a bag-of-words type
corpus, that is used by many R text analysis packages.
• Finally, it is a common step to filter and weight the terms in the DTM.
Overview of Text Analytics
• Importing text
• In order to be able to map all known characters to a single scheme, the Unicode
standard was proposed,
• although it also requires a digital encoding format (such as the UTF-8 format,
but also UTF-16 or UTF-32).
• String operations
• Digital text is represented as a sequence of characters, called a string.
• Strings are represented as objects called “character” types, which are vectors of
strings.
• The group of string operations refers to the low-level operations for working
with textual data.
• The most common string operations are joining, splitting, and extracting parts of
strings (collectively referred to as parsing) and the use of regular expressions to
find or replace patterns.
Overview of Text Analytics
• Preprocessing
• full texts must be tokenized into smaller, more specific text features, such as
words or word combinations.
• Also, the computational performance and accuracy of many text analysis
techniques can be improved by normalizing features, or by removing
“stopwords”: words designated in advance to be of no interest, and which are
therefore discarded prior to analysis.
• Tokenization
• Tokenization is the process of splitting a text into tokens.
• Most often tokens are words, because these are the most common semantically
meaningful components of texts.
• For many languages, splitting texts by words can mostly be done with low-level
string processing due to clear indicators of word boundaries, such as white
spaces, dots and commas.
• A good tokenizer, however, must also be able to handle certain exceptions, such
as the period in the title “Dr.”, which can be confused for a sentence boundary.
Overview of Text Analytics
• Normalization: Lowercasing and stemming
• The process of normalization broadly refers to the transformation of words into a
more uniform form.
• This can be important if for a certain analysis a computer has to recognize when
two words have (roughly) the same meaning, even if they are written slightly
differently.
• Another advantage is that it reduces the size of the vocabulary (i.e., the full range
of features used in the analysis).
• A simple but important normalization techniques is to make all text lower case.
• If we do not perform this transformation, then a computer will not recognize that
two words are identical if one of them was capitalized because it occurred at the
start of a sentence.
Overview of Text Analytics
• Normalization: stemming and lemmatization
• Another argument for normalization is that a base word might have different
morphological variations, such as the suffixes from conjugating a verb, or
making a noun plural. Example: break, breaks, breaking, broken, broke
• For purposes of analysis, we might wish to consider these variations as
equivalent because of their close semantic relation, and because reducing the
feature space is generally desirable when multiple features are in fact closely
related.
• A technique for achieving this is stemming, which is essentially a rule-based
algorithm that converts inflected forms of words into their base forms (stems).
• A more advanced technique is lemmatization, which uses a dictionary to replace
words with their morphological root form.
Overview of Text Analytics
• Removing stopwords
• Common words such as “the” in the English language are rarely informative
about the content of a text.
• Filtering these words out has the benefit of reducing the size of the data,
reducing computational load, and in some cases also improving accuracy.
• Document-term matrix (DTM)
• DTM is one of the most common formats for representing a text corpus (i.e. a
collection of texts) in a bag-of-words format.
• A DTM is a matrix in which rows are documents, columns are terms, and cells
indicate how often each term occurred in each document.
• The advantage of this representation is that it allows the data to be analyzed with
vector and matrix algebra, effectively moving from text to numbers.
• Furthermore, with the use of special matrix formats for sparse matrices, text data
in a DTM format is very memory efficient and can be analyzed with highly
optimized operations.
Overview of Text Analytics
• Filtering and weighting
• Not all terms are equally informative for text analysis.
• One way to deal with this is to remove these terms from the DTM.
• We have already discussed the use of stopword lists to remove very common
terms, but there are likely still other common words and this will be different
between corpora.
• Furthermore, it can be useful to remove very rare terms for tasks
• This is especially useful for improving efficiency, because it can greatly reduce the
size of the vocabulary (i.e., the number of unique terms), but it can also improve
accuracy.
• A simple but effective method is to filter on document frequencies (the number
of documents in which a term occurs)
Overview of Text Analytics
• Filtering and weighting
• Instead of removing less informative terms, an alternative approach is assign
them variable weights.
• Many text analysis techniques perform better if terms are weighted to take an
estimated information value into account, rather than directly using their
occurrence frequency.
• Given a sufficiently large corpus, we can use information about the distribution
of terms in the corpus to estimate this information value.
• A popular weighting scheme that does so is term frequency-inverse document
frequency (tf-idf), which down-weights that occur in many documents in the
corpus
Overview of Text Analytics
• Term frequency: It tells us the occurrences of the word in a document. It can be
computed as

• Inverse document frequency: It tells us how important a term is. While


calculating tf, all terms are considered as important but we know that some
terms like ‘is’, ‘are’, ‘the’ are frequent in every document and thus are less
important. It can be calculated as:

• Then tf/idf can be calculated by multiplying these two.


Text Analytics
• Statistical NLP
– POS Tagging
– Ambiguity
– Tokenization \ Sentence Detection \ Parsing
– Context
– Stemming
– Synonymy and polysemy
• Data Mining
– Massive amounts of data
– No training data available
Sentiments Analysis
• The world is moving towards a fully digitalized economy at an incredible
pace
• as a result, a enormous amount of data is being produced by the
internet, social media, smartphones, tech equipment and many other
sources each day
• which has led to the evolution of Big Data management and analytics
• Sentiment analysis is one such tool and the most popular branch of
textual analytics which with the help of statistics and natural language
processing examine and classify the unorganized textual data into various
sentiments
• It is also known as opinion mining as it largely focuses on the opinion and
attitude of the people through analyzing their texts
Sentiments Analysis
• At its simplest, sentiment analysis quantifies the mood of a tweet or
comment by counting the number of positive and negative words.
• By subtracting the negative from the positive, the sentiment score is
generated.
• For example, this comment generates an overall sentiment score of
2, for having two positive words:
Sentiments Analysis
• You can push this simple approach a bit further by looking for
negations, or words which reverse the sentiment in a section of the
text:

• The presence of the word don’t before like produces a negative


score rather than a positive one, giving an overall sentiment score
of -2
Sentiments Analysis
• Now social media is not only used for chatting and file sharing, it
has gone much beyond that.
• Many organizations use social media as a tool to understand the
likes and dislikes of their customers.
• This can be done through sentiment analysis or opinion mining.
• Sentiment analysis involves many tasks such as subjectivity
detection, text preprocessing, feature extraction and sentiment
classification.
• Subjectivity/objectivity:
• Text which holds some sentiment is called subjective text. For example: “3
idiots is an awesome movie”.
• On the other hand objective text does not hold any sentiment. For
example- “Raj Kumar Hirani is the director of the movie”.
Sentiments Analysis
• For sentiment analysis, we only require subjective text which can
be further classified into positive or negative.
• If we are taking objective text also, then we have to take three
classes positive, negative or neutral.
• Polarity: Subjective text can be positive or negative. This is called
polarity of text. Text can be of positive polarity or negative polarity.
• Sentiment level: Sentiment analysis can be performed at three
levels-
• Document level in which the whole document is given positive or negative
polarity.
Sentiments Analysis
• Sentence level in which each sentence is analyzed to give positive or
negative polarity. Overall polarity is computed by counting the positive and
negative comments. Majority comments decide the overall sentiment.
• Phrase level in which phrases or aspects in a sentence are analyzed to
classify as positive or negative.
Sentiments Analysis Process
• Data gathering phase: From tweets, movie reviews, product
reviews, blog data, news data, etc.
• Text preprocessing: Involves stop word removal and stemming.
• Stop words are the words in the text which do not contribute to
any sentiment. For example: In “This is a good movie”, (this, is, a)
are the stop words.
• Stemming is the process of removing prefixes or suffixes. For
example: ‘enjoying’ or ‘enjoyed’ can be stemmed to ‘enjoy’.
• Feature extraction: Involves converting text dataset into feature
vector or some other representations unigram, bigram or n-gram
model, term frequency, POS tagging and tf-idf (term frequency
inverse document frequency).
Sentiments Analysis Process
• Unigram feature set takes one word at a time. For example: “This is a good
movie” will be taken as ( this, is, a, good, movie).
• In bigram, we take a pair of two and so on. For example (this is, is a, a good,
good movie).
• Term frequency feature set takes into account the number of occurrences (or
frequency) of a term.
• POS is Parts Of Speech tagging. As we know, adjectives (in the above example
‘good’) and adverbs contribute to most of the sentiment. So POS helps us in
identifying adjectives and adverbs in a sentence.
• Tf-idf is the most informative set. It tells us how much a word is important in a
document. It increases proportionally as the term frequency in a document
increases but decreases if a term is occurring frequently in all the documents
(document frequency). For example: stop words occurs in all the documents
and do not facilitate any classification.
Sentiments Analysis Process
• Term frequency: It tells us the occurrences of the word in a document. It can be
computed as

• Inverse document frequency: It tells us how important a term is. While


calculating tf, all terms are considered as important but we know that some
terms like ‘is’, ‘are’, ‘the’ are frequent in every document and thus are less
important. It can be calculated as:

• Then tf/idf can be calculated by multiplying these two.


Sentiments Analysis Process
Sentiments Analysis Process
• Sentiment classification: After feature extraction, the final phase is the
sentiment classification
• The text is classified into positive and negative classes. There are mainly
two approaches for this:
• Subjective lexicon: we have scores for each word indicating its positive,
negative or neutral nature. For a given text, we sum all the positive,
negative and neutral scores separately. In the end, highest score gives us
the overall polarity. This approach can be further classified into
dictionary based approach and corpus-based approach.
• Dictionary based approach involves the creation of seed list from opinion words
from the dataset and then expanding it with the help of dictionaries or thesaurus.
• The corpus-based approach is also similar to dictionary based, the only difference
is that in this we prepare seed list from the domain oriented corpus.
Sentiments Analysis Process
• For example, if our work is on movie reviews dataset, seed list will
be prepared from movie domain text only.
• Corpus-based classification can be done in two ways- first is using a
statistical technique which works on the basis of co-occurrence of
words in the corpus
• i.e. if words occur mostly in the positive text then its polarity is
positive otherwise negative.
• Another technique is Semantic-based approach. Wordnet is an
example of this. It works on the principle of similarity between
words.
• If some word in our dataset matches with the word in Wordnet,
then we can use its score from SentiWordnet to find its polarity.
Sentiments Analysis Process
• Machine learning: It is an automatic classification process.
Classification is performed using features which are extracted from
the text(as explained above).
• It is of two types- Supervised and Unsupervised learning.
• Supervised learning involves training of classifier with labeled
training data. Labeling means that class labels are known for each
term in training dataset. Once the classifier is trained, it can be
used to classify the testing data.
• Unsupervised learning: no class label is known prior. The model
makes inferences from incoming data and cluster it.
Sentiments Analysis Process
• Popular supervised learning classification algorithms are
• SVM(Support Vector Machine),
• NBC(Naïve Bayes classifier),
• ANN(Artificial Neural Network), etc.
• SVM is the most popular classification algorithm. It creates a
hyperplane which is used to classify data.
• It is a non-probabilistic classifier. If not properly trained, the
problem of overfitting may arise i.e too much training leads to
learning of noise in data as concepts to the classifier.
• Bayesian Network (BN) uses Directed acyclic graphs to represent
dependencies between two variables. For example, a network
depicting dependencies between symptoms and diseases can be
used to find a disease for a given symptom.
Sentiments Analysis Process
• ANN works similar as brain solves problems by neurons connected
by axons. These are self-learned or self-trained systems.
• They require less data for training.
• But the system acts as a black box and we can’t view the
relationships.
Sentiments Analysis Process
• The sentiment analysis task is usually modeled as a classification
problem where a classifier is fed with a text and returns the
corresponding category, e.g. positive, negative, or neutral (in case
polarity analysis is being performed).
• Said machine learning classifier can usually be implemented with
the following steps and components:
Sentiments Analysis Process
• In the training process (a), our model learns to associate a
particular input (i.e. a text) to the corresponding output (tag)
based on the test samples used for training.
• The feature extractor transfers the text input into a feature vector.
Pairs of feature vectors and tags (e.g. positive, negative, or neutral)
are fed into the machine learning algorithm to generate a model.
• In the prediction process (b), the feature extractor is used to
transform unseen text inputs into feature vectors. These feature
vectors are then fed into the model, which generates predicted
tags (again, positive, negative, or neutral).
Limitations: Sentiments Analysis
• Though Sentiment analysis has been one of the most popular
textual analysis tools among businesses, scholars and analysts to
take decisions and for research purposes
• Sentiment analysis has its own limitations as language is very
complex and the meaning of each and every word changes with
time and from person to person
• Also, the accuracy of the analysis can’t be accurately measured and
compared with how human beings analyze emotions.
Limitations: Sentiments Analysis
• The problem can be classified into three main factors:

• Sarcasm:
• It is a popular form of mockery to ridicule or convey insult.
• Analytics fails to recognize these forms of emotions and might prove to be
ineffective in such cases
• Though the efforts are being made to cater to this problem through the extensive
use of machine learning and artificial intelligence and we might see an improved
version of sentiment analysis in near future

• “I am so proud of your stupidity, you make me feel good about myself.”


Limitations: Sentiments Analysis
• Multiple Meanings:
• A word could have many meanings and it may represent multiple emotions as we
move from one geography to another or even one person to another
• Many English words in the UK may mean different in American English

• For ex: “I think you’ve been playing horribly dope.”

• Dependency:
• Sentiment analysis largely depends on the predefined words and their individual
score
• Which leads to many problems like ambiguity in the context of the sentence
• A sentence which includes ‘good’ might not have any emotions attached to it but
will be shown as positive by the analysis
Sentiments Analysis
• Despite its limitations Sentiment Analysis is extremely popular and
widely used analytical tool in business intelligence for social media
monitoring, brand health examination, effects of ad campaigns or new
product launch and various research purposes
• It is frequently applied to Twitter data and Customer reviews by
marketers and customer service teams to identify the feelings of
consumers
• Sentiment analysis has also started to gain popularity in areas like
psychology, political science and other alike fields where textual data is
obtained and explored from books, transcripts, and reports

You might also like