TEXT MINING AND
SENTIMENT ANALYSIS
Extracting textual information to draw insights
Jeroen VK Rombouts
1
Topics for the Session
1. Introduction
2. Process of Text Analytics
3. Text Analytics Techniques
4. Practical Applications
5. Use Case Discussion
2
1. INTRODUCTION
3
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
What is Text Mining ?
u Text Mining is the process of deriving high-quality information through
statistical pattern learning from text
u Types: text categorization, text clustering, concept/entity extraction,
production of granular taxonomies, sentiment analysis, document
summarization, and entity relation modelling
4
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Need for Text Mining (1/2)
u The global text analytics market was valued at USD 3.95 billion and is
expected to reach USD 10.38 billion by 2023 with an expected Compound
Annual Growth rate (CAGR) of 17.3% during the forecast period of 2018–2023
u Text analytics tools are being increasingly used by organizations to aid their
business-making process by offering actionable insights from various forms of
text sources, such as client interaction, emails, blogs, product reviews,
tweets, etc.
u The primary objective of text analytics is to accumulate different forms of
data, including structured and unstructured, which is further utilized for
analysis, thereby fuelling the organization’s business decisions
5
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Need for Text Mining (2/2)
u In marketing: analytical customer relationship management, predictive
model for customer attrition, sentiment analysis of a brand (benchmarking,
market analysis, competitive analysis …)
u Determine the identity of a brand, the way it communicates to its audience,
which emotional triggers it uses for its marketing campaigns …
u Ultimately, text mining allows a brand to readjust its communication and
strategy by identifying how audience/partners/competitors perceive it.
User Valence Volume .…
Peter +5 500
Sarah +3 400
Comp. 1 -10 5000
… … …
6
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Text mining & Social Media Data – The Questions
u Volume:
u How much?
u Examples of metrics?
u Valence:
u How to measure?
u Examples of metrics?
u Heterogeneity?
u How different?
u Examples of metrics?
7
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Text mining & Social Media Data – The Answers
u Volume:
u Amount of data scraped – measured in terms of kilobytes/gigabytes
u Number of records in the given data
u Valence:
u Measure the amount of positivity or negativity of a sentence
u Polarity and subjectivity
u Heterogeneity?
u Similarity of words in the text corpus
u Clustering based on the term frequencies
8
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
What is our Prime Focus ?
External and non-structured Data: Internalized Data:
Network, UGC, etc. Datawarehouse, ERP, CRM, etc.
External structured Data: Panel, Data for organizations and
Survey, Tests, etc. businesses directly usable for
business solutions
9
2. PROCESS OF TEXT
ANALYTICS
3
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Process of Text Analytics
u Collection of Text Data
u Pre-processing
u Feature Extraction
u Feature Selection
u Text Analysis and Modelling
u Natural Language Processing
u Sentiment Analysis
u Text Grouping and Classification
11
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Text Mining – Classification tree
12
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Text Data
Data for text analytics can be of many forms
such as:
u Structured – Survey forms, Tests, Word
docs
u Semi-structured – Job listings, Retail
invoices, Reports
u Unstructured – Blogs, Tweets, Comments
13
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Pre-processing
u Case Conversion
u Punctuation removal
u Stopwords removal – Common words without significance
u Rare words removal – Very rare words which have no meaning
u Spelling correction
u Tokenization – Breaking down a sentence into a list of words
u Stemming – pruning the words to obtain the root word
u Lemmatization – changing the grammatical tense to obtain the root word
14
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Feature Extraction
u Number of words
u Number of characters
u Average word length
u Number of stopwords
u Number of special characters
u Number of numeric characters
u Number of uppercase words
15
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Feature Selection
u Feature selection refers to the filtering of useful information from the
extracted features through the methods discussed before.
u Feature selection can either be done by ‘Bag of Words’ method or by Machine
Learning
u Some other feature selection techniques and N-grams, Term Frequency,
Inverse Document Frequency (TF-IDF), Word embeddings
16
3. TEXT ANALYTICS
TECHNIQUES
3
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Natural Language Processing
u The foremost functionality of the NLP in Text Mining is Parts Of Speech
tagging (commonly referred to as POS tagging). This function identifies each
word in a sentence as a grammatical part and tags them.
u Other features of NLP include:
u Text summarization
u Machine Translation
u Optical Character Recognition
u Document to Information
18
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Sentiment Analysis
u Brand perception among customers is one of the key factors to be considered
before making any critical decisions in the current market
u Sentiment Analysis of Text Data which has been collected, cleaned and
processed, will help us to better understand the consumer market
u The data for sentiment analysis is usually tweets, social media posts, blog
comments, product reviews, etc.
u Sentiment Analysis can also be carried out on large paragraphs to perceive the
emotion of the given text
19
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Text Classification
u Types:
u Supervised document classification
u Unsupervised document classification
u Semi-supervised document classification
u Techniques:
u K-nearest neighbour algorithms
u Naïve Bayes classifier
u Support Vector Machines
u Artificial Neural Networks
20
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
POS Tagging – Parts Of Speech
u POS tagging is a process by which a single Parts of Speech tag is assigned to
each word (and symbols/punctuations) in a text.
u This is very useful to find out the grammatical patterns in N-grams and to
calculate distance metrics between different POS tags
21
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
TF-IDF (1/2)
u TF-IDF refers to Term Frequency – Inverse Document Frequency. It gives us the
importance of a particular word found in a text corpus
u The value of TF-IDF increases proportionally to the number of times a word
appears in the document and is offset by the number of documents in the
corpus that contain the word
u The formula for Term Frequency is given by:
22
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
TF-IDF (2/2)
u The Inverse Document Frequency is given by:
u Finally TF-IDF:
23
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Similarity – Levenshtein Distance
u The minimum number of edits (insertion, deletion, substitution) needed to
change a string of characters into another
u For example, the Levenshtein distance between kitten and sitting is 3, since
the following three edits change one into the other, and there is no way to do
it with fewer than three edits:
kitten → sitten (substitution of "s" for "k")
sitten → sittin (substitution of "i" for "e")
sittin → sitting (insertion of "g" at the end).
u Application: Spell Checkers, Fuzzy String searching, assist natural language
translation based on translation memory
24
4. PRACTICAL
APPLICATIONS
3
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Practical Applications
u Spam mail Classification
u Brand perception in current Market
u Competitor Analysis
u Contextual Advertising
u Business Intelligence
u Prediction and Prevention of Crime
u Customer Care services
u Fraud detection by Insurance Companies
26
5. USE CASE
DISCUSSION
3
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Hands on Text Analytics session
u Open the “Text Analytics - Accenture Strategic Business Analytics Chair”
python notebook
u type ‘pip install’ followed by the library name, to download required
packages or dependencies
u pip install textblob
u Set working directory to the location of the “train_E6oV3lV” CSV file
28
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Motivation behind the Use Case
u Hate speech is an unfortunately common occurr
ence on the Internet. Often social media sites
like Facebook and Twitter face the problem of
identifying and censoring problematic posts while
weighing the right to freedom of speech.
u The importance of detecting and moderating
hate speech is evident from the strong connection
between hate speech and actual hate crimes.
u Early identification of users promoting hate speech
could enable outreach programs that attempt to
prevent an escalation from speech to action.
29
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
About the Data set
u This data consists of Tweets was extracted from Twitter and is available for
the public on Analytics Vidhya contest – “Twitter Sentiment Analysis”
u The data is in the form of CSV containing 31,962 unique tweets which have
been scraped from twitter which has a mix of hate, neutral and positive
tweets
u Each tweet has a corresponding tweet ID and its sentiment label
u The hate tweets have been labelled as ‘1’ and the others as ‘0’
30
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Let’s explore the data
31
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Text Pre-processing
32
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Feature Selection
TF-IDF N-grams
33
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Sentiment Analysis - Output
This analysis gives us a general opinion about
the set of tweets we took into consideration.
From the pie chart, we can see that around
80% of the tweets are either neutral or
positive and hence there is very less
hate/negative content on this text corpus.
34
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
Word Cloud
What conclusions can we
draw based on the resulting
word cloud ?
We can refine the graph by
removing certain words
from the original corpus,
e.g.:
• Remove “go”
• Use Spelling Checks
35
Process of Text Text Analytics Practical Use Case
Introduction
Analytics Techniques Applications Discussion
K-means Clustering
Through K-Means clustering we
can now identify the group of
people who have a higher positive
sentiment than the rest, which is
cluster 2.
By clustering the tweets through
the sentiments instead, we can
classify the users according to
their emotions expressed in their
posts.
36
Conclusion and Future Scope
u Thus from our above analysis, we have obtained insights on the overall
sentiment of the people whose tweets have been scrutinized.
u This sentiment analysis will provide the base for hate/love speech
recognition.
u Further delving into the subject, we can train a model with our newly tagged
tweets and predict the occurrence of hate speeches of a new set of tweets.
37