Text and Sentiment Analysis

The document discusses text analytics and preprocessing techniques for text mining. It covers topics like text importing, string operations, preprocessing steps like tokenization, normalization through lowercasing and stemming, removing stopwords, creating a document-term matrix, and filtering and weighting terms. The goal of these preprocessing steps is to prepare raw text data for analysis using techniques like classification, clustering, and information extraction.

Uploaded by

ris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

126 views41 pages

Text and Sentiment Analysis

Uploaded by

ris

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Agenda

• Text Analytics
• Sentiment Analysis
• Web Mining
• Information Retrieval
Text Analytics
• The amount of information available on the Web has increased
rapidly (Information-explosion era)
– World’s data doubles every 18 months
• Users demand useful and reliable information from the Web in the
shortest time possible
• Obstacles to fulfilling this demand includes:
– Language barriers, diversified users
– Users may provide only vague specifications of the information
they want
• We must perform searching and extracting information from the
Web texts using NLP technologies
Text Analytics
• Data-mining: Extraction of interesting information (or patterns) from
structured data.
• 80-90% of all data is held in various unstructured formats
• Useful information can be derived from this unstructured data
• Intelligence in text mining is based on NLP techniques
• NLP can be used as a preprocessing technique to harvest data and
get an initial understanding of the patterns that exist in the data
• Text Mining = Statistical NLP (structured data) + Data mining
(pattern discovery)
Text Analytics
• Text Preprocessing
– Syntactic/Semantic text analysis
• Features Generation
– Bag of words
• Features Selection
– Simple counting
– Statistics
• Data Mining
– Classification (Supervised) / Clustering (Unsupervised)
• Analyzing results
Text Analytics: Text Preprocessing
Removal of punctuations

Removal of numbers

Change to lower case

Stop words removal

Extra whitespace removal

Stemming
Text Analytics
• Feature Generation
– Text document is represented by the words it contains
(and their occurrences)
- Order of words is not that important for certain applications
(Bag of words)
– Stemming: identifies a word by its root
- Reduce dimensionality
• Stop words: The common words unlikely to help text mining
Text Analytics
• Feature Selection
– Reduce dimensionality
- Learners have difficulty addressing tasks with high dimensionality
- Only interested in the information relevant to what is being analyzed
– Irrelevant features
- Not all features help
Text Analytics
• Supervised learning (classification)
– The training data is labeled indicating the class
– New data is classified based on the training set
– Correct classification: The known label of test sample is identical with the class
result from the classification model
• Unsupervised learning (clustering)
– The class labels of training data are unknown
– Establish the existence of classes or clusters in the data
– Good clustering method: high intra-cluster similarity and low inter-cluster
similarity
Text Analytics
• Descriptive: understanding underlying processes or behavior
– Web Mining (Opinion extraction, Sentiment analysis)
– Clustering (Blogs, Patterns and trends)
• Predictive: predict an unseen or unmeasured value
– Classification (Spam detection)
– Information Retrieval (Searching)
– Pattern and trend forecasting, Knowledge Acquisition from query logs
Text Analytics
• Statistical NLP
– POS Tagging
– Ambiguity
– Tokenization \ Sentence Detection \ Parsing
– Context
– Stemming
– Synonymy and polysemy
• Data Mining
– Massive amounts of data
– No training data available
Overview of Text Analytics
• In the data preparation section we discuss five steps to prepare texts for
analysis.
• The first step, importing text, covers the functions for reading texts
from various types of file formats (e.g., txt, csv, pdf) into a raw text
corpus in R.
• The steps string operations and preprocessing cover techniques for
manipulating raw texts and processing them into tokens (i.e., units of
text, such as words or word stems).
• The tokens are then used for creating the document-term matrix
(DTM), which is a common format for representing a bag-of-words type
corpus, that is used by many R text analysis packages.
• Finally, it is a common step to filter and weight the terms in the DTM.
Overview of Text Analytics
• Importing text
• In order to be able to map all known characters to a single scheme, the Unicode
standard was proposed,
• although it also requires a digital encoding format (such as the UTF-8 format,
but also UTF-16 or UTF-32).
• String operations
• Digital text is represented as a sequence of characters, called a string.
• Strings are represented as objects called “character” types, which are vectors of
strings.
• The group of string operations refers to the low-level operations for working
with textual data.
• The most common string operations are joining, splitting, and extracting parts of
strings (collectively referred to as parsing) and the use of regular expressions to
find or replace patterns.
Overview of Text Analytics
• Preprocessing
• full texts must be tokenized into smaller, more specific text features, such as
words or word combinations.
• Also, the computational performance and accuracy of many text analysis
techniques can be improved by normalizing features, or by removing
“stopwords”: words designated in advance to be of no interest, and which are
therefore discarded prior to analysis.
• Tokenization
• Tokenization is the process of splitting a text into tokens.
• Most often tokens are words, because these are the most common semantically
meaningful components of texts.
• For many languages, splitting texts by words can mostly be done with low-level
string processing due to clear indicators of word boundaries, such as white
spaces, dots and commas.
• A good tokenizer, however, must also be able to handle certain exceptions, such
as the period in the title “Dr.”, which can be confused for a sentence boundary.
Overview of Text Analytics
• Normalization: Lowercasing and stemming
• The process of normalization broadly refers to the transformation of words into a
more uniform form.
• This can be important if for a certain analysis a computer has to recognize when
two words have (roughly) the same meaning, even if they are written slightly
differently.
• Another advantage is that it reduces the size of the vocabulary (i.e., the full range
of features used in the analysis).
• A simple but important normalization techniques is to make all text lower case.
• If we do not perform this transformation, then a computer will not recognize that
two words are identical if one of them was capitalized because it occurred at the
start of a sentence.
Overview of Text Analytics
• Normalization: stemming and lemmatization
• Another argument for normalization is that a base word might have different
morphological variations, such as the suffixes from conjugating a verb, or
making a noun plural. Example: break, breaks, breaking, broken, broke
• For purposes of analysis, we might wish to consider these variations as
equivalent because of their close semantic relation, and because reducing the
feature space is generally desirable when multiple features are in fact closely
related.
• A technique for achieving this is stemming, which is essentially a rule-based
algorithm that converts inflected forms of words into their base forms (stems).
• A more advanced technique is lemmatization, which uses a dictionary to replace
words with their morphological root form.
Overview of Text Analytics
• Removing stopwords
• Common words such as “the” in the English language are rarely informative
about the content of a text.
• Filtering these words out has the benefit of reducing the size of the data,
reducing computational load, and in some cases also improving accuracy.
• Document-term matrix (DTM)
• DTM is one of the most common formats for representing a text corpus (i.e. a
collection of texts) in a bag-of-words format.
• A DTM is a matrix in which rows are documents, columns are terms, and cells
indicate how often each term occurred in each document.
• The advantage of this representation is that it allows the data to be analyzed with
vector and matrix algebra, effectively moving from text to numbers.
• Furthermore, with the use of special matrix formats for sparse matrices, text data
in a DTM format is very memory efficient and can be analyzed with highly
optimized operations.
Overview of Text Analytics
• Filtering and weighting
• Not all terms are equally informative for text analysis.
• One way to deal with this is to remove these terms from the DTM.
• We have already discussed the use of stopword lists to remove very common
terms, but there are likely still other common words and this will be different
between corpora.
• Furthermore, it can be useful to remove very rare terms for tasks
• This is especially useful for improving efficiency, because it can greatly reduce the
size of the vocabulary (i.e., the number of unique terms), but it can also improve
accuracy.
• A simple but effective method is to filter on document frequencies (the number
of documents in which a term occurs)
Overview of Text Analytics
• Filtering and weighting
• Instead of removing less informative terms, an alternative approach is assign
them variable weights.
• Many text analysis techniques perform better if terms are weighted to take an
estimated information value into account, rather than directly using their
occurrence frequency.
• Given a sufficiently large corpus, we can use information about the distribution
of terms in the corpus to estimate this information value.
• A popular weighting scheme that does so is term frequency-inverse document
frequency (tf-idf), which down-weights that occur in many documents in the
corpus
Overview of Text Analytics
• Term frequency: It tells us the occurrences of the word in a document. It can be
computed as

• Inverse document frequency: It tells us how important a term is. While

calculating tf, all terms are considered as important but we know that some
terms like ‘is’, ‘are’, ‘the’ are frequent in every document and thus are less
important. It can be calculated as:

• Then tf/idf can be calculated by multiplying these two.

Text Analytics
• Statistical NLP
– POS Tagging
– Ambiguity
– Tokenization \ Sentence Detection \ Parsing
– Context
– Stemming
– Synonymy and polysemy
• Data Mining
– Massive amounts of data
– No training data available
Sentiments Analysis
• The world is moving towards a fully digitalized economy at an incredible
pace
• as a result, a enormous amount of data is being produced by the
internet, social media, smartphones, tech equipment and many other
sources each day
• which has led to the evolution of Big Data management and analytics
• Sentiment analysis is one such tool and the most popular branch of
textual analytics which with the help of statistics and natural language
processing examine and classify the unorganized textual data into various
sentiments
• It is also known as opinion mining as it largely focuses on the opinion and
attitude of the people through analyzing their texts
Sentiments Analysis
• At its simplest, sentiment analysis quantifies the mood of a tweet or
comment by counting the number of positive and negative words.
• By subtracting the negative from the positive, the sentiment score is
generated.
• For example, this comment generates an overall sentiment score of
2, for having two positive words:
Sentiments Analysis
• You can push this simple approach a bit further by looking for
negations, or words which reverse the sentiment in a section of the
text:

• The presence of the word don’t before like produces a negative

score rather than a positive one, giving an overall sentiment score
of -2
Sentiments Analysis
• Now social media is not only used for chatting and file sharing, it
has gone much beyond that.
• Many organizations use social media as a tool to understand the
likes and dislikes of their customers.
• This can be done through sentiment analysis or opinion mining.
• Sentiment analysis involves many tasks such as subjectivity
detection, text preprocessing, feature extraction and sentiment
classification.
• Subjectivity/objectivity:
• Text which holds some sentiment is called subjective text. For example: “3
idiots is an awesome movie”.
• On the other hand objective text does not hold any sentiment. For
example- “Raj Kumar Hirani is the director of the movie”.
Sentiments Analysis
• For sentiment analysis, we only require subjective text which can
be further classified into positive or negative.
• If we are taking objective text also, then we have to take three
classes positive, negative or neutral.
• Polarity: Subjective text can be positive or negative. This is called
polarity of text. Text can be of positive polarity or negative polarity.
• Sentiment level: Sentiment analysis can be performed at three
levels-
• Document level in which the whole document is given positive or negative
polarity.
Sentiments Analysis
• Sentence level in which each sentence is analyzed to give positive or
negative polarity. Overall polarity is computed by counting the positive and
negative comments. Majority comments decide the overall sentiment.
• Phrase level in which phrases or aspects in a sentence are analyzed to
classify as positive or negative.
Sentiments Analysis Process
• Data gathering phase: From tweets, movie reviews, product
reviews, blog data, news data, etc.
• Text preprocessing: Involves stop word removal and stemming.
• Stop words are the words in the text which do not contribute to
any sentiment. For example: In “This is a good movie”, (this, is, a)
are the stop words.
• Stemming is the process of removing prefixes or suffixes. For
example: ‘enjoying’ or ‘enjoyed’ can be stemmed to ‘enjoy’.
• Feature extraction: Involves converting text dataset into feature
vector or some other representations unigram, bigram or n-gram
model, term frequency, POS tagging and tf-idf (term frequency
inverse document frequency).
Sentiments Analysis Process
• Unigram feature set takes one word at a time. For example: “This is a good
movie” will be taken as ( this, is, a, good, movie).
• In bigram, we take a pair of two and so on. For example (this is, is a, a good,
good movie).
• Term frequency feature set takes into account the number of occurrences (or
frequency) of a term.
• POS is Parts Of Speech tagging. As we know, adjectives (in the above example
‘good’) and adverbs contribute to most of the sentiment. So POS helps us in
identifying adjectives and adverbs in a sentence.
• Tf-idf is the most informative set. It tells us how much a word is important in a
document. It increases proportionally as the term frequency in a document
increases but decreases if a term is occurring frequently in all the documents
(document frequency). For example: stop words occurs in all the documents
and do not facilitate any classification.
Sentiments Analysis Process
• Term frequency: It tells us the occurrences of the word in a document. It can be
computed as

• Inverse document frequency: It tells us how important a term is. While

• Then tf/idf can be calculated by multiplying these two.

Sentiments Analysis Process
Sentiments Analysis Process
• Sentiment classification: After feature extraction, the final phase is the
sentiment classification
• The text is classified into positive and negative classes. There are mainly
two approaches for this:
• Subjective lexicon: we have scores for each word indicating its positive,
negative or neutral nature. For a given text, we sum all the positive,
negative and neutral scores separately. In the end, highest score gives us
the overall polarity. This approach can be further classified into
dictionary based approach and corpus-based approach.
• Dictionary based approach involves the creation of seed list from opinion words
from the dataset and then expanding it with the help of dictionaries or thesaurus.
• The corpus-based approach is also similar to dictionary based, the only difference
is that in this we prepare seed list from the domain oriented corpus.
Sentiments Analysis Process
• For example, if our work is on movie reviews dataset, seed list will
be prepared from movie domain text only.
• Corpus-based classification can be done in two ways- first is using a
statistical technique which works on the basis of co-occurrence of
words in the corpus
• i.e. if words occur mostly in the positive text then its polarity is
positive otherwise negative.
• Another technique is Semantic-based approach. Wordnet is an
example of this. It works on the principle of similarity between
words.
• If some word in our dataset matches with the word in Wordnet,
then we can use its score from SentiWordnet to find its polarity.
Sentiments Analysis Process
• Machine learning: It is an automatic classification process.
Classification is performed using features which are extracted from
the text(as explained above).
• It is of two types- Supervised and Unsupervised learning.
• Supervised learning involves training of classifier with labeled
training data. Labeling means that class labels are known for each
term in training dataset. Once the classifier is trained, it can be
used to classify the testing data.
• Unsupervised learning: no class label is known prior. The model
makes inferences from incoming data and cluster it.
Sentiments Analysis Process
• Popular supervised learning classification algorithms are
• SVM(Support Vector Machine),
• NBC(Naïve Bayes classifier),
• ANN(Artificial Neural Network), etc.
• SVM is the most popular classification algorithm. It creates a
hyperplane which is used to classify data.
• It is a non-probabilistic classifier. If not properly trained, the
problem of overfitting may arise i.e too much training leads to
learning of noise in data as concepts to the classifier.
• Bayesian Network (BN) uses Directed acyclic graphs to represent
dependencies between two variables. For example, a network
depicting dependencies between symptoms and diseases can be
used to find a disease for a given symptom.
Sentiments Analysis Process
• ANN works similar as brain solves problems by neurons connected
by axons. These are self-learned or self-trained systems.
• They require less data for training.
• But the system acts as a black box and we can’t view the
relationships.
Sentiments Analysis Process
• The sentiment analysis task is usually modeled as a classification
problem where a classifier is fed with a text and returns the
corresponding category, e.g. positive, negative, or neutral (in case
polarity analysis is being performed).
• Said machine learning classifier can usually be implemented with
the following steps and components:
Sentiments Analysis Process
• In the training process (a), our model learns to associate a
particular input (i.e. a text) to the corresponding output (tag)
based on the test samples used for training.
• The feature extractor transfers the text input into a feature vector.
Pairs of feature vectors and tags (e.g. positive, negative, or neutral)
are fed into the machine learning algorithm to generate a model.
• In the prediction process (b), the feature extractor is used to
transform unseen text inputs into feature vectors. These feature
vectors are then fed into the model, which generates predicted
tags (again, positive, negative, or neutral).
Limitations: Sentiments Analysis
• Though Sentiment analysis has been one of the most popular
textual analysis tools among businesses, scholars and analysts to
take decisions and for research purposes
• Sentiment analysis has its own limitations as language is very
complex and the meaning of each and every word changes with
time and from person to person
• Also, the accuracy of the analysis can’t be accurately measured and
compared with how human beings analyze emotions.
Limitations: Sentiments Analysis
• The problem can be classified into three main factors:

• Sarcasm:
• It is a popular form of mockery to ridicule or convey insult.
• Analytics fails to recognize these forms of emotions and might prove to be
ineffective in such cases
• Though the efforts are being made to cater to this problem through the extensive
use of machine learning and artificial intelligence and we might see an improved
version of sentiment analysis in near future

• “I am so proud of your stupidity, you make me feel good about myself.”

Limitations: Sentiments Analysis
• Multiple Meanings:
• A word could have many meanings and it may represent multiple emotions as we
move from one geography to another or even one person to another
• Many English words in the UK may mean different in American English

• For ex: “I think you’ve been playing horribly dope.”

• Dependency:
• Sentiment analysis largely depends on the predefined words and their individual
score
• Which leads to many problems like ambiguity in the context of the sentence
• A sentence which includes ‘good’ might not have any emotions attached to it but
will be shown as positive by the analysis
Sentiments Analysis
• Despite its limitations Sentiment Analysis is extremely popular and
widely used analytical tool in business intelligence for social media
monitoring, brand health examination, effects of ad campaigns or new
product launch and various research purposes
• It is frequently applied to Twitter data and Customer reviews by
marketers and customer service teams to identify the feelings of
consumers
• Sentiment analysis has also started to gain popularity in areas like
psychology, political science and other alike fields where textual data is
obtained and explored from books, transcripts, and reports

ICSE Class 10 2023-2024 Computer Notes Short Notes PDF
92% (12)
ICSE Class 10 2023-2024 Computer Notes Short Notes PDF
57 pages
Data Analytics, Data Visualization and Big Data
No ratings yet
Data Analytics, Data Visualization and Big Data
25 pages
Data Storytelling
No ratings yet
Data Storytelling
2 pages
How To Program With Java Ebook
100% (1)
How To Program With Java Ebook
310 pages
2019-05-28 PDF
No ratings yet
2019-05-28 PDF
68 pages
Big Data Analytics A Spotify Case Study
No ratings yet
Big Data Analytics A Spotify Case Study
9 pages
Introduction To Sentiment Analysis PDF
No ratings yet
Introduction To Sentiment Analysis PDF
32 pages
From Algorithms To Stories.
No ratings yet
From Algorithms To Stories.
49 pages
Data Mapping
No ratings yet
Data Mapping
3 pages
Visual Studio C# Book
100% (2)
Visual Studio C# Book
319 pages
chapter 1
No ratings yet
chapter 1
63 pages
Evaluating Student Descriptive Answers Using Natural Language Processing IJERTV3IS031517
No ratings yet
Evaluating Student Descriptive Answers Using Natural Language Processing IJERTV3IS031517
3 pages
Grain Truck Transportation Cost Calculator: Ag Decision Maker - Iowa State University Extension and Outreach
No ratings yet
Grain Truck Transportation Cost Calculator: Ag Decision Maker - Iowa State University Extension and Outreach
4 pages
Drools WB Docs
100% (1)
Drools WB Docs
228 pages
Xamarin Community Toolkit
No ratings yet
Xamarin Community Toolkit
208 pages
Research Paper
No ratings yet
Research Paper
7 pages
The Complete Java Crash Course - Learn Interactively
No ratings yet
The Complete Java Crash Course - Learn Interactively
93 pages
Detect Phishing Website by Using Machine Learning
No ratings yet
Detect Phishing Website by Using Machine Learning
4 pages
MyCC Report 29072019
No ratings yet
MyCC Report 29072019
336 pages
Top 10 Flaws PDF
No ratings yet
Top 10 Flaws PDF
32 pages
New Python Basics Assignment
0% (1)
New Python Basics Assignment
5 pages
Big Data Analytics in Agriculture
No ratings yet
Big Data Analytics in Agriculture
9 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
34 pages
2.3 Workbook
No ratings yet
2.3 Workbook
17 pages
Basic Charts and Multidimensional Visualization
No ratings yet
Basic Charts and Multidimensional Visualization
33 pages
RG 002 0 EN (Programming - Reference - Guide) PDF
No ratings yet
RG 002 0 EN (Programming - Reference - Guide) PDF
68 pages
Text Mining With R - Twitter Data Analysis
No ratings yet
Text Mining With R - Twitter Data Analysis
24 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Text Analysis in R
No ratings yet
Text Analysis in R
21 pages
Creatorcon: Building Your First Integrationhub Spoke: Jonatan Jonny' Jardi
No ratings yet
Creatorcon: Building Your First Integrationhub Spoke: Jonatan Jonny' Jardi
40 pages
Written Report - Chapter 3 - Visualizing Data
No ratings yet
Written Report - Chapter 3 - Visualizing Data
5 pages
SAS+Programming+Fast+Track+ Cource+Notes
No ratings yet
SAS+Programming+Fast+Track+ Cource+Notes
406 pages
Implementing The Account and Financial Dimensions Framework AX2012
No ratings yet
Implementing The Account and Financial Dimensions Framework AX2012
43 pages
Data Literacy: - Presented By: Ramona Brown, NBCT Literacy
No ratings yet
Data Literacy: - Presented By: Ramona Brown, NBCT Literacy
38 pages
Ecomm
No ratings yet
Ecomm
2 pages
Text Mining in R (Intro)
0% (1)
Text Mining in R (Intro)
4 pages
A Short Course in Python For Astronomers
No ratings yet
A Short Course in Python For Astronomers
21 pages
DA Project Report
No ratings yet
DA Project Report
17 pages
Social Media Analytics
No ratings yet
Social Media Analytics
11 pages
Proximity Search Operators Guidelines
No ratings yet
Proximity Search Operators Guidelines
2 pages
Data Visualization Market
No ratings yet
Data Visualization Market
35 pages
T5L DWIN OS Development Guide 20220518
No ratings yet
T5L DWIN OS Development Guide 20220518
18 pages
5 6181746519327113983
No ratings yet
5 6181746519327113983
16 pages
Group09 HTC
No ratings yet
Group09 HTC
2 pages
Text Mining Project Report
No ratings yet
Text Mining Project Report
27 pages
An Introduction To Social Network Analysis
100% (8)
An Introduction To Social Network Analysis
38 pages
5 String
No ratings yet
5 String
9 pages
Whitepaper Effective Dashboards
No ratings yet
Whitepaper Effective Dashboards
10 pages
Gestalt Principles Assignment
No ratings yet
Gestalt Principles Assignment
21 pages
C++ STL
No ratings yet
C++ STL
53 pages
PL1 Condition Codes
No ratings yet
PL1 Condition Codes
10 pages
Iform XML Post Service For Google Docs
0% (1)
Iform XML Post Service For Google Docs
7 pages
Social Network Analysis in R PDF
No ratings yet
Social Network Analysis in R PDF
35 pages
Data Visualization and Discovery For Better Business Decisions
No ratings yet
Data Visualization and Discovery For Better Business Decisions
36 pages
Intro Most Favorite Question For Interviewers Is Interchanging Two Variables With Out Using The Third
No ratings yet
Intro Most Favorite Question For Interviewers Is Interchanging Two Variables With Out Using The Third
10 pages
RAbbitMQ With Python
No ratings yet
RAbbitMQ With Python
4 pages
Network of Partners Through Acquisitions and Stake Purchases
No ratings yet
Network of Partners Through Acquisitions and Stake Purchases
2 pages
Instruction: Attempt The Following Questions According To The Grouping Scheme Below
No ratings yet
Instruction: Attempt The Following Questions According To The Grouping Scheme Below
3 pages
3.Self-Consistent Asset Pricing Models
No ratings yet
3.Self-Consistent Asset Pricing Models
23 pages
Keyphrase Extraction (3rd Review)
No ratings yet
Keyphrase Extraction (3rd Review)
22 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
3 Palindrome
No ratings yet
3 Palindrome
2 pages
Development, NGOs and Civil Society
100% (2)
Development, NGOs and Civil Society
213 pages
Array Solution
No ratings yet
Array Solution
5 pages
Computer Applications - Worksheet Computer Applications - Worksheet
No ratings yet
Computer Applications - Worksheet Computer Applications - Worksheet
2 pages
12
No ratings yet
12
16 pages
HTC and Virtual Reality: 1. What Are HTC's Competitive Advantages and Disadvantages in VR in 2017? Advantages
No ratings yet
HTC and Virtual Reality: 1. What Are HTC's Competitive Advantages and Disadvantages in VR in 2017? Advantages
2 pages
Analyzing Social Media Data in Python Chapter1
No ratings yet
Analyzing Social Media Data in Python Chapter1
21 pages
Tutorials Week 7
No ratings yet
Tutorials Week 7
2 pages
Big Data Concepts
No ratings yet
Big Data Concepts
5 pages
Introduction To Big Data - Presentation
No ratings yet
Introduction To Big Data - Presentation
30 pages
CSE Image Processing
0% (1)
CSE Image Processing
15 pages
Bigdata MINT PDF
No ratings yet
Bigdata MINT PDF
4 pages
Data Visualization (Reference To First Slide Data Content)
No ratings yet
Data Visualization (Reference To First Slide Data Content)
5 pages
7 Rama Communication
No ratings yet
7 Rama Communication
15 pages
10.string Handling
No ratings yet
10.string Handling
44 pages
Data Visualization
No ratings yet
Data Visualization
31 pages
The Age of Big Data: Kayvan Tirdad
No ratings yet
The Age of Big Data: Kayvan Tirdad
26 pages
Generative Design: A Paradigm For Design Research
No ratings yet
Generative Design: A Paradigm For Design Research
8 pages
Visual Analytics
No ratings yet
Visual Analytics
36 pages
An Introduction To Data Mining
No ratings yet
An Introduction To Data Mining
47 pages
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
No ratings yet
What Is A DSS?: Decision Support Systems Concepts, Methodologies, and Technologies: An Overview
9 pages
Toc NOTES - 2
No ratings yet
Toc NOTES - 2
0 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
From Everand
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
Zhenya Antić
No ratings yet
Data Literacy Fundamentals: Understanding the Power & Value of Data
From Everand
Data Literacy Fundamentals: Understanding the Power & Value of Data
Ben Jones
No ratings yet
Getting Started with Greenplum for Big Data Analytics
From Everand
Getting Started with Greenplum for Big Data Analytics
Sunila Gollapudi
No ratings yet
Social Media Data Mining and Analytics
From Everand
Social Media Data Mining and Analytics
Gabor Szabo
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Data Management Complete Self-Assessment Guide
From Everand
Data Management Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
From Everand
Decision Support System: Fundamentals and Applications for The Art and Science of Smart Choices
Fouad Sabry
No ratings yet

Text and Sentiment Analysis

Uploaded by

Text and Sentiment Analysis

Uploaded by

Agenda

Change to lower case

Stop words removal

Extra whitespace removal

• Inverse document frequency: It tells us how important a term is. While

• Then tf/idf can be calculated by multiplying these two.

• The presence of the word don’t before like produces a negative

• Inverse document frequency: It tells us how important a term is. While

• Then tf/idf can be calculated by multiplying these two.

• “I am so proud of your stupidity, you make me feel good about myself.”

• For ex: “I think you’ve been playing horribly dope.”

You might also like