0% found this document useful (0 votes)

20 views

Chapter 2

Uploaded by

Ramdhan Firdaus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Chapter 2

Uploaded by

Ramdhan Firdaus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Bag-of-words

S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What is a bag-of-words (BOW) ?

Describes the occurrence of words within a document or a collection of documents (corpus)

Builds a vocabulary of the words and a measure of their presence

SENTIMENT ANALYSIS IN PYTHON

Amazon product reviews

SENTIMENT ANALYSIS IN PYTHON

Sentiment analysis with BOW: Example
This is the best book ever. I loved the book and highly recommend it!!!

{'This': 1, 'is': 1, 'the': 2 , 'best': 1 , 'book': 2,

'ever': 1, 'I':1 , 'loved':1 , 'and': 1 , 'highly': 1,
'recommend': 1 , 'it': 1 }

Lose word order and grammar rules!

SENTIMENT ANALYSIS IN PYTHON

BOW end result
The output will look something like this:

SENTIMENT ANALYSIS IN PYTHON

CountVectorizer function
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=1000)
vect.fit(data.review)
X = vect.transform(data.review)

SENTIMENT ANALYSIS IN PYTHON

CountVectorizer output
X

<10000x1000 sparse matrix of type '<class 'numpy.int64'>'

with 406668 stored elements in Compressed Sparse Row format>

SENTIMENT ANALYSIS IN PYTHON

Transforming the vectorizer
# Transform to an array
my_array = X.toarray()

# Transform back to a dataframe, assign column names

X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Getting granular
with n-grams
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Context matters
I am happy, not sad.

I am sad, not happy.

Pu ing 'not' in front of a word (negation) is one example of how context ma ers.

SENTIMENT ANALYSIS IN PYTHON

Capturing context with a BOW
Unigrams : single tokens

Bigrams: pairs of tokens

Trigrams: triples of tokens

n-grams: sequence of n-tokens

SENTIMENT ANALYSIS IN PYTHON

Capturing context with BOW
The weather today is wonderful.

Unigrams : { The, weather, today, is, wonderful }

Bigrams: {The weather, weather today, today is, is wonderful}

Trigrams: {The weather today, weather today is, today is wonderful}

SENTIMENT ANALYSIS IN PYTHON

n-grams with the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Only unigrams
ngram_range=(1, 1)

# Uni- and bigrams

ngram_range=(1, 2)

SENTIMENT ANALYSIS IN PYTHON

What is the best n?
Longer sequence of tokens
Results in more features

Higher precision of machine learning models

Risk of over ing

SENTIMENT ANALYSIS IN PYTHON

Specifying vocabulary size
CountVectorizer(max_features, max_df, min_df)

max_features: if speci ed, it will include only the top most frequent words in the vocabulary
If max_features = None, all words will be included

max_df: ignore terms with higher than speci ed frequency

If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

min_df: ignore terms with lower than speci ed frequency

If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Build new features
from text
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Goal of the video

Goal : Enrich the existing dataset with features related to the text column (capturing the
sentiment)

SENTIMENT ANALYSIS IN PYTHON

Product reviews data
reviews.head()

SENTIMENT ANALYSIS IN PYTHON

Features from the review column

How long is each review?

How many sentences does it contain?

What parts of speech are involved?

How many punctuation marks?

SENTIMENT ANALYSIS IN PYTHON

Tokenizing a string
from nltk import word_tokenize

anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'

word_tokenize(anna_k)

['Happy','families','are', 'all','alike',',',
'every','unhappy', 'family', 'is','unhappy','in',
'its','own','way','.']

SENTIMENT ANALYSIS IN PYTHON

Tokens from a column
# General form of list comprehension
[expression for item in iterable]

word_tokens = [word_tokenize(review) for review in reviews.review]

type(word_tokens)

list

type(word_tokens[0])

list

SENTIMENT ANALYSIS IN PYTHON

Tokens from a column
len_tokens = []

# Iterate over the word_tokens list

for i in range(len(word_tokens)):
len_tokens.append(len(word_tokens[i]))

# Create a new feature for the length of each review

reviews['n_tokens'] = len_tokens

SENTIMENT ANALYSIS IN PYTHON

Dealing with punctuation
We did not address it but you can exclude it

A feature that measures the number of punctuation signs

A review with many punctuation signs could signal a very emotionally charged opinion

SENTIMENT ANALYSIS IN PYTHON

Reviews with a feature for the length
reviews.head()

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Can you guess the
language?
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Language of a string in Python
from langdetect import detect_langs
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'

detect_langs(foreign)

[es:0.9999945352697024]

SENTIMENT ANALYSIS IN PYTHON

Language of a column
Problem: Detect the language of each of the strings and capture the most likely language in
a new column

from langdetect import detect_langs

reviews = pd.read_csv('product_reviews.csv')

reviews.head()

SENTIMENT ANALYSIS IN PYTHON

Building a feature for the language
languages = []

for row in range(len(reviews)):

languages.append(detect_langs(reviews.iloc[row, 1]))

languages
[it:0.9999982541301151],
[es:0.9999954153640488],
[es:0.7142833997345875, en:0.2857160465706441],
[es:0.9999942365605781],
[es:0.999997956049055] ...

SENTIMENT ANALYSIS IN PYTHON

Building a feature for the language
# Transform the first list to a string and split on a colon
str(languages[0]).split(':')
['[es', '0.9999954153640488]']

str(languages[0]).split(':')[0]
'[es'

str(languages[0]).split(':')[0][1:]
'es'

SENTIMENT ANALYSIS IN PYTHON

Building a feature for the language
languages = [str(lang).split(':')[0][1:] for lang in languages]

reviews['language'] = languages

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N

Sentiment Analysis of Rotten Tomatoes For Box Office Revenue Prediction
No ratings yet
Sentiment Analysis of Rotten Tomatoes For Box Office Revenue Prediction
6 pages
Galileo Document Production Manual
No ratings yet
Galileo Document Production Manual
81 pages
Chapter 3
No ratings yet
Chapter 3
28 pages
Chapter 1
No ratings yet
Chapter 1
26 pages
Chapter 4
No ratings yet
Chapter 4
35 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
4 pages
ML Sentimentanalysis
No ratings yet
ML Sentimentanalysis
5 pages
NLP_Sentimental_Analysis__1736351356
No ratings yet
NLP_Sentimental_Analysis__1736351356
32 pages
Sentiment Analysis For PolishPoznan Studies in Contemporary Linguistics
No ratings yet
Sentiment Analysis For PolishPoznan Studies in Contemporary Linguistics
24 pages
Introduction To Sentiment Analysis PDF
No ratings yet
Introduction To Sentiment Analysis PDF
32 pages
Assignment 3
No ratings yet
Assignment 3
23 pages
Session 7
No ratings yet
Session 7
17 pages
UNIT V (1)
No ratings yet
UNIT V (1)
22 pages
10 1109@icaccs48705 2020 9074208
No ratings yet
10 1109@icaccs48705 2020 9074208
3 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
Ppt- Sentiment Analysis Using Machine Learning Algorithms
No ratings yet
Ppt- Sentiment Analysis Using Machine Learning Algorithms
23 pages
4. Chapter 8 Text Analytics
No ratings yet
4. Chapter 8 Text Analytics
42 pages
Viva Questions for Opinion Mining Project By NASIR ABBAS- VUBWN
No ratings yet
Viva Questions for Opinion Mining Project By NASIR ABBAS- VUBWN
8 pages
sentimental_analysis[1]
No ratings yet
sentimental_analysis[1]
37 pages
Paper 2021.findings-Emnlp.278
No ratings yet
Paper 2021.findings-Emnlp.278
7 pages
Pre Processing
No ratings yet
Pre Processing
9 pages
Sypnosis: Twitter Sentimental Analysis
No ratings yet
Sypnosis: Twitter Sentimental Analysis
3 pages
Tweet-Sentiment-Extraction - Exploratory Data Analysis
No ratings yet
Tweet-Sentiment-Extraction - Exploratory Data Analysis
11 pages
Sentiment Analysis
No ratings yet
Sentiment Analysis
4 pages
CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report
No ratings yet
CS771: GROUP-19 Sentiment Analysis in Movie Reviews: Project Report
28 pages
YZV 201E Exercise 2
No ratings yet
YZV 201E Exercise 2
4 pages
Package Sentimentr': R Topics Documented
No ratings yet
Package Sentimentr': R Topics Documented
49 pages
Python Mp
No ratings yet
Python Mp
11 pages
Text Analysis
No ratings yet
Text Analysis
4 pages
Pysentimiento: A Python Toolkit For Sentiment Analysis and Socialnlp Tasks
No ratings yet
Pysentimiento: A Python Toolkit For Sentiment Analysis and Socialnlp Tasks
4 pages
Implementation of Sentiment Analysis On Twitter Data
No ratings yet
Implementation of Sentiment Analysis On Twitter Data
6 pages
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
No ratings yet
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
7 pages
Chapter 10 - Text Analytics
No ratings yet
Chapter 10 - Text Analytics
13 pages
Sentiment Analysis For Movie Reviews
No ratings yet
Sentiment Analysis For Movie Reviews
3 pages
Assignment No 6 - Polarity
No ratings yet
Assignment No 6 - Polarity
2 pages
A Natural Language Processing For Sentiment Analysis From Text Using Deep Learning Algorithm
No ratings yet
A Natural Language Processing For Sentiment Analysis From Text Using Deep Learning Algorithm
7 pages
Challenges in Sentiment Analysis-Mohammad2017
No ratings yet
Challenges in Sentiment Analysis-Mohammad2017
23 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
Enhanced Sentiment Learning Using Twitter Hashtags and Smileys
No ratings yet
Enhanced Sentiment Learning Using Twitter Hashtags and Smileys
9 pages
Less Is More: Selecting Informative Unigrams For Sentiment Classification
No ratings yet
Less Is More: Selecting Informative Unigrams For Sentiment Classification
10 pages
Report
No ratings yet
Report
12 pages
Twiiter Sentiment Analysis
No ratings yet
Twiiter Sentiment Analysis
15 pages
Dav Exp7 56
No ratings yet
Dav Exp7 56
8 pages
1 PB
No ratings yet
1 PB
5 pages
Polarity Identification Through Emoticon Using Context Based Sentiment Analysis_1605073640
No ratings yet
Polarity Identification Through Emoticon Using Context Based Sentiment Analysis_1605073640
5 pages
A_Comparative_Study_on_Bengali_Speech_Sentiment_Analysis_Based_on_Audio_Data
No ratings yet
A_Comparative_Study_on_Bengali_Speech_Sentiment_Analysis_Based_on_Audio_Data
8 pages
Improved Feature Extraction and Classification - Sentiment Analysis - Trupthi2016
No ratings yet
Improved Feature Extraction and Classification - Sentiment Analysis - Trupthi2016
6 pages
All Practicals
No ratings yet
All Practicals
33 pages
Sentiment Analysis of Social Media with Python _ by Haaya Naushan _ Towards Data Science
No ratings yet
Sentiment Analysis of Social Media with Python _ by Haaya Naushan _ Towards Data Science
9 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
No ratings yet
Sentiment Analysis On Twitter Data Using Machine Learning Algorithms in Python
15 pages
05 - Dictionaries and Tuples
No ratings yet
05 - Dictionaries and Tuples
61 pages
Part C - Assignment No. 2 Mini-Project On Twitter
No ratings yet
Part C - Assignment No. 2 Mini-Project On Twitter
7 pages
fin_ijprems1714118825
No ratings yet
fin_ijprems1714118825
6 pages
Sentiment Analysis: Name - Ankit Srivastava
No ratings yet
Sentiment Analysis: Name - Ankit Srivastava
18 pages
EXP5
No ratings yet
EXP5
15 pages
Arabic Language Sentiment Analysis On Health Services
No ratings yet
Arabic Language Sentiment Analysis On Health Services
5 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
B. Inggris Kls 11
No ratings yet
B. Inggris Kls 11
2 pages
Whatsnew DeskPack 23 11
No ratings yet
Whatsnew DeskPack 23 11
24 pages
Chemistry Investigatory Projects Class 12
No ratings yet
Chemistry Investigatory Projects Class 12
5 pages
Daslight 5 Manual en
No ratings yet
Daslight 5 Manual en
77 pages
DVR User Manual
No ratings yet
DVR User Manual
68 pages
Pg-93 - Imp Map - NCR RAJ GUJ GUR FAR
No ratings yet
Pg-93 - Imp Map - NCR RAJ GUJ GUR FAR
107 pages
Soredex Digora DXR-50000 Imaging Plate System - Service Manual
No ratings yet
Soredex Digora DXR-50000 Imaging Plate System - Service Manual
87 pages
ACF and PACF in Excel
No ratings yet
ACF and PACF in Excel
11 pages
Question Bank
100% (1)
Question Bank
12 pages
Can11 Gem5
No ratings yet
Can11 Gem5
8 pages
Cambridge International AS & A Level: Information Technology 9626/11
No ratings yet
Cambridge International AS & A Level: Information Technology 9626/11
16 pages
Nissan X Trail Model t32 Series Service Repair Manual
100% (1)
Nissan X Trail Model t32 Series Service Repair Manual
9,003 pages
Nastasa Alexandru en
No ratings yet
Nastasa Alexandru en
4 pages
mnl_avalon_spec
No ratings yet
mnl_avalon_spec
66 pages
Claroty SRA Onboarding Instructions Externals
No ratings yet
Claroty SRA Onboarding Instructions Externals
6 pages
Gnu General Public License
No ratings yet
Gnu General Public License
11 pages
Download ebooks file Effective TypeScript 62 Specific Ways to Improve Your TypeScript 1st Edition Dan Vanderkam all chapters
100% (5)
Download ebooks file Effective TypeScript 62 Specific Ways to Improve Your TypeScript 1st Edition Dan Vanderkam all chapters
51 pages
ABS-J-089-03-DRG-002 - Detail Fabrication Drawing Instrument Air Receiver - 39 Cu.m-paRT DRAWING
No ratings yet
ABS-J-089-03-DRG-002 - Detail Fabrication Drawing Instrument Air Receiver - 39 Cu.m-paRT DRAWING
1 page
User Manual: Mjptbcwreis-2023
No ratings yet
User Manual: Mjptbcwreis-2023
11 pages
Introductory Electronic Devices and Circuits 6th Ed - Paynter
No ratings yet
Introductory Electronic Devices and Circuits 6th Ed - Paynter
1,010 pages
Advance Your Awk Skills With Two Easy Tutorials
No ratings yet
Advance Your Awk Skills With Two Easy Tutorials
14 pages
50 Assignment Copy
No ratings yet
50 Assignment Copy
54 pages
SUrge Arrester Specification For LoW Voltage Switchegar
No ratings yet
SUrge Arrester Specification For LoW Voltage Switchegar
1 page
CTPS - Unit 4 Notes
No ratings yet
CTPS - Unit 4 Notes
33 pages
GenMath11 q1 WK 1 LAS No. 1
No ratings yet
GenMath11 q1 WK 1 LAS No. 1
7 pages
Log
No ratings yet
Log
115 pages
Product Bulletin: Optical Bypass Relay OBR40 From Hirschmann™
No ratings yet
Product Bulletin: Optical Bypass Relay OBR40 From Hirschmann™
4 pages
Ethical Hacking Lab 07
No ratings yet
Ethical Hacking Lab 07
36 pages
Cotton Crop Disease Detection Using Decision Tree Classifier
No ratings yet
Cotton Crop Disease Detection Using Decision Tree Classifier
5 pages

Chapter 2

Uploaded by

Chapter 2

Uploaded by

Bag-of-words

Describes the occurrence of words within a document or a collection of documents (corpus)

Builds a vocabulary of the words and a measure of their presence

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

{'This': 1, 'is': 1, 'the': 2 , 'best': 1 , 'book': 2,

Lose word order and grammar rules!

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

<10000x1000 sparse matrix of type '<class 'numpy.int64'>'

SENTIMENT ANALYSIS IN PYTHON

# Transform back to a dataframe, assign column names

SENTIMENT ANALYSIS IN PYTHON

I am sad, not happy.

SENTIMENT ANALYSIS IN PYTHON

Bigrams: pairs of tokens

Trigrams: triples of tokens

n-grams: sequence of n-tokens

SENTIMENT ANALYSIS IN PYTHON

Unigrams : { The, weather, today, is, wonderful }

Bigrams: {The weather, weather today, today is, is wonderful}

Trigrams: {The weather today, weather today is, today is wonderful}

SENTIMENT ANALYSIS IN PYTHON

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Uni- and bigrams

SENTIMENT ANALYSIS IN PYTHON

Higher precision of machine learning models

Risk of over ing

SENTIMENT ANALYSIS IN PYTHON

max_df: ignore terms with higher than speci ed frequency

Default is 1, which means it does not ignore any terms

min_df: ignore terms with lower than speci ed frequency

Default is 1, which means it does not ignore any terms

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

How long is each review?

How many sentences does it contain?

What parts of speech are involved?

How many punctuation marks?

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

word_tokens = [word_tokenize(review) for review in reviews.review]

SENTIMENT ANALYSIS IN PYTHON

# Iterate over the word_tokens list

# Create a new feature for the length of each review

SENTIMENT ANALYSIS IN PYTHON

A feature that measures the number of punctuation signs

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

from langdetect import detect_langs

SENTIMENT ANALYSIS IN PYTHON

for row in range(len(reviews)):

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

SENTIMENT ANALYSIS IN PYTHON

You might also like