0% found this document useful (0 votes)
20 views

Chapter 2

Uploaded by

Ramdhan Firdaus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Chapter 2

Uploaded by

Ramdhan Firdaus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Bag-of-words

S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What is a bag-of-words (BOW) ?

Describes the occurrence of words within a document or a collection of documents (corpus)

Builds a vocabulary of the words and a measure of their presence

SENTIMENT ANALYSIS IN PYTHON


Amazon product reviews

SENTIMENT ANALYSIS IN PYTHON


Sentiment analysis with BOW: Example
This is the best book ever. I loved the book and highly recommend it!!!

{'This': 1, 'is': 1, 'the': 2 , 'best': 1 , 'book': 2,


'ever': 1, 'I':1 , 'loved':1 , 'and': 1 , 'highly': 1,
'recommend': 1 , 'it': 1 }

Lose word order and grammar rules!

SENTIMENT ANALYSIS IN PYTHON


BOW end result
The output will look something like this:

SENTIMENT ANALYSIS IN PYTHON


CountVectorizer function
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=1000)
vect.fit(data.review)
X = vect.transform(data.review)

SENTIMENT ANALYSIS IN PYTHON


CountVectorizer output
X

<10000x1000 sparse matrix of type '<class 'numpy.int64'>'


with 406668 stored elements in Compressed Sparse Row format>

SENTIMENT ANALYSIS IN PYTHON


Transforming the vectorizer
# Transform to an array
my_array = X.toarray()

# Transform back to a dataframe, assign column names


X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Getting granular
with n-grams
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Context matters
I am happy, not sad.

I am sad, not happy.

Pu ing 'not' in front of a word (negation) is one example of how context ma ers.

SENTIMENT ANALYSIS IN PYTHON


Capturing context with a BOW
Unigrams : single tokens

Bigrams: pairs of tokens

Trigrams: triples of tokens

n-grams: sequence of n-tokens

SENTIMENT ANALYSIS IN PYTHON


Capturing context with BOW
The weather today is wonderful.

Unigrams : { The, weather, today, is, wonderful }

Bigrams: {The weather, weather today, today is, is wonderful}

Trigrams: {The weather today, weather today is, today is wonderful}

SENTIMENT ANALYSIS IN PYTHON


n-grams with the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Only unigrams
ngram_range=(1, 1)

# Uni- and bigrams


ngram_range=(1, 2)

SENTIMENT ANALYSIS IN PYTHON


What is the best n?
Longer sequence of tokens
Results in more features

Higher precision of machine learning models

Risk of over ing

SENTIMENT ANALYSIS IN PYTHON


Specifying vocabulary size
CountVectorizer(max_features, max_df, min_df)

max_features: if speci ed, it will include only the top most frequent words in the vocabulary
If max_features = None, all words will be included

max_df: ignore terms with higher than speci ed frequency


If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

min_df: ignore terms with lower than speci ed frequency


If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Build new features
from text
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Goal of the video

Goal : Enrich the existing dataset with features related to the text column (capturing the
sentiment)

SENTIMENT ANALYSIS IN PYTHON


Product reviews data
reviews.head()

SENTIMENT ANALYSIS IN PYTHON


Features from the review column

How long is each review?

How many sentences does it contain?

What parts of speech are involved?

How many punctuation marks?

SENTIMENT ANALYSIS IN PYTHON


Tokenizing a string
from nltk import word_tokenize

anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'

word_tokenize(anna_k)

['Happy','families','are', 'all','alike',',',
'every','unhappy', 'family', 'is','unhappy','in',
'its','own','way','.']

SENTIMENT ANALYSIS IN PYTHON


Tokens from a column
# General form of list comprehension
[expression for item in iterable]

word_tokens = [word_tokenize(review) for review in reviews.review]


type(word_tokens)

list

type(word_tokens[0])

list

SENTIMENT ANALYSIS IN PYTHON


Tokens from a column
len_tokens = []

# Iterate over the word_tokens list


for i in range(len(word_tokens)):
len_tokens.append(len(word_tokens[i]))

# Create a new feature for the length of each review


reviews['n_tokens'] = len_tokens

SENTIMENT ANALYSIS IN PYTHON


Dealing with punctuation
We did not address it but you can exclude it

A feature that measures the number of punctuation signs


A review with many punctuation signs could signal a very emotionally charged opinion

SENTIMENT ANALYSIS IN PYTHON


Reviews with a feature for the length
reviews.head()

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Can you guess the
language?
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Language of a string in Python
from langdetect import detect_langs
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'

detect_langs(foreign)

[es:0.9999945352697024]

SENTIMENT ANALYSIS IN PYTHON


Language of a column
Problem: Detect the language of each of the strings and capture the most likely language in
a new column

from langdetect import detect_langs


reviews = pd.read_csv('product_reviews.csv')

reviews.head()

SENTIMENT ANALYSIS IN PYTHON


Building a feature for the language
languages = []

for row in range(len(reviews)):


languages.append(detect_langs(reviews.iloc[row, 1]))

languages
[it:0.9999982541301151],
[es:0.9999954153640488],
[es:0.7142833997345875, en:0.2857160465706441],
[es:0.9999942365605781],
[es:0.999997956049055] ...

SENTIMENT ANALYSIS IN PYTHON


Building a feature for the language
# Transform the first list to a string and split on a colon
str(languages[0]).split(':')
['[es', '0.9999954153640488]']

str(languages[0]).split(':')[0]
'[es'

str(languages[0]).split(':')[0][1:]
'es'

SENTIMENT ANALYSIS IN PYTHON


Building a feature for the language
languages = [str(lang).split(':')[0][1:] for lang in languages]

reviews['language'] = languages

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N

You might also like