Chapter 2
Chapter 2
S E N T I M E N T A N A LY S I S I N P Y T H O N
Violeta Misheva
Data Scientist
What is a bag-of-words (BOW) ?
vect = CountVectorizer(max_features=1000)
vect.fit(data.review)
X = vect.transform(data.review)
Violeta Misheva
Data Scientist
Context matters
I am happy, not sad.
Pu ing 'not' in front of a word (negation) is one example of how context ma ers.
# Only unigrams
ngram_range=(1, 1)
max_features: if speci ed, it will include only the top most frequent words in the vocabulary
If max_features = None, all words will be included
Violeta Misheva
Data Scientist
Goal of the video
Goal : Enrich the existing dataset with features related to the text column (capturing the
sentiment)
anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'
word_tokenize(anna_k)
['Happy','families','are', 'all','alike',',',
'every','unhappy', 'family', 'is','unhappy','in',
'its','own','way','.']
list
type(word_tokens[0])
list
Violeta Misheva
Data Scientist
Language of a string in Python
from langdetect import detect_langs
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'
detect_langs(foreign)
[es:0.9999945352697024]
reviews.head()
languages
[it:0.9999982541301151],
[es:0.9999954153640488],
[es:0.7142833997345875, en:0.2857160465706441],
[es:0.9999942365605781],
[es:0.999997956049055] ...
str(languages[0]).split(':')[0]
'[es'
str(languages[0]).split(':')[0][1:]
'es'
reviews['language'] = languages