Back of Words
Back of Words
1. Bag of words:
The dog is on the table
2. Embeddings (~word2vec):
King
Woman
Bag of words
CountVectorizer
every
hi I’m so excited about this course
one
1 1 1
1 1 1 1 1 1
1 2 3
sklearn.feature_extraction.text.CountVectorizer
Bag of words: TFiDF
Bag of words: TFiDF
Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
Bag of words: TFiDF
Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
Inverse Document Frequency
idf = np.log(x.shape[0] / (x > 0).sum(0))
x = x * idf
Bag of words: TFiDF
Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
Inverse Document Frequency
idf = np.log(x.shape[0] / (x > 0).sum(0))
x = x * idf
sklearn.feature_extraction.text.TfidfVectorizer
Bag of words: TF
every
hi I’m so excited about this course
one
0.33 0.33 0.33
0.16 0.16 0.16 0.16 0.16 0.16
0.16 0.33 0.5
Bag of words: TF+iDF
every
hi I’m so excited about this course
one
0.36 0.36 0
0.06 0.06 0 0.18 0.18 0.18
0.06 0.13 0
N-grams
this,
is,
𝑁 = 1: unigrams a,
sentence
this is
𝑁 = 2: bigrams is a
a sentence
this is a
𝑁 = 3: trigrams Is a sentence
N-grams
this,
is,
𝑁 = 1: unigrams a,
sentence
this is
𝑁 = 2: bigrams is a
a sentence
this is a
𝑁 = 3: trigrams Is a sentence
sklearn.feature_extraction.text.CountVectorizer:
Ngram_range, analyzer
Texts preprocessing
1. Lowercase
2. Lemmatization
3. Stemming
4. Stopwords
Texts preprocessing: lowercase
Very, very
Very very Sunny sunny
sunny.
1 1 0 1
Sunny... Sunny!
0 0 2 0
Texts preprocessing: lemmatization and stemming
Stemming:
democracy, democratic, and democratization -> democr
Lemmatization:
democracy, democratic, and democratization -> democracy
Texts preprocessing: lemmatization and stemming
Stemming:
democracy, democratic, and democratization -> democr
Saw -> s
Lemmatization:
democracy, democratic, and democratization -> democracy
Saw -> see or saw (depending on context)
Texts preprocessing: stopwords
Examples:
1. Articles or prepositions
2. Very common words
Texts preprocessing: stopwords
Examples:
1. Articles or prepositions
2. Very common words
sklearn.feature_extraction.text.CountVectorizer:
max_df
Conclusion