0% found this document useful (0 votes)
29 views21 pages

Back of Words

1. Feature extraction from texts can involve representing text as bag-of-words (BOW) using techniques like CountVectorizer and TfidfVectorizer, which create vector representations of documents based on word counts and frequencies. 2. BOW representations involve preprocessing texts through steps like lemmatization, stemming, and removing stopwords, as well as using n-grams to capture local context. 3. TFiDF is commonly used for BOW after term frequency (TF) is calculated, which weights words by frequency in a document, and inverse document frequency (IDF) is calculated, which weights words by rarity across documents.

Uploaded by

ambigus9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views21 pages

Back of Words

1. Feature extraction from texts can involve representing text as bag-of-words (BOW) using techniques like CountVectorizer and TfidfVectorizer, which create vector representations of documents based on word counts and frequencies. 2. BOW representations involve preprocessing texts through steps like lemmatization, stemming, and removing stopwords, as well as using n-grams to capture local context. 3. TFiDF is commonly used for BOW after term frequency (TF) is calculated, which weights words by frequency in a document, and inverse document frequency (IDF) is calculated, which weights words by rarity across documents.

Uploaded by

ambigus9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Feature extraction from texts and images

Solely text/images competitions


Feature extraction from texts and images

Common features + text


Titanic dataset
Feature extraction from texts and images

Common features + images/text


Text -> vector

1. Bag of words:
The dog is on the table

2. Embeddings (~word2vec):
King

Word vectors Man


Queen

Woman
Bag of words

I’m so excited So excited.


(excited)
about this SO EXCITED.
Hi everyone!
course! EXCITED, I AM!

CountVectorizer

every
hi I’m so excited about this course
one
1 1 1
1 1 1 1 1 1
1 2 3
sklearn.feature_extraction.text.CountVectorizer
Bag of words: TFiDF
Bag of words: TFiDF

Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
Bag of words: TFiDF

Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
Inverse Document Frequency
idf = np.log(x.shape[0] / (x > 0).sum(0))
x = x * idf
Bag of words: TFiDF

Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
Inverse Document Frequency
idf = np.log(x.shape[0] / (x > 0).sum(0))
x = x * idf

sklearn.feature_extraction.text.TfidfVectorizer
Bag of words: TF

I’m so excited So excited.


(excited)
about this SO EXCITED.
Hi everyone!
course! EXCITED, I AM!

every
hi I’m so excited about this course
one
0.33 0.33 0.33
0.16 0.16 0.16 0.16 0.16 0.16
0.16 0.33 0.5
Bag of words: TF+iDF

I’m so excited So excited.


(excited)
about this SO EXCITED.
Hi everyone!
course! EXCITED, I AM!

every
hi I’m so excited about this course
one
0.36 0.36 0
0.06 0.06 0 0.18 0.18 0.18
0.06 0.13 0
N-grams

this,
is,
𝑁 = 1: unigrams a,
sentence
this is
𝑁 = 2: bigrams is a
a sentence

this is a
𝑁 = 3: trigrams Is a sentence
N-grams

this,
is,
𝑁 = 1: unigrams a,
sentence
this is
𝑁 = 2: bigrams is a
a sentence

this is a
𝑁 = 3: trigrams Is a sentence

sklearn.feature_extraction.text.CountVectorizer:
Ngram_range, analyzer
Texts preprocessing

1. Lowercase
2. Lemmatization
3. Stemming
4. Stopwords
Texts preprocessing: lowercase

Very, very
Very very Sunny sunny
sunny.
1 1 0 1
Sunny... Sunny!
0 0 2 0
Texts preprocessing: lemmatization and stemming

I had a car I have car

We have cars We have car


Texts preprocessing: lemmatization and stemming

I had a car I have car

We have cars We have car

Stemming:
democracy, democratic, and democratization -> democr

Lemmatization:
democracy, democratic, and democratization -> democracy
Texts preprocessing: lemmatization and stemming

I had a car I have car

We have cars We have car

Stemming:
democracy, democratic, and democratization -> democr
Saw -> s
Lemmatization:
democracy, democratic, and democratization -> democracy
Saw -> see or saw (depending on context)
Texts preprocessing: stopwords

Examples:
1. Articles or prepositions
2. Very common words
Texts preprocessing: stopwords

Examples:
1. Articles or prepositions
2. Very common words

NLTK, Natural Language Toolkit library for python

sklearn.feature_extraction.text.CountVectorizer:
max_df
Conclusion

Pipeline of applying BOW


1. Preprocessing:
Lowercase, stemming, lemmatization, stopwords
2. Bag of words:
Ngrams can help to use local context
3. Postprocessing: TFiDF

You might also like