0% found this document useful (0 votes)

29 views21 pages

Back of Words

1. Feature extraction from texts can involve representing text as bag-of-words (BOW) using techniques like CountVectorizer and TfidfVectorizer, which create vector representations of documents based on word counts and frequencies. 2. BOW representations involve preprocessing texts through steps like lemmatization, stemming, and removing stopwords, as well as using n-grams to capture local context. 3. TFiDF is commonly used for BOW after term frequency (TF) is calculated, which weights words by frequency in a document, and inverse document frequency (IDF) is calculated, which weights words by rarity across documents.

Uploaded by

ambigus9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views21 pages

Back of Words

Uploaded by

ambigus9

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Feature extraction from texts and images

Solely text/images competitions

Feature extraction from texts and images

Common features + text

Titanic dataset
Feature extraction from texts and images

Common features + images/text

Text -> vector

1. Bag of words:
The dog is on the table

2. Embeddings (~word2vec):
King

Word vectors Man

Queen

Woman
Bag of words

I’m so excited So excited.

(excited)
about this SO EXCITED.
Hi everyone!
course! EXCITED, I AM!

CountVectorizer

every
hi I’m so excited about this course
one
1 1 1
1 1 1 1 1 1
1 2 3
sklearn.feature_extraction.text.CountVectorizer
Bag of words: TFiDF
Bag of words: TFiDF

Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
Bag of words: TFiDF

Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
Inverse Document Frequency
idf = np.log(x.shape[0] / (x > 0).sum(0))
x = x * idf
Bag of words: TFiDF

Term frequency
tf = 1 / x.sum(axis=1) [:,None]
x = x * tf
Inverse Document Frequency
idf = np.log(x.shape[0] / (x > 0).sum(0))
x = x * idf

sklearn.feature_extraction.text.TfidfVectorizer
Bag of words: TF

I’m so excited So excited.

(excited)
about this SO EXCITED.
Hi everyone!
course! EXCITED, I AM!

every
hi I’m so excited about this course
one
0.33 0.33 0.33
0.16 0.16 0.16 0.16 0.16 0.16
0.16 0.33 0.5
Bag of words: TF+iDF

I’m so excited So excited.

(excited)
about this SO EXCITED.
Hi everyone!
course! EXCITED, I AM!

every
hi I’m so excited about this course
one
0.36 0.36 0
0.06 0.06 0 0.18 0.18 0.18
0.06 0.13 0
N-grams

this,
is,
𝑁 = 1: unigrams a,
sentence
this is
𝑁 = 2: bigrams is a
a sentence

this is a
𝑁 = 3: trigrams Is a sentence
N-grams

this,
is,
𝑁 = 1: unigrams a,
sentence
this is
𝑁 = 2: bigrams is a
a sentence

this is a
𝑁 = 3: trigrams Is a sentence

sklearn.feature_extraction.text.CountVectorizer:
Ngram_range, analyzer
Texts preprocessing

1. Lowercase
2. Lemmatization
3. Stemming
4. Stopwords
Texts preprocessing: lowercase

Very, very
Very very Sunny sunny
sunny.
1 1 0 1
Sunny... Sunny!
0 0 2 0
Texts preprocessing: lemmatization and stemming

I had a car I have car

We have cars We have car

Texts preprocessing: lemmatization and stemming

I had a car I have car

We have cars We have car

Stemming:
democracy, democratic, and democratization -> democr

Lemmatization:
democracy, democratic, and democratization -> democracy
Texts preprocessing: lemmatization and stemming

I had a car I have car

We have cars We have car

Stemming:
democracy, democratic, and democratization -> democr
Saw -> s
Lemmatization:
democracy, democratic, and democratization -> democracy
Saw -> see or saw (depending on context)
Texts preprocessing: stopwords

Examples:
1. Articles or prepositions
2. Very common words
Texts preprocessing: stopwords

Examples:
1. Articles or prepositions
2. Very common words

NLTK, Natural Language Toolkit library for python

sklearn.feature_extraction.text.CountVectorizer:
max_df
Conclusion

Pipeline of applying BOW

1. Preprocessing:
Lowercase, stemming, lemmatization, stopwords
2. Bag of words:
Ngrams can help to use local context
3. Postprocessing: TFiDF

Web and Social Media Analytics Lab
No ratings yet
Web and Social Media Analytics Lab
34 pages
Extra Feature NLP
No ratings yet
Extra Feature NLP
5 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Lab 2
No ratings yet
Lab 2
49 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
Sentiment Analysis On Movie Reviews: Natural Language Processing UML602 Project Report
No ratings yet
Sentiment Analysis On Movie Reviews: Natural Language Processing UML602 Project Report
13 pages
Lect 5
No ratings yet
Lect 5
40 pages
Module 3 Lab 3
No ratings yet
Module 3 Lab 3
4 pages
Token Ization
No ratings yet
Token Ization
5 pages
Text Mining
No ratings yet
Text Mining
34 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Feature Extraction NLP
No ratings yet
Feature Extraction NLP
19 pages
Lab 5
No ratings yet
Lab 5
27 pages
An Introduction To Feature Extraction
No ratings yet
An Introduction To Feature Extraction
2 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
Tugas NLP - 1152000052 1
No ratings yet
Tugas NLP - 1152000052 1
14 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
Lab 04 - Text Normalization Tutorial
No ratings yet
Lab 04 - Text Normalization Tutorial
5 pages
NLTK
No ratings yet
NLTK
4 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
Sumati
No ratings yet
Sumati
10 pages
Module 2 Feature Engineering and Text Representation
No ratings yet
Module 2 Feature Engineering and Text Representation
19 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
Module III
No ratings yet
Module III
42 pages
Sree017 NLP
No ratings yet
Sree017 NLP
3 pages
Basenlp
No ratings yet
Basenlp
5 pages
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Pipeline
No ratings yet
Pipeline
9 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Text Mining
No ratings yet
Text Mining
62 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
Scientific Notation Notes
100% (1)
Scientific Notation Notes
3 pages
Cobas C 111 - Host Interface Manual
100% (1)
Cobas C 111 - Host Interface Manual
93 pages
What Is Difference Between Server Jobs and Parallel Jobs? Ans:-Server Jobs
No ratings yet
What Is Difference Between Server Jobs and Parallel Jobs? Ans:-Server Jobs
71 pages
CSE209 (Lab Report 5)
No ratings yet
CSE209 (Lab Report 5)
7 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
DAA Unit 5
No ratings yet
DAA Unit 5
29 pages
Pragmatic Project Automation
No ratings yet
Pragmatic Project Automation
172 pages
EDPMS User Manual Guide
No ratings yet
EDPMS User Manual Guide
34 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Actitime Scenarios
100% (1)
Actitime Scenarios
44 pages
Unit-5 Undecidability-ToC
No ratings yet
Unit-5 Undecidability-ToC
13 pages
Intel DPDK API Reference
No ratings yet
Intel DPDK API Reference
353 pages
Sig Fig Homework Quiz
No ratings yet
Sig Fig Homework Quiz
3 pages
CQP Application Form PDF
No ratings yet
CQP Application Form PDF
26 pages
AI and Security
100% (1)
AI and Security
11 pages
CMS Vadapalani Placements List
No ratings yet
CMS Vadapalani Placements List
11 pages
GW-7228 J1939/Modbus RTU Slave Gateway: User's Manual
No ratings yet
GW-7228 J1939/Modbus RTU Slave Gateway: User's Manual
48 pages
Systems Theory Modelling
No ratings yet
Systems Theory Modelling
45 pages
Comparison of Steganographic Techniques
100% (1)
Comparison of Steganographic Techniques
5 pages
EE 244 Tutorial For Programming The BASYS
No ratings yet
EE 244 Tutorial For Programming The BASYS
32 pages
LearningMaterial ICT4 v6 0 Week5
No ratings yet
LearningMaterial ICT4 v6 0 Week5
14 pages
Introduction To Perl P1
No ratings yet
Introduction To Perl P1
6 pages
Gravador CCTV Hiseeu
No ratings yet
Gravador CCTV Hiseeu
20 pages
Activity 1.2.5 Mechanical System Efficiency - VEX
No ratings yet
Activity 1.2.5 Mechanical System Efficiency - VEX
10 pages
Topic - 7 (Uncertainty)
No ratings yet
Topic - 7 (Uncertainty)
25 pages
Manual of Gradework
No ratings yet
Manual of Gradework
5 pages
Java
No ratings yet
Java
10 pages
Process Transformations Distribution Networks Responsibility Assignments Timing Cycles Inventory Sets Motivation Intentions
No ratings yet
Process Transformations Distribution Networks Responsibility Assignments Timing Cycles Inventory Sets Motivation Intentions
3 pages
Upgrading To CitectSCADA Version 6
No ratings yet
Upgrading To CitectSCADA Version 6
6 pages
2.3 Linear Equations
No ratings yet
2.3 Linear Equations
2 pages
Fragebogen Vor Dem Beginn Des Praktikums
No ratings yet
Fragebogen Vor Dem Beginn Des Praktikums
1 page
SumTotal Benefits Administration Software
No ratings yet
SumTotal Benefits Administration Software
2 pages