Module 2 Feature Engineering and Text Representation

Uploaded by

raonithin252

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views19 pages

Module 2 Feature Engineering and Text Representation

Uploaded by

raonithin252

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

NLP Module: Feature Engineering &

Text Representation
NLP Process

Text Processing Feature Engineering & Learning Models

Text Representation
Clean up the text to make it Learn how to extract information Use learning models to
easier to use and more from text and represent it identify parts of speech,
consistent to increase numerically entities, sentiment, and
prediction accuracy later on other aspects of the text.
Feature Engineering & Text Representation

Bag of Words Model

One-Hot-Encoding N-Gram Encoding TFIDF
Using Countvectorizer
Transforms categorical Generalization of Captures word order in a Converts a collection of
feature into many binary one-hot-encoding for a string vector model raw documents to a
features of text matrix of TFIDF features
What is Feature
Engineering & What is
Text Representation?
Feature engineering is the process of transforming the raw
data to improve the accuracy of models by creating new
features from existing data
Text Representation is numerically representing text to
make it mathematically computable
One Hot Encoding
One Hot Encoding
Encodes categorical data as real numbers such that the magnitude of each
dimension is meaningful

For each distinct possible value, a new feature is created

Exp:
One Hot Encoding in Scikit-Learn
from sklearn.preprocessing import OneHotEncoder

oh_enc = OneHotEncoder()

oh_enc.ﬁt(df[['name', 'kind']])

oh_enc.transform(df[['name', 'kind']]).todense()

One Hot Encoding in Pandas

pd.get_dummies(df[['name', 'kind']])
Bag of Words Model
Bag of Words Model
Extracts features from text

Stop words not included, word order is lost, sparse encoding

Exp:
Bag of Words Model using Scikit-Learn
sample_text = ['This is the ﬁrst document.', 'This document is the second
document.', ‘This is the third document.’ ]

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words="english")

vectorizer.ﬁt(sample_text)

#to see what words were kept

print("Words:", list(enumerate(vectorizer.get_feature_names())))
N-gram Encoding
N-gram Encoding
Extracts features from text while capturing local word order by deﬁning
counts over sliding windows

Exp n =2 :
N-gram Encoding using Scikit-Learn
sample_text = ['This is the ﬁrst document.', 'This document is the second
document.', ‘This is the third document.’ ]

from sklearn.feature_extraction.text import CountVectorizer

bigram = CountVectorizer(ngram_range=(1, 2))

bigram.ﬁt(sample_text)

#to see what words were kept

print("Words:", list(zip(range(0,len(bigram.get_feature_names())),
bigram.get_feature_names())))
TFIDF Vectorizer
TFIDF Vectorizer
Converts a collection of raw documents to a matrix of TFIDF features

What are TFIDF Features?

TFIDF stands for term frequency inverse document frequency and it represents
text data by indicating the importance of the word relative to the other words in
the text

2 Parts:

TF: (# of times term t appears in a document)/ (total # of terms in the document)

IDF: (log10 (total # of documents)/(# of documents with term t in it)

TFIDF Vectorizer con.

TFIDF = TF*IDF

TF represents how frequently the word shows up in the document

IDF represents how important the word is to the document (rare words)
TFIDF Vectorizer Encoding using Scikit-Learn

from sklearn.feature_extraction.text import TﬁdfVectorizer

sample_text = ['This is the ﬁrst document.', 'This document is the second

document.', ‘This is the third document.’ ]

vectorizer = TﬁdfVectorizer()

X = vectorizer.ﬁt_transform(sample_text)

print(vectorizer.get_feature_names())

MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
NLP Lab Manual for B.E. Students
No ratings yet
NLP Lab Manual for B.E. Students
21 pages
Ch4 Word Embeddings
No ratings yet
Ch4 Word Embeddings
21 pages
Extra Feature NLP
No ratings yet
Extra Feature NLP
5 pages
Module III
No ratings yet
Module III
42 pages
Python Text Classification Guide
No ratings yet
Python Text Classification Guide
34 pages
NLP Crecord Mid2
No ratings yet
NLP Crecord Mid2
36 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Text Vectorization
No ratings yet
Text Vectorization
18 pages
Lesson 2 Feature Engineering On Text Data
No ratings yet
Lesson 2 Feature Engineering On Text Data
89 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Feature Engineering Guide
100% (2)
Feature Engineering Guide
44 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
Lab 5
No ratings yet
Lab 5
27 pages
TextFeatureEnginerring-NLP Lec2
No ratings yet
TextFeatureEnginerring-NLP Lec2
60 pages
NLP Text Representation Guide
No ratings yet
NLP Text Representation Guide
131 pages
BBC Sports Text Preprocessing Guide
No ratings yet
BBC Sports Text Preprocessing Guide
6 pages
10253.exp 5
No ratings yet
10253.exp 5
12 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
NLP Record 2
No ratings yet
NLP Record 2
18 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Lect 04
No ratings yet
Lect 04
44 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
21 pages
I041 NLP Assignment5
No ratings yet
I041 NLP Assignment5
12 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
Bag of Words Feature Extraction Guide
No ratings yet
Bag of Words Feature Extraction Guide
21 pages
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
No ratings yet
Machine Learning, NLP - Text Classification Using Scikit-Learn, Python and NLTK
9 pages
Machine Learning for NLP: Tokenization & Features
No ratings yet
Machine Learning for NLP: Tokenization & Features
37 pages
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
NLP Feature Extraction Techniques
No ratings yet
NLP Feature Extraction Techniques
19 pages
Methodology
No ratings yet
Methodology
9 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
NLP Text Classification Techniques
No ratings yet
NLP Text Classification Techniques
3 pages
Feature Normalisation
No ratings yet
Feature Normalisation
9 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
05 - Feature Engineering (Text)
No ratings yet
05 - Feature Engineering (Text)
28 pages
Unit IV
No ratings yet
Unit IV
58 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
NLP Challenges & Techniques
No ratings yet
NLP Challenges & Techniques
45 pages
NLP4E - Day1
No ratings yet
NLP4E - Day1
32 pages
Text Representation in NLP Techniques
No ratings yet
Text Representation in NLP Techniques
57 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Unit 2
No ratings yet
Unit 2
21 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Python NLP Techniques Guide
No ratings yet
Python NLP Techniques Guide
18 pages
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
No ratings yet
Key2Vec Automatic Ranked Keyphrase Extraction From Scientific Articles Using Phrase Embeddings
6 pages
LIC Agent R. Balaiah Profile Overview
No ratings yet
LIC Agent R. Balaiah Profile Overview
15 pages
Smart City Waste Management Hackathon
No ratings yet
Smart City Waste Management Hackathon
3 pages
Embedded OS: Real-Time Systems
No ratings yet
Embedded OS: Real-Time Systems
26 pages
Embedded Operating Systems
No ratings yet
Embedded Operating Systems
30 pages
Embedded Os
No ratings yet
Embedded Os
10 pages
20 - Matematics Form 1-158-177
No ratings yet
20 - Matematics Form 1-158-177
20 pages
Review of Related Literature 2.1 Concrete Strength: It Matters
No ratings yet
Review of Related Literature 2.1 Concrete Strength: It Matters
19 pages
A333 Pipe Specifications Overview
No ratings yet
A333 Pipe Specifications Overview
8 pages
Average Study Material PDF 4 PDF
No ratings yet
Average Study Material PDF 4 PDF
6 pages
Maths Literacy Grade 12 Trial 2021 P1 and Memo
No ratings yet
Maths Literacy Grade 12 Trial 2021 P1 and Memo
27 pages
Chèn Phần Tử Vào Mảng
No ratings yet
Chèn Phần Tử Vào Mảng
8 pages
Cambridge Assessment International Education: Chemistry 0620/41 May/June 2019
No ratings yet
Cambridge Assessment International Education: Chemistry 0620/41 May/June 2019
9 pages
Orthogonal Representation, Fourier Series and Power Spectra
No ratings yet
Orthogonal Representation, Fourier Series and Power Spectra
24 pages
Vaisala PTB330 Datasheet B210708EN E
No ratings yet
Vaisala PTB330 Datasheet B210708EN E
2 pages
Eduqas 02 Rhythm Questions
No ratings yet
Eduqas 02 Rhythm Questions
12 pages
Hypothesis Testing (CW & TA)
No ratings yet
Hypothesis Testing (CW & TA)
4 pages
Thermodynamics in Chemical Engineering
100% (2)
Thermodynamics in Chemical Engineering
406 pages
Semifinal 2004
No ratings yet
Semifinal 2004
5 pages
CS273 - Protein Structure Prediction
No ratings yet
CS273 - Protein Structure Prediction
39 pages
JNTUK R19 C Programming Lab
No ratings yet
JNTUK R19 C Programming Lab
30 pages
MBIST Final 22062016
No ratings yet
MBIST Final 22062016
94 pages
DM10 Module Update Instructions
No ratings yet
DM10 Module Update Instructions
3 pages
Salt Analysis ZnSO4
No ratings yet
Salt Analysis ZnSO4
3 pages
To 220
No ratings yet
To 220
3 pages
Industrial Mechanics English Course SMIM01
No ratings yet
Industrial Mechanics English Course SMIM01
46 pages
Improve Energy Efficiency in Induction Melting
No ratings yet
Improve Energy Efficiency in Induction Melting
13 pages
Daikin Ceiling Suspended Air Conditioning
100% (4)
Daikin Ceiling Suspended Air Conditioning
11 pages
Ferroelectric Memory Simulations
No ratings yet
Ferroelectric Memory Simulations
5 pages
Digital CW Modulation Techniques: by Mrs. Leena Mehta
No ratings yet
Digital CW Modulation Techniques: by Mrs. Leena Mehta
16 pages
Lecture 2 Bearing and Punching Stress, Strain
No ratings yet
Lecture 2 Bearing and Punching Stress, Strain
16 pages
UVM Ramakrishna
0% (2)
UVM Ramakrishna
54 pages
Game and Auction Theory
No ratings yet
Game and Auction Theory
13 pages
VTE Mathematics
No ratings yet
VTE Mathematics
6 pages
Understanding Process Sigma Level
No ratings yet
Understanding Process Sigma Level
11 pages
3D Modeling of Complex Structure Based On AutoCAD VBA
No ratings yet
3D Modeling of Complex Structure Based On AutoCAD VBA
3 pages