TextFeatureEnginerring-NLP Lec2

The document discusses feature engineering for Natural Language Processing (NLP), emphasizing its importance in developing NLP applications. It outlines various types of features, including meta features and text-based features, and explains techniques such as TF-IDF, Word2Vec, and FastText for transforming text into numerical representations. Additionally, it presents problem formulations for tasks like text classification and duplicate question detection on platforms like Stack Overflow.

Uploaded by

rashid.premiercollege

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views60 pages

TextFeatureEnginerring-NLP Lec2

Uploaded by

rashid.premiercollege

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Feature Engineering for

Natural Language Processing

Susan Li
https://medium.com/@actsusanli
https://www.linkedin.com/in/susanli/
Sr. Data Scientist
What are features
When we predict New York City
taxi fare

• Distance between pickup and drop

off location
• Time of the day
• Day of the week
• Is a holiday or not
What are the features for NLP?
How does
computer
perceive text?
Meta features Text based features
• Number of words in the text • Vectorization (CountVectorizer,
• Number of unique words in the text TfidfVectorizer, HashingVectorizer)
• Number of characters in the text • Tokenization
• Number of stopwords • Lemmatization
• Number of punctuations • Stemming
• Number of upper case words • N-grams
• Number of title case words • Part of speech tagging
• Average length of the words • Parsing
• Text distance • Named entity and key phrase
• Language of text extraction
• Page rank • Capitalization pattern
Meta features
Text based features
Feature engineering is the most
important part of
developing NLP applications.

Feature engineering is the most creative

Feature aspect of Data Science (art and skill).

Engineering
for NLP Domain knowledge / brainstorming
sessions.

Check / revisit what worked before.

What Is Not Feature Engineering

Data collection

Removing stopwords, removing non-alphanumeric, removing HTML tag, lowercasing. They are data preprocessing.

Creating the target variable (labeling the data)

Scaling or normalization

PCA

Hyperparameter optimization or tuning.

Our machine learning algorithm looks at the textual content
before applying SVM on the word vector.
Feature Engineering for NLP

TF-IDF Word2vec FastText

Simple &
Topic
complicated The future
Modeling
features
Problem 1: Predict label of a
Stack Overflow question
https://storage.googleapis.com/tensorflow-workshop-
examples/stack-overflow-data.csv
Our problem is best formulated as multi-
class, single label text classification which
Problem formulation predicts a given Stack Overflow question
to one of a set of target labels.
python
Our machine learning algorithm looks at the textual content
before applying SVM on the word vector.
• Unigrams: “nlp multilabel classification
problem” -> [“nlp”, “multilabel”, “classification”,
“problem”]
• Bigrams: “nlp multilabel classification problem”
-> [“nlp multilabel”, “multilabel classification”,
N-gram “classification problem”]
• Trigrams: [“nlp multilabel classification”,
“multilabel classification problem”]
• Character-grams: “multilabel” -> [“mul”, “ult”,
“lti”, “til”, “ila”, “lab”, “abe”, “bel”]
TF-IDF computation of the word “python”
TF-IDF sparse matrix example

Without going into the math, TF-IDF are word frequency scores
that try to highlight words that are more interesting, e.g.
frequent in a document but not across documents.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=3, max_features =
None, token_patterns = r’\w{1, }’, strip_accents = ‘unicode’,
Coding TF-IDF analyzer=‘word’, ngram_range(1, 3), strop_words=‘english’)

tfidf_matrix = vectorizer.fit_transform(questions_train)
extract from the sentence a rich set of hand-designed features
which are then fed to a classical shallow classification
algorithm, e.g. a Support Vector Machine (SVM), often with a
linear kernel
You shall know a word by the company it keeps
(Firth, J. R. 1957:11)

• Word2vec is not deep learning

Word2vec • Word2vec turns input text into a numerical
form that deep neural networks can process as
inputs.
• We can either download one of the pre-trained
models from GloVe, or train a Word2Vec model
from scratch with gensim
CBOW: if we have the phrase “how to plot
dataframe bar graph”, the parameters/features
of {how, to, plot, bar, graph} are used to predict
{dataframe}. Predicts the current word given the
neighboring words.
Word2vec
Skip-gram: if we have the phrase “how to plot
dataframe bar graph”, the parameters/features
of {dataframe} are used to predict {how, to, plot,
bar, graph}. Predicts the neighboring words given
the current word.
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-
compositionality.pdf
Word2vec

https://www.slideshare.net/TraianRebedea/what-is-word2vec
Airbnb

https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-
recommendations-and-real-time-personalization-in-search-601172f7603e
Uber

https://eng.uber.com/uber-eats-query-understanding/
Word2Vec FastText

• Treat each word as the smallest unit • An extension of Word2vec

to train on. • Treats each word as composed of
• Does not perform well for rare character n-grams.
words. • Generate better word embeddings for
• Can’t generate word embedding if a rare words.
word does not appear in training • Can construct the vector for a word
corpus. even it does not appear in training
• Training faster. corpus.
• Training slower
from gensim.models import word2vec
Word2vec model = word2vec.Word2Vec(corpus, size=100,
window=20, min_count=500, workers=4)

from gensim.models import FastText

FastText model_FastText = FastText(corpus, size=100,
window=20, min_count=500, workers=4)
Problem 2: Topic Modeling
(Unsupervised)
from sklearn.decomposition import LatentDirichletAllocation

lda_model = LatentDirichletAllocation(n_components=20,
Topic Modeling learning_method='online’,
random_state=0, n_jobs = -1)

lda_output = lda_model.fit_transform(df_vectorized)
sql

python
Problem 3: Auto detect
duplicate Stack Overflow
questions
Our problem is a binary
classification problem in which we
Problem formulation Identify and classify whether
question pairs are duplicates or not
Meta features Word2vec features
• The word counts of each question. • Word2vec vectors for each question.
• The character length of each question. • Word mover’s distance.
(https://markroxor.github.io/gensim/static/no
• The number of common words of these two tebooks/WMD_tutorial.html)
questions
• ... • Cosine distance between vectors of question1
and question2.
• Manhattan distance between vectors of
question1 and question2.
Fuzzy Features • Euclidean distance between vectors of
question1 and question2.
• Fuzzy string matching related features.
(https://github.com/seatgeek/fuzzywuzzy) • Jaccard similarity between vectors of
question1 and question2.
• ...
Debug end-to-end Auto ML models

• LIME

• SHAP
LIME

“printf” is positive for c, but negative for Other.

If we remove the words printf from the document, we expect the model to predict c with probability 0.97 - 0.28 = 0.69
SHAP

“php” is the biggest signal word used by our model, contributing most to “php” predictions.
It’s unlikely you’d see the word “php” used in a python question.
Good Features are the backbone of
any machine learning model.
And good feature creation often
needs domain knowledge, creativity,
and lots of time.
Resources:
• Glove: https://blog.acolyer.org/2016/04/22/glove-global-
vectors-for-word-representation/
• Word2vec: https://blog.acolyer.org/2016/04/21/the-
amazing-power-of-word-vectors/
• https://arxiv.org/pdf/1301.3781.pdf - Mikolov et al. 2013
• https://papers.nips.cc/paper/5021-distributed-
representations-of-words-and-phrases-and-their-
compositionality.pdf - Mikolov et al. 2013
• https://arxiv.org/pdf/1411.2738v3.pdf - Rong 2014
• https://arxiv.org/pdf/1402.3722v1.pdf - Goldberg and Levy
2014

Mathematical Analysis and Its Applications 2015
No ratings yet
Mathematical Analysis and Its Applications 2015
752 pages
Assignment B 2025 Q
No ratings yet
Assignment B 2025 Q
3 pages
Three Reservoir Problems
100% (3)
Three Reservoir Problems
28 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Greenshield's and Greenberg's Model
0% (1)
Greenshield's and Greenberg's Model
10 pages
CH 4
100% (1)
CH 4
113 pages
Smooth N-Gram
No ratings yet
Smooth N-Gram
2 pages
DM Chapter 9 - Word Embedding
No ratings yet
DM Chapter 9 - Word Embedding
7 pages
Project Synopsis-1
100% (1)
Project Synopsis-1
11 pages
NLP 160709201345
No ratings yet
NLP 160709201345
61 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
Verification and Validation Norvig 2016
No ratings yet
Verification and Validation Norvig 2016
83 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
One-Way Analysis of Variance Test PDF
No ratings yet
One-Way Analysis of Variance Test PDF
24 pages
Srujitha 1
No ratings yet
Srujitha 1
91 pages
Natural Language Processing: Lecture # 7
No ratings yet
Natural Language Processing: Lecture # 7
36 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
Unit IV
No ratings yet
Unit IV
57 pages
Module III
No ratings yet
Module III
42 pages
21CSC204J-DAA Unit 2
No ratings yet
21CSC204J-DAA Unit 2
104 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
Lect 5
No ratings yet
Lect 5
40 pages
Unit IV
No ratings yet
Unit IV
58 pages
Week 2 and 3
No ratings yet
Week 2 and 3
76 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Algorithm Trading System Project
No ratings yet
Algorithm Trading System Project
9 pages
Lect 04
No ratings yet
Lect 04
44 pages
Fourier Series
No ratings yet
Fourier Series
88 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
Lab 5
No ratings yet
Lab 5
27 pages
One Class Text Classification Using An Ensemble of Classifiers
No ratings yet
One Class Text Classification Using An Ensemble of Classifiers
71 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
NLP 9
No ratings yet
NLP 9
44 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Presentation of Oral Exam 2222
No ratings yet
Presentation of Oral Exam 2222
49 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP4E - Day1
No ratings yet
NLP4E - Day1
32 pages
M-Tech 1 Year Cace Lab Computational Lab
No ratings yet
M-Tech 1 Year Cace Lab Computational Lab
41 pages
Unit 5b - Natural Language Processing
No ratings yet
Unit 5b - Natural Language Processing
41 pages
Machine Learning
No ratings yet
Machine Learning
39 pages
NLP Record
No ratings yet
NLP Record
16 pages
Lab 5
No ratings yet
Lab 5
4 pages
Word Embeddings in NLP
No ratings yet
Word Embeddings in NLP
42 pages
05 - Feature Engineering (Text)
No ratings yet
05 - Feature Engineering (Text)
28 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
Linear Systems TEST
No ratings yet
Linear Systems TEST
4 pages
Cot 1 - 2023-2024 - Pmdas - Gemdas
No ratings yet
Cot 1 - 2023-2024 - Pmdas - Gemdas
19 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Report Rohun Sjmoon
No ratings yet
Report Rohun Sjmoon
6 pages
Big Data Analytics Chap 11
No ratings yet
Big Data Analytics Chap 11
8 pages
DeekshikaJadyada26 AP24LDS11
No ratings yet
DeekshikaJadyada26 AP24LDS11
7 pages
Topic06 - APT and Multifactor Models
No ratings yet
Topic06 - APT and Multifactor Models
28 pages
Dialog System A Comprehensive Understanding
No ratings yet
Dialog System A Comprehensive Understanding
42 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
No ratings yet
تمثيل النص كموترات - تدريب - مايكروسوفت ليرن
14 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Research Paper Final
No ratings yet
Research Paper Final
10 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
A Survey of Gaussian Convolution Algorithms: Pascal Getreuer
No ratings yet
A Survey of Gaussian Convolution Algorithms: Pascal Getreuer
25 pages
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Econometric Analysis of Panel Data: William Greene Department of Economics Stern School of Business
No ratings yet
Econometric Analysis of Panel Data: William Greene Department of Economics Stern School of Business
37 pages
Automatic Stance Detection: By: Abhinav Kumar Jha
No ratings yet
Automatic Stance Detection: By: Abhinav Kumar Jha
24 pages
Diffie Hellman
No ratings yet
Diffie Hellman
8 pages
Building A Simple Chatbot From Scratch in Python1
No ratings yet
Building A Simple Chatbot From Scratch in Python1
8 pages
Dynamic Embedding Projection-Gated
No ratings yet
Dynamic Embedding Projection-Gated
10 pages
21 01 23
No ratings yet
21 01 23
8 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
Stanford Center For AI Safety - Whitepaper
No ratings yet
Stanford Center For AI Safety - Whitepaper
6 pages
Chap5 ECL301L Lab Manual Pascual
No ratings yet
Chap5 ECL301L Lab Manual Pascual
12 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
MCA Sem I and II Pat 2019
No ratings yet
MCA Sem I and II Pat 2019
1 page
NLP Detailed QA
No ratings yet
NLP Detailed QA
3 pages
SMM5303 Syllabus
No ratings yet
SMM5303 Syllabus
3 pages
Social Network Analysis
No ratings yet
Social Network Analysis
3 pages
Distributed Computing QP - Comp
No ratings yet
Distributed Computing QP - Comp
1 page
Assignment 4
No ratings yet
Assignment 4
5 pages
EEE2035F: Signals and Systems I: Class Test 1
No ratings yet
EEE2035F: Signals and Systems I: Class Test 1
5 pages
Problems: Soru 6.1
No ratings yet
Problems: Soru 6.1
2 pages
BEE302A
No ratings yet
BEE302A
2 pages
Lesson 1.1 - AP Precalculus - Calc Medic
No ratings yet
Lesson 1.1 - AP Precalculus - Calc Medic
2 pages
Siddhant Srivastava Software Resume
No ratings yet
Siddhant Srivastava Software Resume
1 page
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet