0% found this document useful (0 votes)
13 views60 pages

TextFeatureEnginerring-NLP Lec2

The document discusses feature engineering for Natural Language Processing (NLP), emphasizing its importance in developing NLP applications. It outlines various types of features, including meta features and text-based features, and explains techniques such as TF-IDF, Word2Vec, and FastText for transforming text into numerical representations. Additionally, it presents problem formulations for tasks like text classification and duplicate question detection on platforms like Stack Overflow.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views60 pages

TextFeatureEnginerring-NLP Lec2

The document discusses feature engineering for Natural Language Processing (NLP), emphasizing its importance in developing NLP applications. It outlines various types of features, including meta features and text-based features, and explains techniques such as TF-IDF, Word2Vec, and FastText for transforming text into numerical representations. Additionally, it presents problem formulations for tasks like text classification and duplicate question detection on platforms like Stack Overflow.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Feature Engineering for

Natural Language Processing

Susan Li
https://medium.com/@actsusanli
https://www.linkedin.com/in/susanli/
Sr. Data Scientist
What are features
When we predict New York City
taxi fare

• Distance between pickup and drop


off location
• Time of the day
• Day of the week
• Is a holiday or not
What are the features for NLP?
How does
computer
perceive text?
Meta features Text based features
• Number of words in the text • Vectorization (CountVectorizer,
• Number of unique words in the text TfidfVectorizer, HashingVectorizer)
• Number of characters in the text • Tokenization
• Number of stopwords • Lemmatization
• Number of punctuations • Stemming
• Number of upper case words • N-grams
• Number of title case words • Part of speech tagging
• Average length of the words • Parsing
• Text distance • Named entity and key phrase
• Language of text extraction
• Page rank • Capitalization pattern
Meta features
Text based features
Feature engineering is the most
important part of
developing NLP applications.

Feature engineering is the most creative


Feature aspect of Data Science (art and skill).

Engineering
for NLP Domain knowledge / brainstorming
sessions.

Check / revisit what worked before.


What Is Not Feature Engineering

Data collection

Removing stopwords, removing non-alphanumeric, removing HTML tag, lowercasing. They are data preprocessing.

Creating the target variable (labeling the data)

Scaling or normalization

PCA

Hyperparameter optimization or tuning.


Our machine learning algorithm looks at the textual content
before applying SVM on the word vector.
Feature Engineering for NLP

TF-IDF Word2vec FastText

Simple &
Topic
complicated The future
Modeling
features
Problem 1: Predict label of a
Stack Overflow question
https://storage.googleapis.com/tensorflow-workshop-
examples/stack-overflow-data.csv
Our problem is best formulated as multi-
class, single label text classification which
Problem formulation predicts a given Stack Overflow question
to one of a set of target labels.
python
Our machine learning algorithm looks at the textual content
before applying SVM on the word vector.
• Unigrams: “nlp multilabel classification
problem” -> [“nlp”, “multilabel”, “classification”,
“problem”]
• Bigrams: “nlp multilabel classification problem”
-> [“nlp multilabel”, “multilabel classification”,
N-gram “classification problem”]
• Trigrams: [“nlp multilabel classification”,
“multilabel classification problem”]
• Character-grams: “multilabel” -> [“mul”, “ult”,
“lti”, “til”, “ila”, “lab”, “abe”, “bel”]
TF-IDF computation of the word “python”
TF-IDF sparse matrix example

Without going into the math, TF-IDF are word frequency scores
that try to highlight words that are more interesting, e.g.
frequent in a document but not across documents.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=3, max_features =
None, token_patterns = r’\w{1, }’, strip_accents = ‘unicode’,
Coding TF-IDF analyzer=‘word’, ngram_range(1, 3), strop_words=‘english’)

tfidf_matrix = vectorizer.fit_transform(questions_train)
extract from the sentence a rich set of hand-designed features
which are then fed to a classical shallow classification
algorithm, e.g. a Support Vector Machine (SVM), often with a
linear kernel
You shall know a word by the company it keeps
(Firth, J. R. 1957:11)

• Word2vec is not deep learning


Word2vec • Word2vec turns input text into a numerical
form that deep neural networks can process as
inputs.
• We can either download one of the pre-trained
models from GloVe, or train a Word2Vec model
from scratch with gensim
CBOW: if we have the phrase “how to plot
dataframe bar graph”, the parameters/features
of {how, to, plot, bar, graph} are used to predict
{dataframe}. Predicts the current word given the
neighboring words.
Word2vec
Skip-gram: if we have the phrase “how to plot
dataframe bar graph”, the parameters/features
of {dataframe} are used to predict {how, to, plot,
bar, graph}. Predicts the neighboring words given
the current word.
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-
compositionality.pdf
Word2vec

https://www.slideshare.net/TraianRebedea/what-is-word2vec
Airbnb

https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-
recommendations-and-real-time-personalization-in-search-601172f7603e
Uber

https://eng.uber.com/uber-eats-query-understanding/
Word2Vec FastText

• Treat each word as the smallest unit • An extension of Word2vec


to train on. • Treats each word as composed of
• Does not perform well for rare character n-grams.
words. • Generate better word embeddings for
• Can’t generate word embedding if a rare words.
word does not appear in training • Can construct the vector for a word
corpus. even it does not appear in training
• Training faster. corpus.
• Training slower
from gensim.models import word2vec
Word2vec model = word2vec.Word2Vec(corpus, size=100,
window=20, min_count=500, workers=4)

from gensim.models import FastText


FastText model_FastText = FastText(corpus, size=100,
window=20, min_count=500, workers=4)
Problem 2: Topic Modeling
(Unsupervised)
from sklearn.decomposition import LatentDirichletAllocation

lda_model = LatentDirichletAllocation(n_components=20,
Topic Modeling learning_method='online’,
random_state=0, n_jobs = -1)

lda_output = lda_model.fit_transform(df_vectorized)
sql

python
Problem 3: Auto detect
duplicate Stack Overflow
questions
Our problem is a binary
classification problem in which we
Problem formulation Identify and classify whether
question pairs are duplicates or not
Meta features Word2vec features
• The word counts of each question. • Word2vec vectors for each question.
• The character length of each question. • Word mover’s distance.
(https://markroxor.github.io/gensim/static/no
• The number of common words of these two tebooks/WMD_tutorial.html)
questions
• ... • Cosine distance between vectors of question1
and question2.
• Manhattan distance between vectors of
question1 and question2.
Fuzzy Features • Euclidean distance between vectors of
question1 and question2.
• Fuzzy string matching related features.
(https://github.com/seatgeek/fuzzywuzzy) • Jaccard similarity between vectors of
question1 and question2.
• ...
Debug end-to-end Auto ML models

• LIME

• SHAP
LIME

“printf” is positive for c, but negative for Other.


If we remove the words printf from the document, we expect the model to predict c with probability 0.97 - 0.28 = 0.69
SHAP

“php” is the biggest signal word used by our model, contributing most to “php” predictions.
It’s unlikely you’d see the word “php” used in a python question.
Good Features are the backbone of
any machine learning model.
And good feature creation often
needs domain knowledge, creativity,
and lots of time.
Resources:
• Glove: https://blog.acolyer.org/2016/04/22/glove-global-
vectors-for-word-representation/
• Word2vec: https://blog.acolyer.org/2016/04/21/the-
amazing-power-of-word-vectors/
• https://arxiv.org/pdf/1301.3781.pdf - Mikolov et al. 2013
• https://papers.nips.cc/paper/5021-distributed-
representations-of-words-and-phrases-and-their-
compositionality.pdf - Mikolov et al. 2013
• https://arxiv.org/pdf/1411.2738v3.pdf - Rong 2014
• https://arxiv.org/pdf/1402.3722v1.pdf - Goldberg and Levy
2014

You might also like