TextFeatureEnginerring-NLP Lec2
TextFeatureEnginerring-NLP Lec2
Susan Li
https://medium.com/@actsusanli
https://www.linkedin.com/in/susanli/
Sr. Data Scientist
What are features
When we predict New York City
taxi fare
Engineering
for NLP Domain knowledge / brainstorming
sessions.
Data collection
Removing stopwords, removing non-alphanumeric, removing HTML tag, lowercasing. They are data preprocessing.
Scaling or normalization
PCA
Simple &
Topic
complicated The future
Modeling
features
Problem 1: Predict label of a
Stack Overflow question
https://storage.googleapis.com/tensorflow-workshop-
examples/stack-overflow-data.csv
Our problem is best formulated as multi-
class, single label text classification which
Problem formulation predicts a given Stack Overflow question
to one of a set of target labels.
python
Our machine learning algorithm looks at the textual content
before applying SVM on the word vector.
• Unigrams: “nlp multilabel classification
problem” -> [“nlp”, “multilabel”, “classification”,
“problem”]
• Bigrams: “nlp multilabel classification problem”
-> [“nlp multilabel”, “multilabel classification”,
N-gram “classification problem”]
• Trigrams: [“nlp multilabel classification”,
“multilabel classification problem”]
• Character-grams: “multilabel” -> [“mul”, “ult”,
“lti”, “til”, “ila”, “lab”, “abe”, “bel”]
TF-IDF computation of the word “python”
TF-IDF sparse matrix example
Without going into the math, TF-IDF are word frequency scores
that try to highlight words that are more interesting, e.g.
frequent in a document but not across documents.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=3, max_features =
None, token_patterns = r’\w{1, }’, strip_accents = ‘unicode’,
Coding TF-IDF analyzer=‘word’, ngram_range(1, 3), strop_words=‘english’)
tfidf_matrix = vectorizer.fit_transform(questions_train)
extract from the sentence a rich set of hand-designed features
which are then fed to a classical shallow classification
algorithm, e.g. a Support Vector Machine (SVM), often with a
linear kernel
You shall know a word by the company it keeps
(Firth, J. R. 1957:11)
https://www.slideshare.net/TraianRebedea/what-is-word2vec
Airbnb
https://medium.com/airbnb-engineering/listing-embeddings-for-similar-listing-
recommendations-and-real-time-personalization-in-search-601172f7603e
Uber
https://eng.uber.com/uber-eats-query-understanding/
Word2Vec FastText
lda_model = LatentDirichletAllocation(n_components=20,
Topic Modeling learning_method='online’,
random_state=0, n_jobs = -1)
lda_output = lda_model.fit_transform(df_vectorized)
sql
python
Problem 3: Auto detect
duplicate Stack Overflow
questions
Our problem is a binary
classification problem in which we
Problem formulation Identify and classify whether
question pairs are duplicates or not
Meta features Word2vec features
• The word counts of each question. • Word2vec vectors for each question.
• The character length of each question. • Word mover’s distance.
(https://markroxor.github.io/gensim/static/no
• The number of common words of these two tebooks/WMD_tutorial.html)
questions
• ... • Cosine distance between vectors of question1
and question2.
• Manhattan distance between vectors of
question1 and question2.
Fuzzy Features • Euclidean distance between vectors of
question1 and question2.
• Fuzzy string matching related features.
(https://github.com/seatgeek/fuzzywuzzy) • Jaccard similarity between vectors of
question1 and question2.
• ...
Debug end-to-end Auto ML models
• LIME
• SHAP
LIME
“php” is the biggest signal word used by our model, contributing most to “php” predictions.
It’s unlikely you’d see the word “php” used in a python question.
Good Features are the backbone of
any machine learning model.
And good feature creation often
needs domain knowledge, creativity,
and lots of time.
Resources:
• Glove: https://blog.acolyer.org/2016/04/22/glove-global-
vectors-for-word-representation/
• Word2vec: https://blog.acolyer.org/2016/04/21/the-
amazing-power-of-word-vectors/
• https://arxiv.org/pdf/1301.3781.pdf - Mikolov et al. 2013
• https://papers.nips.cc/paper/5021-distributed-
representations-of-words-and-phrases-and-their-
compositionality.pdf - Mikolov et al. 2013
• https://arxiv.org/pdf/1411.2738v3.pdf - Rong 2014
• https://arxiv.org/pdf/1402.3722v1.pdf - Goldberg and Levy
2014