Lect_06_Feature_Engineering_and_Selection
Lect_06_Feature_Engineering_and_Selection
MSBA 315 2
Machine Learning Pipeline
MSBA 315 3
Why Feature Engineering is Important?
MSBA 315 4
Feature types
• Numerical
• Uniform, Quantile, and Clustered Discretization (Binning or Categorization)
• Categorical
• Ordinal, One-Hot, and Dummy Variable Encoding
• Date time
• Text
• Other domain specific features
MSBA 315 5
Feature Extraction from Date/Time
• Break up dates and time into individual features
• Date/Time -> Year, Month, Day, Hour, Min
• Pandas: 𝑑𝑓[′ 𝑑𝑎𝑡𝑒′]. 𝑑𝑡. 𝑦𝑒𝑎𝑟, …
• 𝑑𝑓[′ 𝑑𝑎𝑡𝑒 ′ ]. 𝑑𝑡. 𝑤𝑒𝑒𝑘𝑑𝑎𝑦_𝑛𝑎𝑚𝑒
• Beginning of time: 00:00:00 UTC of the 1st of January 1970th
• Convert Time Zone
• 𝑑𝑓[‘𝑑𝑎𝑡𝑒’]. 𝑑𝑡. 𝑡𝑧_𝑙𝑜𝑐𝑎𝑙𝑖𝑧𝑒(′𝐴𝑓𝑟𝑖𝑐𝑎/𝐴𝑏𝑖𝑑𝑗𝑎𝑛′)
• 𝑑𝑓[‘𝑑𝑎𝑡𝑒’]. 𝑑𝑡. 𝑡𝑧_𝑐𝑜𝑛𝑣𝑒𝑟𝑡(′𝐸𝑢𝑟𝑜𝑝𝑒/𝐿𝑜𝑛𝑑𝑜𝑛′)
• Create new features
• Evening, Noon, Night
• Business hours or not
• Business quarter or season of the year
• Daylight savings or not, Public holiday or not
• Purchases_last_month, Purchases_last_week
• 𝑇𝑖𝑚𝑒_𝐿𝑒𝑓𝑡 – 𝑇𝑖𝑚𝑒_𝐴𝑟𝑟𝑖𝑣𝑒𝑑 MSBA 315 6
Feature Extraction from Date/Time
Time series
• Lag Features
• Transform time series problem into self-supervised learning
• Ex., predict the value at the next time 𝑡 + 1 given the value at the current
time 𝑡
• Rolling Window Statistics
• Calculate summary statistics across the values in the sliding window and
include these as features in our dataset
• 𝑑𝑓. 𝑟𝑜𝑙𝑙𝑖𝑛𝑔(𝑤𝑖𝑛𝑑𝑜𝑤 = 2). 𝑚𝑒𝑎𝑛()
MSBA 315 7
Feature Engineering – Create New Features
• Polynomial features are created by raising existing features to an
exponent, usually 2 (squared) or 3 (cubed)
• Sklearn: input [A, B] degree-2 polynomial features are [1, A, B, A^2, AB, B^2]
• Interaction features: add new variables that represent the interaction
between features
• If you have features A and B create features A*B, A+B, A/B, A-B
• This explodes the feature space, for example,
• if you have 10 features and considering two variable interactions
10!
• 𝐶 10,2 = = 45 (combination of two features)
8! 10−8 !
• 45 × 4 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = 180 new features are included in your model
• Crossing features are created by computing the cross product AxBencode non-linearity
source: DataVedas
MSBA 315 9
Feature Engineering – Create New Features
Ex: 3
𝒙𝟏 𝒓
𝑥3 = 𝑥1 𝑥2
Ex: 2 𝑥3 = 𝑥1 𝑥2
𝒙𝟐
𝑦 = 𝑏 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3
𝒙𝟏
𝑦 = 𝑏 + 𝑤1 𝑥1 + 𝑤2 𝑥2 MSBA 315 𝑥1 𝑥2 10
Feature Engineering from Text
(unstructured data)
MSBA 315 11
BOW/TF – Vector
Source: A. Zheng and A. Casari (2018), Feature Engineering for Machine Learning
MSBA 315 13
BOW/TF – Preprocessing for Cleaner Features
• Stopwords
• Frequency-Based Filtering
• Rare and/or frequent words
• Stemming
• Correcting spelling, grammars
• Removing character repetitions
• Etc.
Source: A. Zheng and A. Casari (2018), Feature Engineering for Machine Learning
MSBA 315 14
Term Frequency Inverse Document Frequency (𝒕𝒇. 𝒊𝒅𝒇)
• It is essentially a feature scaling technique intended to reflect how
important a word is to a document in a corpus
• 𝒕𝒇. 𝒊𝒅𝒇: a normalized count where each word count is divided by the
number of documents in which this word appears
• 𝒕𝒇(𝒕, 𝒅): number of times a term 𝒕 appears in a document 𝒅
𝑁 (𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠) 𝑁
• 𝒊𝒅𝒇(𝒕): =
number of documents in which term 𝒕 appears 𝑑𝑓(𝑡)
𝑁
• 𝒕𝒇. 𝒊𝒅𝒇(𝒕) = 𝑡𝑓
𝑑𝑓(𝑡)
𝑁
• 𝒕𝒇. 𝒊𝒅𝒇(𝒕) = 𝑡𝑓. log( )
𝑑𝑓(𝑡)
1+ 𝑁
• Various smoothing formulas: 𝑡𝑓. log( )
1+𝑑𝑓(𝑡)
MSBA 315 15
Example: TF TF.IDF
The weight of
• “is” is reduced to 0
• “puppy” is increased to 1.38
• “cat” is increased to 1.38
MSBA 315 17
Traditional Feature Engineering - Limitations
MSBA 315 18
Distributed Representation
MSBA 315 19
Distributed Representation
MSBA 315 20
Distributed Representation
MSBA 315 21
Dense Vector Representation
MSBA 315 22
Dense Vector Representation – Illustrative Examples
• Good at predicting
other words
appearing in its
context
• Commonly used
dimensions: 50, 100,
200, 300
MSBA 315 24
Word2Vec: Continuous BOW (CBOW)
MSBA 315 25
Word2Vec: Skip-Gram
MSBA 315 26
Which one to choose?
Skip-Gram CBOW
✓ Works well with small amount ✓ Much faster to train
of training data ✓ Better accuracy for frequent
✓ Represents well even rare words
words or phrases
You need to conduct several experiment and pick what is best for your
application
MSBA 315 27
GloVe - Global Vectors for Word Representation
- An extension to word2vec for efficiently learning word vectors,
developed by Pennington, et al. at Stanford
- GloVe combines both the global statistics of matrix factorization with
the local context-based learning (similar to word2vec)
• Classical vector space model representations of words developed using
matrix factorization techniques such as Latent Semantic Analysis (LSA)
• Good at using global text statistics
• Not as good at capturing meaning and analogies
- GloVe constructs an explicit word co-occurrence matrix using
statistics across the whole text corpus, which can provide a more
global context
• words that co-occur frequently in a corpus are likely to have similar meanings
MSBA 315 28
FastText
• Word2vec and GloVe struggle to get good representations of rare words or
words that were not present in the training corpus
• FastText is another extension of the word2vec model developed at
Facebook, which works well with rare words
• Instead of learning vectors for words directly, it represents each word as an
n-gram of characters
• For instance, the fastText representation of “artificial”, with n=3, is:
<ar, art, rti, tif, ifi, fic, ici, ial, al>
• Help capture the meaning of shorter words and allows the embeddings to
understand suffixes and prefixes
• Once the word has been represented using character n-grams, a skip-gram
model is trained to learn the embeddings
• It is a bag of words model since no internal structure of the word is taken
into account (the order of the n-grams doesn’t matter)
MSBA 315 29
Issues with Distributed Representation
• Word ambiguity
• The word bank could mean a land sloping down to a river or a financial
institution
• Representing each word with one vector cannot capture its different meaning
• Contextual Word Embeddings
• Similarity is different from relatedness
• male and man are similar
• computer and keyboard are related but dissimilar
• Most evaluation datasets don’t distinguish between word similarity and
relatedness
• Bias inherited and amplified
• Gender, racial, etc. bias are captured by the word embedding from the data
• Bias gets amplified: Word embedding are used in downstream applications
MSBA 315 32
Dimensionality Reduction
MSBA 315 33
Why Dimensionality Reduction is Useful?
• Yield more compact representation of the data
• Provide more interpretable representation of the target concept
• Allows the model to focus attention on the most relevant variables
• Cheaper to collect less features (variables)
• Alleviate the “curse of dimensionality”
MSBA 315 34
Curse of Dimensionality
When the dimensionality
increases (number of features),
the volume of the space
increases so fast that the
available data points become
sparse
• In 1D feature space our training data to cover 20% the population of cats and dogs
• In 2D we need to obtain 45% of the population in each dimension (0.45^2 = 20%)
• In 3D we need to obtain 58% of the population in each dimension (0.58^3 = 20%)
• For a fixed amount of data, adding dimensions lead to overfitting
• If we keep adding dimensions, the amount of training data needs to grow exponentially fast to
maintain the same coverage and to avoid overfitting
MSBA 315 35
Dimensionality Reduction Techniques
• Feature Selection Methods
• Matrix Factorization
• Manifold Learning
• Autoencoder Methods
MSBA 315 36
Dimensionality Reduction — Feature Selection
• Filter or Statistical methods: use univariate correlation between
features and target, to select a subset of features (most predictive)
• Examples include Pearson's correlation and Chi-Squared test
MSBA 315 37
Dimensionality Reduction – Matrix Factorization
• Matrix factorization is used to decompose a high-dimensional matrix
into two or more lower-dimensional matrices
• Reduce the number of features in a dataset while retaining as much
useful information as possible
• Common matrix factorization methods for dimensionality reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
• Non-negative Matrix Factorization (NMF)
• Independent Component Analysis (ICA)
MSBA 315 38
Dimensionality Reduction – PCA
• Find a set of orthogonal axes called principal components (PC) that
explain the maximum variance in the data
• Project the dataset onto these principal components to obtain a
lower-dimensional representation.
MSBA 315 39
Dimensionality Reduction – PCA
After the transformation,
the data points became
mixed, making
classification difficult.
MSBA 315 40
Dimensionality Reduction – Manifold Learning
• Manifold learning algorithms try to find a lower-dimensional
representation of the data that retains the essential information
• The projection is designed to both create a low-dimensional
representation of the dataset while preserving the
structure/relationship of the data
MSBA 315 42
Which Dimensionality Reduction Technique?
• No overall best technique for dimensionality reduction
• Use experiments to discover which techniques, when paired with
your model of choice, result in the best performance on your valid set
• Typically, linear algebra and manifold learning methods assume that
all input features have the same scale or distribution
• A good practice to either normalize or standardize data prior to using these
methods if the input variables have differing scales or units
• Dimensionality reduction is typically performed after data cleaning
and data scaling and before training a predictive
• Must also be performed on validation and test datasets, and before
making a prediction in production
MSBA 315 43