0% found this document useful (0 votes)
1 views

Lect_06_Feature_Engineering_and_Selection

The document discusses feature engineering in machine learning, covering techniques for structuring data, extracting features from various data types, and feature selection methods. It emphasizes the importance of feature engineering as a critical aspect of applied machine learning and outlines various approaches, including traditional and distributed representations for text data. Additionally, it highlights dimensionality reduction techniques to improve model performance and interpretability.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Lect_06_Feature_Engineering_and_Selection

The document discusses feature engineering in machine learning, covering techniques for structuring data, extracting features from various data types, and feature selection methods. It emphasizes the importance of feature engineering as a critical aspect of applied machine learning and outlines various approaches, including traditional and distributed representations for text data. Additionally, it highlights dimensionality reduction techniques to improve model performance and interpretability.

Uploaded by

gacia der
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

MSBA 315

ML & Predictive Analytics

Lecture 06 – Feature Engineering


Wael Khreich
[email protected]
Learning Outcomes
• Feature Engineering
• Structure data
• Feature extraction from date/time
• Time Series
• Creating new features
• Unstructured text data
• Traditional Representation
• One-Hot Encoding, TF, TF.IDF
• Distributed Representation
• Word embeddings: Word2Vec, GloVe, FastText
• Feature Selection
• Wrapper/Scoring (RFE, Forward backward selection)
• Embedded/Intrinsic (Feature importance)
• Dimensionality Reduction (PCA, t-SNE)

MSBA 315 2
Machine Learning Pipeline

MSBA 315 3
Why Feature Engineering is Important?

“Coming up with features is difficult, time-consuming, requires expert


knowledge. ‘Applied machine learning’ is basically feature engineering.”
—Andrew Ng (deeplearning.ai and Landing AI )

“Feature engineering is the art part of data science.”


—Sergey Yurgenson (Kaggle Grandmaster)

MSBA 315 4
Feature types
• Numerical
• Uniform, Quantile, and Clustered Discretization (Binning or Categorization)
• Categorical
• Ordinal, One-Hot, and Dummy Variable Encoding
• Date time
• Text
• Other domain specific features

MSBA 315 5
Feature Extraction from Date/Time
• Break up dates and time into individual features
• Date/Time -> Year, Month, Day, Hour, Min
• Pandas: 𝑑𝑓[′ 𝑑𝑎𝑡𝑒′]. 𝑑𝑡. 𝑦𝑒𝑎𝑟, …
• 𝑑𝑓[′ 𝑑𝑎𝑡𝑒 ′ ]. 𝑑𝑡. 𝑤𝑒𝑒𝑘𝑑𝑎𝑦_𝑛𝑎𝑚𝑒
• Beginning of time: 00:00:00 UTC of the 1st of January 1970th
• Convert Time Zone
• 𝑑𝑓[‘𝑑𝑎𝑡𝑒’]. 𝑑𝑡. 𝑡𝑧_𝑙𝑜𝑐𝑎𝑙𝑖𝑧𝑒(′𝐴𝑓𝑟𝑖𝑐𝑎/𝐴𝑏𝑖𝑑𝑗𝑎𝑛′)
• 𝑑𝑓[‘𝑑𝑎𝑡𝑒’]. 𝑑𝑡. 𝑡𝑧_𝑐𝑜𝑛𝑣𝑒𝑟𝑡(′𝐸𝑢𝑟𝑜𝑝𝑒/𝐿𝑜𝑛𝑑𝑜𝑛′)
• Create new features
• Evening, Noon, Night
• Business hours or not
• Business quarter or season of the year
• Daylight savings or not, Public holiday or not
• Purchases_last_month, Purchases_last_week
• 𝑇𝑖𝑚𝑒_𝐿𝑒𝑓𝑡 – 𝑇𝑖𝑚𝑒_𝐴𝑟𝑟𝑖𝑣𝑒𝑑 MSBA 315 6
Feature Extraction from Date/Time

Time series

• Lag Features
• Transform time series problem into self-supervised learning
• Ex., predict the value at the next time 𝑡 + 1 given the value at the current
time 𝑡
• Rolling Window Statistics
• Calculate summary statistics across the values in the sliding window and
include these as features in our dataset
• 𝑑𝑓. 𝑟𝑜𝑙𝑙𝑖𝑛𝑔(𝑤𝑖𝑛𝑑𝑜𝑤 = 2). 𝑚𝑒𝑎𝑛()

MSBA 315 7
Feature Engineering – Create New Features
• Polynomial features are created by raising existing features to an
exponent, usually 2 (squared) or 3 (cubed)
• Sklearn: input [A, B] degree-2 polynomial features are [1, A, B, A^2, AB, B^2]
• Interaction features: add new variables that represent the interaction
between features
• If you have features A and B create features A*B, A+B, A/B, A-B
• This explodes the feature space, for example,
• if you have 10 features and considering two variable interactions
10!
• 𝐶 10,2 = = 45 (combination of two features)
8! 10−8 !
• 45 × 4 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛 = 180 new features are included in your model
• Crossing features are created by computing the cross product AxBencode non-linearity

• Results a vector orthogonal to AxB that might encode non-linearity


one-hot encoding

• You can apply feature cross to one-hot encoding


MSBA 315 8
Feature Engineering – Feature Crossing

source: DataVedas

MSBA 315 9
Feature Engineering – Create New Features
Ex: 3

𝑟= 𝑥12 + 𝑥22 House price predictor:


Ex: 1 𝒙𝟐 𝜽
𝑥2 [latitude × num_bedrooms]
𝜃 = arctan
𝑥1

𝒙𝟏 𝒓

𝑥3 = 𝑥1 𝑥2
Ex: 2 𝑥3 = 𝑥1 𝑥2
𝒙𝟐

𝑦 = 𝑏 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑤3 𝑥3

𝒙𝟏
𝑦 = 𝑏 + 𝑤1 𝑥1 + 𝑤2 𝑥2 MSBA 315 𝑥1 𝑥2 10
Feature Engineering from Text
(unstructured data)

MSBA 315 11
BOW/TF – Vector

• A text document is converted into a


vector of counts
• The vector contains an entry for every
possible word in the vocabulary
• BOW converts a text document into a
flat vector
• Ignores ordering, structure, etc.
• Improvement: bag-of-n-grams
• Inefficient memory representation
• Improvement: Sparse vector representation
Source: A. Zheng and A. Casari (2018), Feature Engineering for Machine Learning
MSBA 315 12
BOW/TF – Matrix

Source: A. Zheng and A. Casari (2018), Feature Engineering for Machine Learning
MSBA 315 13
BOW/TF – Preprocessing for Cleaner Features
• Stopwords
• Frequency-Based Filtering
• Rare and/or frequent words
• Stemming
• Correcting spelling, grammars
• Removing character repetitions
• Etc.

BOW could be extended to Bag-of-n-Grams

Source: A. Zheng and A. Casari (2018), Feature Engineering for Machine Learning
MSBA 315 14
Term Frequency Inverse Document Frequency (𝒕𝒇. 𝒊𝒅𝒇)
• It is essentially a feature scaling technique intended to reflect how
important a word is to a document in a corpus
• 𝒕𝒇. 𝒊𝒅𝒇: a normalized count where each word count is divided by the
number of documents in which this word appears
• 𝒕𝒇(𝒕, 𝒅): number of times a term 𝒕 appears in a document 𝒅
𝑁 (𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠) 𝑁
• 𝒊𝒅𝒇(𝒕): =
number of documents in which term 𝒕 appears 𝑑𝑓(𝑡)
𝑁
• 𝒕𝒇. 𝒊𝒅𝒇(𝒕) = 𝑡𝑓
𝑑𝑓(𝑡)
𝑁
• 𝒕𝒇. 𝒊𝒅𝒇(𝒕) = 𝑡𝑓. log( )
𝑑𝑓(𝑡)
1+ 𝑁
• Various smoothing formulas: 𝑡𝑓. log( )
1+𝑑𝑓(𝑡)
MSBA 315 15
Example: TF TF.IDF
The weight of
• “is” is reduced to 0
• “puppy” is increased to 1.38
• “cat” is increased to 1.38

𝒕𝒇. 𝒊𝒅𝒇 = 𝑡𝑓. log(𝑁/𝑑𝑓(𝑡))


𝑡𝑓. 𝑖𝑑𝑓 𝒑𝒖𝒑𝒑𝒚 = 1 × log 4/1 = 1.38
𝑡𝑓. 𝑖𝑑𝑓 𝒄𝒂𝒕 = 1 × log 4/1 = 1.38
𝑡𝑓. 𝑖𝑑𝑓 𝒊𝒔 = 1 × log(4/4) = 0
Source: A. Zheng and A. Casari (2018), Feature Engineering for Machine Learning
MSBA 315 16
TF.IDF

• Term Frequency: Reduces bias to long documents (normalized count)


• Inverse Document Frequency: Reduces bias to common tokens
• TF-IDF measures the originality of a word
• TF-IDF: Can be used
• to identify most important tokens in a document
• to remove unimportant tokens
• or as a preprocessing step to dimensionality reduction

MSBA 315 17
Traditional Feature Engineering - Limitations

• Bag of unstructured words ignore


• Semantics
Same BOW representation:
• Structure • No, I have money.
• Sequence • I have no money.
• Context
• Huge sparse matrix of words counts
• Example: 20k (speech), 500k (big vocab), and 13M (Google 1T)

MSBA 315 18
Distributed Representation

“You shall know a word by the company it keeps.”


–J.R. Firth 1957

MSBA 315 19
Distributed Representation

Define the meaning of a word by understanding its context

MSBA 315 20
Distributed Representation

Define the meaning of a word by understanding its context

MSBA 315 21
Dense Vector Representation

• Instead of capturing co-occurrence counts directly , we predict


surrounding words of every word

• Learn low-dimensional representations of words by framing a predicting


task: using context to predict words in a surrounding window

• Transform the unsupervised learning into a self-supervised prediction

MSBA 315 22
Dense Vector Representation – Illustrative Examples
• Good at predicting
other words
appearing in its
context

• Commonly used
dimensions: 50, 100,
200, 300

MSBA 315 Source: David Rozado 23


Word2Vec
• Created by Google in 2013
• Predict every word from its context words
• Compute and generate high quality, distributed, and continuous
dense vector representations of words that capture contextual and
semantic similarity
• Take in massive textual corpora, create a vocabulary of possible
words, and generate dense word embeddings for each word in the
vector space representing that vocabulary

MSBA 315 24
Word2Vec: Continuous BOW (CBOW)

MSBA 315 25
Word2Vec: Skip-Gram

MSBA 315 26
Which one to choose?

Skip-Gram CBOW
✓ Works well with small amount ✓ Much faster to train
of training data ✓ Better accuracy for frequent
✓ Represents well even rare words
words or phrases

- Each application has its own requirements

You need to conduct several experiment and pick what is best for your
application

MSBA 315 27
GloVe - Global Vectors for Word Representation
- An extension to word2vec for efficiently learning word vectors,
developed by Pennington, et al. at Stanford
- GloVe combines both the global statistics of matrix factorization with
the local context-based learning (similar to word2vec)
• Classical vector space model representations of words developed using
matrix factorization techniques such as Latent Semantic Analysis (LSA)
• Good at using global text statistics
• Not as good at capturing meaning and analogies
- GloVe constructs an explicit word co-occurrence matrix using
statistics across the whole text corpus, which can provide a more
global context
• words that co-occur frequently in a corpus are likely to have similar meanings

MSBA 315 28
FastText
• Word2vec and GloVe struggle to get good representations of rare words or
words that were not present in the training corpus
• FastText is another extension of the word2vec model developed at
Facebook, which works well with rare words
• Instead of learning vectors for words directly, it represents each word as an
n-gram of characters
• For instance, the fastText representation of “artificial”, with n=3, is:
<ar, art, rti, tif, ifi, fic, ici, ial, al>
• Help capture the meaning of shorter words and allows the embeddings to
understand suffixes and prefixes
• Once the word has been represented using character n-grams, a skip-gram
model is trained to learn the embeddings
• It is a bag of words model since no internal structure of the word is taken
into account (the order of the n-grams doesn’t matter)
MSBA 315 29
Issues with Distributed Representation
• Word ambiguity
• The word bank could mean a land sloping down to a river or a financial
institution
• Representing each word with one vector cannot capture its different meaning
• Contextual Word Embeddings
• Similarity is different from relatedness
• male and man are similar
• computer and keyboard are related but dissimilar
• Most evaluation datasets don’t distinguish between word similarity and
relatedness
• Bias inherited and amplified
• Gender, racial, etc. bias are captured by the word embedding from the data
• Bias gets amplified: Word embedding are used in downstream applications
MSBA 315 32
Dimensionality Reduction

Reducing the number of input features (dimensions)

MSBA 315 33
Why Dimensionality Reduction is Useful?
• Yield more compact representation of the data
• Provide more interpretable representation of the target concept
• Allows the model to focus attention on the most relevant variables
• Cheaper to collect less features (variables)
• Alleviate the “curse of dimensionality”

MSBA 315 34
Curse of Dimensionality
When the dimensionality
increases (number of features),
the volume of the space
increases so fast that the
available data points become
sparse

Source: Vincent Spruyt

• In 1D feature space our training data to cover 20% the population of cats and dogs
• In 2D we need to obtain 45% of the population in each dimension (0.45^2 = 20%)
• In 3D we need to obtain 58% of the population in each dimension (0.58^3 = 20%)
• For a fixed amount of data, adding dimensions lead to overfitting
• If we keep adding dimensions, the amount of training data needs to grow exponentially fast to
maintain the same coverage and to avoid overfitting
MSBA 315 35
Dimensionality Reduction Techniques
• Feature Selection Methods
• Matrix Factorization
• Manifold Learning
• Autoencoder Methods

MSBA 315 36
Dimensionality Reduction — Feature Selection
• Filter or Statistical methods: use univariate correlation between
features and target, to select a subset of features (most predictive)
• Examples include Pearson's correlation and Chi-Squared test

• Wrapper or Scoring methods: train and evaluate the model with


different subsets of features and select the best subset (validation)
• Example Recursive Feature Elimination (RFE), Forward-Backward [Greedy]

• Embedded or Intrinsic methods: Feature selection is a part of some


model training process
• Feature Importance of decision trees, ensemble of trees, and regression
models

MSBA 315 37
Dimensionality Reduction – Matrix Factorization
• Matrix factorization is used to decompose a high-dimensional matrix
into two or more lower-dimensional matrices
• Reduce the number of features in a dataset while retaining as much
useful information as possible
• Common matrix factorization methods for dimensionality reduction:
• Principal Component Analysis (PCA)
• Singular Value Decomposition (SVD)
• Non-negative Matrix Factorization (NMF)
• Independent Component Analysis (ICA)

MSBA 315 38
Dimensionality Reduction – PCA
• Find a set of orthogonal axes called principal components (PC) that
explain the maximum variance in the data
• Project the dataset onto these principal components to obtain a
lower-dimensional representation.

MSBA 315 39
Dimensionality Reduction – PCA
After the transformation,
the data points became
mixed, making
classification difficult.

After the transformation,


the data remained linearly
separable.

MSBA 315 40
Dimensionality Reduction – Manifold Learning
• Manifold learning algorithms try to find a lower-dimensional
representation of the data that retains the essential information
• The projection is designed to both create a low-dimensional
representation of the dataset while preserving the
structure/relationship of the data

• Examples of manifold learning:


• Self-Organizing Map (SOM)
• Locally Linear Embedding (LLE)
• Isometric Mapping (Isomap)
• Uniform Manifold Approximation and Projection
• t-distributed Stochastic Neighbor Embedding (t-SNE) Source: Wikipedia

• Projected features have little relationship with the original features


MSBA 315 41
Dimensionality Reduction – Autoencoders
• Use deep neural network architecture to reduce features dimension
• Self-supervised learning: force a model to reproduce its input correctly
• A network model is used to compress the data flow to a bottleneck
layer with far fewer dimensions than the original input data

Source Samyak Kala

MSBA 315 42
Which Dimensionality Reduction Technique?
• No overall best technique for dimensionality reduction
• Use experiments to discover which techniques, when paired with
your model of choice, result in the best performance on your valid set
• Typically, linear algebra and manifold learning methods assume that
all input features have the same scale or distribution
• A good practice to either normalize or standardize data prior to using these
methods if the input variables have differing scales or units
• Dimensionality reduction is typically performed after data cleaning
and data scaling and before training a predictive
• Must also be performed on validation and test datasets, and before
making a prediction in production

MSBA 315 43

You might also like