0% found this document useful (0 votes)
55 views72 pages

Feature Selection Engineering

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views72 pages

Feature Selection Engineering

Uploaded by

mahesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

Introduction to machine learning

Feature Selection / Feature Engineering

1
Introduction to machine learning

Feature Selection (TOC)


S.No Topic Scope Objective
1 Introduction to Feature sele Discuss what feature selection is and why is Distinguish between good and not good
ction it important features, information / signal, noise and
SNR
2 Features Signal & Noise Discuss the meaning of signal and noise Understand what SNR is, good features
associated with a feature have large SNR resulting in high
accuracy and generalization
3 Feature Selection Methods Exploratory Data Analytics using Descriptive Learn different ways of identifying good
statistics , assess given features using visual features. How to use pairplot for visually
tools, Filter , Wrapper and Embedded identifying good features
methods Create Feature Report
4 Feature Engineering Fixing missing values and outliers, generate transform features, generate new
Methods new features from existing ones[ features to improve SNR at model level
5 Code walk thru on feature Missing value and outlier treatment, Familiarize with appropriate libraries and
engineering and feature generating polynomial features function calls to used for feature
selection engineering and selection
6 Classwork 1 and 2 Feature Engineering / Selection Familiarize with feature selection and
engineering techniques
7 Advantages / Feature selection / engineering techniques Appreciate the need for feature selection
disadvantages of feature
selection/engineering

2
Introduction to machine learning

Feature Selection (Contd…)


S.No Topic Scope Objective
8 Business applications Case studies from real world To understand real world applications

9 Interview questions on Revisit the key points feature selection and Interpret the questions, apply the
feature selection and feature engineering concept and respond accurately
engineering
10 Online Assets Techniques of feature selection and Expand horizons beyond what is
engineering covered covered in class in this area
11 Practice work Techniques of feature selection and Learn feature selection and engineering
engineering covered through practice on other data sets

3
Introduction to machine learning

Introduction to Feature Selection

4
Introduction to machine learning

Introduction to Feature Selection

1. A strategy to focus on things that matter and ignore those that don’t (author name
unknown)

2. It is the process of filtering out features from among all the given features in the data, that
are not good predictors

3. Establishing data reliability is important and equally important is to assess whether all the
data given (in form of features / attributes) are relevant for the task at hand

4. The power of a model is function of the power of the algorithm and quality of data used.
a. Any ML algorithm with poor quality data = poor model
b. Any ML algorithm with good quality data = powerful model
c. Powerful algorithm with not bad data = powerful model

d. Powerful algorithms are those that can handle complex distributions by employing mathematical
tricks of projecting data into higher dimensions to achieve the task. For e.g. KSVM or DNN

e. Good quality data is when most of the features are strong predictors of the target
f. Not bad data is when not all the features are poor predictors of the target
g. Poor quality data is when all the features are poor predictors of the target

5
Introduction to machine learning

Introduction to Feature Selection

1. Often datasets carry some irrelevant, unimportant features. These features cause a
number of problems which in turn prevents the process of efficient predictive modeling -
a. These features add more noise to data than information
b. They lead towards sub-optimal solution due to unnecessary overfit
c. Training machine model takes more time and other resources

6
Introduction to machine learning

Features Signal & Noise

7
Introduction to machine learning

Introduction to Feature Selection

1. Every feature in a given data set contribute to SNR (signal to noise ratio) at model level
ie. contain both information (a.k.a signal) and noise (unexplainable variance)

2. Powerful features add more signal than noise while weak ones add less signal more
noise to a model

3. Including powerful features while filtering out weak ones increase SNR and thus
maximize information at model level

4. One indirect way of assessing SNR is “adjusted R square” used in linear regression or
“Information gain” in decision trees or AIC / BIC metrics

5. Higher SNR leads to better accuracy and generalizability of the models. It also helps
keep models simple.

6. Filtering out poor features helps in reduced stress on computational resources. Hence,
feature selection plays an important role towards building good models

8
Introduction to machine learning

Features Selection Methods

9
Introduction to machine learning

Feature Selection : Methods of feature selection

1. There are three broad categories of methods for feature selection


a. Filter methods
b. Wrapper methods
c. Embedded methods

2. Filter based methods – remove features using statistical methods to score each feature
for it’s relevance. E.g. measuring correlation between each variable independently with
the target OR use P values using statistical approach

3. Wrapper methods – Evaluate and compare different combination of features and assign a
score. To score, a predictive model is run on the combination of features and the one with
highest accuracy gets highest score.

4. Embedded methods – some models such as decision tree, random forest have inbuilt
feature evaluation and selection methods \

5. Exploratory Data Analytics (EDA) is the first step towards feature selection

6. Visual assessment – Visual analysis of the features using tools such as Kernel Density
Estimates, scatter plots

10
Introduction to machine learning

Feature Selection : Methods of feature selection

Exploratory Data Analytics – Some of the key activities performed in EDA


include:
1. Meaningful standardized names to the attributes to address any ambiguities that can
lead to mistakes. For e.g. “mileage” could mean miles per gallon or miles covered on
the odometer or “blood pressure”, is it systolic or diastolic? Each has different range

2. Meta information about the data. Details such as what is it, how it was collected, units
of measurement, frequency of measurement, possible range of values etc. This will
given an idea of what kind of challenges are likely in using the dataset

3. Address the challenges in the data in its existing form. For e.g. missing values, outliers,
data shift, sampling bias

4. Descriptive stats – spread(central values , skew, tails), multi gaussian

5. Data distribution across different target classes (if in classification domain)

11
Introduction to machine learning
Feature Selection : Methods of feature selection
Exploratory Data Analytics PIMA Diabetic Data –

12
Introduction to machine learning
Feature Selection : Methods of feature selection
Exploratory data analytics (EDA)

Key observations from PIMA diabetes dataset EDA

1. Preg : Mean > median, max – Q3 > 1Q – min , Q3-Q2 almost = Q2 – Q1 , body of distribution almost
symmetric with a long tail on right side
2. Test : Mean > median , max – Q3 >> Q1 – min , Q3 – Q2 >> Q2 – Q1 , the body of distribution is also
skewed
3. Most attributes have long tail on either side. This calls for an investigation of these points to establish root
cause
4. In the raw form, these features are not in a state where they can be used for modelling
13
Introduction to machine learning
Feature Selection : Methods of feature selection

1. Long tails are


seen on ‘test’,
‘pedi’ columns

2. On all the
columns the
distribution of the
two classes are
completely
eclipsing one
another

3. With minor
differences in the
distribution in
‘age’, ‘plas’

4. None of the
columns can be
good
differentiators

14
Introduction to machine learning
Feature Selection : Methods of feature selection
Exploratory data analytics (EDA)

Key observations from PIMA diabetes dataset EDA

1. On most of the given features, the two classes (diabetic and non-diabetic)
have overlapping distributions such as in case of blood pressure, skin test
columns
a. Such attributes fail to discriminate between the two classes and hence are of no use for
classification

2. On many attributes we see a long thin tail, some have a small bump in the tail
such as Test and Pedi.
a. Presence of long tail indicate presence of extreme outliers, a bump indicates presence of
many outlier values in the data on that column
b. Outliers impact accuracy of all models, some more seriously than others. For e.g. Logistic
Regression Vs SVC with soft margins

15
Introduction to machine learning
Feature Selection : Methods of feature selection
Exploratory Data Analytics Auto_mpg Data –

1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each
instance)

16
Introduction to machine learning
Feature Selection : Methods of feature selection

Exploratory data analytics (EDA)

Key observations from auto-mpg dataset EDA

1. Mpg : mean = median, 1Q – Median = 3Q – Median, max-Q3 = 17.6 > min-Q1 – this indicates the
distribution on this column is approximately gaussian except for long tail on right side (confirm visually)
2. Cyl : categorical variable, median (4) < mean (5.5) indicates presence of long tail on right side, may be
due to few very large cars which have large number of cylinders
3. Disp: median(148.5) < mean (193.4), Q1 – Median = 36 < Q3 – Median = 164.5 , body is asymmetric, max
– q3 >> Q1 – min …. Long tail on the right side …. Presence of extreme large cars (check visually)

17
Introduction to machine learning
Feature Selection : Methods of feature selection

1. MPG, target variable has


approx gaussian distribution

2. Some features show multi


modal distributions

3. Likely to find long thick tails


(many outliers)

4. Except “Acceleration” most


features have good
collinearity with target
feature (observe first row)

5. Strong collinearity between


independent features

6. An OLS , quadratic linear


model may serve the
purpose

18
Introduction to machine learning
Feature Selection : Methods of feature selection

Exploratory data analytics (EDA)

Key observations from auto-mpg dataset EDA

1. Strong collinearity among multiple attributes


2. All attributes except “acceleration” are strong predictors for mpg
3. Multi gaussians (multi modal distributions on various attributes)

19
Introduction to machine learning
Feature Selection : Methods of feature selection

1. Univariate Selection using statistical tests


a. The scikit-learn library provides the SelectKBest class, which can be
used with a suite of different statistical tests to select a specific number
of features.
b. Statistical approach using P values in stats models

2. Recursive Feature Elimination in wrapper methods


a. Use Sklearn RFE (Recursive Feature Elimination) for a given algorithm

3. Ensemble methods for feature selection in embedded methods


1. Use model.feature_importances_ function to identify the good features

FeatureSelectionMethods.ipynb

Statsmodel_lm_diff_distributions.ipynb

20
Introduction to machine learning

Features Engineering

21
Introduction to machine learning

Feature Engineering

1. Feature engineering is the process of transforming features from the state


they are in (with missing values, outliers, skews) to a state where they are
likely to contribute to model accuracy

2. Feature engineering consists of imputing missing values, fixing outliers,


mathematical transformations, generating new features from existing ones

3. It is mandatory that after each transformation we ensure we have not


introduced bias variance errors
a. Bias errors can creep in due to sampling, elimination of features, chosen complexity of the
model, strategy to fix missing values and outliers

b. Variance errors can creep in thru generation of features from existing features, deleting
records with missing values or outliers

22
Introduction to machine learning

Feature Engineering PIMA Diabetes

Missing Values
pima.describe()

Relatively long tail on Mean and Median differ but


higher side of the central not too badly. Except in
values ‘test’, ‘pedi’

Few extreme outliers except in ‘test’, ‘pedi’


which seems to have many outliers
Fancyimputerknn_pima.ipynb

23
Introduction to machine learning
Outlier Analysis

What is an outlier? Some definitions:

It is an observation that :
1. deviates so much from other observations as to arouse suspicion that it was
generated by a different mechanism
2. appears to deviate markedly from other members of the sample in which it occurs
3. appears to be inconsistent with the remainder of that set of data

Outlier detection methods:

Can be broadly divided into two categories:


4. Univariate methods – identify outliers on a single attribute using statistical analysis
such as IQR (Inter Quartile Range) or standard deviation
5. Multivariate methods – identify outliers based on the trends in the data, use non
statistical methods such as Mahalanobis distance

FeatureSelectionMethods.ipynb

24
Introduction to machine learning

Outlier detection methods:

Can be broadly divided into two categories based on number of attributes :


1. Univariate methods – identify outliers on a single attribute using statistical analysis
such as IQR (Inter Quartile Range) or standard deviation

2. Multivariate methods – identify outliers based on the trends in the data, use methods
such as Mahalanobis distance

25
Introduction to machine learning

Histogram and Box plot for univariate outlier detection:

26
Introduction to machine learning

Histogram and Box plot for univariate outlier detection:

27
Introduction to machine learning

What is an outlier? It is an observation that

1. deviates so much from other observations as to arouse suspicion that it


was generated by a different mechanism, appears to deviate markedly
from other members of the sample in which it occurs, appears to be
inconsistent with the remainder of that set of data

2. Some examples of outliers –

3.

28
Introduction to machine learning

Feature Engineering (Outlier Treatment)


SN Strategies When to use Advantage Disadvantage
1 Replace outliers with a. When we have few a. Simple and not a. Will change the spread if few
central values of data points with likely to spoil the outliers have extreme values
values just out of data if handled b. Will change spread if lot of
the limits appropriately outliers just beyond the limits

2 Cap the outliers to first / a. When few outliers a. Simple approach, a. May create multiple modal
3rd quartile based on close to the limits b. Easy to implement values if too many outliers
which side of the central b. May change the central values
value they are significantly

3 Replace outliers with a. When too many a. Simple and logical a. Relationships between
predicted values outliers with varying attributes may not be strong
values and the columns enough
are strongly correlated b. Relations may be statistical
chance

29
Introduction to machine learning
Feature Engineering (Generating new features)

Feature generation from existing features requires domain expertise (which is


required anyway) and also a good understanding of the distribution among the
various attributes (Pair plot analysis)

Observe the distributions off-diagonal in the pair plot to understand the


distribution.
a. If in classification, are the classes clustered across some attributes?
b. If in regression, are their significant collinearity between the attributes and are the
linear or quadratic of above….

30
Introduction to machine learning
Feature Engineering (Generating new features)

In the PIMA Diabetes dataset we observe following distribution pattern across


BMI and Test columns

1. The scales are standardized (Z scored). When BMI is less than 0 and Test is less
than 0 the density of non-diabetic (blue) is higher than that of diabetic (orange).

2. When the BMI >0 and Test > 0 (top right hand side quadrant), we notice that
density of diabetic is greater than non-diabetic

def genbmi(row):
if row["BMI"] < 0:
return "0"
else
return "1"
Test

def gentest(row):
if row["test"] < 0
return "0"
else:
return "1"
BMI
Ref: PIMA_Feature Engineering

31
Introduction to machine learning
Feature Engineering (Generating polynomial features)
In the car-mpg.csv dataset we
notice many features have

1. Strong positive / negative


relationship (collinearity)
amongst themselves

2. In some cases the


attributes are colinear and
curved

3. Some attributes have a


geometrically non-linear
relation with the target
(MPG)

32
Introduction to machine learning
Feature Engineering (Generating polynomial features)

1. Given the observations, we can use sklearn polynomial feature generator to


generate new features automatically

2. These new features will embody these observations and hopefully give us better model
both in terms of accuracy and generalizability

3. For example, if an input sample is two dimensional and of the form [a, b], the degree-2
polynomial features are [1, a, b, a^2, ab, b^2]

4. A polynomial of degree n is a function of the form where the a’s are real numbers
(sometimes called the coefficients of the polynomial).

5. For example, f(x) = 4x 3 − 3x 2 + 2 is a polynomial of degree 3, as 3 is the highest


power of x in the formula. This is called a cubic polynomial, or just a cubic

6. Generating polynomial features from existing features is equivalent of project data from
given dimensions to higher dimensional space

33
Introduction to machine learning
Feature Engineering (Generating polynomial features)

It is evident in the pair plot that


some attributes have a linear
relationship with mpg

Others seem to have


quadratic relationship at
least in part

Still others seem


to have quadratic
with linear
component

34
Introduction to machine learning
Feature Engineering (Generating polynomial features)

With polynomial
features

Source:https://journals.plos.org/
plosone/article/figure?id
=10.1371/journal.pone.0092248.g
004

35
Introduction to machine learning
Feature Engineering (Generating polynomial features)

Simple linear model with no polynomial features


may be an underfit as it under represents the
true relation between the attributes and the
target

With quadratic features i.e. degree 2


polynomial may give us a better model as it
embodies the polynomial distribution of points
across attributes and the target variable

But with a lot of many original features,


polynomial feature generator can grow the
features space dimensions exponentially
leading to curse of dimensionality making
our model overfit
Image Source:
http://www.gitta.info/SpatChangeAna/en/html/spatial_dist_an_TrendSurf.html

36
Introduction to machine learning
Feature Engineering (Generating polynomial features)

To avoid the curse of dimensionality i.e.


overfitting, we regularize the optimization
function

Regularization prevents optimization function


from generating large magnitude coefficients
representing relation between the attributes and
target
Regularized
Large magnitude coefficients are tell tale sign of
curse of dimensionality resulting in most of the
feature space being empty

The algorithms, given a free run, will in such


case generate surfaces with many sharp peaks
and valleys with each point sitting in the valley
or peak
Ridge_Lasso_poly_AutoMPG.ipynb
37
Introduction to machine learning

Preprocessing attribute data

One of the most common transformations done on numeric data is the scale /
measurement units of the data. The learning processes (gradient descent for e.g.) under
the hood of various algorithms benefit from this

1. Rescale - Zscores – When the attributes have different scales, transform them into
single scale. For this we can use MinMaxScaler or Zscores. Zscores helps also in
centering the data i.e. the mean values of all numerical attributes become 0. Helpful in
algorithms based on distance calculations, gradient descent based algorithms

2. Standardizing data – Transform the data on various attributes having a gaussian


distributions with different standard deviations and means into distributions with mean
=0 and standard deviation of 1. Useful for algorithms that expect gaussian distribution

Ref:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessin
g
38
Introduction to machine learning

Preprocessing attribute data

3. Normalize data – Rescale all the attribute values such that each observation as a
vector length of 1. Used when attributes have different scales and data is sparse. Can
be useful for algorithms using distance calculations, learning process such as gradient
descent

4. Binarize data - A way to convert a given attribute / target variable to 0,1. For e.g. it can
be used to modify thresholds in probability based models such as logistic regression to
improve class level predictions

Ref:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preproces
sing
39
Introduction to machine learning

Feature Engineering on Numeric data

1. Integers and floats are the most common data types that are directly used in building
models. Transforming them before modelling may yield better results!

2. Feature engineering on numerical columns may take the form of-


a. scaling the data if using algorithms that involve similarity measurements based on distance
calculations
b. Transforming the distributions using mathematical techniques such as PCA
c. Generating polynomial features from existing ones

3. PCA helps improve signal to noise ratio by taking into account the covariance between
features

4. Interaction & Polynomial features enrich the models, especially in linear models is
using interaction features , polynomial features

40
Introduction to machine learning

Regularization for Feature Selection

Linear Model with binning and polynomial features

1. SVM generates polynomial features based on the degree mentioned. Default degree =
3
2. The gamma factor controls the flexibility of the curve
3. Since SVM uses kernel tricks, it does not generate all the synthetic features. However,
it gets the benefits of the same
4. Kernel trick saves computation time and resources which may prove too costly when
working with polynomial features

41
Introduction to machine learning

Linear Regression Model -

Lab- 1- Estimating mileage based on features of a second hand car

Description – Sample data is available at


https://archive.ics.uci.edu/ml/datasets/Auto+MPG

The dataset has 9 attributes listed below that define the quality
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)

Sol : Ridge_Lasso_Regression.ipynb

42
Introduction to machine learning

Regularising Linear Models (Shrinkage methods)

When we have too many parameters and exposed to curse of dimensionality, we resort to dimensionality
reduction techniques such as transforming to PCA and eliminating the PCA with least magnitude of eigen
values. This can be a laborious process before we find the right number principal components. Instead, we
can employ the shrinkage methods.

Shrinkage methods attempt to shrink the coefficients of the attributes and lead us towards simpler yet
effective models. The two shrinkage methods are :

1. Ridge regression is similar to the linear regression where the objective is to find the best fit
surface. The difference is in the way the best coefficients are found. Unlike linear
regression where the optimization function is SSE, here it is slightly different

Linear Regression cost function Ridge Regression with additional term in the cost function

2. The term is like a penalty term used to penalize large magnitude coefficients when it
is set to a high number, coefficients are suppressed significantly. When it is set to 0, the
cost function becomes same as linear regression cost function
43
Introduction to machine learning

Regularising Linear Models (Shrinkage methods)

Why should we be interested in shrinking the coefficients? How does it help?


When we have large number of dimensions and few data points, the models are likely to become complex,
overfit and prone to variance errors. When you print out the coefficients of the attributes of such complex
model, you will notice that the magnitude of the different coefficients become large

Large coefficients indicate a case where for a unit change in the input variable, the magnitude of change in
the target column is very large.

Coeff for simple linear regression model of 10 Coeff with polynomial features shooting up to 57 from 10
dimensions
-9.67853872e-13 -1.06672046e+12 -4.45865268e+00 -2.24519565e+00 -
1. The coefficient for cyl is 2.5059518049385052 2.96922206e+00 -1.56882955e+00 3.00019063e+00 -1.42031640e+12 -
2. The coefficient for disp is 2.5357082860560483 5.46189566e+11 3.62350196e+12 -2.88818173e+12 -1.16772461e+00 -
1.43814087e+00 -7.49492645e-03 2.59439087e+00 -1.92409515e+00 -
3. The coefficient for hp is -1.7889335736325294 3.41759793e+12 -6.27534905e+12 -2.44065576e+12 -2.32961194e+12
4. The coefficient for wt is -5.551819873098725 3.97766113e-01 1.94046021e-01 -4.26086426e-01 3.58203125e+00 -
5. The coefficient for acc is 0.11485734803440854 2.05296326e+00 -7.51019934e+11 -6.18967069e+11 -5.90805593e+11
6. The coefficient for yr is 2.931846548211609 2.47863770e-01 -6.68518066e-01 -1.92150879e+00 -7.37030029e-01 -
7. The coefficient for car_type is 2.977869737601944 1.01183732e+11 -8.33924574e+10 -7.95983063e+10 -1.70394897e-01
5.25512695e-01 -3.33097839e+00 1.56301740e+12 1.28818991e+12
8. The coefficient for origin_america is -0.5832955290166003 1.22958044e+12 5.80200195e-01 1.55352783e+00 3.64527008e+11
9. The coefficient for origin_asia is 0.3474931380432235 3.00431724e+11 2.86762821e+11 3.97644043e-01 8.58604718e+10
10. The coefficient for origin_europe is 0.3774164680868855 7.07635073e+10 6.75439422e+10 -7.25449332e+11 1.00689540e+12
9.61084146e+11 2.18532428e+11 -4.81675252e+12 2.63818648e+12

Ref: Ridge_Lasso_Regression.ipynb Very large coefficients!

44
Introduction to machine learning
Regularising Linear Models (Shrinkage methods)

Z = f ( x, y) 1. Curse of dimensionality results in large magnitude


coefficients which results in a complex undulated surface /
model.

2. This complex surface has the data points occupying the


peaks and the valleys

3. The model gives near 100% accuracy in training but poor


result in testing and the testing scores also vary a lot from
one sample to another.

4. The model is supposed to have absorbed the noise in the


data distribution!

5. Large magnitudes of the coefficient give the least SSE and


at times SSE = 0! A model that fits the training set 100%!

6. Such models do not generalize

=0

45
Introduction to machine learning
Regularising Linear Models (Shrinkage methods)

1. In Ridge Regression, the algorithm while trying to find the


best combination of coefficients which minimize the SSE on
the training data, is constrained by the penalty term

2. The penalty term is akin to cost of magnitude of the


coefficients. Higher the magnitude, more the cost. Thus to
minimize the cost, the coefficient are suppressed

3. Thus the resulting surface tends to be relatively much more


smoother than the unconstrained surface. This means we
have settled for a model which will make errors in the
Z = f ( x, y) training data

4. This is fine as long as the errors can be attributed to the


random fluctuations i.e. because the model does not
absorb the random fluctuations in the data

5. Such model will perform equally well on unseen data i.e.


test data. The model will generalize better than the complex
model

46
Introduction to machine learning

Regularising Linear Models (Shrinkage methods)

Impact of Ridge Regression on the coefficients of the 56 attributes


Ridge model: [[ 0. 3.73512981 -2.93500874 -2.13974194 -3.56547812 -1.28898893 3.01290805
2.04739082 0.0786974 0.21972225 -0.3302341 -1.46231096 -1.17221896 0.00856067 2.48054694
-1.67596093 0.99537516 -2.29024279 4.7699338 -2.08598898 0.34009408 0.35024058 -0.41761834
3.06970569 -2.21649433 1.86339518 -2.62934278 0.38596397 0.12088534 -0.53440382 -1.88265835
-0.7675926 -0.90146842 0.52416091 0.59678246 -0.26349448 0.5827378 -3.02842915 -0.36548074
0.5956112 -0.15941014 0.49168856 1.45652375 -0.43819158 -0.20964198 0.77665496 0.36489921
-0.4750838 0.3551047 0.23188557 -1.42941282 2.06831543 -0.34986402 -0.32320394 0.39054656 0.06283411]]

Large coefficients have been suppressed, almost close to 0 in many cases.

Ref: Ridge_Lasso_Regression.ipynb

47
Introduction to machine learning

Regularising Linear Models (Shrinkage methods)

1. Lasso Regression is similar to the Ridge regression with a difference in the penalty term.
Unlike Ridge, the penalty term here is raised to power 1. Also known as L1 norm.

2. The term continues to be the input parameter which will decide how high penalties
would be for the coefficients. Larger the value more diminished the coefficients will be.

3. Unlike Ridge regression, where the coefficients are driven towards zero but may not
become zero, Lasso Regression penalty process will make many of the coefficients 0. In
other words, literally drop the dimensions

48
Introduction to machine learning

Regularising Linear Models (Shrinkage methods)

Impact of Lasso Regression on the coefficients of the 56 attributes

Lasso model: [ 0. 0.52263805 -0.5402102 -1.99423315 -4.55360385 -0.85285179 2.99044036 0.00711821 -0. 0.76073274 -0. -0. -0.19736449
0. 2.04221833 -1.00014513 0. -0. 4.28412669 -0. 0. 0.31442062 -0. 2.13894094 -1.06760107 0. -0. 0. 0. -0.44991392 -1.55885506 -0. -0.68837902 0.
0.17455864 -0.34653644 0.3313704 -2.84931966 0. -0.34340563 0.00815105 0.47019445 1.25759712 -0.69634581 0. 0.55528147 0.2948979 -0.67289549
0.06490671 0. -1.19639935 1.06711702 0. -0.88034391 0. -0. ]

Large coefficients have been suppressed, to 0 in many cases, making those dimensions useless i.e. dropped
from the model.

Ref: Ridge_Lasso_Regression.ipynb

49
Introduction to machine learning

Regularising Linear Models (Comparing The Methods)

To compare the Ridge and Lasso, let us first transform our error function (which is
a quadratic / convex function) into a contour graph

1. Every ring on the error function represents a combination of


coefficients (m1 and m2 in the image) which result in same
quantum of error i.e. SSE

2. Let us convert that to a 2d contour plot. In the contour plot,


every ring represents one quantum of error.

3. The innermost ring / bull’s eye is the combination of the


coefficients that gives the lease SSE
m2

m1

50
Introduction to machine learning

Regularising Linear Models (Ridge Constraint)


1. Yellow circle is the Ridge
Most optimal combination of constraint region
Lowest SSE error
m1, m2 given the constraints representing the ridge
ring. violates the
penalty (sum of squared
constraint coeff)

2. Any combination of m1
nd m2 that fall within
m2 yellow is a possible
solution

3. The most optimal of all


Sub-optimal solutions is the one
combination which satisfies the
Allowed combination of m1, m2. constraint and also
of m1, m2 by Ridge m1 Meets minimizes the SSE
constraint but (smallest possible red
Constraints
circle)
is not the
minimal 4. Thus the optimal solution
possible SSE of m1 and m2 is the one
within where the yellow circle
constraint touches a red circle.

The point to note is that the red rings and yellow circle will never be tangential (touch) on the axes
representing the coefficient. Hence Ridge can make coefficients close to zero but never zero. You may
notice some coefficients becoming zero but that will be due to roundoff…
51
Introduction to machine learning
Regularising Linear Models (Ridge Constraint)
1. As the lambda value (shown here as alpha) increases,
the coefficients have to become smaller and smaller to
minimize the penalty term in the cost function i.e. the

2. The larger the lambda, smaller the sum of squared


coefficients should be and as a result the tighter the
constraint region
m2
3. The tighter the constraint region, the larger will be the
red circle in the contour diagram that will be tangent to
the boundary of the yellow region

4. Thus, higher the lambda, stronger the shrinkage, the


coefficients are shrinked strongly and hence more
smooth the surface / model
m1
5. More smoother the surface, more likely the model is
going to perform equally well in production

6. When we move away from a model with sharp peaks and


valleys (complex model) to smoother surface (simpler
models), we reduce the variance errors but bias errors
go up.

7. Using gridsearch, we have to find the right value of


lambda which results in right fit, neither too complex nor
too simple a model

52
Introduction to machine learning

Regularising Linear Models (Lasso Constraint)


1. Yellow rectangle is the
Most optimal combination of Lasso constraint region
Lowest SSE error
m1, m2 given the constraints representing the Lasso
ring. violates the
penalty (sum coeff)
constraint
2. Any combination of m1
nd m2 that fall within
yellow is a possible
m2 solution

3. The most optimal of all


solutions is the one
Sub-optimal which satisfies the
combination constraint and also
Allowed combination of m1, m2. minimizes the SSE
of m1, m2 by Lasso m1 Meets (smallest possible red
constraint but circle)
Constraints
is not the 4. Thus the optimal solution
minimal of m1 and m2 is the one
possible SSE where the yellow
within rectangle touches a red
constraint circle.

The beauty of Lasso is, the red circle may touch the constraint region on the attribute axis! In the picture
above the circle is touching the yellow rectangle on the m1 axis. But at that point m2 coefficient is 0!
Which means, that dimension has been dropped from analysis. Thus Lasso does dimensionality
reduction which Ridge does not
53
Introduction to machine learning

Classwork – 1
Objective - Learn to interpret the pair plots, Handle missing values and outliers. Appreciate
the need for good features for a great model
1. Load the data_banknote_authentication.txt file**
2. Analyze the data column wise
3. Handle missing values if any
4. Detect and handle outliers if any

5. Compare the distributions with the Pima Diabetes distribution. Why does this data set
give such high degree of accuracy while Pima data set does not

6. Could we have done better in handling the outliers instead of replacing the outliers with
central values. (how about class wise outlier analysis and replacing with 2standard
dev?

** Owner of database: Volker Lohweg (University of Applied Sciences, Ostwestfalen-


Lippe, volker.lohweg '@' hs-owl.de) , Donor of database: Helene Dörksen (University of
Applied Sciences, Ostwestfalen-Lippe, helene.doerksen '@' hs-owl.de)
Date received: August, 2012

54
Introduction to machine learning

Classwork – 1 Banknote authentication (Contd…)

1. Do you think replacing outliers with central values was a good strategy? Could we have done something
different
2. What changes do you notice in the pair plot on the different dimensions after the outlier handling
3. Do you think the score will be higher than the model on original data?
4. Look at the scales of the pairplot before and after we converted to Z score, do you think there was a
need to do that?
5. Both in Pima dataset and this dataset, the classes are overlapping on every dimension. But models on
this data give much better results. Why?
Ref:SVC_Bank_Note_Kernels.ipynb

55
Introduction to machine learning

Classwork – 2 Concrete Strength Prediction

Objective - To familiarize with feature evaluation and selection and feature engineering

1. Build a simple linear regression to predict the strength. Note down the accuracy
2. Select the features you think are good and re-build the model. Any improvement in accuracy?
3. Generate second degree polynomial features and build the model again. Did the accuracy improve
4. Regularize the model using Lasso and Ridge… Did it improve?
5. What is your overall view on the features. Rank the features on importance based on decision tree
regressor
6. Do the list of features with top three rank match the top three features from Lasso and Ridge models?

Ref:Linear_Regression_concrete.ipynb

56
Introduction to machine learning

Advantages of FeatureSelection and


Feature Engineering

57
Introduction to machine learning

Advantages & disadvantages of Feature Selection& Feature Engineering

1. Positive impact on model’s performance in production in form of improve


accuracy and generalization because of increase in SNR

2. Optimal use of computational resources such as CPU cycles, memory and


storage

3. Some times model become simpler such as in case of feature selection


through univariate analysis.

4. Feature selection can introduce bias errors while feature generation could lead
to variance errors!

5. Models may become complex too such as in polynomial feature generation

6. Models may become uninterpretable such as in case of PCA transformations

58
Introduction to machine learning

Real World Case Studies

59
Introduction to machine learning

Applications of Feature Selection & Engineering in Real World

1. Customer Analysis Segmentation - By understanding what motivates them to


make a purchase, brands can build their business around providing
solutions to those needs. To understand their needs, one needs to identify and
represent customer behavior in terms of quantitative features. Ref:
https://www.brandwatch.com/blog/how-to-write-customer-analysis/

2. Recommendation Systems – Using matrix decomposition techniques on


customer / Product rating matrix, one can identify latent features that connect
customer to products. Using this knowledge, one can predict rating a person is
likely to give to a product and thus help decide which product to push to which
customer

3. Medical Diagnosis - feature selection and classification techniques for the


diagnosis and prediction of chronic diseases. Ref:
https://www.sciencedirect.com/science/article/pii/S1110866517300294

60
Introduction to machine learning

Interview Questions in FeatureSelection and


Feature Engineering

61
Introduction to machine learning

Interview Questions on Feature Selection & Feature Engineering

1. What is the purpose of feature selection and feature engineering

2. How do you decide important features in classification and regression case

3. What are the different categories of techniques for feature importance


assessment

4. What is feature engineering? How does generating a feature out of given


features help in modeling

5. What is SNR

6. What could be the problem of keeping all the given features in a model

7. What is the advantage / disadvantage of feature selection and engineering

8. How can feature selection and engineering impact bias variance errors

62
Introduction to machine learning

Online Resources & Books

63
Introduction to machine learning

Books and online resources

1. Feature Engineering for Machine Learning – O’Reilly - Alice Zheng , Amanda


Casari

2. Python Data Science Handbook – O’ Reilly - Jake VanderPlas

3. Python Deeper Insights into Machine Learning – Packt – Sabestian Raschka,


David Julian, John Hearty

4. https://livebook.manning.com/book/real-world-machine-learning/chapter-5/1

64
Introduction to machine learning

Practise Work

65
Introduction to machine learning
Project Flow
Seq no Milestone Purpose Key Tasks
1 Define the project To be clear on the objective of the project Name the project, Define the purpose and
1. Exploratory data analysis source of data
2. Feature engineering and selection
2 Data loading and Understand data. Column names and What do the columns represent
3. Compare several machine learning models on a performance metric
check meanings, potential problems, meaning How data was collected and when
4. Perform hyperparameter tuning on the best model Expected problems
5. Evaluate the best model on the testing set
3 Exploratory data Understand the data in terms of central values, Univariate analysis using descriptive
6. Interpret the model results
analysis spread, extreme values, missing values, statistics , Bi-variate analysis
7. Draw conclusions and document
sampling work
errors, interactions, mix of gaussians Pair plot analysis
4 Data Preparation Missing value treatment, Outlier value Decide the appropriate strategy to i and
treatment, analysis of the gaussians (if they handle both missing values and outliers
exist), Segregating data for training, validation
and testing

5 Establish reliability Does it represent the current process? Has the Statistical tests such as T-Test, Normal
of the data strategy for missing data and outlier handling Deviate Z Test, Analysis of variance
adversely impacted the data set’s
representation , after the split, do the training,
validation and test represent the process

6 Feature Identify the columns from the given dataset that Bi-variate / multi-variate analysis,
engineering and have the potential for being good features, exploring the pair plot, scaling method
selection explore the possibility of creating new powerful
features , scale the data

7 Select ML Identify suitable algorithm/s, create base model Given the purpose, given the data
algorithms Identify the hyper parameters to evaluate and distributions, from pair plot, identify
set suitable algorithms

66
Introduction to machine learning
Project Flow

Seq no Milestone Purpose Key Tasks


8 Explore the model If using deep neural networks, select the right Input / ouput layer size
architecture input and output layer size, number of hidden Hidden layers and size
layers, Activation functions, loss functions,
optimizer function and variants

9 Build and compile Build the models Build the model in Python with minimal
the model coding (use functions)
10 Hyper parameter For the selected algorithms, evaluate the hyper Use random grid search to identify optimal
tuning parameters values of the hyper parameters
11 Evaluate the Assess the reliability of the model and estimate Kfold Cross Validation / Bootstrap
models the range of accuracy at 95% confidence for sampling on test data, Use confusion
each model matrix, classification metrics to ensure
model will generalize, PowerTest

12 Ensemble the Select the top few models based on their 95% Select appropriate ensemble techniques –
model confidence performance and ensemble Bagging ,Boosting, RandomForest
13 Create the pipeline Standardize the steps for data transformation Make pipeline, check the pipeline on test
that were employed to create the model set
14 Deploy the model Deploy the model in production environment. Integrate with existing processes/ re-
design the process flows/
15 Project Report Document the steps and methodologies Document the steps taken for every
milestone (The documentation should be
done in the code to explain the steps and
also separately in detail at project level

67
Introduction to machine learning
Project Description
1. Project Name – Green Building Rating

2. Objective – Build a model to assess energy rating of a building from the given parameters

3. Purpose - Replace manual evaluation method with a reliable automated way of evaluation

4. Data Source –publicly available building energy data about New York city at
https://www1.nyc.gov/html/gbee/html/plan/ll84_scores.shtml

68
Introduction to machine learning
About The Data

Local Law 84 of 2009, or the NYC Benchmarking Law, requires annual benchmarking and disclosure of
energy and water usage information. Covered properties include tax lots with a single building with a gross
floor area greater than 50,000 square feet (sq ft) and tax lots having more than one building with a gross
floor area of more than 100,000 sq ft. Starting in 2018, the NYC Benchmarking Law will also include
properties greater than 25,000 sq ft. Metrics are calculated by the Environmental Protection Agency’s tool
ENERGY STAR Portfolio Manager, and data is self-reporting by building owners. The public availability of
data allows for local and national comparison of a buildings’ performance, incentivizes the most accurate
benchmarking of energy usage, and informs energy management decisions.

Source:
https://www1.nyc.gov/html/gbee/downloads/misc/nyc_benchmarking_disclosure_data_definitions_2017.pd
f

Understand what the data is about, what the columns mean… ignoring this, may result in missing
important information that would have made a model powerful.

69
Introduction to machine learning
Understand the data

Understand what the data is about, what the columns mean… ignoring this, may result in missing
important information that would have made a model powerful.

1. NYC law requires properties of certain size and above to report their energy usage
2. Energy Star Score is a 1 – 100 percentile ranking based on the energy consumption annually
3. Energy Star Score is a relative measure for comparing energy efficiency of the buildings

https://www1.nyc.gov/html/gbee/downloads/misc/nyc_benchmarking_disclosure_data_definitions_2017.pd
f

70
Introduction to machine learning
Load Data

Local Law 84 of 2009, or the NYC Benchmarking Law, requires annual benchmarking and disclosure of
energy and water usage information. Covered properties include tax lots with a single building with a gross
floor area greater than 50,000 square feet (sq ft) and tax lots having more than one building with a gross
floor area of more than 100,000 sq ft. Starting in 2018, the NYC Benchmarking Law will also include
properties greater than 25,000 sq ft. Metrics are calculated by the Environmental Protection Agency’s tool
ENERGY STAR Portfolio Manager, and data is self-reporting by building owners. The public availability of
data allows for local and national comparison of a buildings’ performance, incentivizes the most accurate
benchmarking of energy usage, and informs energy management decisions.

Source:
https://www1.nyc.gov/html/gbee/downloads/misc/nyc_benchmarking_disclosure_data_definitions_2017.pd
f

71
Introduction to machine learning

Thanks

72

You might also like