Feature Selection Engineering
Feature Selection Engineering
1
Introduction to machine learning
2
Introduction to machine learning
9 Interview questions on Revisit the key points feature selection and Interpret the questions, apply the
feature selection and feature engineering concept and respond accurately
engineering
10 Online Assets Techniques of feature selection and Expand horizons beyond what is
engineering covered covered in class in this area
11 Practice work Techniques of feature selection and Learn feature selection and engineering
engineering covered through practice on other data sets
3
Introduction to machine learning
4
Introduction to machine learning
1. A strategy to focus on things that matter and ignore those that don’t (author name
unknown)
2. It is the process of filtering out features from among all the given features in the data, that
are not good predictors
3. Establishing data reliability is important and equally important is to assess whether all the
data given (in form of features / attributes) are relevant for the task at hand
4. The power of a model is function of the power of the algorithm and quality of data used.
a. Any ML algorithm with poor quality data = poor model
b. Any ML algorithm with good quality data = powerful model
c. Powerful algorithm with not bad data = powerful model
d. Powerful algorithms are those that can handle complex distributions by employing mathematical
tricks of projecting data into higher dimensions to achieve the task. For e.g. KSVM or DNN
e. Good quality data is when most of the features are strong predictors of the target
f. Not bad data is when not all the features are poor predictors of the target
g. Poor quality data is when all the features are poor predictors of the target
5
Introduction to machine learning
1. Often datasets carry some irrelevant, unimportant features. These features cause a
number of problems which in turn prevents the process of efficient predictive modeling -
a. These features add more noise to data than information
b. They lead towards sub-optimal solution due to unnecessary overfit
c. Training machine model takes more time and other resources
6
Introduction to machine learning
7
Introduction to machine learning
1. Every feature in a given data set contribute to SNR (signal to noise ratio) at model level
ie. contain both information (a.k.a signal) and noise (unexplainable variance)
2. Powerful features add more signal than noise while weak ones add less signal more
noise to a model
3. Including powerful features while filtering out weak ones increase SNR and thus
maximize information at model level
4. One indirect way of assessing SNR is “adjusted R square” used in linear regression or
“Information gain” in decision trees or AIC / BIC metrics
5. Higher SNR leads to better accuracy and generalizability of the models. It also helps
keep models simple.
6. Filtering out poor features helps in reduced stress on computational resources. Hence,
feature selection plays an important role towards building good models
8
Introduction to machine learning
9
Introduction to machine learning
2. Filter based methods – remove features using statistical methods to score each feature
for it’s relevance. E.g. measuring correlation between each variable independently with
the target OR use P values using statistical approach
3. Wrapper methods – Evaluate and compare different combination of features and assign a
score. To score, a predictive model is run on the combination of features and the one with
highest accuracy gets highest score.
4. Embedded methods – some models such as decision tree, random forest have inbuilt
feature evaluation and selection methods \
5. Exploratory Data Analytics (EDA) is the first step towards feature selection
6. Visual assessment – Visual analysis of the features using tools such as Kernel Density
Estimates, scatter plots
10
Introduction to machine learning
2. Meta information about the data. Details such as what is it, how it was collected, units
of measurement, frequency of measurement, possible range of values etc. This will
given an idea of what kind of challenges are likely in using the dataset
3. Address the challenges in the data in its existing form. For e.g. missing values, outliers,
data shift, sampling bias
11
Introduction to machine learning
Feature Selection : Methods of feature selection
Exploratory Data Analytics PIMA Diabetic Data –
12
Introduction to machine learning
Feature Selection : Methods of feature selection
Exploratory data analytics (EDA)
1. Preg : Mean > median, max – Q3 > 1Q – min , Q3-Q2 almost = Q2 – Q1 , body of distribution almost
symmetric with a long tail on right side
2. Test : Mean > median , max – Q3 >> Q1 – min , Q3 – Q2 >> Q2 – Q1 , the body of distribution is also
skewed
3. Most attributes have long tail on either side. This calls for an investigation of these points to establish root
cause
4. In the raw form, these features are not in a state where they can be used for modelling
13
Introduction to machine learning
Feature Selection : Methods of feature selection
2. On all the
columns the
distribution of the
two classes are
completely
eclipsing one
another
3. With minor
differences in the
distribution in
‘age’, ‘plas’
4. None of the
columns can be
good
differentiators
14
Introduction to machine learning
Feature Selection : Methods of feature selection
Exploratory data analytics (EDA)
1. On most of the given features, the two classes (diabetic and non-diabetic)
have overlapping distributions such as in case of blood pressure, skin test
columns
a. Such attributes fail to discriminate between the two classes and hence are of no use for
classification
2. On many attributes we see a long thin tail, some have a small bump in the tail
such as Test and Pedi.
a. Presence of long tail indicate presence of extreme outliers, a bump indicates presence of
many outlier values in the data on that column
b. Outliers impact accuracy of all models, some more seriously than others. For e.g. Logistic
Regression Vs SVC with soft margins
15
Introduction to machine learning
Feature Selection : Methods of feature selection
Exploratory Data Analytics Auto_mpg Data –
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each
instance)
16
Introduction to machine learning
Feature Selection : Methods of feature selection
1. Mpg : mean = median, 1Q – Median = 3Q – Median, max-Q3 = 17.6 > min-Q1 – this indicates the
distribution on this column is approximately gaussian except for long tail on right side (confirm visually)
2. Cyl : categorical variable, median (4) < mean (5.5) indicates presence of long tail on right side, may be
due to few very large cars which have large number of cylinders
3. Disp: median(148.5) < mean (193.4), Q1 – Median = 36 < Q3 – Median = 164.5 , body is asymmetric, max
– q3 >> Q1 – min …. Long tail on the right side …. Presence of extreme large cars (check visually)
17
Introduction to machine learning
Feature Selection : Methods of feature selection
18
Introduction to machine learning
Feature Selection : Methods of feature selection
19
Introduction to machine learning
Feature Selection : Methods of feature selection
FeatureSelectionMethods.ipynb
Statsmodel_lm_diff_distributions.ipynb
20
Introduction to machine learning
Features Engineering
21
Introduction to machine learning
Feature Engineering
b. Variance errors can creep in thru generation of features from existing features, deleting
records with missing values or outliers
22
Introduction to machine learning
Missing Values
pima.describe()
23
Introduction to machine learning
Outlier Analysis
It is an observation that :
1. deviates so much from other observations as to arouse suspicion that it was
generated by a different mechanism
2. appears to deviate markedly from other members of the sample in which it occurs
3. appears to be inconsistent with the remainder of that set of data
FeatureSelectionMethods.ipynb
24
Introduction to machine learning
2. Multivariate methods – identify outliers based on the trends in the data, use methods
such as Mahalanobis distance
25
Introduction to machine learning
26
Introduction to machine learning
27
Introduction to machine learning
3.
28
Introduction to machine learning
2 Cap the outliers to first / a. When few outliers a. Simple approach, a. May create multiple modal
3rd quartile based on close to the limits b. Easy to implement values if too many outliers
which side of the central b. May change the central values
value they are significantly
3 Replace outliers with a. When too many a. Simple and logical a. Relationships between
predicted values outliers with varying attributes may not be strong
values and the columns enough
are strongly correlated b. Relations may be statistical
chance
29
Introduction to machine learning
Feature Engineering (Generating new features)
30
Introduction to machine learning
Feature Engineering (Generating new features)
1. The scales are standardized (Z scored). When BMI is less than 0 and Test is less
than 0 the density of non-diabetic (blue) is higher than that of diabetic (orange).
2. When the BMI >0 and Test > 0 (top right hand side quadrant), we notice that
density of diabetic is greater than non-diabetic
def genbmi(row):
if row["BMI"] < 0:
return "0"
else
return "1"
Test
def gentest(row):
if row["test"] < 0
return "0"
else:
return "1"
BMI
Ref: PIMA_Feature Engineering
31
Introduction to machine learning
Feature Engineering (Generating polynomial features)
In the car-mpg.csv dataset we
notice many features have
32
Introduction to machine learning
Feature Engineering (Generating polynomial features)
2. These new features will embody these observations and hopefully give us better model
both in terms of accuracy and generalizability
3. For example, if an input sample is two dimensional and of the form [a, b], the degree-2
polynomial features are [1, a, b, a^2, ab, b^2]
4. A polynomial of degree n is a function of the form where the a’s are real numbers
(sometimes called the coefficients of the polynomial).
6. Generating polynomial features from existing features is equivalent of project data from
given dimensions to higher dimensional space
33
Introduction to machine learning
Feature Engineering (Generating polynomial features)
34
Introduction to machine learning
Feature Engineering (Generating polynomial features)
With polynomial
features
Source:https://journals.plos.org/
plosone/article/figure?id
=10.1371/journal.pone.0092248.g
004
35
Introduction to machine learning
Feature Engineering (Generating polynomial features)
36
Introduction to machine learning
Feature Engineering (Generating polynomial features)
One of the most common transformations done on numeric data is the scale /
measurement units of the data. The learning processes (gradient descent for e.g.) under
the hood of various algorithms benefit from this
1. Rescale - Zscores – When the attributes have different scales, transform them into
single scale. For this we can use MinMaxScaler or Zscores. Zscores helps also in
centering the data i.e. the mean values of all numerical attributes become 0. Helpful in
algorithms based on distance calculations, gradient descent based algorithms
Ref:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessin
g
38
Introduction to machine learning
3. Normalize data – Rescale all the attribute values such that each observation as a
vector length of 1. Used when attributes have different scales and data is sparse. Can
be useful for algorithms using distance calculations, learning process such as gradient
descent
4. Binarize data - A way to convert a given attribute / target variable to 0,1. For e.g. it can
be used to modify thresholds in probability based models such as logistic regression to
improve class level predictions
Ref:
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preproces
sing
39
Introduction to machine learning
1. Integers and floats are the most common data types that are directly used in building
models. Transforming them before modelling may yield better results!
3. PCA helps improve signal to noise ratio by taking into account the covariance between
features
4. Interaction & Polynomial features enrich the models, especially in linear models is
using interaction features , polynomial features
40
Introduction to machine learning
1. SVM generates polynomial features based on the degree mentioned. Default degree =
3
2. The gamma factor controls the flexibility of the curve
3. Since SVM uses kernel tricks, it does not generate all the synthetic features. However,
it gets the benefits of the same
4. Kernel trick saves computation time and resources which may prove too costly when
working with polynomial features
41
Introduction to machine learning
The dataset has 9 attributes listed below that define the quality
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
Sol : Ridge_Lasso_Regression.ipynb
42
Introduction to machine learning
When we have too many parameters and exposed to curse of dimensionality, we resort to dimensionality
reduction techniques such as transforming to PCA and eliminating the PCA with least magnitude of eigen
values. This can be a laborious process before we find the right number principal components. Instead, we
can employ the shrinkage methods.
Shrinkage methods attempt to shrink the coefficients of the attributes and lead us towards simpler yet
effective models. The two shrinkage methods are :
1. Ridge regression is similar to the linear regression where the objective is to find the best fit
surface. The difference is in the way the best coefficients are found. Unlike linear
regression where the optimization function is SSE, here it is slightly different
Linear Regression cost function Ridge Regression with additional term in the cost function
2. The term is like a penalty term used to penalize large magnitude coefficients when it
is set to a high number, coefficients are suppressed significantly. When it is set to 0, the
cost function becomes same as linear regression cost function
43
Introduction to machine learning
Large coefficients indicate a case where for a unit change in the input variable, the magnitude of change in
the target column is very large.
Coeff for simple linear regression model of 10 Coeff with polynomial features shooting up to 57 from 10
dimensions
-9.67853872e-13 -1.06672046e+12 -4.45865268e+00 -2.24519565e+00 -
1. The coefficient for cyl is 2.5059518049385052 2.96922206e+00 -1.56882955e+00 3.00019063e+00 -1.42031640e+12 -
2. The coefficient for disp is 2.5357082860560483 5.46189566e+11 3.62350196e+12 -2.88818173e+12 -1.16772461e+00 -
1.43814087e+00 -7.49492645e-03 2.59439087e+00 -1.92409515e+00 -
3. The coefficient for hp is -1.7889335736325294 3.41759793e+12 -6.27534905e+12 -2.44065576e+12 -2.32961194e+12
4. The coefficient for wt is -5.551819873098725 3.97766113e-01 1.94046021e-01 -4.26086426e-01 3.58203125e+00 -
5. The coefficient for acc is 0.11485734803440854 2.05296326e+00 -7.51019934e+11 -6.18967069e+11 -5.90805593e+11
6. The coefficient for yr is 2.931846548211609 2.47863770e-01 -6.68518066e-01 -1.92150879e+00 -7.37030029e-01 -
7. The coefficient for car_type is 2.977869737601944 1.01183732e+11 -8.33924574e+10 -7.95983063e+10 -1.70394897e-01
5.25512695e-01 -3.33097839e+00 1.56301740e+12 1.28818991e+12
8. The coefficient for origin_america is -0.5832955290166003 1.22958044e+12 5.80200195e-01 1.55352783e+00 3.64527008e+11
9. The coefficient for origin_asia is 0.3474931380432235 3.00431724e+11 2.86762821e+11 3.97644043e-01 8.58604718e+10
10. The coefficient for origin_europe is 0.3774164680868855 7.07635073e+10 6.75439422e+10 -7.25449332e+11 1.00689540e+12
9.61084146e+11 2.18532428e+11 -4.81675252e+12 2.63818648e+12
44
Introduction to machine learning
Regularising Linear Models (Shrinkage methods)
=0
45
Introduction to machine learning
Regularising Linear Models (Shrinkage methods)
46
Introduction to machine learning
Ref: Ridge_Lasso_Regression.ipynb
47
Introduction to machine learning
1. Lasso Regression is similar to the Ridge regression with a difference in the penalty term.
Unlike Ridge, the penalty term here is raised to power 1. Also known as L1 norm.
2. The term continues to be the input parameter which will decide how high penalties
would be for the coefficients. Larger the value more diminished the coefficients will be.
3. Unlike Ridge regression, where the coefficients are driven towards zero but may not
become zero, Lasso Regression penalty process will make many of the coefficients 0. In
other words, literally drop the dimensions
48
Introduction to machine learning
Lasso model: [ 0. 0.52263805 -0.5402102 -1.99423315 -4.55360385 -0.85285179 2.99044036 0.00711821 -0. 0.76073274 -0. -0. -0.19736449
0. 2.04221833 -1.00014513 0. -0. 4.28412669 -0. 0. 0.31442062 -0. 2.13894094 -1.06760107 0. -0. 0. 0. -0.44991392 -1.55885506 -0. -0.68837902 0.
0.17455864 -0.34653644 0.3313704 -2.84931966 0. -0.34340563 0.00815105 0.47019445 1.25759712 -0.69634581 0. 0.55528147 0.2948979 -0.67289549
0.06490671 0. -1.19639935 1.06711702 0. -0.88034391 0. -0. ]
Large coefficients have been suppressed, to 0 in many cases, making those dimensions useless i.e. dropped
from the model.
Ref: Ridge_Lasso_Regression.ipynb
49
Introduction to machine learning
To compare the Ridge and Lasso, let us first transform our error function (which is
a quadratic / convex function) into a contour graph
m1
50
Introduction to machine learning
2. Any combination of m1
nd m2 that fall within
m2 yellow is a possible
solution
The point to note is that the red rings and yellow circle will never be tangential (touch) on the axes
representing the coefficient. Hence Ridge can make coefficients close to zero but never zero. You may
notice some coefficients becoming zero but that will be due to roundoff…
51
Introduction to machine learning
Regularising Linear Models (Ridge Constraint)
1. As the lambda value (shown here as alpha) increases,
the coefficients have to become smaller and smaller to
minimize the penalty term in the cost function i.e. the
52
Introduction to machine learning
The beauty of Lasso is, the red circle may touch the constraint region on the attribute axis! In the picture
above the circle is touching the yellow rectangle on the m1 axis. But at that point m2 coefficient is 0!
Which means, that dimension has been dropped from analysis. Thus Lasso does dimensionality
reduction which Ridge does not
53
Introduction to machine learning
Classwork – 1
Objective - Learn to interpret the pair plots, Handle missing values and outliers. Appreciate
the need for good features for a great model
1. Load the data_banknote_authentication.txt file**
2. Analyze the data column wise
3. Handle missing values if any
4. Detect and handle outliers if any
5. Compare the distributions with the Pima Diabetes distribution. Why does this data set
give such high degree of accuracy while Pima data set does not
6. Could we have done better in handling the outliers instead of replacing the outliers with
central values. (how about class wise outlier analysis and replacing with 2standard
dev?
54
Introduction to machine learning
1. Do you think replacing outliers with central values was a good strategy? Could we have done something
different
2. What changes do you notice in the pair plot on the different dimensions after the outlier handling
3. Do you think the score will be higher than the model on original data?
4. Look at the scales of the pairplot before and after we converted to Z score, do you think there was a
need to do that?
5. Both in Pima dataset and this dataset, the classes are overlapping on every dimension. But models on
this data give much better results. Why?
Ref:SVC_Bank_Note_Kernels.ipynb
55
Introduction to machine learning
Objective - To familiarize with feature evaluation and selection and feature engineering
1. Build a simple linear regression to predict the strength. Note down the accuracy
2. Select the features you think are good and re-build the model. Any improvement in accuracy?
3. Generate second degree polynomial features and build the model again. Did the accuracy improve
4. Regularize the model using Lasso and Ridge… Did it improve?
5. What is your overall view on the features. Rank the features on importance based on decision tree
regressor
6. Do the list of features with top three rank match the top three features from Lasso and Ridge models?
Ref:Linear_Regression_concrete.ipynb
56
Introduction to machine learning
57
Introduction to machine learning
4. Feature selection can introduce bias errors while feature generation could lead
to variance errors!
58
Introduction to machine learning
59
Introduction to machine learning
60
Introduction to machine learning
61
Introduction to machine learning
5. What is SNR
6. What could be the problem of keeping all the given features in a model
8. How can feature selection and engineering impact bias variance errors
62
Introduction to machine learning
63
Introduction to machine learning
4. https://livebook.manning.com/book/real-world-machine-learning/chapter-5/1
64
Introduction to machine learning
Practise Work
65
Introduction to machine learning
Project Flow
Seq no Milestone Purpose Key Tasks
1 Define the project To be clear on the objective of the project Name the project, Define the purpose and
1. Exploratory data analysis source of data
2. Feature engineering and selection
2 Data loading and Understand data. Column names and What do the columns represent
3. Compare several machine learning models on a performance metric
check meanings, potential problems, meaning How data was collected and when
4. Perform hyperparameter tuning on the best model Expected problems
5. Evaluate the best model on the testing set
3 Exploratory data Understand the data in terms of central values, Univariate analysis using descriptive
6. Interpret the model results
analysis spread, extreme values, missing values, statistics , Bi-variate analysis
7. Draw conclusions and document
sampling work
errors, interactions, mix of gaussians Pair plot analysis
4 Data Preparation Missing value treatment, Outlier value Decide the appropriate strategy to i and
treatment, analysis of the gaussians (if they handle both missing values and outliers
exist), Segregating data for training, validation
and testing
5 Establish reliability Does it represent the current process? Has the Statistical tests such as T-Test, Normal
of the data strategy for missing data and outlier handling Deviate Z Test, Analysis of variance
adversely impacted the data set’s
representation , after the split, do the training,
validation and test represent the process
6 Feature Identify the columns from the given dataset that Bi-variate / multi-variate analysis,
engineering and have the potential for being good features, exploring the pair plot, scaling method
selection explore the possibility of creating new powerful
features , scale the data
7 Select ML Identify suitable algorithm/s, create base model Given the purpose, given the data
algorithms Identify the hyper parameters to evaluate and distributions, from pair plot, identify
set suitable algorithms
66
Introduction to machine learning
Project Flow
9 Build and compile Build the models Build the model in Python with minimal
the model coding (use functions)
10 Hyper parameter For the selected algorithms, evaluate the hyper Use random grid search to identify optimal
tuning parameters values of the hyper parameters
11 Evaluate the Assess the reliability of the model and estimate Kfold Cross Validation / Bootstrap
models the range of accuracy at 95% confidence for sampling on test data, Use confusion
each model matrix, classification metrics to ensure
model will generalize, PowerTest
12 Ensemble the Select the top few models based on their 95% Select appropriate ensemble techniques –
model confidence performance and ensemble Bagging ,Boosting, RandomForest
13 Create the pipeline Standardize the steps for data transformation Make pipeline, check the pipeline on test
that were employed to create the model set
14 Deploy the model Deploy the model in production environment. Integrate with existing processes/ re-
design the process flows/
15 Project Report Document the steps and methodologies Document the steps taken for every
milestone (The documentation should be
done in the code to explain the steps and
also separately in detail at project level
67
Introduction to machine learning
Project Description
1. Project Name – Green Building Rating
2. Objective – Build a model to assess energy rating of a building from the given parameters
3. Purpose - Replace manual evaluation method with a reliable automated way of evaluation
4. Data Source –publicly available building energy data about New York city at
https://www1.nyc.gov/html/gbee/html/plan/ll84_scores.shtml
68
Introduction to machine learning
About The Data
Local Law 84 of 2009, or the NYC Benchmarking Law, requires annual benchmarking and disclosure of
energy and water usage information. Covered properties include tax lots with a single building with a gross
floor area greater than 50,000 square feet (sq ft) and tax lots having more than one building with a gross
floor area of more than 100,000 sq ft. Starting in 2018, the NYC Benchmarking Law will also include
properties greater than 25,000 sq ft. Metrics are calculated by the Environmental Protection Agency’s tool
ENERGY STAR Portfolio Manager, and data is self-reporting by building owners. The public availability of
data allows for local and national comparison of a buildings’ performance, incentivizes the most accurate
benchmarking of energy usage, and informs energy management decisions.
Source:
https://www1.nyc.gov/html/gbee/downloads/misc/nyc_benchmarking_disclosure_data_definitions_2017.pd
f
Understand what the data is about, what the columns mean… ignoring this, may result in missing
important information that would have made a model powerful.
69
Introduction to machine learning
Understand the data
Understand what the data is about, what the columns mean… ignoring this, may result in missing
important information that would have made a model powerful.
1. NYC law requires properties of certain size and above to report their energy usage
2. Energy Star Score is a 1 – 100 percentile ranking based on the energy consumption annually
3. Energy Star Score is a relative measure for comparing energy efficiency of the buildings
https://www1.nyc.gov/html/gbee/downloads/misc/nyc_benchmarking_disclosure_data_definitions_2017.pd
f
70
Introduction to machine learning
Load Data
Local Law 84 of 2009, or the NYC Benchmarking Law, requires annual benchmarking and disclosure of
energy and water usage information. Covered properties include tax lots with a single building with a gross
floor area greater than 50,000 square feet (sq ft) and tax lots having more than one building with a gross
floor area of more than 100,000 sq ft. Starting in 2018, the NYC Benchmarking Law will also include
properties greater than 25,000 sq ft. Metrics are calculated by the Environmental Protection Agency’s tool
ENERGY STAR Portfolio Manager, and data is self-reporting by building owners. The public availability of
data allows for local and national comparison of a buildings’ performance, incentivizes the most accurate
benchmarking of energy usage, and informs energy management decisions.
Source:
https://www1.nyc.gov/html/gbee/downloads/misc/nyc_benchmarking_disclosure_data_definitions_2017.pd
f
71
Introduction to machine learning
Thanks
72