0% found this document useful (0 votes)

7 views26 pages

Variable Selection

Uploaded by

BHAI LOG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views26 pages

Variable Selection

Uploaded by

BHAI LOG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Variable Selection Methods

For Regression and Classification problems

W W W . P E A K S 2 T A I L S . C O M
Feature Preparation/Processing

Feature engineering Covariates creation and Feature transformation

Covariates creation Features transformation

➢ Taking lags of MEVs ➢ Lag transformation

➢ Taking interaction terms ➢ Growth rates
➢ Debt to income ratio ➢ Weight of evidence
➢ Loan to value ratio (discretization)
➢ Utilization ratio

➢ Basic Filters
Feature selection Selecting important variables ➢ Statistical Filters
➢ Wrapper Methods
➢ Embedded Methods

W W W . P E A K S 2 T A I L S . C O M
Variable Selection Methods

Remove constants, quasi constants and

Sort of data cleaning Basic Filters duplicates

Selecting important variables

Statistical Filters Fairly high correlation, intuitive signs, Woe
individually
Variable Selection trend, high information value

Selecting important variables Wrapper Methods Forward, backward, stepwise regression

collectively Model Selection based on 𝑅 2 , AUC, p value, VIF,
Exhaustive Model search

Unselecting variables leading Embedded methods Penalized regression models

to overfitting Model tuning ridge, lasso, elastic net

W W W . P E A K S 2 T A I L S . C O M
Linear Regression Logistic Regression
1. Co-variates creation Ratios, interactions Ratios, interactions
LTV ratio, DTI ratio, Utilisation ratio
2. Variable transformation Lagged transformations, relative Weight of evidence transformations
change
3. Basic Filters Constants, quasi constants and Constants, quasi constants and
duplicates duplicates
4. Statistical Filters Correlation, Multicollinearity, Sign WOE trend, Information value, Gini, Chi
intuitiveness square test, Mutual information
5. Wrapper methods Forward/Backward/Stepwise Forward/Backward/Stepwise
Regression, Exhaustive Model Search Regression, Exhaustive Model Search
6. Embedded Methods Penalised Regression – Lasso, Ridge, Penalised Regression – Lasso, Ridge,
Elastic Net Elastic Net
Cost Sensitive Learning

W W W . P E A K S 2 T A I L S . C O M
Particular Formula
1. Growth 𝑋𝑡 − 𝑋𝑡 − 1
𝑋𝑡
2. Difference Xt – Xt-1
3. MA Average of n quarters
4. QoQ Xt vs Xt-1
5. YoY Xt vs Xt-4
6. Lag Xt-1
7. Leading Xt+1
8. Log Odds 𝐷𝑅
Log1−𝐷𝑅
9. Vasicek Z calculated by minimizing squared
errors

W W W . P E A K S 2 T A I L S . C O M
Basic Filters Rationale
1. Constants We want to explain variation in Y through variations in X. If X has no variation, it
is not useful to explain the variance in Y

2. Quasi- Constants Variables with low variance are dropped for the same reasons as above

3. Duplicates Adding duplicate variables causing redundancy in the model. It can also lead to
problems like multicollinearity

Statistical Filters Linear Regression Logistic Regression

1. Magnitude of relationship with Y Correlation should be greater than IV should be high. Gini should be
should be significant 30% above 10%
2. Direction of relationship should Actual sign of correlation should be Trend of WOE should be same as
be as expected same as expected expected

W W W . P E A K S 2 T A I L S . C O M
Variable Clustering (for dimension reduction)
• The variable clustering (PROC VARCLUS in SAS) procedure is standard and widely used in the industry for variable selection. The VARCLUS procedure
divides a set of numeric variables into either disjointed or hierarchical clusters. Associated with each cluster is a linear combination of the variables in
the cluster, which may be either the first principal component or the centroid component. PROC VARCLUS displays the R 2 value of each variable within
its own cluster and against its nearest cluster. The lower the ratio of (1 - R2own) / (1 - R2nearest) for each variable, the better it can represent the cluster.
Either the top 10 variables from each cluster or ratio below some cut-off (0.5) to account for significant set of variables are selected. The cut-off used
would change iteratively as the total number of variables that would appear fit to be imputed in the step-wise regression (after variable reduction
techniques like correlation), regression results are examined, analyzing model performance and goodness of fit results. This technique helps in
providing the first line of defense to multi-collinearity.

In simple language it’s a 4-step process

1. Create Principal components – Think of these as clusters
2. Assign variables to Principal components – Based on correlation of each variable to each component
3. Create Subcomponents – If the variables in cluster don’t have significant correlation the principal component,
then go more granular
4. Choose the best variable – Out of all the variables that belong to a cluster, choose a variable which has
the highest correlation with that cluster and lowest correlation to other clusters.

W W W . P E A K S 2 T A I L S . C O M
Variable Clustering

W W W . P E A K S 2 T A I L S . C O M
Weight of
Information Value Predictive Power
< 0.02 Useless for prediction
0.02 to 0.1 Weak predictor

Evidences 0.1 to 0.3

0.3 to 0.5
Medium predictor
Strong predictor
> 0.5 Suspicious or too good to be true

Is high IV really bad ?

W W W . P E A K S 2 T A I L S . C O M
Wrapper methods
Technique Explanation

1. Forward Selection Start with a Null model. Add one variable at a time. Start with variable for which AIC is
the lowest. Add next variable for which the AIC is lower. We can also use p value or
AUC or Marginal Information Value or Marginal Contributions
2. Backward Selection Start with a Full model. Eliminate one variable at a time. Start with the variable for
which AIC is lowest. Eliminate next variable for which the AIC is lower.
3. Stepwise Regression Suppose you add a variable through Forward regression. After adding a variable, you
can eliminate an existing variable if p value is above 5% or VIF is above threshold.
4. Sequential Forward At every step after you add new variables, you eliminate each of the existing variable
and see if model performance can be improved. Also, model performance is checked
on testing set.
5. Sequential Backward At every step after you eliminate variables, you add back each of the eliminated
variable and see if model performance can be improved. Also, model performance is
checked on testing set.
6. Recursive Feature Elimination Rank all the features based on absolute values of beta coefficients. Eliminate the least
ranked feature. Repeat the process on balance variables.
7. Exhaustive Model Search Try all the possible combinations of models and choose the model which passes all
the assumptions tests and has sufficient accuracy

Note - We can also use some other techniques like Forced Variable selection or controlling selection sequence
W W W . P E A K S 2 T A I L S . C O M
Wrapper
Methods
Which technique is the best ?

W W W . P E A K S 2 T A I L S . C O M
Penalised Regression to reduce Overfitting

• What leads to ‘overfitting’

Too much simplicity

and ‘underfitting’ in the
data.
- Overfitting a cause by
taking extremely high no. of
variables making the model
complex.
- Underfitting is caused by
taking extremely low no. of
variables. Too much complexity

W W W . P E A K S 2 T A I L S . C O M
# Reducing the problem of Overfitting
• Preventing the Algorithm from getting too complex requires estimating a penalty () for increase in
complexity & proper data sampling achieved using validation ( k fold validation).
• When we add penalty terms to our regular regression model, it becomes penalized regression model.

# Penalized Regression
• Penalized Regression is useful for reducing a large no. of features to a manageable set and for making
good prediction in a variety of large data sets specially when the features (X’s) are correlated.
→ Penalised regression includes a constraints such that the regression coefficients are chosen to minimize the
SSE + a penalty term that increases in size with the number of included features. So, in penalized regression
a features must make a significant contribution to the model fit to offset penalty from including it.  only
the most imp features for explaining Y will remain in the penalised regression model.

W W W . P E A K S 2 T A I L S . C O M
# LASSO Regression
Q. How to find  ? (Regularisation Parameter)
We choose that level of , for which the mean squared error of
validation set is the lowest.
# Ridge Regression
→ Finding out the MSE on validation set requires K – fold cross
validation.

Step 1: For the first fold, on training data, take  = say 0.1, run the penalized regression model and find
out the Beta Coefficients.
# Elastic Nets Regression
Step 2: Based on estimated Beta Coefficients find out Y for the first fold validation set and thus the error

1 and 2
terms, now you can calculate mean square error in the validation set.
Step 3: If K = 4, repeat the above 2 steps 3 times more and collect MSE, in validation fold 2, fold 3, fold 4.
Take the average MSE.
Step 4: Repeat all the above 3 steps taking lambda = say 0.3.
Step 5: Plot the average MSE validation against possible  (lambdas) and choose that  which gives the
lowest average MSE in the validation sets.

Regression Objective Function for Ridge

Linear Regression SSE +
Logistic Regression - Log Likelihood +
W W W . P E A K S 2 T A I L S . C O M
Penalised Regression – Linear
Regression
W W W . P E A K S 2 T A I L S . C O M
Penalised Regression - Logistic

W W W . P E A K S 2 T A I L S . C O M
Regression Pipeline

Dimension Sign test Explanatory power & Add back business

Reduction Contemporaneous preferred variables

• Variable • Actual signs = • Correlation > 30% • To keep the model

Clustering expected signs useful for them

Exhaustive Model Statistical accuracy Contemporaneous Line of Business

Search and adequacy and Parsimonious and
Stability

• Without • Adjusted R2, MSE, • Fewer lags, fewer • Business people

regularisation i.e. Regression variables will choose the
K fold CV or with assumptions best model finally
regularisation i.e.
nested K fold CV

W W W . P E A K S 2 T A I L S . C O M
_log
_lag_1
_lag_2
_lag_3
_lag_4
_lead_1
_lead_2
Variable transformation list
_lead_3
_lead_4
_qoq_diff
_qoq_diff_lag_1
_qoq_diff_lag_2
_qoq_diff_lag_3
_qoq_diff_lag_4
_yoy_diff
_yoy_diff_lag_1
_yoy_diff_lag_2
_yoy_diff_lag_3
_yoy_diff_lag_4
_qoq_log_growth
_qoq_log_growth_lag_1
_qoq_log_growth_lag_2
_qoq_log_growth_lag_3
_qoq_log_growth_lag_4
_qoq_simple_growth
_qoq_simple_growth_lag_1
_qoq_simple_growth_lag_2
_qoq_simple_growth_lag_3
data
_qoq_simple_growth_lag_4
_yoy_log_growth 1.000
_yoy_log_growth_lag_1
_yoy_log_growth_lag_2
_yoy_log_growth_lag_3
0.800
_yoy_log_growth_lag_4
_yoy_simple_growth 0.600
_yoy_simple_growth_lag_1
_yoy_simple_growth_lag_2 0.400
_yoy_simple_growth_lag_3
0.200
_yoy_simple_growth_lag_4
_qqma2_leading
_qqma3_leading 0.000
_qqma4_leading 1 3 5 7 9 11131517192123252729313335373941434547495153555759
_qqma2_lagging
_qqma3_lagging -0.200
_qqma4_lagging

W W W . P E A K S 2 T A I L-0.400S . C O M
Is Correlation sufficient to detect Multicollinearity ?

W W W . P E A K S 2 T A I L S . C O M
Exhaustive
Model Search

W W W . P E A K S 2 T A I L S . C O M
• Run all possible models without
Dimension • Variable Exhaustive Model regularisation i.e. K fold CV or with
Reduction Clustering Search regularisation i.e. nested K fold CV

Logical WOE Trend

• Monotonic trend
achieved Statistical accuracy • Multicollinearity, LR test, p value
and adequacy • Area under the Curve, AIC

Stability
• Characteristic
Stability Index
Contemporaneous
• Fewer lags,
and Parsimonious fewer variables
and Stability
Explanatory Power
&
• High IV or
Contemporaneous Gini > 0.1

Line of
Add back business Business
preferred variables

Classification
Pipeline

W W W . P E A K S 2 T A I L S . C O M
What qualitative factors do business consider?

Characteristics Examples
I - Implementable LTV_time if no objective way of getting valuations timely
M - Manipulative Variables based on self-reported income
P – Policy or Legal Constraints Utilizing religion as a risk driver, Alternative data
O - Objective Length of employment may have subjective interpretations
(full time or part time)
R - Recognisable Variable like social media activity level not related to credit risk
T - Transparency Calculation methodology of variable should be clear
A - Available Data with high missing values like investment portfolio value
N - Necessary Variables with low statistical performance but very important
T - Tangible Intended use of loan although important but non- tangible

W W W . P E A K S 2 T A I L S . C O M
bureau_score num_ccj max_arrears_12m
2 0.5 1
0 0
1 1 2
-0.5 3 4 5 6 1 2 3 4 5 6
7 8 9 7 8 9 10 11
-1 12 13 14 15
0 -1
1 2 -1.5 -2
3 4
-1 5 6 7 -2 -3

cc_util annual_income months_since_recent_cc_delinq

1
1
1
0 0 0.5
1 2 3 4 1 2 3 4 5
5 6 7 8 6 7 0
-1 9 10 -1 8 9 10 11 12
-0.5 1 2 3 4 5 6 7
-2 -2 8 9
-1

W W W . P E A K S 2 T A I L S . C O M
Equal Frequency Bins

Monotonic Bins

W W W . P E A K S 2 T A I L S . C O M
Exhaustive
Model Search

W W W . P E A K S 2 T A I L S . C O M
Important Thresholds
Metric Threshold
Characteristic Stability Index < 0.1
Gini > 0.1
Information Value > 0.02
Multicollinearity correlation cutoff < 0.5 to 0.7
Multicollinearity IV cut off < 2 to 3
Model AUC > 0.7
Model Gini > 0.4 to 0.5
Model PSI <0.1

Gini or IV which is more important ?

W W W . P E A K S 2 T A I L S . C O M

CORE TOOLS-MSA 4th Ed
No ratings yet
CORE TOOLS-MSA 4th Ed
94 pages
2 Msa
No ratings yet
2 Msa
31 pages
Unit 5 (Dimensionality Reduction)
No ratings yet
Unit 5 (Dimensionality Reduction)
96 pages
Queuing
No ratings yet
Queuing
21 pages
ML - Module 5
No ratings yet
ML - Module 5
80 pages
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
No ratings yet
Filter Based Feature Selection Using ANOVA: Suppose A Company Wants To Analyze Whether The
66 pages
LVM Class 5
No ratings yet
LVM Class 5
83 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
82 pages
General-to-Specific Reductions of Vector Autoregressive Processes
No ratings yet
General-to-Specific Reductions of Vector Autoregressive Processes
20 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Week 2 v1.1 (Hidden) - Dimensionality and Evaluation
No ratings yet
Week 2 v1.1 (Hidden) - Dimensionality and Evaluation
47 pages
MAE 108 - Probability and Statistical Methods For Engineers - Spring 2015 Final Exam, June 10 Instructions
No ratings yet
MAE 108 - Probability and Statistical Methods For Engineers - Spring 2015 Final Exam, June 10 Instructions
8 pages
Model Selection
No ratings yet
Model Selection
49 pages
ML Lecture 6 7 Preprocess
No ratings yet
ML Lecture 6 7 Preprocess
43 pages
STAT22209 - Chapter 02-Regression Analyisis - 2022
No ratings yet
STAT22209 - Chapter 02-Regression Analyisis - 2022
41 pages
Lesson 5 Model Selection
No ratings yet
Lesson 5 Model Selection
41 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
STA302 Week12 Full
No ratings yet
STA302 Week12 Full
30 pages
Feature Selection - New
No ratings yet
Feature Selection - New
41 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
ISL Answers
No ratings yet
ISL Answers
19 pages
L 10 Principal Component Analysis 09052024 072206pm
No ratings yet
L 10 Principal Component Analysis 09052024 072206pm
37 pages
Maths Decisions
No ratings yet
Maths Decisions
20 pages
BEC503 Digital Communication
No ratings yet
BEC503 Digital Communication
1 page
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
3b Features PDF
No ratings yet
3b Features PDF
40 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
A Review On Variable Selection in Regression
No ratings yet
A Review On Variable Selection in Regression
27 pages
NEC ML UNIT-III Complete Final
No ratings yet
NEC ML UNIT-III Complete Final
22 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
Chapter 4
No ratings yet
Chapter 4
23 pages
Selecting Amongst Large Classes of Models: Brian D. Ripley
No ratings yet
Selecting Amongst Large Classes of Models: Brian D. Ripley
38 pages
Data Science Technical Interview Questions
No ratings yet
Data Science Technical Interview Questions
24 pages
Arima Model
No ratings yet
Arima Model
30 pages
RM - Variable Selection Methods and Goodness of Fit
No ratings yet
RM - Variable Selection Methods and Goodness of Fit
20 pages
Features Election
No ratings yet
Features Election
18 pages
E-Note 14653 Content Document 20231228101402AM
No ratings yet
E-Note 14653 Content Document 20231228101402AM
10 pages
Module-3 - DS (Autosaved)
No ratings yet
Module-3 - DS (Autosaved)
18 pages
Feature Selection
No ratings yet
Feature Selection
18 pages
Empirical Finance8
No ratings yet
Empirical Finance8
11 pages
Introduction To Dimensionality Reduction-1
No ratings yet
Introduction To Dimensionality Reduction-1
16 pages
3rd Module EDBA Contiuation1
No ratings yet
3rd Module EDBA Contiuation1
6 pages
N1D FC Quantitative Methods PDF
No ratings yet
N1D FC Quantitative Methods PDF
40 pages
Module 2
No ratings yet
Module 2
12 pages
ML Unit Iv Part I
No ratings yet
ML Unit Iv Part I
11 pages
MultiLinear VariableSelection
No ratings yet
MultiLinear VariableSelection
10 pages
How To Minimize Misclassification Rate and Expected Loss For Given Model
No ratings yet
How To Minimize Misclassification Rate and Expected Loss For Given Model
7 pages
Applied Natural Language Processing: Barbara Rosario
No ratings yet
Applied Natural Language Processing: Barbara Rosario
39 pages
ARIMA Model
No ratings yet
ARIMA Model
30 pages
ML - Unit 3
No ratings yet
ML - Unit 3
4 pages
13 Paper PDF
No ratings yet
13 Paper PDF
14 pages
ASM-BDM - Module 3 - Notes
No ratings yet
ASM-BDM - Module 3 - Notes
12 pages
Time-Series Analysis (Ch18)
No ratings yet
Time-Series Analysis (Ch18)
63 pages
Empirical Finance
No ratings yet
Empirical Finance
5 pages
Sa 16
No ratings yet
Sa 16
5 pages
Machine Learning Insem-01 QP
No ratings yet
Machine Learning Insem-01 QP
6 pages
Time Series: Scholarship Statistics
No ratings yet
Time Series: Scholarship Statistics
6 pages
Lecture 19
No ratings yet
Lecture 19
25 pages
Lecture 15 - 23.09.2024 - Feature Selection
No ratings yet
Lecture 15 - 23.09.2024 - Feature Selection
47 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
Data Pre Processing
No ratings yet
Data Pre Processing
26 pages
Data Mining
No ratings yet
Data Mining
2 pages
Module 3 Regression Notes
No ratings yet
Module 3 Regression Notes
3 pages
Marketing Research Project On Starbucks
No ratings yet
Marketing Research Project On Starbucks
49 pages
Markov Processes, Lab 1: 1 Preparations
No ratings yet
Markov Processes, Lab 1: 1 Preparations
12 pages
Chapter 6
No ratings yet
Chapter 6
94 pages
Quiz Unit 3 Exam Remotely Proctored 1 PDF
No ratings yet
Quiz Unit 3 Exam Remotely Proctored 1 PDF
20 pages
Chapter 3 Introduction To Data Science A Python Approach To Concepts, Techniques and Applications
No ratings yet
Chapter 3 Introduction To Data Science A Python Approach To Concepts, Techniques and Applications
22 pages
Engineering Research Models Evaluation Methods
No ratings yet
Engineering Research Models Evaluation Methods
4 pages
Erlang Distribution Queue
No ratings yet
Erlang Distribution Queue
16 pages
Brownian Motion Warwick Notes
No ratings yet
Brownian Motion Warwick Notes
64 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Solutions To Sample Final Exam ECO2151
No ratings yet
Solutions To Sample Final Exam ECO2151
7 pages
ANOVA DTGT Practice With Answers
No ratings yet
ANOVA DTGT Practice With Answers
2 pages
CEM 515 SPC Quiz Student Name: - Student No
No ratings yet
CEM 515 SPC Quiz Student Name: - Student No
4 pages
COT No. 4sampling Distribution With Replacement
No ratings yet
COT No. 4sampling Distribution With Replacement
15 pages
DLL STAT Week 1
No ratings yet
DLL STAT Week 1
12 pages
Simple & Multiple Regression
No ratings yet
Simple & Multiple Regression
12 pages
Book Review JORS1991
No ratings yet
Book Review JORS1991
5 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
9 pages
1b.eco 329 Assumptions of OLS
No ratings yet
1b.eco 329 Assumptions of OLS
9 pages
Program Name: B.Tech CSE Semester: 5th Course Name: Machine Learning Course Code:PEC-CS-D-501 (I) Facilitator Name: Aastha
No ratings yet
Program Name: B.Tech CSE Semester: 5th Course Name: Machine Learning Course Code:PEC-CS-D-501 (I) Facilitator Name: Aastha
20 pages
Hasts211 W8y23
No ratings yet
Hasts211 W8y23
2 pages
Exercise 3.1 Independent T-Test (Diez)
No ratings yet
Exercise 3.1 Independent T-Test (Diez)
3 pages
TSDS Admission Exam Questions 2022 FINAL
No ratings yet
TSDS Admission Exam Questions 2022 FINAL
4 pages
Math 178 Discussion
No ratings yet
Math 178 Discussion
2 pages
Advanced Techniques for Multivariate Data Analysis Using PYTHON. Predictive Models for Classification and Segmentation
From Everand
Advanced Techniques for Multivariate Data Analysis Using PYTHON. Predictive Models for Classification and Segmentation
César Pérez López
No ratings yet
Encyclopedia of Financial Models, Volume III
From Everand
Encyclopedia of Financial Models, Volume III
Frank J. Fabozzi
No ratings yet
Adaptive Filtering Prediction and Control
From Everand
Adaptive Filtering Prediction and Control
Graham C Goodwin
No ratings yet
Elements of Random Walk and Diffusion Processes
From Everand
Elements of Random Walk and Diffusion Processes
Oliver C. Ibe
No ratings yet

Variable Selection

Uploaded by

Variable Selection

Uploaded by

Variable Selection Methods

For Regression and Classification problems

Feature engineering Covariates creation and Feature transformation

Covariates creation Features transformation

➢ Taking lags of MEVs ➢ Lag transformation

Remove constants, quasi constants and

Selecting important variables

Selecting important variables Wrapper Methods Forward, backward, stepwise regression

Unselecting variables leading Embedded methods Penalized regression models

Statistical Filters Linear Regression Logistic Regression

In simple language it’s a 4-step process

Evidences 0.1 to 0.3

Is high IV really bad ?

• What leads to ‘overfitting’

Too much simplicity

Regression Objective Function for Ridge

Dimension Sign test Explanatory power & Add back business

• Variable • Actual signs = • Correlation > 30% • To keep the model

Exhaustive Model Statistical accuracy Contemporaneous Line of Business

• Without • Adjusted R2, MSE, • Fewer lags, fewer • Business people

Logical WOE Trend

cc_util annual_income months_since_recent_cc_delinq

Gini or IV which is more important ?

You might also like