Variable Selection
Variable Selection
W W W . P E A K S 2 T A I L S . C O M
Feature Preparation/Processing
➢ Basic Filters
Feature selection Selecting important variables ➢ Statistical Filters
➢ Wrapper Methods
➢ Embedded Methods
W W W . P E A K S 2 T A I L S . C O M
Variable Selection Methods
W W W . P E A K S 2 T A I L S . C O M
Linear Regression Logistic Regression
1. Co-variates creation Ratios, interactions Ratios, interactions
LTV ratio, DTI ratio, Utilisation ratio
2. Variable transformation Lagged transformations, relative Weight of evidence transformations
change
3. Basic Filters Constants, quasi constants and Constants, quasi constants and
duplicates duplicates
4. Statistical Filters Correlation, Multicollinearity, Sign WOE trend, Information value, Gini, Chi
intuitiveness square test, Mutual information
5. Wrapper methods Forward/Backward/Stepwise Forward/Backward/Stepwise
Regression, Exhaustive Model Search Regression, Exhaustive Model Search
6. Embedded Methods Penalised Regression – Lasso, Ridge, Penalised Regression – Lasso, Ridge,
Elastic Net Elastic Net
Cost Sensitive Learning
W W W . P E A K S 2 T A I L S . C O M
Particular Formula
1. Growth 𝑋𝑡 − 𝑋𝑡 − 1
𝑋𝑡
2. Difference Xt – Xt-1
3. MA Average of n quarters
4. QoQ Xt vs Xt-1
5. YoY Xt vs Xt-4
6. Lag Xt-1
7. Leading Xt+1
8. Log Odds 𝐷𝑅
Log1−𝐷𝑅
9. Vasicek Z calculated by minimizing squared
errors
W W W . P E A K S 2 T A I L S . C O M
Basic Filters Rationale
1. Constants We want to explain variation in Y through variations in X. If X has no variation, it
is not useful to explain the variance in Y
2. Quasi- Constants Variables with low variance are dropped for the same reasons as above
3. Duplicates Adding duplicate variables causing redundancy in the model. It can also lead to
problems like multicollinearity
W W W . P E A K S 2 T A I L S . C O M
Variable Clustering (for dimension reduction)
• The variable clustering (PROC VARCLUS in SAS) procedure is standard and widely used in the industry for variable selection. The VARCLUS procedure
divides a set of numeric variables into either disjointed or hierarchical clusters. Associated with each cluster is a linear combination of the variables in
the cluster, which may be either the first principal component or the centroid component. PROC VARCLUS displays the R 2 value of each variable within
its own cluster and against its nearest cluster. The lower the ratio of (1 - R2own) / (1 - R2nearest) for each variable, the better it can represent the cluster.
Either the top 10 variables from each cluster or ratio below some cut-off (0.5) to account for significant set of variables are selected. The cut-off used
would change iteratively as the total number of variables that would appear fit to be imputed in the step-wise regression (after variable reduction
techniques like correlation), regression results are examined, analyzing model performance and goodness of fit results. This technique helps in
providing the first line of defense to multi-collinearity.
W W W . P E A K S 2 T A I L S . C O M
Variable Clustering
W W W . P E A K S 2 T A I L S . C O M
Weight of
Information Value Predictive Power
< 0.02 Useless for prediction
0.02 to 0.1 Weak predictor
W W W . P E A K S 2 T A I L S . C O M
Wrapper methods
Technique Explanation
1. Forward Selection Start with a Null model. Add one variable at a time. Start with variable for which AIC is
the lowest. Add next variable for which the AIC is lower. We can also use p value or
AUC or Marginal Information Value or Marginal Contributions
2. Backward Selection Start with a Full model. Eliminate one variable at a time. Start with the variable for
which AIC is lowest. Eliminate next variable for which the AIC is lower.
3. Stepwise Regression Suppose you add a variable through Forward regression. After adding a variable, you
can eliminate an existing variable if p value is above 5% or VIF is above threshold.
4. Sequential Forward At every step after you add new variables, you eliminate each of the existing variable
and see if model performance can be improved. Also, model performance is checked
on testing set.
5. Sequential Backward At every step after you eliminate variables, you add back each of the eliminated
variable and see if model performance can be improved. Also, model performance is
checked on testing set.
6. Recursive Feature Elimination Rank all the features based on absolute values of beta coefficients. Eliminate the least
ranked feature. Repeat the process on balance variables.
7. Exhaustive Model Search Try all the possible combinations of models and choose the model which passes all
the assumptions tests and has sufficient accuracy
Note - We can also use some other techniques like Forced Variable selection or controlling selection sequence
W W W . P E A K S 2 T A I L S . C O M
Wrapper
Methods
Which technique is the best ?
W W W . P E A K S 2 T A I L S . C O M
Penalised Regression to reduce Overfitting
W W W . P E A K S 2 T A I L S . C O M
# Reducing the problem of Overfitting
• Preventing the Algorithm from getting too complex requires estimating a penalty () for increase in
complexity & proper data sampling achieved using validation ( k fold validation).
• When we add penalty terms to our regular regression model, it becomes penalized regression model.
# Penalized Regression
• Penalized Regression is useful for reducing a large no. of features to a manageable set and for making
good prediction in a variety of large data sets specially when the features (X’s) are correlated.
→ Penalised regression includes a constraints such that the regression coefficients are chosen to minimize the
SSE + a penalty term that increases in size with the number of included features. So, in penalized regression
a features must make a significant contribution to the model fit to offset penalty from including it. only
the most imp features for explaining Y will remain in the penalised regression model.
W W W . P E A K S 2 T A I L S . C O M
# LASSO Regression
Q. How to find ? (Regularisation Parameter)
We choose that level of , for which the mean squared error of
validation set is the lowest.
# Ridge Regression
→ Finding out the MSE on validation set requires K – fold cross
validation.
Step 1: For the first fold, on training data, take = say 0.1, run the penalized regression model and find
out the Beta Coefficients.
# Elastic Nets Regression
Step 2: Based on estimated Beta Coefficients find out Y for the first fold validation set and thus the error
1 and 2
terms, now you can calculate mean square error in the validation set.
Step 3: If K = 4, repeat the above 2 steps 3 times more and collect MSE, in validation fold 2, fold 3, fold 4.
Take the average MSE.
Step 4: Repeat all the above 3 steps taking lambda = say 0.3.
Step 5: Plot the average MSE validation against possible (lambdas) and choose that which gives the
lowest average MSE in the validation sets.
W W W . P E A K S 2 T A I L S . C O M
Regression Pipeline
W W W . P E A K S 2 T A I L S . C O M
_log
_lag_1
_lag_2
_lag_3
_lag_4
_lead_1
_lead_2
Variable transformation list
_lead_3
_lead_4
_qoq_diff
_qoq_diff_lag_1
_qoq_diff_lag_2
_qoq_diff_lag_3
_qoq_diff_lag_4
_yoy_diff
_yoy_diff_lag_1
_yoy_diff_lag_2
_yoy_diff_lag_3
_yoy_diff_lag_4
_qoq_log_growth
_qoq_log_growth_lag_1
_qoq_log_growth_lag_2
_qoq_log_growth_lag_3
_qoq_log_growth_lag_4
_qoq_simple_growth
_qoq_simple_growth_lag_1
_qoq_simple_growth_lag_2
_qoq_simple_growth_lag_3
data
_qoq_simple_growth_lag_4
_yoy_log_growth 1.000
_yoy_log_growth_lag_1
_yoy_log_growth_lag_2
_yoy_log_growth_lag_3
0.800
_yoy_log_growth_lag_4
_yoy_simple_growth 0.600
_yoy_simple_growth_lag_1
_yoy_simple_growth_lag_2 0.400
_yoy_simple_growth_lag_3
0.200
_yoy_simple_growth_lag_4
_qqma2_leading
_qqma3_leading 0.000
_qqma4_leading 1 3 5 7 9 11131517192123252729313335373941434547495153555759
_qqma2_lagging
_qqma3_lagging -0.200
_qqma4_lagging
W W W . P E A K S 2 T A I L-0.400S . C O M
Is Correlation sufficient to detect Multicollinearity ?
W W W . P E A K S 2 T A I L S . C O M
Exhaustive
Model Search
W W W . P E A K S 2 T A I L S . C O M
• Run all possible models without
Dimension • Variable Exhaustive Model regularisation i.e. K fold CV or with
Reduction Clustering Search regularisation i.e. nested K fold CV
Stability
• Characteristic
Stability Index
Contemporaneous
• Fewer lags,
and Parsimonious fewer variables
and Stability
Explanatory Power
&
• High IV or
Contemporaneous Gini > 0.1
Line of
Add back business Business
preferred variables
Classification
Pipeline
W W W . P E A K S 2 T A I L S . C O M
What qualitative factors do business consider?
Characteristics Examples
I - Implementable LTV_time if no objective way of getting valuations timely
M - Manipulative Variables based on self-reported income
P – Policy or Legal Constraints Utilizing religion as a risk driver, Alternative data
O - Objective Length of employment may have subjective interpretations
(full time or part time)
R - Recognisable Variable like social media activity level not related to credit risk
T - Transparency Calculation methodology of variable should be clear
A - Available Data with high missing values like investment portfolio value
N - Necessary Variables with low statistical performance but very important
T - Tangible Intended use of loan although important but non- tangible
W W W . P E A K S 2 T A I L S . C O M
bureau_score num_ccj max_arrears_12m
2 0.5 1
0 0
1 1 2
-0.5 3 4 5 6 1 2 3 4 5 6
7 8 9 7 8 9 10 11
-1 12 13 14 15
0 -1
1 2 -1.5 -2
3 4
-1 5 6 7 -2 -3
W W W . P E A K S 2 T A I L S . C O M
Equal Frequency Bins
Monotonic Bins
W W W . P E A K S 2 T A I L S . C O M
Exhaustive
Model Search
W W W . P E A K S 2 T A I L S . C O M
Important Thresholds
Metric Threshold
Characteristic Stability Index < 0.1
Gini > 0.1
Information Value > 0.02
Multicollinearity correlation cutoff < 0.5 to 0.7
Multicollinearity IV cut off < 2 to 3
Model AUC > 0.7
Model Gini > 0.4 to 0.5
Model PSI <0.1
W W W . P E A K S 2 T A I L S . C O M