MI - Unit 5
MI - Unit 5
Unit –5
Prabhu K, AP(SS)/CSE
Unit -V
• The process of selecting the proper level of flexibility for a model is known as
model selection
Resampling
• Two common methods of Resampling are
• Cross Validation
• The Validation Set Approach
• K-Fold cross validation
• Leave one out cross validation
• Bootstrapping
Cross-Validation
• We randomly split the 392 observations into two sets, a training set
containing 196 of the data points, and a validation set containing the
remaining 196 observations.
The Validation Set Approach or Holdout
cross-validation
• Advantage
• Can simply choose the method with the best validation score
• Disadvantages
• the validation estimate of the test error rate can be highly variable, depending
on precisely which observations are included in the training set and which
• In the validation approach, only a subset of the observations — those that are
included in the training set rather than in the validation set — are used to fit
the model so validation set error rate may tend to overestimate the test error
II
Leave-one-out-cross-validation(LOOCV)
• Though MSE1 is unbiased for the test error, it is a poor estimate because it is
highly variable, since it is based upon a single observation (x1, y1).
• repeat the procedure by selecting (x2, y2) for the validation data, training the
statistical learning procedure on the n − 1 observations
• The LOOCV estimate for the test MSE is the average of these n test error
estimates:
Leave-one-out-cross-validation
• Advantages
• it has far less bias
• the LOOCV approach tends not to overestimate the test error rate
• performing LOOCV multiple times will always yield the same results
• Disadvantage
• LOOCV is expensive to implement since the model has to be fit n times
Hold out or Validation set approach
Vs LOOCV
Advantage Disadvantage
Holdout/ Cheap Variance: Unreliable
Validation set estimate of future
approach performance
• The bootstrap is a flexible and powerful statistical tool that can be used to
quantify the uncertainty associated with a given estimator or statistical
learning method.
• Assume a small dataset x={3,5,2,1,7} and we want to compute the bias and variance of the
sample mean =3.6
• Bias()=3.2-3.6=-0.4
• Var()=1/2*(3.2-3.2)2+(3.4-3.2)2+(3.0-3.2)2=0.04
A graphical illustration of the bootstrap approach on a small sample containing n = 3
observations. Each bootstrap data set contains n observations, sampled with
replacement from the original data set. Each bootstrap data set is used to obtain an
estimate of α
A simple example
• To invest a fixed sum of money in two financial assets that yield returns of X
and Y , respectively, where X and Y are random quantities.
• We will invest a fraction α of our money in X, and will invest the remaining
1 − α in Y .
• choose α to minimize the total risk, or variance, of our investment. i.e
minimize Var(αX + (1 − α)Y ).
• The value that minimizes the risk is given by
𝞼 2 𝑌 − 𝞼 𝑋𝑌
𝞪=
𝞼 2 𝑋+𝞼 2 𝑌 − 2𝞼 𝑋𝑌
• We can then estimate the value of α that minimizes the variance of our
investment using
^ 2𝑌 −𝞼
𝞼 ^ 𝑋𝑌
𝞪=
^ 2 𝑋+ 𝞼
𝞼 ^ 2 𝑌 − 2𝞼
^ 𝑋𝑌
Each panel displays 100 simulated returns for investments X and Y . From left
to right and top to bottom, the resulting estimates for α are 0.576, 0.532,
0.657, and 0.651.
Example Contd
1000 .
√
1000
1
∑
1000− 1 𝑟 =1
( 𝞪^ −𝞪 ) 2=0.083
X1, X2,...,Xp
• This simple linear model can be improved, by replacing plain least squares fitting
• Alternative fitting procedures can yield better prediction accuracy and model
interpretability
Linear Model Selection
Prediction Accuracy
• If the true relationship between the response and the predictors is approximately
linear, the least squares estimates will have low bias
• If n>>p
• Where n number of observations and p the number of variables
• Least square estimate have low variance and will perform well on test
observations
• If n is not much larger than p
• Least squares fit result in overfitting
• consequently poor predictions on future observations not used in model training.
• If p>n
• There is no unique least squares coefficient estimate.
• The variance is infinite
Linear Model Selection
• Model Interpretability
• some or many of the variables used in a multiple regression model are not
associated with the response
• Including such irrelevant variables leads to unnecessary complexity in the
resulting model
• remove these variables by setting the corresponding coefficient estimates to
zero
• least squares is extremely unlikely to yield any coefficient estimates that are
exactly zero
• Need approaches for automatically performing feature selection or variable
selection,that is for excluding irrelevant variables from a multiple regression model
• Alternatives to using least squares to fit are
• Subset Selection
• This approach involves identifying a subset of the p predictors that we believe
to be related to the response
• Fit a model using least squares on the reduced set of variables.
• Shrinkage or regularization
• This approach involves fitting a model involving all p predictors
• the estimated coefficients are shrunken towards zero relative to the least
squares estimates
• It performs variable selection
• Dimension Reduction.
• involves projecting the p predictors into a M-dimensional subspace where M <p
• Achieved by computing M different linear combinations, or projections, of the
variables
• these M projections are used as predictors to fit a linear regression model by
least squares
Subset Selection
• Involves identifying a subset of the p predictors that believe to be related
to the response
• Then fit a model using least squares on the reduced set of variables
• 1. Let M0 denote the null model, which contains no predictors. This model
simply predicts the sample mean for each observation.
• 2. For k = 1, 2,...p:
• (b) Pick the best among these models, and call it Mk. Here best is
defined as having the smallest RSS, or equivalently largest R2.
• 3. Select a single best model from among M0,...,Mp using cross validated
• Computational limitations
• The number of possible models that must be considered grows rapidly as
p increases
• 2p models that involve subsets of p predictors
• If p = 10 approximately 1,000 possible models to be considered
• If p = 20, then there are over one million possibilities
• Computationally infeasible for values of p greater than around 40
Stepwise Selection
• For computational reasons, best subset selection cannot be applied with
very large p
• The larger the search space, the higher the chance of finding models that
look good on the training data
• an enormous search space can lead to overfitting and high variance of the
coefficient estimates.
• For both of these reasons, stepwise methods, which explore a far more
restricted set of models, are attractive alternatives to best subset selection.
• Stepwise Selection methods
• Forward Stepwise Selection
• Backward Stepwise Selection
Forward Stepwise Selection
• at each step the variable that gives the greatest additional improvement
to the fit is added to the model
Algorithm - Forward stepwise selection
• 2. For k = 0,...,p − 1:
• (b) Choose the best among these p − k models, and call it Mk+1.Here
best is defined as having smallest RSS or highest R2.
• It involves fitting one null model, along with p − k models in the kth
iteration, for k = 0,...,P − 1
• This amounts to a total of 1 + =1+p(p+1)/2 models
• When p = 20, best subset selection requires fitting 1,048,576 models,
whereas forward stepwise selection requires fitting only 211 models.
• It is not guaranteed to find the best possible model out of all 2p models
containing subsets of the p predictors
Example
• It begins with the full least squares model containing all p predictors, and
then iteratively removes the least useful predictor one-at-a-time
• 2. For k = p, p − 1,..., 1:
• (a) Consider all k models that contain all but one of the predictors in
Mk, for a total of k − 1 predictors.
• (b) Choose the best among these k models, and call it Mk−1. Here best is
defined as having smallest RSS or highest R2.
• After adding each new variable, the method may also remove any
variables that no longer provide an improvement in the model fit
Choosing the Optimal Model
• Two common approaches to select the best model with respect to test
error are
• Indirectly estimate test error by making an adjustment to the training
error to account for the bias due to overfitting.
• Directly estimate the test error, using either a validation set approach
or a cross-validation approach
Choosing the Optimal Model
• Cp
• AIC(Akaike information criterion)
• BIC(Bayesian information criterion)
• Adjusted R2
• where ˆσ2 is an estimate of the variance of the error associated with each
response measurement in Y = β0 + β1X1 + ··· + βpXp + ℇ
• the Cp statistic adds a penalty of 2dσˆ2 to the training RSS in order to adjust
for the fact that the training error tends to underestimate the test error
• the penalty increases as the number of predictors in the model increases;
this is intended to adjust for the corresponding decrease in training RSS
• choose the model with the lowest Cp value
Choosing the Optimal Model
• The AIC criterion is defined for a large class of models fit by maximum
likelihood. AIC is given by
• For least squares models, Cp and AIC are proportional to each other
• BIC is derived from a Bayesian point of view, but ends up looking similar to
Cp and AIC.
• For the least squares model with d predictors BIC is given by
• BIC will tend to take on a small value for a model with a low test
error so select the model that has the lowest BIC value
Choosing the Optimal Model
• For a least squares model with d variables, the adjusted R2 statistic is
calculated as
• The subset selection methods involve using least squares to fit a linear
model that contains a subset of the predictors
• lasso
Ridge Regression
• Ridge regression is very similar to least squares, except that the
coefficients are estimated by minimizing a slightly different quantity
• The ridge regression coefficient estimates R
are the values that
minimize 2
+λ
𝑝
= RSS + λ ∑ β 𝑗 2
𝑗=1
• The tuning parameter λ serves to control the relative impact of these two
terms on the regression coefficient estimates.
• When λ = 0, the penalty term has no effect, and ridge regression will
produce the least squares estimates
• λ→ ∞, the impact of the shrinkage penalty grows, and the ridge
regression coefficient estimates will approach zero
• ridge regression will produce a different set of coefficient estimates, R λ ,
for each value of λ
The standardized ridge regression coefficients are displayed for
the Credit data set, as a function of λ and || λ R ||2/|| ||2.
Why Does Ridge Regression Improve Over Least
Squares?
2
+λ
𝑝
= RSS + λ ∑ |β 𝑗|
𝑗=1
The Lasso
• Transforms the predictors and then fit a least squares model using the
transformed variables
• Let Z1, Z2,...,ZM represent M <p predictors
Zm=jmXj
• The term dimension reduction comes from the fact that this
approach reduces the problem of estimating the p+ 1
coefficients β0, β1,...,βp to the simpler problem of estimating
where
• p is large relative to n M << p can significantly
reduce the variance of the fitted coefficients.
• If M = p, and all the Zm are linearly independent no
dimension reduction occurs
Dimensionality reduction
• Principal components analysis
• PCA; also called the Karhunen-Loeve, or K-L, method
• Given N data vectors from k-dimensions, find c ≤ k orthogonal vectors (Principal
components) that can be best used to represent data
• Steps
• Normalize input data: Each attribute falls within the same range
• Compute c orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the c principal component
vectors
• The principal components are sorted in order of decreasing “significance” or
strength
• Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance. (i.e., using the
strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only
• Used for handling sparse data
High Dimensional data
16CST52 -DWM
High Dimensional data
16CST52 -DWM
Reference(s):