0% found this document useful (0 votes)
37 views72 pages

MI - Unit 5

Unit 5 of the Machine Intelligence course covers resampling techniques such as cross-validation and bootstrapping for model selection and assessment. It discusses various methods including the validation set approach, leave-one-out cross-validation, and k-fold cross-validation, highlighting their advantages and disadvantages. Additionally, it addresses linear model selection strategies like subset selection, shrinkage methods, and dimension reduction to improve prediction accuracy and interpretability.

Uploaded by

ersenthilprabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views72 pages

MI - Unit 5

Unit 5 of the Machine Intelligence course covers resampling techniques such as cross-validation and bootstrapping for model selection and assessment. It discusses various methods including the validation set approach, leave-one-out cross-validation, and k-fold cross-validation, highlighting their advantages and disadvantages. Additionally, it addresses linear model selection strategies like subset selection, shrinkage methods, and dimension reduction to improve prediction accuracy and interpretability.

Uploaded by

ersenthilprabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

19CSCN1602 – Machine Intelligence

Unit –5
Prabhu K, AP(SS)/CSE
Unit -V

Resampling and Model Selection

Resampling: Cross Validation, Bootstrapping – Linear Model Selection: Subset


selection, Shrinkage methods, Dimension reduction methods – High
Dimensional data.
Resampling: Cross Validation
Resampling

• Resampling involve repeatedly drawing samples from a training set and


refitting a model of interest on each sample in order to obtain additional
information about the fitted model.
• It involve fitting the same statistical method multiple times using different
subsets of the training data.
• It is computationally expensive: because it involve fitting the same statistical
method multiple times using different subsets of the training data

• The process of evaluating a model’s performance is known as model


assessment

• The process of selecting the proper level of flexibility for a model is known as
model selection
Resampling
• Two common methods of Resampling are
• Cross Validation
• The Validation Set Approach
• K-Fold cross validation
• Leave one out cross validation
• Bootstrapping
Cross-Validation

• Cross-Validation is used to estimate the test error associated with a


model to evaluate its performance.
• Different methods under Cross Validation are
• The Validation Set Approach or Holdout cross-validation
• Leave one out cross validation
• K-Fold cross validation
The Validation Set Approach or Holdout cross-
validation
• It involves randomly dividing the available set of observations into two
parts, a training set and a validation set or hold-out set
• The model is fit on the training set and the fitted model is used to make
predictions on the validation set.
• The resulting validation set error provides an estimate of the test error
assessed using MSE in the case of a quantitative response and
misclassification rate in the case of a qualitative (discrete) value
Example: automobile data

• Want to compare linear vs higher-order polynomial terms in a linear


regression

• We randomly split the 392 observations into two sets, a training set
containing 196 of the data points, and a validation set containing the
remaining 196 observations.
The Validation Set Approach or Holdout
cross-validation
• Advantage

• Very simple and easy to implement

• Can simply choose the method with the best validation score

• Disadvantages

• the validation estimate of the test error rate can be highly variable, depending

on precisely which observations are included in the training set and which

observations are included in the validation set.

• In the validation approach, only a subset of the observations — those that are

included in the training set rather than in the validation set — are used to fit

the model so validation set error rate may tend to overestimate the test error

rate for the model fit on the entire data set.

II
Leave-one-out-cross-validation(LOOCV)

• LOOCV is a better option than the validation set approach.


• Instead of splitting the entire dataset into two halves only one
observation is used for validation and the rest is used to fit the model.
• The statistical learning method is fit on the n − 1 training observations

• a prediction is made for the excluded observation using its value x1

• Though MSE1 is unbiased for the test error, it is a poor estimate because it is
highly variable, since it is based upon a single observation (x1, y1).

• repeat the procedure by selecting (x2, y2) for the validation data, training the
statistical learning procedure on the n − 1 observations

{(x1, y1),(x3, y3),...,(xn, yn)} computing MSE2 = ( y2− )2.

• Repeating this approach n times produces n squared errors, MSE1,..., MSEn.

• The LOOCV estimate for the test MSE is the average of these n test error
estimates:
Leave-one-out-cross-validation
• Advantages
• it has far less bias
• the LOOCV approach tends not to overestimate the test error rate
• performing LOOCV multiple times will always yield the same results
• Disadvantage
• LOOCV is expensive to implement since the model has to be fit n times
Hold out or Validation set approach
Vs LOOCV

Advantage Disadvantage
Holdout/ Cheap Variance: Unreliable
Validation set estimate of future
approach performance

Leave one out Doesn’t waste data Expensive


k-fold cross-validation
• This approach involves randomly dividing the set of observations into k folds of
nearly equal size.
• The first fold is treated as a validation set and the model is fit on the remaining
folds.
• The procedure is then repeated k times, where a different group each time is
treated as the validation set.
• The k-fold CV estimate is computed by averaging these value

• LOOCV is a special case of k-fold CV in which k is set to equal n

• Computational is the advantage of using k = 5 or k = 10 rather than k = n


• performing 10-fold CV requires fitting the learning procedure only ten times
which may be much more feasible
How many folds are needed?
• With a large number of folds
• The bias of the true error rate estimator will be small
• The variance of the true error rate estimator will be large
• The computational time will be very large
• With a small number of folds
• The number of experiments and computation time are reduced
• The variance of the estimator will be small
• The bias of the estimator will be large
• In practice the choice of the number of folds depends on the size of the dataset
• For larger dataset ,even 3-fold cross validation will be quite accurate
• For very sparse dataset ,we may have to use leave one out in order to train
on as many examples as possible
• A common choice for K-fold cross validation is K=10
Bias-Variance Trade-Off for k-Fold Cross-
Validation
• Validation set approach
• Overestimates the test error rate as training set contains only half
the observations of the entire data set
• LOOCV
• unbiased estimates of the test error as each training set contains n −
1 observations
• k-fold CV
• k = 5 or k = 10 intermediate level of bias as each training set
contains (k − 1)n/k observations
Bias-Variance Trade-Off for k-Fold Cross-
Validation
• LOOCV
• averaging the outputs of n fitted models, each of which is trained on an
almost identical set of observations therefore, outputs are highly
(positively) correlated with each other
• k-fold CV with k<n
• averaging the outputs of k fitted models that are somewhat less correlated
with each other , since the overlap between the training sets in each model is
smaller
• The mean of many highly correlated quantities has higher variance than does the
mean of many quantities that are not as highly correlated
• LOOCV have higher variance than k-fold CV.
• k-fold cross-validation with k = 5 or k = 10 suffer neither from excessively high
bias nor from very high variance.
Bias-Variance Trade-Off for k-Fold Cross-
Validation
Method Bias Variance

Validation overestimates of the test -


set approach error rate

LOOCV unbiased estimates of the test higher variance


error.

k-fold CV k-fold cross-validation using


For k = 5 or k = 10 have
k = 5 or k = 10 have neither
intermediate level of bias high bias nor high variance.
Cross-Validation on Classification
Problems

In the classification , the LOOCV error rate takes the form


where Erri = I(yi ≠ )
Bootstrapping
The Bootstrap

• The bootstrap is a flexible and powerful statistical tool that can be used to
quantify the uncertainty associated with a given estimator or statistical
learning method.

• For example, it can be used to estimate the standard errors of the


coefficients from a linear regression fit
Bootstrap
• The bootstrap is a resampling technique
with replacement
• From a dataset with N examples
• Randomly select (with replacement)N
examples and use this set for training
• The remaining examples that were not
selected for training are used for
testing
• This value is likely to change fold to
fold
• Repeat this process for a specified number
of folds(k)
• The true error is estimated as the average
error rate on test data
Example

• Assume a small dataset x={3,5,2,1,7} and we want to compute the bias and variance of the
sample mean =3.6

• We generate a number of bootstrap samples(three in this case)

• Assume that the first bootstrap yields the dataset{7,3,2,3,1}

• Compute the sample mean 𝞪*1=3.2

• The second bootstrap yields the dataset{5,1,1,3,7}

• Compute the sample mean 𝞪*2=3.4

• The third bootstrap yields the dataset{2,2,7,1,3}

• Compute the sample mean 𝞪*3=3.0

• Average these estimates and obtain an average of 𝞪*𝞱=3.2

• Bias()=3.2-3.6=-0.4

• Var()=1/2*(3.2-3.2)2+(3.4-3.2)2+(3.0-3.2)2=0.04
A graphical illustration of the bootstrap approach on a small sample containing n = 3
observations. Each bootstrap data set contains n observations, sampled with
replacement from the original data set. Each bootstrap data set is used to obtain an
estimate of α
A simple example

• To invest a fixed sum of money in two financial assets that yield returns of X
and Y , respectively, where X and Y are random quantities.
• We will invest a fraction α of our money in X, and will invest the remaining
1 − α in Y .
• choose α to minimize the total risk, or variance, of our investment. i.e
minimize Var(αX + (1 − α)Y ).
• The value that minimizes the risk is given by
𝞼 2 𝑌 − 𝞼 𝑋𝑌
𝞪=
𝞼 2 𝑋+𝞼 2 𝑌 − 2𝞼 𝑋𝑌

where σ 2 X = Var(X), σ2 Y = Var(Y ), and σXY = Cov(X, Y ).


Example Contd

• But the values of σ 2 X, σ 2 Y , and σXY are unknown.

• We can compute estimates for these quantities, 2


,
X
2
Y , and XY , using a
data set that contains measurements for X and Y .

• We can then estimate the value of α that minimizes the variance of our
investment using

^ 2𝑌 −𝞼
𝞼 ^ 𝑋𝑌
𝞪=
^ 2 𝑋+ 𝞼
𝞼 ^ 2 𝑌 − 2𝞼
^ 𝑋𝑌
Each panel displays 100 simulated returns for investments X and Y . From left
to right and top to bottom, the resulting estimates for α are 0.576, 0.532,
0.657, and 0.651.
Example Contd

• To estimate the standard deviation of , we repeated the process of


simulating 100 paired observations of X and Y , and estimating α 1,000
times.

• We thereby obtained 1,000 estimates for α, which we can call 1, 2, . . . ,

1000 .

• For these simulations the parameters were set to σ 2 X = 1, σ2 Y = 1.25, and

σXY = 0.5, and so we know that the true value of α is 0.6.


Example Contd

• The mean over all 1,000 estimates for α is

very close to α = 0.6, and the standard deviation of the estimates is


1000
1

1000− 1 𝑟 =1
( 𝞪^ −𝞪 ) 2=0.083

• This gives us a very good idea of the accuracy of : SE() ≈ 0.083.


• So roughly speaking, for a random sample from the population, we would
expect to differ from α by approximately 0.08, on average
Left: A histogram of the estimates of α obtained by generating 1,000 simulated
data sets from the true population. Center: A histogram of the estimates of α
obtained from 1,000 bootstrap samples from a single data set. Right: The estimates
of α displayed in the left and center panels are shown as boxplots. In each panel,
the pink line indicates the true value of α
A graphical illustration of the bootstrap approach on a small sample containing n = 3
observations. Each bootstrap data set contains n observations, sampled with
replacement from the original data set. Each bootstrap data set is used to obtain an
estimate of α
Linear Model Selection
Linear Model Selection

• In the regression setting, the standard linear model

• Y = β0 + β1X1 + ··· + βpXp + ℇ

• describes the relationship between a response Y and a set of variables

X1, X2,...,Xp

• This simple linear model can be improved, by replacing plain least squares fitting

with some alternative fitting procedures.

• Alternative fitting procedures can yield better prediction accuracy and model

interpretability
Linear Model Selection
Prediction Accuracy
• If the true relationship between the response and the predictors is approximately
linear, the least squares estimates will have low bias
• If n>>p
• Where n number of observations and p the number of variables
• Least square estimate have low variance and will perform well on test
observations
• If n is not much larger than p
• Least squares fit result in overfitting
• consequently poor predictions on future observations not used in model training.
• If p>n
• There is no unique least squares coefficient estimate.
• The variance is infinite
Linear Model Selection
• Model Interpretability
• some or many of the variables used in a multiple regression model are not
associated with the response
• Including such irrelevant variables leads to unnecessary complexity in the
resulting model
• remove these variables by setting the corresponding coefficient estimates to
zero
• least squares is extremely unlikely to yield any coefficient estimates that are
exactly zero
• Need approaches for automatically performing feature selection or variable
selection,that is for excluding irrelevant variables from a multiple regression model
• Alternatives to using least squares to fit are
• Subset Selection
• This approach involves identifying a subset of the p predictors that we believe
to be related to the response
• Fit a model using least squares on the reduced set of variables.
• Shrinkage or regularization
• This approach involves fitting a model involving all p predictors
• the estimated coefficients are shrunken towards zero relative to the least
squares estimates
• It performs variable selection
• Dimension Reduction.
• involves projecting the p predictors into a M-dimensional subspace where M <p
• Achieved by computing M different linear combinations, or projections, of the
variables
• these M projections are used as predictors to fit a linear regression model by
least squares
Subset Selection
• Involves identifying a subset of the p predictors that believe to be related
to the response

• Then fit a model using least squares on the reduced set of variables

• Best subset selection

• Fit a separate least squares regression for each possible combination


of the p predictors
• Fit all p models that contain exactly one predictor , all = p(p−1)/2
models that contain exactly two predictors, and so forth
• Then look at all of the resulting models, with the goal of identifying
the one that is best
Algorithm-Best subset selection

• 1. Let M0 denote the null model, which contains no predictors. This model
simply predicts the sample mean for each observation.

• 2. For k = 1, 2,...p:

• (a) Fit all models that contain exactly k predictors.

• (b) Pick the best among these models, and call it Mk. Here best is
defined as having the smallest RSS, or equivalently largest R2.

• 3. Select a single best model from among M0,...,Mp using cross validated

prediction error, Cp (AIC), BIC, or adjusted R2.


For each possible model containing a subset of the ten predictors in the
Credit data set, the RSS and R2 are displayed. The red frontier tracks the
best model for a given number of predictors, according to RSS and R2.
Drawback of Subset Selection

• Computational limitations
• The number of possible models that must be considered grows rapidly as
p increases
• 2p models that involve subsets of p predictors
• If p = 10 approximately 1,000 possible models to be considered
• If p = 20, then there are over one million possibilities
• Computationally infeasible for values of p greater than around 40
Stepwise Selection
• For computational reasons, best subset selection cannot be applied with
very large p
• The larger the search space, the higher the chance of finding models that
look good on the training data
• an enormous search space can lead to overfitting and high variance of the
coefficient estimates.
• For both of these reasons, stepwise methods, which explore a far more
restricted set of models, are attractive alternatives to best subset selection.
• Stepwise Selection methods
• Forward Stepwise Selection
• Backward Stepwise Selection
Forward Stepwise Selection

• Forward stepwise selection is a computationally efficient alternative to


best subset selection

• Forward stepwise selection begins with a model containing no predictors

• then adds predictors to the model, one-at-a-time, until all of the


predictors are in the model

• at each step the variable that gives the greatest additional improvement
to the fit is added to the model
Algorithm - Forward stepwise selection

• 1. Let M0 denote the null model, which contains no predictors.

• 2. For k = 0,...,p − 1:

• (a) Consider all p − k models that augment the predictors in Mk with


one additional predictor.

• (b) Choose the best among these p − k models, and call it Mk+1.Here
best is defined as having smallest RSS or highest R2.

• 3. Select a single best model from among M0,...,Mp using crossvalidated


prediction error, Cp (AIC), BIC, or adjusted R2
Forward Stepwise Selection

• It involves fitting one null model, along with p − k models in the kth
iteration, for k = 0,...,P − 1
• This amounts to a total of 1 + =1+p(p+1)/2 models
• When p = 20, best subset selection requires fitting 1,048,576 models,
whereas forward stepwise selection requires fitting only 211 models.
• It is not guaranteed to find the best possible model out of all 2p models
containing subsets of the p predictors
Example

The first four selected models for best subset selection


and forward stepwise selection on the Credit data set.
Backward Stepwise Selection

• It begins with the full least squares model containing all p predictors, and
then iteratively removes the least useful predictor one-at-a-time

• The backward selection approach searches through only 1+p(p+ 1)/2


models

• Backward stepwise selection is not guaranteed to yield the best model


containing a subset of the p predictors
Algorithm - Backward stepwise selection

• 1. Let Mp denote the full model, which contains all p predictors.

• 2. For k = p, p − 1,..., 1:

• (a) Consider all k models that contain all but one of the predictors in
Mk, for a total of k − 1 predictors.

• (b) Choose the best among these k models, and call it Mk−1. Here best is
defined as having smallest RSS or highest R2.

• 3. Select a single best model from among M0,...,Mp using crossvalidated


prediction error, Cp (AIC), BIC, or adjusted R2.
Forward stepwise selection Vs Backward
stepwise selection

• Backward selection requires that the number of samples n is larger


than the number of variables p

• Forward stepwise can be used even when n<p


Hybrid Approaches

• Variables are added to the model sequentially, in analogy to forward


selection

• After adding each new variable, the method may also remove any
variables that no longer provide an improvement in the model fit
Choosing the Optimal Model

• Two common approaches to select the best model with respect to test
error are
• Indirectly estimate test error by making an adjustment to the training
error to account for the bias due to overfitting.
• Directly estimate the test error, using either a validation set approach
or a cross-validation approach
Choosing the Optimal Model
• Cp
• AIC(Akaike information criterion)
• BIC(Bayesian information criterion)
• Adjusted R2

• where ˆσ2 is an estimate of the variance of the error associated with each
response measurement in Y = β0 + β1X1 + ··· + βpXp + ℇ
• the Cp statistic adds a penalty of 2dσˆ2 to the training RSS in order to adjust
for the fact that the training error tends to underestimate the test error
• the penalty increases as the number of predictors in the model increases;
this is intended to adjust for the corresponding decrease in training RSS
• choose the model with the lowest Cp value
Choosing the Optimal Model

• The AIC criterion is defined for a large class of models fit by maximum
likelihood. AIC is given by

• For least squares models, Cp and AIC are proportional to each other

• BIC is derived from a Bayesian point of view, but ends up looking similar to
Cp and AIC.
• For the least squares model with d predictors BIC is given by

• BIC will tend to take on a small value for a model with a low test
error so select the model that has the lowest BIC value
Choosing the Optimal Model
• For a least squares model with d variables, the adjusted R2 statistic is
calculated as

• a large value of adjusted R2 indicates a model with a small test


error
Validation and Cross-Validation
• compute the validation set error or the cross-validation error for each model
under consideration, and then select the model for which the resulting estimated
test error is smallest
• it provides a direct estimate of the test error, and makes fewer assumptions about
the true underlying model
Shrinkage Methods
Shrinkage Methods

• The subset selection methods involve using least squares to fit a linear
model that contains a subset of the predictors

• As an alternative, we can fit a model containing all p predictors using a


technique that constrains or regularizes the coefficient estimates , or
equivalently, that shrinks the coefficient estimates towards zero

• The two best-known techniques for shrinking the regression coefficients


towards zero are
• ridge regression

• lasso
Ridge Regression
• Ridge regression is very similar to least squares, except that the
coefficients are estimated by minimizing a slightly different quantity
• The ridge regression coefficient estimates R
are the values that
minimize 2

𝑝
= RSS + λ ∑ β 𝑗 2
𝑗=1

• where λ ≥ 0 is a tuning parameter


• the second term, λ βj 2 called a shrinkage penalty is small when

β1,...,βp are close to zero,

• it has the effect of shrinking the estimates of βj towards zero


Ridge Regression
2

𝑝
= RSS + λ ∑ β 𝑗 2
𝑗=1

• The tuning parameter λ serves to control the relative impact of these two
terms on the regression coefficient estimates.
• When λ = 0, the penalty term has no effect, and ridge regression will
produce the least squares estimates
• λ→ ∞, the impact of the shrinkage penalty grows, and the ridge
regression coefficient estimates will approach zero
• ridge regression will produce a different set of coefficient estimates, R λ ,
for each value of λ
The standardized ridge regression coefficients are displayed for
the Credit data set, as a function of λ and || λ R ||2/|| ||2.
Why Does Ridge Regression Improve Over Least
Squares?

• As λ increases, the flexibility of the ridge regression fit decreases, leading


to decreased variance but increased bias

• When λ = 0, the variance is high but there is no bias


The Lasso
• Disadvantage of Ridge regression

• ridge regression will include all p predictors in the final model

• The penalty λ j will shrink all of the coefficients towards zero,


but it will not set any of them exactly to zero

• it can create a challenge in model interpretation in settings ,in


which the number of variables p is quite large

• Increasing the value of λ will tend to reduce the magnitudes of


the coefficients, but will not result in exclusion of any of the
variables.
The Lasso
• The lasso is a relatively recent alternative to ridge regression
that over comes this disadvantage.
• The lasso coefficients, , minimize the quantity

2

𝑝
= RSS + λ ∑ |β 𝑗|
𝑗=1
The Lasso

• In the case of the lasso, the ℓ1 penalty has the effect of


forcing some of the coefficient estimates to be exactly
equal to zero when the tuning parameter λ is
sufficiently large.
• like best subset selection, the lasso performs variable
selection
• depending on the value of λ, the lasso can produce a
model involving any number of variables
The standardized lasso coefficients on the Credit data set are shown
as a function of λ and || L λ ||1/|| ||1.
Dimension Reduction Methods
Dimension Reduction Methods

• Transforms the predictors and then fit a least squares model using the
transformed variables
• Let Z1, Z2,...,ZM represent M <p predictors

Zm=jmXj

• for some constants ϕ1m, ϕ2m ...,ϕpm, m = 1,...,M

• fit the linear regression model

yi=θ0+ mzim+ϵi, i=1,…n


Dimension Reduction Methods

• The term dimension reduction comes from the fact that this
approach reduces the problem of estimating the p+ 1
coefficients β0, β1,...,βp to the simpler problem of estimating

the M + 1 coefficients θ0, θ1,...,θM , where M<p

• the dimension of the problem has been reduced from p + 1 to


M + 1.

where
• p is large relative to n M << p can significantly
reduce the variance of the fitted coefficients.
• If M = p, and all the Zm are linearly independent no
dimension reduction occurs
Dimensionality reduction
• Principal components analysis
• PCA; also called the Karhunen-Loeve, or K-L, method
• Given N data vectors from k-dimensions, find c ≤ k orthogonal vectors (Principal
components) that can be best used to represent data
• Steps
• Normalize input data: Each attribute falls within the same range
• Compute c orthonormal (unit) vectors, i.e., principal components
• Each input data (vector) is a linear combination of the c principal component
vectors
• The principal components are sorted in order of decreasing “significance” or
strength
• Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance. (i.e., using the
strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only
• Used for handling sparse data
High Dimensional data

16CST52 -DWM
High Dimensional data

Data sets containing more features than observations are


often referred to as high-dimensional

16CST52 -DWM
Reference(s):

• James G, Witten D, Hastie T and Tibshirani R, “An


Introduction to Statistical Learning with
Applications in R”, Springer,2013

You might also like