0% found this document useful (0 votes)
8 views16 pages

Cross Validation

Cross-validation is a model validation technique that involves training on a subset of data and testing on an unseen subset to evaluate model performance. Common methods include Validation Set Approach, Leave-P-out, Leave-One-Out, K-Fold, and Stratified K-Fold cross-validation, each with its own advantages and disadvantages. Model selection involves choosing the best model based on performance criteria, while stepwise regression is a method for iteratively selecting independent variables to improve model accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views16 pages

Cross Validation

Cross-validation is a model validation technique that involves training on a subset of data and testing on an unseen subset to evaluate model performance. Common methods include Validation Set Approach, Leave-P-out, Leave-One-Out, K-Fold, and Stratified K-Fold cross-validation, each with its own advantages and disadvantages. Model selection involves choosing the best model based on performance criteria, while stepwise regression is a method for iteratively selecting independent variables to improve model accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Cross-Validation

Cross-validation is a technique for validating the model efficiency by


training it on the subset of input data and testing on previously unseen
subset of the input data.

The basic steps of cross-validations are:


• Reserve a subset of the dataset as a validation set.
• Provide the training to the model using the training dataset.
• Evaluate model performance using the validation set.
Methods used for Cross-Validation

There are some common methods that are used for cross-validation.
These methods are given below:
1.Validation Set Approach
2.Leave-P-out cross-validation
3.Leave one out cross-validation
4.K-fold cross-validation
5.Stratified k-fold cross-validation
1. Validation Set Approach

• We divide the input dataset into a training set and test or validation
set in the validation set approach. Both the subsets are given 50% of
the dataset.
• One of the big disadvantages is that we are just using a 50% dataset
to train our model, so the model may miss out to capture important
information of the dataset. It also tends to give the underfitted
model.
2.Leave-P-out cross-validation

• In this method, the p datasets are left out of the training data.
• If there are total n datapoints in the input dataset, then n-p data
points will be used as the training dataset and the p data points as the
validation set.
• This complete process is repeated for all the samples, and the average
error is calculated to know the effectiveness of the model.
• The Disadvantage of this technique is, it can be computationally
difficult for the large p.
3. Leave one out cross-validation

• This is similar to the leave-p-out cross-validation, but instead of p, we


need to take 1 dataset out of training.
• For each learning set, only one datapoint is reserved, and the
remaining dataset is used to train the model.
• This process repeats for each datapoint. Hence for n samples, we get
n different training set and n test set.
• In this approach, the bias is minimum as all the data points are used.
• The process is executed for n times; hence execution time is high.
• This approach leads to high variation in testing the effectiveness of
the model.
4. K-Fold Cross-Validation

• K-fold cross-validation approach divides the input dataset into K


groups of samples of equal sizes.
• These samples are called folds.
• For each learning set, the prediction function uses k-1 folds, and the
rest of the folds are used for the test set.
• This approach is a very popular CV approach because it is easy to
understand, and the output is less biased than other methods.
The steps for k-fold cross-validation are:
Split the input dataset into K groups
For each group:
• Take one group as the reserve or test data set.
• Use remaining groups as the training dataset
• Fit the model on the training set and evaluate the performance of the model
using the test set.
An example of 5-folds cross-validation.
The dataset is grouped into 5 folds.
On 1st iteration, the first fold is reserved for test the model, and rest are
used to train the model.
On 2nd iteration, the second fold is used to test the model, and rest are
used to train the model.
This process will continue until each fold is not used for the test fold.
5. Stratified k-fold cross-validation
• This technique is similar to k-fold cross-validation with little changes.
• This approach works on stratification concept, it is a process of
rearranging the data to ensure that each fold or group is a good
representative of the complete dataset.
• To deal with the bias and variance, it is one of the best approaches.
• It can be understood with an example of housing prices, such that the
price of some houses can be much high than other houses.
• To deal such situations, a stratified k-fold cross-validation technique is
useful.
There are some limitations of the cross-validation technique:
• For the ideal conditions, it provides the optimum output. But for the
inconsistent data, it may produce a drastic result.
• In predictive modeling, the data evolves over a period, due to which,
it may face the differences between the training set and validation
sets.
Model Selection
• Model selection is the task of selecting a model from various
candidates.
• It depends on the basis of performance criteria.
• A pre existing set of data is considered
• It involves the design of experiments so that the collected data is well
suited to the problem of model selection.
• Model selection is also referred to the problem of selecting a few
representative models from a large set of models for the purpose of
decision making.
Principles of Model selection
• Simple vs Complex
• Overfitting and Underfitting
• Bias-Variance Tradeoff
Stepwise Regression
• Is the step by step iterative construction of a regression model.
• It involves selection of independent variables to be used in the final
model.
• It involves adding or removing variables and testing for statistical
significance after each iteration
Types of stepwise Regression

• Forward selection – This method is an iterative approach where initially


start with an empty set of features and keep adding a feature which best
improves our model after each iteration. The stopping criterion is till the
addition of a new variable does not improve the performance of the
model.
• Backward elimination – This method is also an iterative approach where
initially start with all features and after each iteration, we remove the
least significant feature. The stopping criterion is till no improvement in
the performance of the model is observed after the feature is removed.
• Bi-directional elimination – This method uses both forward selection and
backward elimination technique simultaneously to reach one unique
solution.

You might also like