0% found this document useful (0 votes)
13 views20 pages

ML Unit4 Notes

The document discusses model validation techniques in classification, focusing on cross-validation methods such as Holdout, K-Fold, and Stratified K-Fold, as well as Leave-One-Out Cross Validation. It also covers the bias-variance tradeoff, regularization techniques like Ridge and Lasso regression, and the concepts of overfitting and underfitting in machine learning models. These methods and concepts are essential for evaluating and selecting the best predictive models while minimizing errors on unseen data.

Uploaded by

freefiretopup606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views20 pages

ML Unit4 Notes

The document discusses model validation techniques in classification, focusing on cross-validation methods such as Holdout, K-Fold, and Stratified K-Fold, as well as Leave-One-Out Cross Validation. It also covers the bias-variance tradeoff, regularization techniques like Ridge and Lasso regression, and the concepts of overfitting and underfitting in machine learning models. These methods and concepts are essential for evaluating and selecting the best predictive models while minimizing errors on unseen data.

Uploaded by

freefiretopup606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT-IV

Model Validation in Classification : Cross Validation - Holdout Method, K-Fold,


Stratified K-Fold, Leave-One-Out Cross Validation. Bias-Variance tradeoff,
Regularization , Overfitting, Underfitting. Ensemble Methods: Boosting, Bagging,
Random
Forest.

4.1 CROSS
VALIDATION

To test the performance of a classifier, we need to have a number of training/validation


set pairs from a dataset X. To get them, if the sample X is large enough, we can
randomly divide it then divide each part randomly into two and use one half for training
and the other half for validation. Unfortunately, datasets are never large enough to
do this. So, we use the same data split differently; this is called cross-validation.
Cross-validation is a technique to evaluate predictive models by partitioning the
original sample into a training set to train the model, and a test set to evaluate it.

During the evaluation of machine learning (ML) models, the following question might
arise:

Is this model the best one available from the hypothesis space of the algorithm in

terms of generalization error on an unknown/future data set?

What training and testing techniques are used for the


model?

What model should be selected from the available ones?

4.2 Methods used for Cross-Validation:


4.2.1 Holdout
Method

Consider training a model using an algorithm on a given dataset. Using the same
training data, you determine that the trained model has an accuracy of 95% or
even 100%. What does this mean? Can this model be used for prediction?
No. This is because your model has been trained on the given data, i.e. it knows the
data
and has generalized over it very well. In contrast, when you try to predict over a
new set

of data, it will most likely give you very bad accuracy because it has never seen the
data before and thus cannot generalize well over it. To deal with such
problems, hold-out methods can be employed.

The hold-out method involves splitting the data into multiple parts and using one
part for training the model and the rest for validating and testing it. It can be
used for both model evaluation and selection.
In cases where every piece of data is used for training the model, there
remains the
problem of selecting the best model from a list of possible models. Primarily, we
want to identify which model has the lowest generalization error or which model
makes a better prediction on future or unseen datasets than all of the others.
There is a need to have a mechanism that allows the model to be trained on one
set of data and tested on another set of data. This is where hold-out comes into
play.

Hold-Out Method for Model Evaluation

Model evaluation using the hold-out method entails splitting the dataset into training

and test datasets, evaluating model performance, and determining the most optimal

model. This
diagram illustrates the hold-out method for model evaluation.
There are two parts to the dataset in the diagram above. One split is held aside as a
training set.
Another set is held back for testing or evaluation of the model. The percentage of the
split is

determined based on the amount of training data available. A typical split of 70–30% is
used in which 70% of the dataset is used for training and 30% is used for testing the
model.

The objective of this technique is to select the best model based on its accuracy on the
testing dataset and compare it with other models. There is, however, the possibility that
the model can be well fitted to the test data using this technique. In other words,

models are trained to improve model accuracy on test datasets based on the

assumption that the test dataset represents the population. As a result, the test error

becomes an optimistic estimation of the generalization error. Obviously, this is not what

we want. Since the final model is trained to fit well (or overfit) the test data, it won’t
generalize well to unknowns or future datasets.

Follow the steps below for using the hold-out method for model
evaluation:

1. Split the dataset in two (preferably 70–30%; however, the split percentage
can vary and should be random).
2. Now, we train the model on the training dataset by selecting some fixed
set of
hyperparameters while training the model.

3. Use the hold-out test dataset to evaluate the model.

4. Use the entire dataset to train the final model so that it can generalize better on
future
datasets.
In this process, the dataset is split into training and test sets, and a fixed set of
hyperparameters is used to evaluate the model. There is another process in
which data can also be split into three sets, and these sets can be used to select a
model or to tune hyperparameters.

Hold-Out Method for Model Selection

Sometimes the model selection process is referred to as hyperparameter tuning.


During the hold-out method of selecting a model, the dataset is separated into
three sets — training, validation, and test.

Follow the steps below for using the hold-out method for model
selection:
1. Divide the dataset into three parts: training dataset, validation dataset, and test
dataset.
2. Now, different machine learning algorithms can be used to train different models.
You can train your classification model, for example, using logistic regression,
random forest, and XGBoost.
3. Tune the hyperparameters for models trained with different algorithms.
Change the hyperparameter settings for each algorithm mentioned in step 2
and come up with multiple models.
4. On the validation dataset, test the performance of each of these models
(associating with each of the algorithms).
5. Choose the most optimal model from those tested on the validation dataset. The
most optimal model will be set up with the most optimal hyperparameters. Using the
example above, let’s suppose the model trained with XGBoost with the
most optimal hyperparameters is selected.
6. Finally, on the test dataset, test the performance of the most optimal
model.

4.2.2 K-Fold Cross-Validation


K-fold cross-validation approach divides the input dataset into K groups of samples of
equal sizes. These samples are called folds. For each learning set, the prediction function
uses k-1

folds, and the rest of the folds are used for the test set. This approach is a very popular
CV
approach because it is easy to understand, and the output is less biased than other

methods. The steps for k-fold cross-validation are:

o Split the input dataset into K groups

o For each group:

o Take one group as the reserve or test data set.

o Use remaining groups as the training dataset

o Fit the model on the training set and evaluate the performance of
the model using the test set.

Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds.
On
1st iteration, the first fold is reserved for test the model , and rest are used to train the
model. On 2nd iteration, the second fold is used to test the model, and rest are used to
train the model. This process will continue until each fold is not used for the test fold.
4.2.3 Stratified k-fold cross-validation:

This technique is similar to k-fold cross-validation with some little changes. This
approach works on stratification concept, it is a process of rearranging the data to
ensure that each fold or group is a good representative of the complete dataset. To deal
with the bias and variance,
it is one of the best
approaches.

It can be understood with an example of housing prices, such that the price of some
houses can be much high than other houses. To tackle such situations, a stratified
k-fold cross- validation technique is useful.

4.2.4 Leave one out cross-


validation

This method is similar to the leave-p-out cross-validation, but instead of p, we need to


take 1 dataset out of training. It means, in this approach, for each learning set, only one
data point is reserved, and the remaining dataset is used to train the model. This process
repeats for each data point. Hence for n samples, we get n different training set
and n test set. It has the following features:

o In this approach, the bias is minimum as all the data points are used.

o The process is executed for n times; hence execution time is high.

o This approach leads to high variation in testing the effectiveness of the model as
we iteratively check against one data point.
4.3 Bias-Variance Trade
off

It is important to understand prediction errors (bias and variance) when it comes


to accuracy in any machine learning algorithm. There is a tradeoff between a model’s
ability to minimize bias and variance which is referred to as the best solution for selecting
a value of Regularization constant. Proper understanding of these errors would help
to avoid the overfitting and underfitting of a data set while training the algorithm

Bi
as
The bias is known as the difference between the prediction of the values by the ML
model and the correct value. Being high in biasing gives a large error in training as well as
testing data. Its recommended that an algorithm should always be low biased to avoid
the problem of underfitting.By high bias, the data predicted is in a straight line format,
thus not fitting
accurately in the data in the data set. Such fitting is known as Underfitting of Data. This

happens when the hypothesis is too simple or linear in nature. Refer to the graph
given below for an example of such a situation.

In such a problem, a hypothesis looks like follows.


Varianc
e
The variability of model prediction for a given data point which tells us spread of our
data is called the variance of the model. The model with high variance has a very complex
fit to the training data and thus is not able to fit accurately on the data which it
hasn’t seen before. As a result, such models perform very well on training data but has
high error rates on test data.When a model is high on variance, it is then said to as
Overfitting of Data. Overfitting is fitting the training set accurately via
complex curve and high order hypothesis but is not the solution as the
error with unseen data is high. While training a data model variance should be
kept low.
The high variance data looks like
follows.

In such a problem, a hypothesis looks like follows.

Bias Variance
Tradeoff
If the algorithm is too simple (hypothesis with linear eq.) then it may be on high bias and
low variance condition and thus is error-prone. If algorithms fit too complex
( hypothesis with high degree eq.) then it may be on high variance and low bias. In the
latter condition, the new entries will not perform well. Well, there is something
between both of these conditions, known as Trade-off or Bias Variance Trade-off.

This tradeoff in complexity is why there is a tradeoff between bias and variance.
An algorithm can’t be more complex and less complex at the same time. For the
graph, the perfect tradeoff will be like.

The best fit will be given by hypothesis on the tradeoff

point. The error to complexity graph to show trade-off is

given as –

This is referred to as the best point chosen for the training of the algorithm which gives
low error in training as well as testing data.
4.4 Regularization :

Regularization is one of the most important concepts of machine learning. It is a


technique to prevent the model from overfitting by adding extra information to it.
Sometimes the machine learning model performs well with the training data but does not
perform well with the test data. It means the model is not able to predict the output
when deals with unseen data by introducing noise in the output, and hence the model is
called overfitted. This problem can be deal with the help of a regularization technique.

This technique can be used in such a way that it will allow to maintain all
variables or features in the model by reducing the magnitude of the variables.
Hence, it maintains accuracy as well as a generalization of the model. it mainly
regularizes or reduces the coefficient of features toward zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the
same
number of
features."

Regularization works by adding a penalty or complexity term to the complex model.


Let's
consider the simple linear regression equation:

y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b

In the above equation, Y represents the value to be predicted

X1, X2, …Xn are the features for Y.

β0,β1,…..βn are the weights or magnitude attached to the features, respectively.


Here
represents the bias of the model, and b represents the intercept.

Linear regression models try to optimize the β0 and b to minimize the cost
function. The
equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is
called as RSS or Residual sum of squares.

Techniques of
Regularization

There are mainly two types of regularization techniques, which are given
below:

o Ridge Regression
o Lasso Regression

Ridge
Regression
o Ridge regression is one of the types of linear regression in which a small
amount of bias is introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to
it. The
amount of bias added to the model is called Ridge Regression penalty. We
can
calculate it by multiplying with the lambda to the squared weight of each
individual feature.
o The equation for the cost function in ridge regression will be:

o In the above equation, the penalty term regularizes the coefficients of the
model, and hence ridge regression reduces the amplitudes of the
coefficients that decreases the complexity of the model.

o As we can see from the above equation, if the values of λ tend to zero, the
equation becomes the cost function of the linear regression model. Hence,
for the minimum value of λ, the model will resemble the linear regression
model.

o A general linear or polynomial regression will fail if there is high collinearity


between the independent variables, so to solve such problems, Ridge
regression can be used.

o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity
of the model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains
only the absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function
of Lasso regression will be:

o Some of the features in this technique are completely neglected for


model evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the
model as well as the feature selection.

4.5 Overfitting and Under


fitting:
To train our machine learning model, we give it some data to learn from.
The process of plotting a series of data points and drawing the best fit line
to understand the relationship between the variables is called Data Fitting.
Our model is the best fit when it can find all necessary patterns in our data
and avoid the random data points and unnecessary patterns called Noise.
4.5.1
Overfitti
ng
When a model performs very well for training data but has poor performance
with test data (new data), it is known as overfitting. In this case, the
machine learning model learns the details and noise in the training data such
that it negatively affects the performance of the
model on test data. Overfitting can happen due to low bias and
high variance.

Reasons for
Overfitting

Data used for training is not cleaned and contains noise (garbage values) in it

The model has a high variance

The size of the training dataset used is not enough

The model is too complex

Ways to Tackle Overfitting

Using K-fold cross-validation


Using Regularization techniques such as Lasso and Ridge
Training model with sufficient data

Adopting ensembling techniques


4.5.2
Underfitting:
When a model has not learned the patterns in the training data well and is unable to
generalize well on the new data, it is known as underfitting. An underfit model has poor
performance on the training data and will result in unreliable predictions. Underfitting
occurs due to high bias and low variance.

Reasons for Underfitting


Data used for training is not cleaned and contains noise (garbage values) in it

The model has a high bias


The size of the training dataset used is not enough

The model is too simple

Ways to Tackle
Underfitting
Increase the number of features in the dataset
Increase model complexity
Reduce noise in the data
Increase the duration of training the data

4.6 Ensemble Methods:


When you want to purchase a new car, will you walk up to the first car shop and
purchase
one based on the advice of the dealer? It’s highly unlikely.

You would likely browser a few web portals where people have posted their
reviews and compare different car models, checking for their features and prices. You
will also probably ask your friends and colleagues for their opinion. In short, you
wouldn’t directly reach a conclusion, but will instead make a decision considering the
opinions of other people as well.

Ensemble models in machine learning operate on a similar idea. They combine the
decisions from multiple models to improve the overall performance.

Advantage : Improvement in predictive accuracy.

Disadvantage : It is difficult to understand an ensemble of


classifiers.

Ensembles overcome three problems


Statistical Problem –

The Statistical Problem arises when the hypothesis space is too large
for the amount of available data. Hence, there are many hypotheses
with the same accuracy on the data and the learning algorithm chooses only
one of them! There is a risk that the accuracy of the chosen hypothesis is low
on unseen data!

Computational Problem –
The Computational Problem arises when the learning algorithm cannot
guarantees finding the best hypothesis

Representational Problem –
The Representational Problem arises when the hypothesis space does not
contain any good approximation of the target class(es).
Types of Ensemble Classifier –
1)Bagging
2)Boosting
3)Random Forest

4.6.1 Bagging:
BAGGing, or Bootstrap AGGregating. BAGGing gets its name because
it combines Bootstrapping and Aggregation to form one ensemble model.
Given a sample of data, multiple bootstrapped subsamples are pulled. A
Decision Tree is formed on each of the bootstrapped subsamples. After each
subsample Decision Tree has been formed, an algorithm is used to aggregate over
the Decision Trees to form the

most efficient predictor. The image below will help


explain:

4.6.2 Boosting :
Unlike bagging, which aggregates prediction results at the end, boosting aggregates
the results at each step. They are aggregated using weighted averaging.
Weighted averaging involves giving all models different weights depending on
their
predictive power. In other words, it gives more weight to the model with the
highest predictive power. This is because the learner with the highest
predictive power is considered the most important.

Boosting works with the following


steps:

1. We sample m-number of subsets from an initial training dataset.

2. Using the first subset, we train the first weak learner.

3. We test the trained weak learner using the training data. As a result of

the testing, some data points will be incorrectly predicted.

4. Each data point with the wrong prediction is sent into the second subset of data,

and this subset is updated.

5. Using this updated subset, we train and test the second weak learner.

6. We continue with the following subset until the total number of subsets is reached.

7. We now have the total prediction. The overall prediction has already been

aggregated at each step, so there is no need to calculate it.

Random Forest Algorithm


Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

Why use Random Forest?


Below are some points that explain why we should use the Random Forest algorithm:

<="" li="">

o It takes less training time as compared to other algorithms.

o It predicts output with high accuracy, even for the large dataset it runs efficiently.

o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?


Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
Max Voting
• max voting method is generally used for classification problems.
• In this technique, multiple models are used to make predictions for each data point.
• The predictions by each model are considered as a ‘vote’.
Averaging
• In this method, we take an average of predictions from all the models and use it to make the final
prediction.
• Averaging can be used for making predictions in regression problems or while calculating
probabilities for classification problems.
Weighted Averaging
• All models are assigned different weights defining the importance of each model for prediction.

You might also like