ML Unit4 Notes
ML Unit4 Notes
4.1 CROSS
VALIDATION
During the evaluation of machine learning (ML) models, the following question might
arise:
Is this model the best one available from the hypothesis space of the algorithm in
Consider training a model using an algorithm on a given dataset. Using the same
training data, you determine that the trained model has an accuracy of 95% or
even 100%. What does this mean? Can this model be used for prediction?
No. This is because your model has been trained on the given data, i.e. it knows the
data
and has generalized over it very well. In contrast, when you try to predict over a
new set
of data, it will most likely give you very bad accuracy because it has never seen the
data before and thus cannot generalize well over it. To deal with such
problems, hold-out methods can be employed.
The hold-out method involves splitting the data into multiple parts and using one
part for training the model and the rest for validating and testing it. It can be
used for both model evaluation and selection.
In cases where every piece of data is used for training the model, there
remains the
problem of selecting the best model from a list of possible models. Primarily, we
want to identify which model has the lowest generalization error or which model
makes a better prediction on future or unseen datasets than all of the others.
There is a need to have a mechanism that allows the model to be trained on one
set of data and tested on another set of data. This is where hold-out comes into
play.
Model evaluation using the hold-out method entails splitting the dataset into training
and test datasets, evaluating model performance, and determining the most optimal
model. This
diagram illustrates the hold-out method for model evaluation.
There are two parts to the dataset in the diagram above. One split is held aside as a
training set.
Another set is held back for testing or evaluation of the model. The percentage of the
split is
determined based on the amount of training data available. A typical split of 70–30% is
used in which 70% of the dataset is used for training and 30% is used for testing the
model.
The objective of this technique is to select the best model based on its accuracy on the
testing dataset and compare it with other models. There is, however, the possibility that
the model can be well fitted to the test data using this technique. In other words,
models are trained to improve model accuracy on test datasets based on the
assumption that the test dataset represents the population. As a result, the test error
becomes an optimistic estimation of the generalization error. Obviously, this is not what
we want. Since the final model is trained to fit well (or overfit) the test data, it won’t
generalize well to unknowns or future datasets.
Follow the steps below for using the hold-out method for model
evaluation:
1. Split the dataset in two (preferably 70–30%; however, the split percentage
can vary and should be random).
2. Now, we train the model on the training dataset by selecting some fixed
set of
hyperparameters while training the model.
4. Use the entire dataset to train the final model so that it can generalize better on
future
datasets.
In this process, the dataset is split into training and test sets, and a fixed set of
hyperparameters is used to evaluate the model. There is another process in
which data can also be split into three sets, and these sets can be used to select a
model or to tune hyperparameters.
Follow the steps below for using the hold-out method for model
selection:
1. Divide the dataset into three parts: training dataset, validation dataset, and test
dataset.
2. Now, different machine learning algorithms can be used to train different models.
You can train your classification model, for example, using logistic regression,
random forest, and XGBoost.
3. Tune the hyperparameters for models trained with different algorithms.
Change the hyperparameter settings for each algorithm mentioned in step 2
and come up with multiple models.
4. On the validation dataset, test the performance of each of these models
(associating with each of the algorithms).
5. Choose the most optimal model from those tested on the validation dataset. The
most optimal model will be set up with the most optimal hyperparameters. Using the
example above, let’s suppose the model trained with XGBoost with the
most optimal hyperparameters is selected.
6. Finally, on the test dataset, test the performance of the most optimal
model.
folds, and the rest of the folds are used for the test set. This approach is a very popular
CV
approach because it is easy to understand, and the output is less biased than other
o Fit the model on the training set and evaluate the performance of
the model using the test set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds.
On
1st iteration, the first fold is reserved for test the model , and rest are used to train the
model. On 2nd iteration, the second fold is used to test the model, and rest are used to
train the model. This process will continue until each fold is not used for the test fold.
4.2.3 Stratified k-fold cross-validation:
This technique is similar to k-fold cross-validation with some little changes. This
approach works on stratification concept, it is a process of rearranging the data to
ensure that each fold or group is a good representative of the complete dataset. To deal
with the bias and variance,
it is one of the best
approaches.
It can be understood with an example of housing prices, such that the price of some
houses can be much high than other houses. To tackle such situations, a stratified
k-fold cross- validation technique is useful.
o In this approach, the bias is minimum as all the data points are used.
o This approach leads to high variation in testing the effectiveness of the model as
we iteratively check against one data point.
4.3 Bias-Variance Trade
off
Bi
as
The bias is known as the difference between the prediction of the values by the ML
model and the correct value. Being high in biasing gives a large error in training as well as
testing data. Its recommended that an algorithm should always be low biased to avoid
the problem of underfitting.By high bias, the data predicted is in a straight line format,
thus not fitting
accurately in the data in the data set. Such fitting is known as Underfitting of Data. This
happens when the hypothesis is too simple or linear in nature. Refer to the graph
given below for an example of such a situation.
Bias Variance
Tradeoff
If the algorithm is too simple (hypothesis with linear eq.) then it may be on high bias and
low variance condition and thus is error-prone. If algorithms fit too complex
( hypothesis with high degree eq.) then it may be on high variance and low bias. In the
latter condition, the new entries will not perform well. Well, there is something
between both of these conditions, known as Trade-off or Bias Variance Trade-off.
This tradeoff in complexity is why there is a tradeoff between bias and variance.
An algorithm can’t be more complex and less complex at the same time. For the
graph, the perfect tradeoff will be like.
given as –
This is referred to as the best point chosen for the training of the algorithm which gives
low error in training as well as testing data.
4.4 Regularization :
This technique can be used in such a way that it will allow to maintain all
variables or features in the model by reducing the magnitude of the variables.
Hence, it maintains accuracy as well as a generalization of the model. it mainly
regularizes or reduces the coefficient of features toward zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the
same
number of
features."
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
Linear regression models try to optimize the β0 and b to minimize the cost
function. The
equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is
called as RSS or Residual sum of squares.
Techniques of
Regularization
There are mainly two types of regularization techniques, which are given
below:
o Ridge Regression
o Lasso Regression
Ridge
Regression
o Ridge regression is one of the types of linear regression in which a small
amount of bias is introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to
it. The
amount of bias added to the model is called Ridge Regression penalty. We
can
calculate it by multiplying with the lambda to the squared weight of each
individual feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the
model, and hence ridge regression reduces the amplitudes of the
coefficients that decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the
equation becomes the cost function of the linear regression model. Hence,
for the minimum value of λ, the model will resemble the linear regression
model.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity
of the model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains
only the absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas
Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function
of Lasso regression will be:
Reasons for
Overfitting
Data used for training is not cleaned and contains noise (garbage values) in it
Ways to Tackle
Underfitting
Increase the number of features in the dataset
Increase model complexity
Reduce noise in the data
Increase the duration of training the data
You would likely browser a few web portals where people have posted their
reviews and compare different car models, checking for their features and prices. You
will also probably ask your friends and colleagues for their opinion. In short, you
wouldn’t directly reach a conclusion, but will instead make a decision considering the
opinions of other people as well.
Ensemble models in machine learning operate on a similar idea. They combine the
decisions from multiple models to improve the overall performance.
The Statistical Problem arises when the hypothesis space is too large
for the amount of available data. Hence, there are many hypotheses
with the same accuracy on the data and the learning algorithm chooses only
one of them! There is a risk that the accuracy of the chosen hypothesis is low
on unseen data!
Computational Problem –
The Computational Problem arises when the learning algorithm cannot
guarantees finding the best hypothesis
Representational Problem –
The Representational Problem arises when the hypothesis space does not
contain any good approximation of the target class(es).
Types of Ensemble Classifier –
1)Bagging
2)Boosting
3)Random Forest
4.6.1 Bagging:
BAGGing, or Bootstrap AGGregating. BAGGing gets its name because
it combines Bootstrapping and Aggregation to form one ensemble model.
Given a sample of data, multiple bootstrapped subsamples are pulled. A
Decision Tree is formed on each of the bootstrapped subsamples. After each
subsample Decision Tree has been formed, an algorithm is used to aggregate over
the Decision Trees to form the
4.6.2 Boosting :
Unlike bagging, which aggregates prediction results at the end, boosting aggregates
the results at each step. They are aggregated using weighted averaging.
Weighted averaging involves giving all models different weights depending on
their
predictive power. In other words, it gives more weight to the model with the
highest predictive power. This is because the learner with the highest
predictive power is considered the most important.
3. We test the trained weak learner using the training data. As a result of
4. Each data point with the wrong prediction is sent into the second subset of data,
5. Using this updated subset, we train and test the second weak learner.
6. We continue with the following subset until the total number of subsets is reached.
7. We now have the total prediction. The overall prediction has already been
As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions,
and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
<="" li="">
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.
Max Voting
• max voting method is generally used for classification problems.
• In this technique, multiple models are used to make predictions for each data point.
• The predictions by each model are considered as a ‘vote’.
Averaging
• In this method, we take an average of predictions from all the models and use it to make the final
prediction.
• Averaging can be used for making predictions in regression problems or while calculating
probabilities for classification problems.
Weighted Averaging
• All models are assigned different weights defining the importance of each model for prediction.